2.3 Comparing two data frames (tibbles)

https://sharla.party/post/comparing-two-dfs/

A summary table from the blog:

dplyr::all_equal() janitor::compare_df_cols() vetr::alike() diffdf::diffdf()
iris is iris
column swapped iris is iris
missing columns
extra columns
missing and extra columns
difference in class
different columns and classes
nice strings to use for messages ✅❌
returns data on differences

First, take iris data as a reference for comparison:

Then create some iris variants for the purpose of comparison:

  • df_missing and df_extra for less or more columns
  • df_class for wrong class
  • df_order for new order of same set of columns

2.3.1 dplyr::all_equal()

dplyr::all_equal(target, current) compare if current and target are identical ,and it could only compares 2 data frames at the same time, with several other arguments:

  • ignore_col_order = TRUE: Should order of columns be ignored?
  • ignore_row_order = TRUE: Should order of rows be ignored?
  • convert = FALSE: Should similar classes be converted? Currently this will convert factor to character and integer to double.

if there are missing and extra columns?

if there’s an incorrect variable class?

2.3.2 janitor::compare_df_cols()

Unlike dplyr::all_equal, janitor::compare_df_cols() returns a comparison of the columns in data frames being compared (what’s in both data frames, and their classes in each). It does not cares about rows, since it mean to show wheather several data frames can be row-binded, instead of identity(Although here we have the same rows).

We can set an option return only to return things that don’t match (or things that do):

Here only the wrong class case is returned, and df_missing, df_extra, df_order are considered matching when compared to df.That is because compare_df_cols() won’t be affected by order of columns, and it use either of dplyr::bind_rows() or rbind() to decide mathcing. bind_rows() are looser in the sense that columns missing from a data frame would be considered a matching (i.e, select() on a data frame will not generate a “new” one). with rbind(), columns missing from a data.frame would be considered a mismatch

Note that janitor::compare_df_cols() returns a data frame, which can be easily incorporated into custom message using the glue package:

and the resulting data frame can be filtered manually when the filters from return aren’t what i want, to see all differences:

To get a binary message to see whether a set of data.frames are row-bindable, use janitor::compare_df_cols_sames()

2.3.3 vetr::alike()

vetr::alike(target, current) is similar to base::all.equal() (dplyr::all_equal()’s conuterparts in base R), but it only compares object structure. In the case of data frames, vetr::alike() compares columns and ignores rows. It is useful for all kinds of objects, but we focus on comparing data frames here.

As it turns out, vetr::alike() detects all differences, and makes a declarative comparison.

2.3.4 diffdf::diffdf()

diffdf is a package dedicated to providing tools for working with data frame difference. diffdf(base, compare) comapres 2 data frames (compare against base) and outputs any differences :

library(diffdf)
diffdf(df, df_missing)
#> Warning in diffdf(df, df_missing): 
#> There are columns in BASE that are not in COMPARE !!
#> Differences found between the objects!
#> 
#> A summary is given below.
#> 
#> There are columns in BASE that are not in COMPARE !!
#> All rows are shown in table below
#> 
#>   =========
#>    COLUMNS 
#>   ---------
#>    Species 
#>   ---------
diffdf(df, df_extra)
#> Warning in diffdf(df, df_extra): 
#> There are columns in COMPARE that are not in BASE !!
#> Differences found between the objects!
#> 
#> A summary is given below.
#> 
#> There are columns in COMPARE that are not in BASE !!
#> All rows are shown in table below
#> 
#>   =========
#>    COLUMNS 
#>   ---------
#>     extra  
#>   ---------
diffdf(df, df_class)
#> Warning in diffdf(df, df_class): 
#> There are columns in BASE and COMPARE with different modes !!
#> There are columns in BASE and COMPARE with different classes !!
#> Differences found between the objects!
#> 
#> A summary is given below.
#> 
#> There are columns in BASE and COMPARE with different modes !!
#> All rows are shown in table below
#> 
#>   ================================
#>    VARIABLE  MODE.BASE  MODE.COMP 
#>   --------------------------------
#>    Species    numeric   character 
#>   --------------------------------
#> 
#> There are columns in BASE and COMPARE with different classes !!
#> All rows are shown in table below
#> 
#>   ==================================
#>    VARIABLE  CLASS.BASE  CLASS.COMP 
#>   ----------------------------------
#>    Species     factor    character  
#>   ----------------------------------
diffdf(df, df_order)
#> No issues were found!

diffdf() is sensitive to missing or extra columns, wrong classes and not to order.

This function also returns a list of data frames with issues invisibly, similar to janitor::compare_df_cols():