2.3 Comparing two data frames (tibbles)

https://sharla.party/post/comparing-two-dfs/

A summary table from the blog:

	dplyr::all_equal()	janitor::compare_df_cols()	vetr::alike()	diffdf::diffdf()
iris is iris	✅	✅	✅	✅
column swapped iris is iris	✅	✅	❌	✅
missing columns	✅	✅	✅	✅
extra columns	✅	✅	✅	✅
missing and extra columns	✅	✅	❌	✅
difference in class	✅	✅	✅	✅
different columns and classes	❌	✅	❌	✅
nice strings to use for messages	✅	❌	✅❌	❌
returns data on differences	❌	✅	❌	✅

First, take iris data as a reference for comparison:

df <- iris %>% as_tibble()

Then create some iris variants for the purpose of comparison:

df_missing and df_extra for less or more columns
df_class for wrong class
df_order for new order of same set of columns

df_missing <- df %>% select(-Species)
df_extra <- df %>% mutate(extra = "extra")
df_class <- df %>% mutate(Species = as.character(Species))
df_order <- df %>% select(Species, everything())

2.3.1 `dplyr::all_equal()`

dplyr::all_equal(target, current) compare if current and target are identical ,and it could only compares 2 data frames at the same time, with several other arguments:

ignore_col_order = TRUE: Should order of columns be ignored?
ignore_row_order = TRUE: Should order of rows be ignored?
convert = FALSE: Should similar classes be converted? Currently this will convert factor to character and integer to double.

if there are missing and extra columns?

all_equal(df, df_missing)
#> [1] "Cols in x but not y: `Species`. "
all_equal(df, df_extra)
#> [1] "Cols in y but not x: `extra`. "

if there’s an incorrect variable class?

all_equal(df, df_class)
#> [1] "Incompatible type for column `Species`: x factor, y character"
all_equal(df, df_class, convert = TRUE)
#> [1] TRUE

2.3.2 `janitor::compare_df_cols()`

Unlike dplyr::all_equal, janitor::compare_df_cols() returns a comparison of the columns in data frames being compared (what’s in both data frames, and their classes in each). It does not cares about rows, since it mean to show wheather several data frames can be row-binded, instead of identity(Although here we have the same rows).

library(janitor)
compare_df_cols(df, df_missing, df_extra, df_class, df_order)
#>    column_name      df df_missing  df_extra  df_class df_order
#> 1        extra    <NA>       <NA> character      <NA>     <NA>
#> 2 Petal.Length numeric    numeric   numeric   numeric  numeric
#> 3  Petal.Width numeric    numeric   numeric   numeric  numeric
#> 4 Sepal.Length numeric    numeric   numeric   numeric  numeric
#> 5  Sepal.Width numeric    numeric   numeric   numeric  numeric
#> 6      Species  factor       <NA>    factor character   factor

We can set an option return only to return things that don’t match (or things that do):

compare_df_cols(df, df_missing, df_extra, df_class, df_order, return = "mismatch")
#>   column_name     df df_missing df_extra  df_class df_order
#> 1     Species factor       <NA>   factor character   factor

Here only the wrong class case is returned, and df_missing, df_extra, df_order are considered matching when compared to df.That is because compare_df_cols() won’t be affected by order of columns, and it use either of dplyr::bind_rows() or rbind() to decide mathcing. bind_rows() are looser in the sense that columns missing from a data frame would be considered a matching (i.e, select() on a data frame will not generate a “new” one). with rbind(), columns missing from a data.frame would be considered a mismatch

# missing column is considered a sort of "matching" when bind_method = "bind_rows"
compare_df_cols(df, df_missing, df_extra, df_class, df_order, return = "match")
#>    column_name      df df_missing  df_extra df_class df_order
#> 1        extra    <NA>       <NA> character     <NA>     <NA>
#> 2 Petal.Length numeric    numeric   numeric  numeric  numeric
#> 3  Petal.Width numeric    numeric   numeric  numeric  numeric
#> 4 Sepal.Length numeric    numeric   numeric  numeric  numeric
#> 5  Sepal.Width numeric    numeric   numeric  numeric  numeric
# method = "rbind"
compare_df_cols(df, df_missing, df_extra, df_class, df_order, return = "match",
                bind_method = "rbind")
#>    column_name      df df_missing df_extra df_class df_order
#> 1 Petal.Length numeric    numeric  numeric  numeric  numeric
#> 2  Petal.Width numeric    numeric  numeric  numeric  numeric
#> 3 Sepal.Length numeric    numeric  numeric  numeric  numeric
#> 4  Sepal.Width numeric    numeric  numeric  numeric  numeric

Note that janitor::compare_df_cols() returns a data frame, which can be easily incorporated into custom message using the glue package:

compare_df_cols(df, df_missing, df_extra, df_class, df_order) %>% 
  mutate(comparison = glue::glue("Column: {column_name}: {df} in df,{df_missing} in df_missing, {df_extra} in df_extra,{df_order} in df_order")) %>% 
  select(comparison)
#>                                                                                           comparison
#> 1                     Column: extra: NA in df,NA in df_missing, character in df_extra,NA in df_order
#> 2 Column: Petal.Length: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 3  Column: Petal.Width: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 4 Column: Sepal.Length: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 5  Column: Sepal.Width: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 6              Column: Species: factor in df,NA in df_missing, factor in df_extra,factor in df_order

and the resulting data frame can be filtered manually when the filters from return aren’t what i want, to see all differences:

compare_df_cols(df, df_missing, df_extra, df_class, df_order) %>% 
  filter(is.na(df) | df_class != df_order)
#>   column_name     df df_missing  df_extra  df_class df_order
#> 1       extra   <NA>       <NA> character      <NA>     <NA>
#> 2     Species factor       <NA>    factor character   factor

To get a binary message to see whether a set of data.frames are row-bindable, use janitor::compare_df_cols_sames()

compare_df_cols_same(df, df_missing)
#> [1] TRUE
compare_df_cols_same(df, df_missing, bind_method = "rbind")
#>   column_name    ..1  ..2
#> 1     Species factor <NA>
#> [1] FALSE

2.3.3 `vetr::alike()`

vetr::alike(target, current) is similar to base::all.equal() (dplyr::all_equal()’s conuterparts in base R), but it only compares object structure. In the case of data frames, vetr::alike() compares columns and ignores rows. It is useful for all kinds of objects, but we focus on comparing data frames here.

library(vetr)
alike(df, df_missing)
#> [1] "`df_missing` should have 5 columns (has 4)"
alike(df, df_extra)
#> [1] "`df_extra` should have 5 columns (has 6)"
alike(df, df_class)
#> [1] "`df_class$Species` should be class \"factor\" (is \"character\")"
alike(df, df_order)
#> [1] "`names(df_order)[1]` should be \"Sepal.Length\" (is \"Species\")"

As it turns out, vetr::alike() detects all differences, and makes a declarative comparison.

2.3.4 `diffdf::diffdf()`

diffdf is a package dedicated to providing tools for working with data frame difference. diffdf(base, compare) comapres 2 data frames (compare against base) and outputs any differences :

library(diffdf)
diffdf(df, df_missing)
#> Warning in diffdf(df, df_missing): 
#> There are columns in BASE that are not in COMPARE !!
#> Differences found between the objects!
#> 
#> A summary is given below.
#> 
#> There are columns in BASE that are not in COMPARE !!
#> All rows are shown in table below
#> 
#>   =========
#>    COLUMNS 
#>   ---------
#>    Species 
#>   ---------
diffdf(df, df_extra)
#> Warning in diffdf(df, df_extra): 
#> There are columns in COMPARE that are not in BASE !!
#> Differences found between the objects!
#> 
#> A summary is given below.
#> 
#> There are columns in COMPARE that are not in BASE !!
#> All rows are shown in table below
#> 
#>   =========
#>    COLUMNS 
#>   ---------
#>     extra  
#>   ---------
diffdf(df, df_class)
#> Warning in diffdf(df, df_class): 
#> There are columns in BASE and COMPARE with different modes !!
#> There are columns in BASE and COMPARE with different classes !!
#> Differences found between the objects!
#> 
#> A summary is given below.
#> 
#> There are columns in BASE and COMPARE with different modes !!
#> All rows are shown in table below
#> 
#>   ================================
#>    VARIABLE  MODE.BASE  MODE.COMP 
#>   --------------------------------
#>    Species    numeric   character 
#>   --------------------------------
#> 
#> There are columns in BASE and COMPARE with different classes !!
#> All rows are shown in table below
#> 
#>   ==================================
#>    VARIABLE  CLASS.BASE  CLASS.COMP 
#>   ----------------------------------
#>    Species     factor    character  
#>   ----------------------------------
diffdf(df, df_order)
#> No issues were found!

diffdf() is sensitive to missing or extra columns, wrong classes and not to order.

This function also returns a list of data frames with issues invisibly, similar to janitor::compare_df_cols():

issues <- diffdf(df, df_missing)
#> Warning in diffdf(df, df_missing): 
#> There are columns in BASE that are not in COMPARE !!
issues$ExtColsBase
#> # A tibble: 1 x 1
#>   COLUMNS
#> * <chr>  
#> 1 Species

issues <- diffdf(df, df_extra)
#> Warning in diffdf(df, df_extra): 
#> There are columns in COMPARE that are not in BASE !!
issues$ExtColsComp
#> # A tibble: 1 x 1
#>   COLUMNS
#> * <chr>  
#> 1 extra

issues <- diffdf(df, df_class)
#> Warning in diffdf(df, df_class): 
#> There are columns in BASE and COMPARE with different modes !!
#> There are columns in BASE and COMPARE with different classes !!
issues$VarModeDiffs
#>   VARIABLE MODE.BASE MODE.COMP
#> 6  Species   numeric character
issues$VarClassDiffs %>% 
  unnest(CLASS.BASE) %>% 
  unnest(CLASS.COMP)
#> # A tibble: 1 x 3
#>   VARIABLE CLASS.BASE CLASS.COMP
#>   <chr>    <chr>      <chr>     
#> 1 Species  factor     character