2.3 Comparing two data frames (tibbles)
https://sharla.party/post/comparing-two-dfs/
A summary table from the blog:
dplyr::all_equal() | janitor::compare_df_cols() | vetr::alike() | diffdf::diffdf() | |
---|---|---|---|---|
iris is iris | ✅ | ✅ | ✅ | ✅ |
column swapped iris is iris | ✅ | ✅ | ❌ | ✅ |
missing columns | ✅ | ✅ | ✅ | ✅ |
extra columns | ✅ | ✅ | ✅ | ✅ |
missing and extra columns | ✅ | ✅ | ❌ | ✅ |
difference in class | ✅ | ✅ | ✅ | ✅ |
different columns and classes | ❌ | ✅ | ❌ | ✅ |
nice strings to use for messages | ✅ | ❌ | ✅❌ | ❌ |
returns data on differences | ❌ | ✅ | ❌ | ✅ |
First, take iris
data as a reference for comparison:
Then create some iris
variants for the purpose of comparison:
df_missing
anddf_extra
for less or more columnsdf_class
for wrong classdf_order
for new order of same set of columns
df_missing <- df %>% select(-Species)
df_extra <- df %>% mutate(extra = "extra")
df_class <- df %>% mutate(Species = as.character(Species))
df_order <- df %>% select(Species, everything())
2.3.1 dplyr::all_equal()
dplyr::all_equal(target, current)
compare if current
and target
are identical ,and it could only compares 2 data frames at the same time, with several other arguments:
ignore_col_order = TRUE
: Should order of columns be ignored?
ignore_row_order = TRUE
: Should order of rows be ignored?
convert = FALSE
: Should similar classes be converted? Currently this will convert factor to character and integer to double.
if there are missing and extra columns?
all_equal(df, df_missing)
#> [1] "Cols in x but not y: `Species`. "
all_equal(df, df_extra)
#> [1] "Cols in y but not x: `extra`. "
if there’s an incorrect variable class?
2.3.2 janitor::compare_df_cols()
Unlike dplyr::all_equal
, janitor::compare_df_cols()
returns a comparison of the columns in data frames being compared (what’s in both data frames, and their classes in each). It does not cares about rows, since it mean to show wheather several data frames can be row-binded, instead of identity(Although here we have the same rows).
library(janitor)
compare_df_cols(df, df_missing, df_extra, df_class, df_order)
#> column_name df df_missing df_extra df_class df_order
#> 1 extra <NA> <NA> character <NA> <NA>
#> 2 Petal.Length numeric numeric numeric numeric numeric
#> 3 Petal.Width numeric numeric numeric numeric numeric
#> 4 Sepal.Length numeric numeric numeric numeric numeric
#> 5 Sepal.Width numeric numeric numeric numeric numeric
#> 6 Species factor <NA> factor character factor
We can set an option return
only to return things that don’t match (or things that do):
compare_df_cols(df, df_missing, df_extra, df_class, df_order, return = "mismatch")
#> column_name df df_missing df_extra df_class df_order
#> 1 Species factor <NA> factor character factor
Here only the wrong class case is returned, and df_missing
, df_extra
, df_order
are considered matching when compared to df
.That is because compare_df_cols()
won’t be affected by order of columns, and it use either of dplyr::bind_rows()
or rbind()
to decide mathcing. bind_rows()
are looser in the sense that columns missing from a data frame would be considered a matching (i.e, select()
on a data frame will not generate a “new” one). with rbind()
, columns missing from a data.frame would be considered a mismatch
# missing column is considered a sort of "matching" when bind_method = "bind_rows"
compare_df_cols(df, df_missing, df_extra, df_class, df_order, return = "match")
#> column_name df df_missing df_extra df_class df_order
#> 1 extra <NA> <NA> character <NA> <NA>
#> 2 Petal.Length numeric numeric numeric numeric numeric
#> 3 Petal.Width numeric numeric numeric numeric numeric
#> 4 Sepal.Length numeric numeric numeric numeric numeric
#> 5 Sepal.Width numeric numeric numeric numeric numeric
# method = "rbind"
compare_df_cols(df, df_missing, df_extra, df_class, df_order, return = "match",
bind_method = "rbind")
#> column_name df df_missing df_extra df_class df_order
#> 1 Petal.Length numeric numeric numeric numeric numeric
#> 2 Petal.Width numeric numeric numeric numeric numeric
#> 3 Sepal.Length numeric numeric numeric numeric numeric
#> 4 Sepal.Width numeric numeric numeric numeric numeric
Note that janitor::compare_df_cols()
returns a data frame, which can be easily incorporated into custom message using the glue package:
compare_df_cols(df, df_missing, df_extra, df_class, df_order) %>%
mutate(comparison = glue::glue("Column: {column_name}: {df} in df,{df_missing} in df_missing, {df_extra} in df_extra,{df_order} in df_order")) %>%
select(comparison)
#> comparison
#> 1 Column: extra: NA in df,NA in df_missing, character in df_extra,NA in df_order
#> 2 Column: Petal.Length: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 3 Column: Petal.Width: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 4 Column: Sepal.Length: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 5 Column: Sepal.Width: numeric in df,numeric in df_missing, numeric in df_extra,numeric in df_order
#> 6 Column: Species: factor in df,NA in df_missing, factor in df_extra,factor in df_order
and the resulting data frame can be filtered manually when the filters from return
aren’t what i want, to see all differences:
compare_df_cols(df, df_missing, df_extra, df_class, df_order) %>%
filter(is.na(df) | df_class != df_order)
#> column_name df df_missing df_extra df_class df_order
#> 1 extra <NA> <NA> character <NA> <NA>
#> 2 Species factor <NA> factor character factor
To get a binary message to see whether a set of data.frames are row-bindable, use janitor::compare_df_cols_sames()
2.3.3 vetr::alike()
vetr::alike(target, current)
is similar to base::all.equal()
(dplyr::all_equal()
’s conuterparts in base R), but it only compares object structure. In the case of data frames, vetr::alike()
compares columns and ignores rows. It is useful for all kinds of objects, but we focus on comparing data frames here.
library(vetr)
alike(df, df_missing)
#> [1] "`df_missing` should have 5 columns (has 4)"
alike(df, df_extra)
#> [1] "`df_extra` should have 5 columns (has 6)"
alike(df, df_class)
#> [1] "`df_class$Species` should be class \"factor\" (is \"character\")"
alike(df, df_order)
#> [1] "`names(df_order)[1]` should be \"Sepal.Length\" (is \"Species\")"
As it turns out, vetr::alike()
detects all differences, and makes a declarative comparison.
2.3.4 diffdf::diffdf()
diffdf
is a package dedicated to providing tools for working with data frame difference. diffdf(base, compare)
comapres 2 data frames (compare
against base
) and outputs any differences :
library(diffdf)
diffdf(df, df_missing)
#> Warning in diffdf(df, df_missing):
#> There are columns in BASE that are not in COMPARE !!
#> Differences found between the objects!
#>
#> A summary is given below.
#>
#> There are columns in BASE that are not in COMPARE !!
#> All rows are shown in table below
#>
#> =========
#> COLUMNS
#> ---------
#> Species
#> ---------
diffdf(df, df_extra)
#> Warning in diffdf(df, df_extra):
#> There are columns in COMPARE that are not in BASE !!
#> Differences found between the objects!
#>
#> A summary is given below.
#>
#> There are columns in COMPARE that are not in BASE !!
#> All rows are shown in table below
#>
#> =========
#> COLUMNS
#> ---------
#> extra
#> ---------
diffdf(df, df_class)
#> Warning in diffdf(df, df_class):
#> There are columns in BASE and COMPARE with different modes !!
#> There are columns in BASE and COMPARE with different classes !!
#> Differences found between the objects!
#>
#> A summary is given below.
#>
#> There are columns in BASE and COMPARE with different modes !!
#> All rows are shown in table below
#>
#> ================================
#> VARIABLE MODE.BASE MODE.COMP
#> --------------------------------
#> Species numeric character
#> --------------------------------
#>
#> There are columns in BASE and COMPARE with different classes !!
#> All rows are shown in table below
#>
#> ==================================
#> VARIABLE CLASS.BASE CLASS.COMP
#> ----------------------------------
#> Species factor character
#> ----------------------------------
diffdf(df, df_order)
#> No issues were found!
diffdf()
is sensitive to missing or extra columns, wrong classes and not to order.
This function also returns a list of data frames with issues invisibly, similar to janitor::compare_df_cols()
:
issues <- diffdf(df, df_missing)
#> Warning in diffdf(df, df_missing):
#> There are columns in BASE that are not in COMPARE !!
issues$ExtColsBase
#> # A tibble: 1 x 1
#> COLUMNS
#> * <chr>
#> 1 Species
issues <- diffdf(df, df_extra)
#> Warning in diffdf(df, df_extra):
#> There are columns in COMPARE that are not in BASE !!
issues$ExtColsComp
#> # A tibble: 1 x 1
#> COLUMNS
#> * <chr>
#> 1 extra
issues <- diffdf(df, df_class)
#> Warning in diffdf(df, df_class):
#> There are columns in BASE and COMPARE with different modes !!
#> There are columns in BASE and COMPARE with different classes !!
issues$VarModeDiffs
#> VARIABLE MODE.BASE MODE.COMP
#> 6 Species numeric character
issues$VarClassDiffs %>%
unnest(CLASS.BASE) %>%
unnest(CLASS.COMP)
#> # A tibble: 1 x 3
#> VARIABLE CLASS.BASE CLASS.COMP
#> <chr> <chr> <chr>
#> 1 Species factor character