2  Reading & Manipulating Data

2.1 Table of content for chapter 02

Chapter section list

2.2 Data Input

2.2.1 Accessing data from a package

There are two ways to access data from a package:

  • Loading the package with base::library(<packageName>)
  • Accessing with utils::data("<dataName>", package = "<packageName>") without first loading the package

R Code 2.1 : Load data and display a data extract with different methods

Listing / Output 2.1: Accessing data from a package and using different methods for displaying a data extract
Code
utils::data("Davis", package = "carData")

glue::glue("---------------------------------------------------------------------")
glue::glue("utils::str: Compactly Display the Structure of an Arbitrary R Object")
glue::glue("---------------------------------------------------------------------")
utils::str(Davis)

glue::glue(" ")
glue::glue("---------------------------------------------------------------------")
glue::glue("car::brief: Print Abbreviated Ouput")
glue::glue("---------------------------------------------------------------------")
car::brief(Davis)

glue::glue(" ")
glue::glue("---------------------------------------------------------------------")
glue::glue("my_glance_data: my own function")
glue::glue("---------------------------------------------------------------------")
my_glance_data(Davis)

glue::glue(" ")
glue::glue("---------------------------------------------------------------------")
glue::glue("dplyr::glimpse: Get a glimpse of your data")
glue::glue("---------------------------------------------------------------------")
dplyr::glimpse(Davis)
#> ---------------------------------------------------------------------
#> utils::str: Compactly Display the Structure of an Arbitrary R Object
#> ---------------------------------------------------------------------
#> 'data.frame':    200 obs. of  5 variables:
#>  $ sex   : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...
#>  $ weight: int  77 58 53 68 59 76 76 69 71 65 ...
#>  $ height: int  182 161 161 177 157 170 167 186 178 171 ...
#>  $ repwt : int  77 51 54 70 59 76 77 73 71 64 ...
#>  $ repht : int  180 159 158 175 155 165 165 180 175 170 ...
#>  
#> ---------------------------------------------------------------------
#> car::brief: Print Abbreviated Ouput
#> ---------------------------------------------------------------------
#> 200 x 5 data.frame (195 rows omitted)
#>     sex weight height repwt repht
#>     [f]    [i]    [i]   [i]   [i]
#> 1   M       77    182    77   180
#> 2   F       58    161    51   159
#> 3   F       53    161    54   158
#> . . .                                 
#> 199 M       90    181    91   178
#> 200 M       79    177    81   178
#>  
#> ---------------------------------------------------------------------
#> my_glance_data: my own function
#> ---------------------------------------------------------------------
#>          obs sex weight height repwt repht
#> 1          1   M     77    182    77   180
#> 49        49   F     54    174    56   173
#> 65        65   M     97    189    98   185
#> 74        74   F     56    163    57   159
#> 122      122   M     69    167    73   165
#> 128      128   F     45    157    45   153
#> 146      146   F     55    160    55   155
#> 153      153   F     47    150    45   152
#> 200...9  200   M     79    177    81   178
#> 200...10 200   M     79    177    81   178
#>  
#> ---------------------------------------------------------------------
#> dplyr::glimpse: Get a glimpse of your data
#> ---------------------------------------------------------------------
#> Rows: 200
#> Columns: 5
#> $ sex    <fct> M, F, F, M, F, M, M, M, M, M, M, F, F, F, F, F, M, F, M, F, M, …
#> $ weight <int> 77, 58, 53, 68, 59, 76, 76, 69, 71, 65, 70, 166, 51, 64, 52, 65…
#> $ height <int> 182, 161, 161, 177, 157, 170, 167, 186, 178, 171, 175, 57, 161,…
#> $ repwt  <int> 77, 51, 54, 70, 59, 76, 77, 73, 71, 64, 75, 56, 52, 64, 57, 66,…
#> $ repht  <int> 180, 159, 158, 175, 155, 165, 165, 180, 175, 170, 174, 163, 158…

There is a problem with the data files from the {carData} package: The files have relevant information in the row names. But row names are not part of the data frame. This can be easily seen by applying functions like utils::str() or dplyr::glimpse().

This is not a great deal here with the Davis data set as it only contains the index of the row numbers. But it is an annoying “feature” when it has important information that belong into a column of the data frame. We have already met such a data frame with Duncan in Listing / Output 1.1.

See Section 2.3 for a more detailed treatment and why I need row names using the {car} package.

Some packages require that you call the utils::data() function to use data frames in the package, even when the package is loaded. Other packages, like {carData}, use R’s lazy data mechanism to provide direct access to data frames when the package is loaded. On a package’s help page, accessed, for example, by help(package="carData"), click on the link for the package DESCRIPTION file. Data frames are automatically available if the line LazyData: yes appears in the package description.

2.2.2 Entering a data frame directly (empty)

2.2.3 Reading data from plain-text files (empty)

2.2.4 File and Paths (empty)

2.2.5 Exporting or saving a data frame to a file (empty)

2.2.6 Reading and writing other file formats

The {car} package includes a function named car::Import() that calls the import() function in {rio} but uses different defaults and arguments, more closely matching the specifications that the authors expect users of the {car} package to prefer. In particular:

  1. Row names: The rio::import() function does not support row names, while car::Import() allows you to convert a column in the input data file into row names. By default, car::Import() assumes that the first character column with unique values in the input data set represents row names. If you do not want car::Import() to use row names, set the argument row.names = FALSE.
  2. Conversion to factors: The rio::import() function does not support automatic conversion of character data to factors, while car::Import() by default converts character variables to factors. To suppress converting character variables to factors, add the argument stringsAsFactors = FALSE.

The car::Export() function similarly writes data files in various formats. car::Export() is identical in use to the rio::export() but car::Export() has an additional argument, keep.row.names: Setting keep.row.names = TRUE adds the row names in a data frame as an additional initial column of the output data set, with the column name id.

Warning 2.1: My preferences are {readr}, {haven} and {foreign}

As I am trying to stick with the {tidyverse} approach and not to use different packages for the same task, I will neither use the {rio} nor the {car} file import/export commands.

My preferences are {readr}, {haven} and {foreign} (in that order).

2.3 Other approaches to reading and managing data sets in R

Fox & Weisberg are not using the tidyverse file commands for two reasons:

  1. {tidyverse} packages are actively antagonistic to the use of row names.
  2. {tidyverse} packages does not automatically convert character variables to factors.

Both arguments for the critics are not convinging for me:

ad 1) Why using row names for indiviual labelling?

Avoiding row names may be a reasonable strategy for huge data sets where the cases don’t have individual identities beyond their row numbers, but automatic labeling of individual cases by their row names can be very useful in regression diagnostics and for other methods that identify unusual or influential cases.

Labelling individual cases are important information and should be incorparated into the data file as a column. The strategy to use these individual names could be fulfilled referencing this columns or applying a default value (for instance the first column in the data set).

As things stand I need to use row names to use the advanced features of the {car} package. This excludes the use of tibbles: Whenever you add row names to a tibble it will loose it special features because it will be converted automatically to a data frame.

Important 2.1: Working with row names

There are several tools for working with row names (see: Tools for working with row names) which will become important when applying the {car} package and trying still to be following the tidyverse approach as well as in any way possible:

  • tibble::has_rownames(<dataFrame>)
  • remove_rownames(<dataFrame>)
  • rownames_to_column(<dataFrame>, var = "<rowNameColumn>")
  • rowid_to_column(<dataFrame>, var = "<rowNameId>")
  • column_to_rownames(.data, var = "<rowNameColumn>")

ad 2) Why using stringAsFactor = TRUE as default?

Even R has with version 4.0.0 the default changed to stringsAsFactors = FALSE, and “hence by default [R] no longer convert strings to factors in calls to data.frame() and read.table()” (R 4.0.0 is released).

Besides it is easy to change this default. It is therefore no big deal in my opinion that justifies to decline the advantages of the {tidyverse} packages.

Remark 2.1. : My strategy to harmonize {car} with {tidyverse}

As far as I understand the {car} packages provides many important enhancement for the regression analysis that are not included in base R or other packages including {tidyverse}. But I trust the future development of the {tidyverse} approache more than the professors Fox and Weisberg, who both are already retired (professor emeritus).

So what I am trying with the book:

  1. First of all to learn the regression tools in {car} for an advanced regression analysis.
  2. To look around if these tools could be incorporated in the {tidyverse} approach. Here I am thinking at the moment of two ways:
    1. Are there packages following the {tidyverse} approach that already implement these advanced regression analysis tools? I am thinking here of the huge list of ggplot2 extensions. Currently (2024-05-19) there are 133 registered extensions available to explore!
    2. Is it possible to code the desired feature myself? At the moment I am not thinking of developing a new package. Maybe it is possible to add the feature with 2-3 lines of code into a seuqence of R commands? An example that inspired me is the formula by D. Freedman and Diaconis (1981) that could be added in calls to ggplot2::geom_histogram().

2.4 Working with data frames

2.4.1 How the R interpreter finds objects (empty)

2.4.2 Missing data

Resource 2.1 : Packages for working with missing data



Code
pkgs_dl(c("Amelia", "ggmice", "mi", "mice", "naniar", "norm"))
#> # A tibble: 6 × 4
#>   package average from       to        
#>   <chr>     <dbl> <date>     <date>    
#> 1 mice       1819 2024-05-12 2024-05-18
#> 2 mi          877 2024-05-12 2024-05-18
#> 3 naniar      585 2024-05-12 2024-05-18
#> 4 norm        525 2024-05-12 2024-05-18
#> 5 Amelia      323 2024-05-12 2024-05-18
#> 6 ggmice       19 2024-05-12 2024-05-18

The top package of my package selection is {mice}. It also has the most extensive documentation.

An interesting choice — not mentioned in the {car} companion — is {naniar}, because it implements NA imputation following the {tidyverse} approach.

Missing data is a profound statistical issue concerning how best to use available information (“imputation of missing data”) when missing data are encountered. I will ignore these issues here and postpone for later treatment (learning sessions). Here I will just list general commands to exclude missing data.

Commands to exclude missing data (NA’s)

  • na.rm: Many R functions that calculate simple statistical summaries, such as base::mean(), stats::var(), stats::sd(), and stats::quantile(), have an na.rm argument to compute the summaries without NA’s. The default value is always na.rm = FALSE so you will get an NA result if you forgot to remove the missing values.
  • na.action: Most statistical-modeling functions in R have an na.action argument, which controls how missing data are handled: na.action is set to a function that takes a data frame as an argument and returns a similar data frame composed entirely of valid data.
    • na.omit(): Default value, which removes all cases with missing data on any variable in the computation. — You can change the default na.action with the options() command. For example, options(na.action="na.exclude")
    • na.exclude(): Similar to na.omit() but saves information about missing cases that are removed. This is the recommended option!
    • na.fail(): Produces an error message.
  • stats::complete.cases(): Returns a logical vector indicating which cases are complete, i.e., have no missing values.
  • tidyr::drop_na(): Drops rows containing missing values. If no columns are mentioned then the function applies to the whole data set.

Don’t forget that the proper way to test for missing values is base::is.na(). The equal operator == will not work!

Bullet List 2.1: Commands to exclude missing data (NA’s)

2.4.3 Modifying and transforming data

For modifying data I will rouintely use the {dpylr} package. Here I will only list other options to remind me if I will see these functions.

Modifying and transforming data without using {dpylr}

  • base::transform(): Create or modify several variables in a data frame at once.
  • base::cut(): creates bins for numeric variables. (For this function I didn’t find an equivalent in the {tidyverse} ecosysystem.)
  • car::Recode(): same as base::cut() but mor flexible.

2.4.4 Binding rows and columns (empty)

2.4.5 Aggregating data frames (empty)

2.4.6 Merging data frames (empty)

2.4.7 Reshaping data (empty)

2.5 Working with matrices, arrays, and lists (empty)

2.6 Dates and times (empty)

2.7 Character data (empty)

2.8 Large datasets in R (empty)