Chapter 5 Importing data: plain-text files

In this chapter:

Importing data from delimited text files
Importing data from fixed-width text files

5.1 Delimited plain-text files

Plain-text files (sometimes called “ASCII files”, after the character encoding standard they use) are often used to share data. They are limited in what they can contain, which has both downsides and upsides. On the downside, they can’t carry any additional information with them, such as variable types and labels. But on the upside, they don’t carry any additional information that requires additional interpretation by the software. This means they can be read consistently by a wide variety of software tools.

Plain-text files come in two varieties: delimited and fixed-width. “Delimited” is a reference to the fact that the files have a character that marks the boundary between two variables. A very common delimited format is the CSV file; the letters in the file name stand for “Comma Separated Values”, and use a comma as the variable delimiter. Another delimited type, somewhat less common, uses the tab character to separate the variables, and will have the extension “TSV” for, you guessed it, “Tab Separated Values”. Occasionally you will find files that use semi-colons, colons, or spaces as the delimiters.

There are base R functions to read this type of file. The functions within the {readr} (Wickham and Hester 2020) package has some advantages over the base R functions when it comes to plain-text files. (The {readr} package is part of the tidyverse.) Compared to the equivalent base R functions, the {readr} functions are quite a bit faster, and the package’s functions provide some useful flexibility when it comes to defining variable types as part of the read operation (rather than reading in the dat and then altering the variable types). As well, it returns a tibble instead of a dataframe. (For information about the difference, see (Wickham 2019a, 3.6 Data frames and tibbles). For working with very large files, you may want to investigate the fread() function in the {data.table} package(Dowle et al. 2023).)

We activate {readr} by using the library() function:

library(readr)

In this chapter, we will be using two of the {readr} functions. Each has a variety of arguments that allow us to control the behaviour of the function; those will be dealt with in the examples.

function	purpose
`read_csv()`	reads the contents of a CSV file
`read_fwf()`	reads the contents of a fixed-width text file

5.2 Using {readr} to read a CSV file

In the following examples, we will use versions of the data in the {palmerpenguins} data (A. M. Horst, Hill, and Gorman 2022; A. Horst 2020)

In the code chunks below, an R character string object penguins_path is created that contains the text string with the location of the file “penguins.csv”.

In the first example, the file has been loaded to our computer using the {dpjr} package, and uses the dpjr_data() function from that package to access the file (no matter its location on our computer).

In the second example, the file is located in a directory called “data” that is in our RStudio project folder. The {here} package will generate the full path, with slash or backslash separators (depending on your computer’s operating system), starting at the location of the RStudio project.

The penguins_path object is then used in the function read_csv() to create a dataframe object called penguins_data.

# Example 2
# create object path using the {here} package
penguins_path <- here::here("data", "penguins.csv")

# read the contents of the CSV file
penguins_data <- read_csv(penguins_path)

read_csv message.

The function returns a message letting us know the type that each variable is assigned.

The arguments of the {readr} package allow a lot of control over how the file is read. Of particular utility are the following:

argument	purpose
`col_types = cols()`	define variable types
`na = ""`	specify which values you want to be turned into `NA`
`skip = 0`	specify how many rows to skip; the default is 0
`n_max = Inf`	the maximum number of records to read; the default is `Inf` for infinity, interpreted as the last row of the file

Adding the col_types = cols() parameter has two benefits. First, defining the variable type avoids errors (such as numeric fields will be read as such, and not as characters) and speeds up the read process since R doesn’t have to check and infer the variable type.

Second, it allows us to alter what {readr} has decided for us. For example, we could set the species variable to be a factor type variable.

When we show the entire table, we can see that the variable species is now a factor type.

penguins_data <- read_csv(penguins_path,
                   col_types =
                     cols(species = col_factor()))
penguins_data

## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm
##    <fct>   <chr>              <dbl>         <dbl>
##  1 Adelie  Torgersen           39.1          18.7
##  2 Adelie  Torgersen           39.5          17.4
##  3 Adelie  Torgersen           40.3          18  
##  4 Adelie  Torgersen           NA            NA  
##  5 Adelie  Torgersen           36.7          19.3
##  6 Adelie  Torgersen           39.3          20.6
##  7 Adelie  Torgersen           38.9          17.8
##  8 Adelie  Torgersen           39.2          19.6
##  9 Adelie  Torgersen           34.1          18.1
## 10 Adelie  Torgersen           42            20.2
## # ℹ 334 more rows
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

If we were working with a very large file and wanted to read the first five rows, just to see what’s there, we could use the n_max = argument:

read_csv(penguins_path, 
         n_max = 5,
         show_col_types = FALSE   # option to turn off message
         )

## # A tibble: 5 × 8
##   species island    bill_length_mm bill_depth_mm
##   <chr>   <chr>              <dbl>         <dbl>
## 1 Adelie  Torgersen           39.1          18.7
## 2 Adelie  Torgersen           39.5          17.4
## 3 Adelie  Torgersen           40.3          18  
## 4 Adelie  Torgersen           NA            NA  
## 5 Adelie  Torgersen           36.7          19.3
## # ℹ 4 more variables: flipper_length_mm <dbl>,
## #   body_mass_g <dbl>, sex <chr>, year <dbl>

5.3 Fixed-width files

Fixed-width files don’t use a delimiter, and instead specify which column(s) each variable occupies, consistently for every row in the entire file.

Fixed-width files are a hold-over from the days when storage was expensive and/or on punch cards. This meant that specific columns in the table (or card) were assigned to a particular variable, and precious space was not consumed with a delimiter. Compression methods have since meant that a CSV file with unfixed variable lengths is more common, but in some big data applications, fixed-width files can be much more efficient.

If you ever have to deal with a fixed-width file, you will (or should!) receive a companion file letting you know the locations of each variable in every row.

In this example, we will use the one provided in the {dpjr} package, “authors_fwf.txt”. This code chunk assigns the path to the file location, which we can use in our code later.

authors_path <- dpjr::dpjr_data("authors_fwf.txt")

This simple file has four (or as we will see, sometimes three, if we combine first and last name as one) variables, and three records (or rows).

first name
last name
U.S. state of birth (two-letter abbreviation)
randomly generated unique ID

If we open the file in a text editor, we see this:

authors_fwf.txt.

We could also use the {base R} function readLines() to see the lines:

readLines(authors_path)

## [1] "Toni Morrison       IL        DJ-1944-QF96"
## [2] "Kurt Vonnegut       IN        XN-5632-TP58"
## [3] "Walt Whitman        NY        LW-1752-TD08"

The arguments within the readr::read_fwf() function include those listed above for the read_csv() function, and some others that are specific to fixed-width files.

argument	purpose
`col_names = c()`	defines a list of the names for the variables
`fwf_widths(widths = c(), col_names = c())`	a list of the character length of each variable, and their names
`fwf_positions(start = , end = , col_names = c())`	character position of the start and end of each variable, and their names
`fwf_cols(variable1_name = c(), variable2_name = c())`	name followed by start and end position of each variable)
`fwf_cols(variable1_name = width, variable2_name = width)`	name followed by width of each variable)

The examples below will elaborate on these arguments.

The first approach would be to allow {readr} to guess where the column breaks are. The fwf_empty() function looks through the specified file and returns the beginning and ending locations it has guessed, as well as the skip value that the read_fwf() function uses.

Note that the column names are specified in a list.

fwf_empty(
  authors_path, 
  col_names = c("first", "last", "state", "unique_id")
  )

## $begin
## [1]  0  5 20 30
## 
## $end
## [1]  4 13 22 NA
## 
## $col_names
## [1] "first"     "last"      "state"     "unique_id"

That information can then be used by the read_fwf() function:

read_fwf(authors_path,
         fwf_empty(
           authors_path,
           col_names = c("first", "last", "state", "unique_id")
         ))

## # A tibble: 3 × 4
##   first last     state unique_id   
##   <chr> <chr>    <chr> <chr>       
## 1 Toni  Morrison IL    DJ-1944-QF96
## 2 Kurt  Vonnegut IN    XN-5632-TP58
## 3 Walt  Whitman  NY    LW-1752-TD08

Note that {readr} will impute the variable type, as it did with the CSV file. And although we won’t implement it in these examples, in the same way read_fwf() allows us to use the col_types specification, as well as na, skip, and others. See the read_fwf() reference at https://readr.tidyverse.org/reference/read_fwf.html for all the details.

Reading this fixed-width file with these three author names worked, but it could break quite easily. We just need one person with three or more components to their name (initials, spaces, or hyphens, as in Ursula K. Le Guin or Ta-Nehisi Coates), or some missing values, and the inconsistent structure throws off the read_fwf() parser.

In the following example, we read a longer list of author names:

authors2_path <- dpjr::dpjr_data("authors2_fwf.txt")

fwf_empty(authors2_path,
          col_names = c("first", "last", "state", "unique_id"))

## $begin
## [1]  0 20 30
## 
## $end
## [1] 19 22 NA
## 
## $col_names
## [1] "first"     "last"      "state"     "unique_id"

The fwf_empty() function found only three columns, as shown in the “begin” and “end” values that are returned.

When we use the read function, it finds three columns:

read_fwf(dpjr::dpjr_data("authors2_fwf.txt"))

## # A tibble: 10 × 3
##    X1                  X2    X3          
##    <chr>               <chr> <chr>       
##  1 Toni Morrison       IL    DJ-1944-QF96
##  2 Kurt Vonnegut       IN    XN-5632-TP58
##  3 Walt Whitman        NY    LW-1752-TD08
##  4 Ursula K. Le Guin   CA    EZ-9789-EA77
##  5 Ta-Nehisi Coates    NY    YN-5151-NV82
##  6 W. E. B. Du Bois    MA    HN-6134-NF80
##  7 F. Scott Fitzgerald MN    YH-4405-TR02
##  8 N.K. Jemisin        IA    EF-7340-DW20
##  9 Flannery O'Connor   GA    HB-8269-XC88
## 10 Henry David Thoreau MA    RL-8200-SU83

read_fwf message.

Adding the col_names = argument now mis-identifies the variables.

read_fwf(authors2_path,
         fwf_empty(
           authors2_path,
           col_names = c("first", "last", "state", "unique_id")
         ))

## # A tibble: 10 × 3
##    first               last  state       
##    <chr>               <chr> <chr>       
##  1 Toni Morrison       IL    DJ-1944-QF96
##  2 Kurt Vonnegut       IN    XN-5632-TP58
##  3 Walt Whitman        NY    LW-1752-TD08
##  4 Ursula K. Le Guin   CA    EZ-9789-EA77
##  5 Ta-Nehisi Coates    NY    YN-5151-NV82
##  6 W. E. B. Du Bois    MA    HN-6134-NF80
##  7 F. Scott Fitzgerald MN    YH-4405-TR02
##  8 N.K. Jemisin        IA    EF-7340-DW20
##  9 Flannery O'Connor   GA    HB-8269-XC88
## 10 Henry David Thoreau MA    RL-8200-SU83

read_fwf message.

A more reliable approach is to specify exactly the width of each column. Note that in the example below, we specify only “name” without splitting it into first and last.

The variables and their widths are as follows:

variable	width	start position	end position
name	20	1	20
state	10	21	30
uniqueID	12	31	42

The widths of each column can be added using the fwf_widths argument:

read_fwf(authors2_path,
         fwf_widths(widths = c(20, 10, 12), 
                    col_names = c("name", "state", "unique_id")))

## # A tibble: 10 × 3
##    name                state unique_id   
##    <chr>               <chr> <chr>       
##  1 Toni Morrison       IL    DJ-1944-QF96
##  2 Kurt Vonnegut       IN    XN-5632-TP58
##  3 Walt Whitman        NY    LW-1752-TD08
##  4 Ursula K. Le Guin   CA    EZ-9789-EA77
##  5 Ta-Nehisi Coates    NY    YN-5151-NV82
##  6 W. E. B. Du Bois    MA    HN-6134-NF80
##  7 F. Scott Fitzgerald MN    YH-4405-TR02
##  8 N.K. Jemisin        IA    EF-7340-DW20
##  9 Flannery O'Connor   GA    HB-8269-XC88
## 10 Henry David Thoreau MA    RL-8200-SU83

read_fwf message.

A third option is to provide two lists of locations using fwf_positions(), the first with the start positions, and the second with the end positions. The first variable name starts at position 1 and ends at position 20, and the second variable unique_id starts at 30 and ends at 42. Note that we won’t read the state variable which occupies the ten columns from 21 through 30.

read_fwf(authors2_path,
         fwf_positions(start = c(1, 31), end = c(20, 42), 
                       col_names = c("name", "unique_id")))

## # A tibble: 10 × 2
##    name                unique_id   
##    <chr>               <chr>       
##  1 Toni Morrison       DJ-1944-QF96
##  2 Kurt Vonnegut       XN-5632-TP58
##  3 Walt Whitman        LW-1752-TD08
##  4 Ursula K. Le Guin   EZ-9789-EA77
##  5 Ta-Nehisi Coates    YN-5151-NV82
##  6 W. E. B. Du Bois    HN-6134-NF80
##  7 F. Scott Fitzgerald YH-4405-TR02
##  8 N.K. Jemisin        EF-7340-DW20
##  9 Flannery O'Connor   HB-8269-XC88
## 10 Henry David Thoreau RL-8200-SU83

read_fwf message.

The fourth option is a syntactic variation on the third, with the same values but in a different order. This time, all of the relevant information about each variable is aggregated, with the variable name followed by the start and end locations.

read_fwf(authors2_path, 
         fwf_cols(name = c(1, 20), unique_id = c(31, 42)))

## # A tibble: 10 × 2
##    name                unique_id   
##    <chr>               <chr>       
##  1 Toni Morrison       DJ-1944-QF96
##  2 Kurt Vonnegut       XN-5632-TP58
##  3 Walt Whitman        LW-1752-TD08
##  4 Ursula K. Le Guin   EZ-9789-EA77
##  5 Ta-Nehisi Coates    YN-5151-NV82
##  6 W. E. B. Du Bois    HN-6134-NF80
##  7 F. Scott Fitzgerald YH-4405-TR02
##  8 N.K. Jemisin        EF-7340-DW20
##  9 Flannery O'Connor   HB-8269-XC88
## 10 Henry David Thoreau RL-8200-SU83

read_fwf message.

And finally, {readr} provides a fifth way to read in a fixed-width file that is a variation on the second approach we saw, with the name and the width values aggregated.

read_fwf(authors2_path, 
         fwf_cols(name = 20, state = 10, unique_id = 12))

## # A tibble: 10 × 3
##    name                state unique_id   
##    <chr>               <chr> <chr>       
##  1 Toni Morrison       IL    DJ-1944-QF96
##  2 Kurt Vonnegut       IN    XN-5632-TP58
##  3 Walt Whitman        NY    LW-1752-TD08
##  4 Ursula K. Le Guin   CA    EZ-9789-EA77
##  5 Ta-Nehisi Coates    NY    YN-5151-NV82
##  6 W. E. B. Du Bois    MA    HN-6134-NF80
##  7 F. Scott Fitzgerald MN    YH-4405-TR02
##  8 N.K. Jemisin        IA    EF-7340-DW20
##  9 Flannery O'Connor   GA    HB-8269-XC88
## 10 Henry David Thoreau MA    RL-8200-SU83

read_fwf message.

5.3.1 An extreme example of a fixed-width file

A particularly interesting research question is the relationship between education level and different health outcomes. In this example, we will start the process of importing a large file that contains data that will allow us to explore whether there is a correlation.

Statistics Canada has made available a Public-Use Microdata File (PUMF) of the Joint Canada/United States Survey of Health (JCUSH), a telephone survey conducted in late 2002 and early 2003. There were 8,688 respondents to the survey, 3,505 Canadians and 5,183 Americans. The data file that is available is anonymized, so we have access to the individual responses, which will facilitate additional analysis.

The webpage for the survey, including the PUMF file, data dictionary, and methodological notes, is here: https://www150.statcan.gc.ca/n1/pub/82m0022x/2003001/4069119-eng.htm

The PUMF is a fixed-width file named “JCUSH.txt”. This file is quite a lot larger than the author names example above. There are the 8,688 records, one for each survey respondent. The data also consists of 366 variables, a combination of assigned variables (such as the unique respondent identification number and the country), survey question responses, and derived variables that are calculated as part of the post-survey data processing.

Here’s what the first 40 characters of the first record looks like:

readLines(dpjr::dpjr_data("JCUSH.txt"), n = 1)

## [1] "1000033308351200211170432144421333431126"

There’s not a bit of white space anywhere in this data file. The 366 variables are stored in only 552 columns—an average of 1.5 columns per variable! This is a good example of the efficiency associated with this approach to encoding the data.

Let’s imagine that our research goal is to determine if there is a relationship between a person’s level of education and their health outcomes. We’ve reviewed the data documentation, and it is clear that the JCUSH data has what we need.

Here’s how one variable, highest level of post-secondary education achieved, appears in the data dictionary:

JCUSH health survey example.

The variable SDJ1GHED is 1 character long, in position 502 of the data. The variable might be only 1 character long, but when coupled with the information in the Content table, it becomes a very powerful piece of information.

For our analysis question, we will read in four variables: the unique household identification number, the country, the overall health outcomes, and education level. For the variable names, we will use the same ones used in the data dictionary:

name	variable	length	position
SAMPLEID	Household identifier	12	1 - 12
SPJ1_TYP	Sample type [country]	1	13
GHJ1DHDI	Health Description Index	1	32
SDJ1GHED	Highest level of post-secondary education attained	1	502

You will note in the code below that in the case of the variables that are of length “1”, the value to indicate the start and end positions are the same.

As well, the SAMPLEID variable is a 12-digit number; the read_fwf() function will interpret this as variable type double, and represent it in scientific notation. To be useful, we want to be able to see and evaluate the entire string. As a result, we use col_types() to specify SAMPLEID as a character.

jcush <- readr::read_fwf(dpjr::dpjr_data("JCUSH.txt"), 
         fwf_cols(
           SAMPLEID = c(1, 12),
           SPJ1_TYP = c(13, 13), 
           GHJ1DHDI = c(32, 32),
           SDJ1GHED = c(502, 502)
           ),
         col_types = list(
           SAMPLEID = col_character()
         ))

head(jcush)

## # A tibble: 6 × 4
##   SAMPLEID     SPJ1_TYP GHJ1DHDI SDJ1GHED
##   <chr>           <dbl>    <dbl>    <dbl>
## 1 100003330835        1        3        4
## 2 100004903392        1        3        2
## 3 100010137168        1        2        1
## 4 100010225523        1        3        3
## 5 100011623697        1        2        2
## 6 100013652729        1        4        3

Imagine, though, the challenge of handling this amount of data at one time! Between the many variables and the complex value labels, the “data” is more than just the fixed-width file. This is a circumstance where a different data storage solution (as we will see later) has some strengths.

References

Dowle, Matt, Arun Srinivasan, Jan Gorecki, Michael Chirico, Pasha Stetsenko, Tom Short, Steve Lianoglou, et al. 2023. data.table: Extension of ’Data.frame’. https://CRAN.R-project.org/package=data.table.

Horst, Allison. 2020. palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://allisonhorst.github.io/palmerpenguins/.

Horst, Allison M., Alison Presmanes Hill, and Kristen B. Gorman. 2022. “Palmer Archipelago Penguins Data in the palmerpenguins R Package - An Alternative to Anderson’s Irises.” The R Journal 14 (1): 244–54. https://doi.org/10.32614/RJ-2022-020.

———. 2019a. Advanced R. 2nd ed. CRC Press. https://adv-r.hadley.nz/index.html.

Wickham, Hadley, and Jim Hester. 2020. readr: Read Rectangular Text Data. https://CRAN.R-project.org/package=readr.