3 Importing Data

3.1 What is Data Importing

For the first R activity of this module, we will practice reading in, or importing, data from a file. After all, if we can’t get the data into R, we won’t be able to perform any analyses on it. The process of importing data into R may seem like a trivial task if the data is already relatively clean. But, you will likely encounter situations as an Actuary or Data Scientist where you will have to find ways to import messy data into your R environment before you can analyze it. This chapter will go over the basics of importing a relatively clean file.

3.1.1 Prereqs:

In this chapter we’re going to focus on how to use the readr package to read or import files into R. readr is a core member of a suite of packages called the tidyverse. The tidyverse collection includes packages that you’re most likely to use in every data analyses, and we will use several other packages from the tidyverse as we work our way through the module.

library(tidyverse)  # Or alternatively, use library(readr)

When you begin your actuarial careers, you may encounter data in many different forms, including Excel files (.xlsx) and Comma-Separated-Value files (.csv). You may also encounter plain text files (.txt), and these files may use specific delimiters to define the boundaries between entries in the data. Or, you may need to connect your R environment to a Database and write queries to pull select data down from the database. Suffice to say, there are many different ways data can be organized that require different methods to capture this data and load it into your R environment.

In this chapter, we will focus on one of the most common file layouts, which is the Comma-Separate-Value (CSV) layout. A CSV file is a type of delimited text file that separates the values in the data by using commas. The raw data will typically look something like the example image below, where the first row of the file specifies the column names, and each of the data entries are separated by commas.

If we were to read this example CSV file into R, it would look like this in our environment:

FIRST_NAME LAST_NAME AGE SEX
JOHN SMITH 17 M
MARY JOHNSON 24 F
JAMES WILLIAMS 35 M
ANNA BROWN 50 F
SARAH JONES 67 F

3.2 Learning the Code

Now that we know the basics of what data importing is, and we’ve seen an example of one of the most common types of file formats, the CSV format, we will practice reading in our own CSV file. The file that we will import is a subset of the 2018 MEPS Household Component - in reality there are over 1,500 columns in the real MEPS consolidated file which can be found here. We will use this example dataset for the rest of the module.

Please download the example MEPS data below, and save the file onto your local computer. Once you have done this, please read Sections 11.1 and 11.2 of R for Data Science to learn the basics of how to import a CSV file into R. Please note the exercises at the end of these two sections are not required.

We will be utilizing the read_csv() function from the readr package. Note that when you are calling functions in R, you have the choice of calling the function in two ways:

  • Using the “full-name” of the function, which uses the syntax some_package::some_function(), which in our case is, readr::read_csv()
  • Or, you may just use the name of the function, for example read_csv(). If you use this approach, you must have already loaded the package that houses the function you intend to use. In this case, you must have loaded the readr package using either library(readr) or library(tidyverse). This option allows for shorter syntax, but there are cases where two functions from two different packages could have the same name in which case you would need to use the “full-name” to specify exactly which package has the function you want to use. Using the full-name is always the safest approach to ensure you’re using your intended function.