3 Data

Data in R can take many formats, but it’s easiest to read in a csv file. To do so, in the tidyverse approach, we use a function called read_csv and the file path.

In R, datasets, vectors, matrices, lists, etc. are all stored as “R objects”. The syntax to create an object is “name <- function()”. Below, I am creating an object called data that is the result of reading in a csv file. You will see a list of your R objects in the top right panel of the R Studio window.

Note: read_csv will automatically take the first row of the csv file and use it as the variable names.

data <- read_csv(file="")

Some packages have built in data sets. For this tutorial, I’m going to use the dataset “flights” from the package “nycflights13”.

library(nycflights13)

You can look at the data using View():

View(flights)

You can also click on data objects in the list in the top right. They will open as an additional tab in the top left quadrant of the R Studio window.

3.1 Data formats

There are many different types of R objects. So far I’ve talked about data objects generally, but data objects can take several forms.

Note: many coding errors happen because you are trying to run functions that need input in a certain form on an object that’s in the wrong form.

3.1.1 Tibbles

Tidyverse data is stored as a “tibble”. The function read_csv will automatically read in data as a tibble.

If your data are in tibble form, you can just run the dataset name as a command to see more information about the dataset. This is a benefit of the tibble form vs. the data frame form. You’ll see the dataset dimensions, variable names, type, and the first 10 rows of the data.

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1  2013     1     1      517            515         2      830            819        11 UA     
##  2  2013     1     1      533            529         4      850            830        20 UA     
##  3  2013     1     1      542            540         2      923            850        33 AA     
##  4  2013     1     1      544            545        -1     1004           1022       -18 B6     
##  5  2013     1     1      554            600        -6      812            837       -25 DL     
##  6  2013     1     1      554            558        -4      740            728        12 UA     
##  7  2013     1     1      555            600        -5      913            854        19 B6     
##  8  2013     1     1      557            600        -3      709            723       -14 EV     
##  9  2013     1     1      557            600        -3      838            846        -8 B6     
## 10  2013     1     1      558            600        -2      753            745         8 AA     
## # … with 336,766 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   destination <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>

3.1.2 Data frames

Base R data is held as a “data frame.”

You can convert tibbles (and matrices, etc) to data frames and vice versa.

flights_df <- as.data.frame(flights)
flights_tibble <- as_tibble(flights) #this is redundant, since flights is a tibble by default

For the most part, you should be able to work with your data in either tibble or data frame format. The only time you may need one or the other is when certain packages or functions require your data to be in a certain form.

3.2 Opening non-csv data in R

The package “rio” is really helpful for importing data. It should automatically figure out what type of file the data is saved as and then open it accordingly, or you can manually specify the format.

library(rio)
install_formats() #this makes sure we have all the formats installed
?import # see the help page
data <- import("", format="")

Another helpful package is “haven”, which reads in Stata .dta files.

library(haven)
data <- read_dta()

3.3 Activity

  1. Install and load the nycflights13 package.

  2. Look at the flights data. Try running “?flights”. What do you see?

  3. If you have your own data, try reading it in.