Chapter 2 Data Types and Structures

2.1 Introduction

Hi! Today, we’re talking about data structures and coding in R.

For this tutorial, we will be working with the mtcars dataset in the dataset package, so make sure to add that package to your library. Let’s also add the tidyverse package, a package we will be using a lot in this course.

library(tidyverse)
library(datasets)
mtcars #this line will print out the dataset
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Don’t forget–you want to make sure that the dataset you are working with is in your environment. You can do this by assigning the mtcars dataset to an object using <-

new_file <- mtcars #this will save mtcars as an object named new_file

You can also see (but not edit) the dataset using the View() function.

#View(new_file) #run this line in your own console

Notice how I used an underscore in the object name? Using lowercased letters and underscores is the most common method for naming objects in R. Learn more about style in Wickham’s style guide here. For more about naming conventions, read this two-page guide.

2.2 Understanding the Data

Now that we have the dataset mtcars saved to an object, let’s learn more about this data. new_file is a dataset with 11 variables (columns) and 32 observations (rows).

We can print out the name of the variables using names(). Try this out yourself!

names(new_file)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

You can save these names in a new object using the <- assignment. If you recall from last week, we used <- to create a new object.

variable_names <- names(new_file)

Next, use the str() function to learn more about each variable. (You should be familiar with str() and summary() from last week).

str(new_file)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

As you can see, the object new_file is a data.frame type (we’ll return to this later). There are 11 variables in this dataset. In this situation, all the variables are numeric (num). But, let’s talk about some of the other data types we are likely to see.

2.3 Data Types

“Numerics” (num) and “integers” (int) are both numbers. You can consider integers to be a special type of numeric. Integers only contain whole numbers, while “Numerics” are more like “floats” or “doubles.” You can learn more about them here.

“Characters” are letters (and numbers and symbols) that collectively constitute a string (i.e., a string of letters). Character objects/vectors/variables are the backbone of Natural Language Processing. You can tell when a value is a character type because it will be in double quotation marks (6 is a number but "6" is a string).

“Logical” objects are binaries: TRUE OR FALSE.

“Factor” objects tend to be categorical variables (e.g., race). Factors are actually integers, but they have labels (called levels) that are characters. In other words, each integer is identifiable by its level or label. Most of the variables in this dataset are factors. Berkeley posted a great tutorial on factors. We’ll also talk a lot about factors when we deal with public opinion data!

To figure out what type your object is, you can use summary(), which is a really useful function for getting some descriptive information about your variables. You can also use class(), which is a function specifically to get the class-type of the object. Try these now in the space below!

summary(new_file)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000
class(new_file)
## [1] "data.frame"

As you can see, because all the variables are numeric, summary() will provide information like the mean and median. This is slightly different for other data types.

2.4 Data Coercion

Sometimes, you will use a variable or object that is not the type that you want. For example, you may read in something that’s a number, but R sees as a character, like the number_variable we will now create.

number_variable <- "10"

class(number_variable)
## [1] "character"

How do we change this to a numeric? We can do so using the as.numeric() function, a base R function.

number_variable <- as.numeric(number_variable)
#if you don't assign your result to an object, it will not be saved.

class(number_variable)
## [1] "numeric"

Notice now that the variable class has changed from character to numeric. The process of changing a variable type this way is called coercion. Learn more about coercion here. For each data type, there is a parallel transformation function. For example, if you wanted to do the opposite of what we just did (turn a number into a character), you can use the as.character() function. If a variable cannot be transformed, it will convert into an NA, which is R’s way of telling you that a variable is missing.

4
## [1] 4
as.character(4)
## [1] "4"
as.numeric("4")
## [1] 4
as.numeric("four")
## Warning: NAs introduced by coercion
## [1] NA

Producing NAs can be especially problematic if you are transforming characters into numbers, so it’s especially important for you to be careful when you are cleaning data, because you don’t want to accidentally turn a string into an NA by turning it into a numeric.

Try this out yourself. Create a string variable “hello” and turn it into a numeric. What is the output?

Now that you know what strings, numerics, factors, and logical are, let’s talk a little more data structures.

Data structures are ways of organizing values. You’ve already heard me reference one, “data frames”. This is the structure of the new_file variable we created earlier. In R, there are a lot of different data structures (one blog post describes 5 basic structures, 4 of which we will talk about here).

2.5 Vectors and Lists

Vectors and lists are the simplest data structures containing more than one value (1-dimensional structures). Of the two, vectors (and more specifically atomic vectors) are the ones you’ll probably use the most in R. Objects in atomic vectors all have the same data type.

You can construct a vector using the c() function, like below:

new_logic_vector <- c(TRUE, TRUE, FALSE, TRUE)
new_string_vector <- c("hello", "world", "hello", "class")

Using class() you can figure out what type of vectors these are. Try that out here!

class(new_logic_vector)
## [1] "logical"

When you create a vector with different types, the c() function will coerce everything into the same data-type. This is an ordered process, as this blog. notes:

There’s a hierarchy of types: the more primitive ones cheerfully and silently convert to those higher up in the food chain. Here’s the order: 1) logical, 2) interger, 3) double, 4) character

That means that if you have a vector of logical and integer values, your logical values will become integer values. If you have a vector of logical, integer, and character values, everything will be coerced into a character value.

new_vector <-c(1, 2, "three", "four", FALSE)
new_vector
## [1] "1"     "2"     "three" "four"  "FALSE"

Vectors and lists contain one row of multiple values. When you save a vector, it will appear in the “Values” part of the enviornment. You can isolate items in a vector or list using brackets, [].

new_logic_vector[1]
## [1] TRUE
new_string_vector[2]
## [1] "world"

Unlike other programming languages (which may start their arrays with 0), vectors and lists in R start with 1.

Though you can use c() to create vectors, the more likely time you’ll encounter them is in more complex data structures. For example, certain columns that you extract from a data frame, data table, tibble, or matrix can be vectors.

You can check whether an object is a vector using the is.vector() function (there are also functions for other data types and structures, like the is.character() function)

is.vector(new_vector)
## [1] TRUE

It’s worth noting that there are only numeric, character, and logical vectors. Factors are considered numeric vectors with corresponding character values for each level.

Lists are like more generalized vectors. We can create lists using the list()function, but (like vectors), it is more common that you will encouter them in larger datasets. Unlike vectors, lists allow you to hold more complex code. For example, you can have a list with differently sized vectors in them, like in the example we create below:

new_list <- list(1:5, c("six", "seven", "eight"), c(9:12), FALSE, c(TRUE))
new_list
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "six"   "seven" "eight"
## 
## [[3]]
## [1]  9 10 11 12
## 
## [[4]]
## [1] FALSE
## 
## [[5]]
## [1] TRUE
new_vector2 <- c(1:5, c("six", "seven", "eight"), c(9:12), FALSE, c(TRUE))
new_vector2
##  [1] "1"     "2"     "3"     "4"     "5"     "six"   "seven" "eight" "9"    
## [10] "10"    "11"    "12"    "FALSE" "TRUE"

Note the differences between the new list, new_list, you created versus the new vector, new_vector2. In the former, the list retained the list structure embedded in the data. As a result,new_list[2] contains a character vector with three character objects, “six”, “seven”, and “eight”. In new_vector2, though, you’ll notice that all the objects have been converted into characters, and the internal list structure hasn been removed, or “flattened” so you have 14 observations, rather than a list of 5 vectors of various lengths.

One thing you may have also noticed is the double bracket structure in lists ([[]]). With both vectors and lists, you can use single brackets to extract information from one or more variables

new_vector2[1:3] #this means I want to see the first through third objects in new_vector
## [1] "1" "2" "3"
new_list[1:3]
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "six"   "seven" "eight"
## 
## [[3]]
## [1]  9 10 11 12

Remember, these results will not save unless you assign it to an object using <-!

Double brackets are unique to lists and can be used to access a single observation or component. One way to think of it is that new_list[1] contains a list ([[1]]), from which you can extract additional information.

new_list[1]
## [[1]]
## [1] 1 2 3 4 5
new_list[[1]]
## [1] 1 2 3 4 5
new_list[[1]][[2]]
## [1] 2

This is probably one of the harder concepts to understand in R programming. A really good explanation of this bracket structure can be found in our R4DS textbook.

If you would like more reading on vectors and lists:

2.6 Matricies and Data Frames

Most of the data you’ll be working with will be “two-dimensional”: a matrix or (most likely) a data frame. Lots of other data structures, like tibbles and data tables, are built on top of the data frame foundation, but also have additional features.

A UC Riverside tutorial’s description may be helpful: > In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows.

Both matrices and data frames are considered more “complex” data structures because they can have multiple rows and columns (as opposed to lists and vectors, which can have many “rows”, but only one “column”).

You can construct matrices using the matrix() function.

matrix(1:12, ncol = 6, nrow = 2)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    3    5    7    9   11
## [2,]    2    4    6    8   10   12
matrix(1:12, nrow = 3, ncol = 5)
## Warning in matrix(1:12, nrow = 3, ncol = 5): data length [12] is not a sub-
## multiple or multiple of the number of columns [5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    4    7   10    1
## [2,]    2    5    8   11    2
## [3,]    3    6    9   12    3

In the above examples, I create two matrices containing numbers between 1 and 12. In the first matrix, this is a 2x6 matrix (2 rows, 6 columns). The second matrix is a 3x5. Because there are only 12 numbers, though, the last column is populated with repeated numbers (1 to 3).

You’ll notice that a matrix is populated by row, and then by column.

If you wanted to extract one value from a matrix, you need to know it’s row and column number. You include these numbers in single brackets, [], with a comma in between. This is called subscripting.

new_matrix <- matrix(1:12, ncol = 6, nrow = 2)
new_matrix[2,5] #row number first, column number second
## [1] 10

If you subscript just the row or column, you will get all information for that row or column.

new_matrix <- matrix(1:80, nrow = 20, ncol = 5) #this will save over the matrix you wrote above
## Warning in matrix(1:80, nrow = 20, ncol = 5): data length differs from size of
## matrix: [80 != 20 x 5]
new_matrix
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   21   41   61    1
##  [2,]    2   22   42   62    2
##  [3,]    3   23   43   63    3
##  [4,]    4   24   44   64    4
##  [5,]    5   25   45   65    5
##  [6,]    6   26   46   66    6
##  [7,]    7   27   47   67    7
##  [8,]    8   28   48   68    8
##  [9,]    9   29   49   69    9
## [10,]   10   30   50   70   10
## [11,]   11   31   51   71   11
## [12,]   12   32   52   72   12
## [13,]   13   33   53   73   13
## [14,]   14   34   54   74   14
## [15,]   15   35   55   75   15
## [16,]   16   36   56   76   16
## [17,]   17   37   57   77   17
## [18,]   18   38   58   78   18
## [19,]   19   39   59   79   19
## [20,]   20   40   60   80   20
new_matrix[15,]
## [1] 15 35 55 75 15
new_matrix[,3]
##  [1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
new_matrix[15,3]
## [1] 55

Notice that if I wanted to get all the rows for the 3th variable (the fourth line in the above code chunk), I would leave the first number (the row number) blank.

One cool thing about matrices is the ability to perform matrix algebra (which is essential for many statistical tests and models, including multiple regressions). For example, we can transpose (pivot) a matrix using t().

new_matrix
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   21   41   61    1
##  [2,]    2   22   42   62    2
##  [3,]    3   23   43   63    3
##  [4,]    4   24   44   64    4
##  [5,]    5   25   45   65    5
##  [6,]    6   26   46   66    6
##  [7,]    7   27   47   67    7
##  [8,]    8   28   48   68    8
##  [9,]    9   29   49   69    9
## [10,]   10   30   50   70   10
## [11,]   11   31   51   71   11
## [12,]   12   32   52   72   12
## [13,]   13   33   53   73   13
## [14,]   14   34   54   74   14
## [15,]   15   35   55   75   15
## [16,]   16   36   56   76   16
## [17,]   17   37   57   77   17
## [18,]   18   38   58   78   18
## [19,]   19   39   59   79   19
## [20,]   20   40   60   80   20
t(new_matrix)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    1    2    3    4    5    6    7    8    9    10    11    12    13    14
## [2,]   21   22   23   24   25   26   27   28   29    30    31    32    33    34
## [3,]   41   42   43   44   45   46   47   48   49    50    51    52    53    54
## [4,]   61   62   63   64   65   66   67   68   69    70    71    72    73    74
## [5,]    1    2    3    4    5    6    7    8    9    10    11    12    13    14
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]    15    16    17    18    19    20
## [2,]    35    36    37    38    39    40
## [3,]    55    56    57    58    59    60
## [4,]    75    76    77    78    79    80
## [5,]    15    16    17    18    19    20

Data frames are like fancier matrices with list/vector properties. Unlike a matrix, which requires every value to be the same type (like a vector), columns in a data frame can be different values. This is especially useful in the event that you have a dataset–like social media data–where some variables are numbers (number of followers), some are strings (usernames), and some are logicals (verified account).

The new_file object is a data frame (as we saw from the str() function). Like a matrix, you can also use brackets to pull out relevant information.

new_file[1,10] #1st column (mpg), 10th observation
## [1] 4

Another very important thing you can do with data frames is pull out individual columns using the $. For example, if you wanted to see the mpg column of new_file, run the following line:

new_file$mpg
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

You can also look at the first 6 observations of this variable using head(). Let’s try this below!

head(new_file$mpg)
## [1] 21.0 21.0 22.8 21.4 18.7 18.1

You can combine this with brackets to identify the n’th row of an observation.

new_file$cyl[10]
## [1] 6

And, you can run functions on individual columns in a data frame, like the as.character() coercion function we learned about previously.

new_file$cyl_string <- as.character(new_file$cyl)
#you can save information to a new column using the <- assignment

new_file$cyl_string[10]
## [1] "6"

And, if you run str() on just a specific column of a data frame, R will return the type of variable the column is.

#str(new_file)
str(new_file$cyl)
##  num [1:32] 6 6 4 6 8 6 8 4 4 6 ...

What happens when you use class()? See for yourself:

class(new_file$cyl)
## [1] "numeric"

If you save the columns as a totally new object, you will notice that it is saved as a vector. In other words, each column in a data frame is a vector.

cyl_variable <- new_file$cyl
is.vector(cyl_variable)
## [1] TRUE

You can play with the is.vector() or is.data.frame() functions to see what R produces:

is.vector(new_file$cyl)
## [1] TRUE
is.numeric(new_file$cyl)
## [1] TRUE
is.data.frame(new_file$cyl) 
## [1] FALSE
is.vector(new_file)
## [1] FALSE
is.numeric(new_file)
## [1] FALSE
is.data.frame(new_file)
## [1] TRUE

One thing you may notice is that new_file$cyl has a finite number of options (4, 6, or 8 cylinders).

head(new_file$cyl)
## [1] 6 6 4 6 8 6

We may want to treat it as a categorical variable (i.e., a factor) rather than keeping it as a numeric. Using one of the coercion functions we learned about above, let us try changing this variable into a factor:

as.factor(new_file$cyl)
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8

This will return the output of the function, but it won’t save the results! To save this, you will want to use the <- assignment we have been using this whole time. Try these three sightly different ways of saving variables:

new_file$cyl <- as.factor(new_file$cyl) #saves the results of this coercion to the original variable name
new_file$cyl2 <- as.factor(new_file$cyl) #saves the results in a new column in the data frame
cyl2 <- as.factor(new_file$cyl) #saves the result in a totally new object

Notice that when we turn this numeric into a factor, it will have three “levels. Levels refer to the number of options that the categorical variable could be. For example, if your gender variable includes four options, man, woman, trans/non-binary, no response, you would have 4 levels.

new_file$cyl <- as.factor(new_file$cyl) 

str(new_file$cyl)
##  Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...

Notice that when the column is a factor, R will also indicate the levels of that factor (this is not displayed for numeric).

If you would like more reading on matrices and data frames: