3 R Data Types and Structures
This lesson is a brief overview of the major data structures and data types we’ll be working within this course. This isn’t an exhaustive list, and other languages have additional data structures, but we don’t need to worry about those here. Instead, we’re going to focus on the ones that are most commonly used in data science applications.
3.1 Data structures in R - the data frame
Data Structures in a way refer to the ‘shape’ or ‘dimensions’ of your data. At this point in your academic career you’re probably familiar with an excel file. As you then know, excel files are organized into rows and columns. You can think of this then as a 2-dimensional set of data that has a size of \(m\) rows x \(n\) columns. This square form of row x column data forms a data frame in R. I would argue that the data frame is THE core data structure of data science. I’m going to get to why this is in a bit, but for now I want to show you want a data frame looks like.
I’m going to import some data using the read_csv()
function. This is part of the tidyverse
package that can be imported using library(tidyverse)
after it’s installed (this is covered in the other lesson). We’re assigning the data to the object income
as the data is about individual yearly incomes (FYI, this is simulated data).
<- read_csv("https://docs.google.com/spreadsheets/d/1ymtFXNEmEc6bOePFgvoHns_x20t16FWJONBuSCieVgY/gviz/tq?tqx=out:csv") income
We can look at the first 5 rows of our data frame by calling the function head()
on our object income
. Doing so we can see that since this is a data frame we have a series of columns and rows, just like an excel spreadsheet.
head(income)
## # A tibble: 6 x 4
## age income group rich_parents
## <dbl> <dbl> <chr> <chr>
## 1 14.2 51912 in_school yes
## 2 16.4 53149. in_school yes
## 3 14.4 43959. in_school yes
## 4 18.8 46167. in_school yes
## 5 19.4 57103. in_school yes
## 6 14.7 42523. in_school yes
We can get information about how big this is using the functions nrow()
and ncol()
to get the number of rows and columns, respectively. We know how many columns there are, but let’s get how many rows.
nrow(income)
## [1] 155
3.1.1 Why data frames?
So why are data frames the predominant data structure of data science? It all comes down to being able to organize data into a series of observations. In a data frame, each row is a unique observation. In the case of income
each row is a unique individual. These observations then have data associated with them… in this case how hold that individual is, their income, if their in school or working, or if they have rich parents.
The goal of data mining and data science is to understand the relationship between these different variables, each of which is a column, by assessing your sample of individual data stored in each row. Thus, this square structure of data represents how data is frequently stored, organized, and used for modeling.
3.2 Down a dimension into vectors
2-dimensional data frames in R are actually just a bunch of 1-dimensional vectors bound together. What do I mean by a 1-d vector? I mean a data object that contains a series of values all of the same type. You could make one called xxx using the c()
function. Let’s make a numeric vector of the values from 1 to 10. We can use 1:10
in c()
as follows. Entering the object name xxx
into the console right afterwards shows how it’s 1-dimensional (it’s a line of values not a square organized into rows and columns).
<- c(1:10)
xxx xxx
## [1] 1 2 3 4 5 6 7 8 9 10
3.2.1 Why do we care about vectors?
As we progress in this class you’ll frequently work with a data frame to get a prediction for each observation (row). What this means is that you have a vector of predictions that you might want to associate back with its observation. We can do that by joining our vector back to a data frame.
Remember how we used nrow(income)
to show that there are 155 rows in income? Let’s just make a vector that same length called id
and add it back to the data frame. We can use the function cbind()
, which means ‘column bind’ to bind columns of equal length together. Here we have a set of columns in our data frame income
that we want to bind with the single vector id
that’ll become a column.
<- 1:155 # first make a vector of values
id <- cbind(income, id)
income head(income)
## age income group rich_parents id
## 1 14.18 51912.00 in_school yes 1
## 2 16.43 53149.32 in_school yes 2
## 3 14.43 43958.78 in_school yes 3
## 4 18.79 46166.81 in_school yes 4
## 5 19.37 57103.41 in_school yes 5
## 6 14.73 42523.45 in_school yes 6
We’ll be using vectors for various things throughout the semester. The most frequent use will be for storing predictions. We’ll either want to compare this to another vector, or instead add it back to the data, so being generally familiar with both now will prepare you for later!
3.3 Data types in R
You have probably noticed that under each column name above there is either <dbl>
or <chr>
. These are the data types within those columns. <dbl>
is another way of saying numeric, while
I’m going to start by making a data frame of different data types. I’ll first make a bunch of vectors with different data, and then I’ll make them into a data frame using data.frame()
. Let’s assign this data frame to the object students
. Calling this object shows that we have some different data types. Note how the length of each vector is the same… data frames can’t handle different length columns!
<- c('art', 'history', 'math', 'data science')
major <- c(22, 20, 17, 19)
age <- c('23000', '1000', '45000','5000')
debt<- c('m', 'f', 'm', 'nr')
gender <- c(0, 0, 1, 1)
first_gen <- c(FALSE, TRUE, TRUE, FALSE)
registered
<- data.frame(major, age, debt, gender, first_gen, registered, stringsAsFactors = F)
students students
## major age debt gender first_gen registered
## 1 art 22 23000 m 0 FALSE
## 2 history 20 1000 f 0 TRUE
## 3 math 17 45000 m 1 TRUE
## 4 data science 19 5000 nr 1 FALSE
3.3.1 Numeric data
Our age
column is clearly numeric data and data that should be numeric. What I mean by that is that age in itself is continuous. People can range from 0 to whatever years old and anywhere in between. These values are not necessary discrete. You can call basic statistical properties on numeric columns. Remember we call columns using the data_frame_name$column_name
format. Let’s get the mean, max, and standard deviation.
mean(students$age)
## [1] 19.5
max(students$age)
## [1] 22
sd(students$age)
## [1] 2.081666
But, what happens if we start calling those functions on our debt
column? That was imported as a character column because the data was specified using quotation marks (e.g. ‘23000’). Whenever R sees quotation marks it automatically makes that entry a string of characters rather than a numeric value in this case. So calling a mathematical function on what is just a string of characters gives us an error.
mean(students$debt)
## Warning in mean.default(students$debt): argument is not numeric or logical:
## returning NA
## [1] NA
What we need to do is convert those values to numeric. R has built in functions to convert between data types. We can use as.numeric()
to covert a character column to numeric. Note just because there’s function doesn’t mean it’s smart… if you tried this on say registered
it’ll just give you NA values.
Anyway, let’s convert students$debt
to numeric and assign to a test object xxx
. We’ll then call mean on xxx
to make sure it worked. If it did we should get a value back.
<- as.numeric(students$debt)
xxx mean(xxx)
## [1] 18500
Great, so let’s overwrite the character column for debt
with one that we converted to numeric. We can do so using this code students$debt <- as.numeric(students$debt)
. By putting the data_frame_name$column_name on the right side of the assignment <-
we are telling R to overwrite whatever is in that column on the right side of <-
with what’s on the left.
Doing so and then checking our data frame shows that now our debt column shows as <dbl>
, aka a numeric column. Success!
$debt <- as.numeric(students$debt)
students students
## major age debt gender first_gen registered
## 1 art 22 23000 m 0 FALSE
## 2 history 20 1000 f 0 TRUE
## 3 math 17 45000 m 1 TRUE
## 4 data science 19 5000 nr 1 FALSE
3.3.2 Character strings
Lots of data is in the form of just text. Text in R gets imported as character strings, which are just a series of letters, numbers, or symbols that have no mathematical meaning. This isn’t to say they don’t have meaning. Indeed, character strings most often represent distinct groups, and we might want to know how observations that belong to one group are different from observations that belong to another.
It’s always good to explore what types of unique strings are present in a column. We can use the unique()
function to do this. Trying this on our students$major
column shows that we have four unique character strings. In a model it might be useful to use these to see how a student’s major influences debt load.
unique(students$major)
## [1] "art" "history" "math" "data science"
This is obvious as we created that column and the values in it. But, in this class we’ll often be working with datasets that are millions or rows long, and thus you can’t simply look at the first few rows of the data frame and know what unique values are there.
Also, there are going to be times when we might have hundreds or even thousands of unique character strings. These become much less useful than the ones above for a couple reasons.
You can’t easily model hundreds and hundreds of effects like this. For example, it would be difficult to model how debt load is influenced by the hundreds of unique majors at U of A. This is why you often hear statistics instead on more general majors… ‘engineering’ vs ‘art’, rather than ‘mechanical engineering with electrical minor’ vs. ‘art history with classics minor’.
Lots of unique levels are often redundant. There are four different bachelor level art degrees each with their own name and thus character string. Is it really worth having them be unique, or instead lumping them all as the same string ‘art’ to reduce the number of unique level?
It’s ok if you don’t see the application of this now… it’ll make more sense as the semester progresses!
3.3.3 Factors
Factors are groups with distinct levels where there’s no ‘bridge’ between them. You can classify our majors into factors… ‘art’, ‘history, ’math’. There is no way for an observation to fall between factor levels. So, if you are ever asking yourself if data is numeric or a factor, ask yourself if it forms unique groups or there are values present that could be between the observations you’ve seen.
In our example above, we can see that our students$first_gen
column has 0 and 1 values, which were imported as numeric.
$first_gen students
## [1] 0 0 1 1
But, this column is to note which students are a first generate college student. It isn’t possible for a person to have a 0.5 here or something like that. You can only be a first generation student or not. So, despite this being a number, we actually want it as a factor!
We can convert using the function as.factor()
just like we did with as.numeric()
. Let’s try first by assigning to a test vector.
<- as.factor(students$first_gen)
xxx xxx
## [1] 0 0 1 1
## Levels: 0 1
So it looks similar, but now states which levels
, or distinct groups are present. Why is this useful? Well, in the case of variables that are truly factors but noted as numeric, R will automatically assume they’re numeric in a model which isn’t what we want. So we’ll frequently have to convert variables to factors before making a model.
Let’s convert this column and our major column to factors.
$major <- as.factor(students$major)
students$first_gen <- as.factor(students$first_gen)
students students
## major age debt gender first_gen registered
## 1 art 22 23000 m 0 FALSE
## 2 history 20 1000 f 0 TRUE
## 3 math 17 45000 m 1 TRUE
## 4 data science 19 5000 nr 1 FALSE
3.3.3.1 Other perks to factors
Once other nice thing about factors is that R will count up how many times it sees each factor level. This is really useful when exploring a dataset. Say you want to know how many people are in each major across UA. Well converting that to a factor and using summary()
on our data frame will do that.
summary(students)
## major age debt gender first_gen
## art :1 Min. :17.0 Min. : 1000 Length:4 0:2
## data science:1 1st Qu.:18.5 1st Qu.: 4000 Class :character 1:2
## history :1 Median :19.5 Median :14000 Mode :character
## math :1 Mean :19.5 Mean :18500
## 3rd Qu.:20.5 3rd Qu.:28500
## Max. :22.0 Max. :45000
## registered
## Mode :logical
## FALSE:2
## TRUE :2
##
##
##
So there’s only one observation in each, and we have two not first-gen students, and two first gen students. gender
we didn’t convert, so it left it has a character string which doesn’t get these summary statistics.
3.3.4 Factors vs. character
You’re probably wondering by now what’s the different between a factor and a character string in R? Well, not much often. Factors are made up of character strings, after all. Furthermore, R will often go and convert character strings to factors ‘under the hood’ when fitting a model. Factors are a bit more restrictive in that if you try to add a value that’s not an existing factor in the column, R won’t allow it.
There are some computational differences that we won’t get into. Honestly, most of the time you’re 100% ok leaving things as character strings. There are some models that don’t do the conversion to factors, so you need to know it’s a thing. They are useful for quick exploring of our data, which is where I (and you will!) use them most.
3.3.5 Logical
Logical data types… TRUE and FALSE, are surprisingly useful. They’re useful because they actually have a numeric value associated with them. TRUE = 1 and FALSE = 0. So they can work like a factor if you want them to in a model. But, you can also count how many TRUE and FALSE values are present. Here we can see that only two students have registered for classes
sum(students$registered)
## [1] 2
3.3.5.1 Evaluation using logicals
Logicals are also used for evaluating statements. When I say statements I mean things like “does this value equal this value”, or “which values in this vector are greater than 5”. When R evaluates these it kicks back TRUE or FALSE
A couple simple examples:
7 == 3
## [1] FALSE
7 - 4 == 3
## [1] TRUE
This works on vectors too. Let’s make a vector of pet types that people own.
<- c('cat', 'cat', 'dog', 'cat', 'dog', 'snake', 'cat')
pets pets
## [1] "cat" "cat" "dog" "cat" "dog" "snake" "cat"
You can see how many of them match ‘cat’. R gives is TRUE everywhere a value in our vector pets
matches ‘cat’.
== 'cat' pets
## [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE
But, since these are 1’s and 0’s under the hood, we can do math with them.
How many people have cats? Call sum()
on that vector!
sum(pets == 'cat')
## [1] 4
What proportion of people have cats? Remember that the proportion is the fraction of cats out of total observations. So we would sum the number of cats and then divide by the total number of observations. We could do this several ways. We know sum()
gives us the number of cats. Using length()
on the pets
vector gives us how many entries there are in total
sum(pets == 'cat')/length(pets)
## [1] 0.5714286
Or, we could just take the mean as all places where there is ‘cat’ are TRUE and therefore equal 1, while everything else is a zero. Thus taking a mean on a TRUE/FALSE vector will give the proportion of that evaluation.
mean(pets == 'cat')
## [1] 0.5714286
Logical statements like this underlie a lot of R operations that we’ll be using frequently!
3.4 Conclusion
I just want to emphasize again that you don’t have to have these ideas mastered now. They’ll make more sense as we work with them over the class. And for now, you have to just trust me when I say we’ll use them. It’s a bit hard to see their relevance now, but just give me time :).