Chapter 3 Lists and Dataframes
Here, we will cover a bit more about R: lists and some dataframe operations. For Quant or A&M students, you will get more R in your 604 class; this is not intended to be an exhaustive delivery of information. Rather, it is intended to be enough for you to be successful in 605.
3.1 Lists
Sometimes you will get data in (or generate it) as a series of lists. Or, perhaps you need to make a list of fake ID numbers, or options to draw from. Whatever the reason, there are a number of different ways to accomplish this.
One way is to simply assign a series of values or words to an object, making a list:
#Make a list
odd <- c(1, 3, 5, 7, 9) #A list with numbers (integers, specifically)
gender <- c("male", "female", "nonbinary", "prefer not to respond") #A list of strings (words)
While this is simple, it can get to be time consuming, particularly if you have many values to input. For example, you wouldn’t want to have to type out the numbers 1 through 5000 counting by ones individually! If you had a case like that, you could make use of the seq()
function, which creates a sequence of numnbers.
The above will create a list, numbers
of the numbers 1 through 10, inclusive of both 1 and 10. For the example above, if we needed to go from 1 to 5000, we would simply adjust our ending number: seq(1:5000)
or seq(1, 5000)
. You can also use the seq()
function to count by a value other than one: by 10s, or only odd or even numbers (counting by 2). We accomplish this by adding an additional argument to the seq()
function: by = x
. In the parenthesis after seq
, we would give our starting value, ending value, and by what interval we want R to generate numbers: seq(start, end, by = interval)
.
## [1] 10 20 30 40 50 60 70 80 90 100
While the numbers_v2
was output as an example, you will typically not print your list to the console, but rather perform an operation on it, add it to your dataframe, or just save it for later calculations.
Lastly, something that may be useful is being able to pick a certain number from a list - in the example below, we are selecting the 14th number from our odd_v2
list. This will print the value to the console. You can also save it as an object if you needed.
## [1] 27
3.2 Dataframes
Dataframes are the most common data format you will be working with. There are a wide range of things that can be done with them, but we will focus on just a few below. As we’ve seen before, we can load in a dataset from either a pre-existing R dataset or an external source (see 2.8 for a refresher), and assign that to an object in R:
3.2.1 Look at first or last few rows
Once assigned to an object, we can look at it, perform operations on it, and do statistical testing. Some dataset operations that come in handy after first loading in data are looking at the first or last 6 rows. After performing an operation or creating a variable, it is wise to check that what you think you did actually worked correctly. This is accomplished by ‘taking a peek’ at your dataset. If you wanted to look at the first 6 rows, you would use the head()
function, whereas if you wanted to look at the last 6 rows you would use the tail()
function. These are both used in place of printing your entire dataset to the console.
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
## height weight
## 10 67 142
## 11 68 146
## 12 69 150
## 13 70 154
## 14 71 159
## 15 72 164
3.2.2 Referencing Specific Columns
Sometimes, you want to perform an operation on just one column of your dataframe. To reference a specific column, you will make use of the $
operator: df$name
would be interpreted as you want the column “name” from the dataframe “df”. We can also reference a column by it’s place in the dataframe: column 1, column 2, etc. We would do this using the following df[row,column]
convention. That is to say, if we wanted all rows of the first column, we would do df[,1]
. We are referencing the dataframe df
, saying we want all rows by leaving that part blank, and saying we want column 1. Both of these column selection options perform equally, and it is often a matter of personal preference which you choose when selecting a single column.
## [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [1] 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
What if you forget what your variables are called? You can look in your ‘Environment’ window, click on your object, and it will open into a new tab in the Source window. Alternatively, you can use the function names()
to print the column names.
## [1] "height" "weight"
3.2.3 Referencing Specific Values
In a list, we could reference a specific value by where it fell in the list (3.1). In a dataframe, there is both a row and a column to reference. Above, we referenced an entire column by it’s location in the dataframe. We can use this same convention to reference a specific value.
## [1] 61
3.3 Change Variable Name
Sometimes you get in data, but you need (or want!) to change some things about it. Perhaps you need to change column names to match other data, or so you better remember what it represents. There are two ways to change a variable (ie: column) name: reference it by number or reference it by name. Both ways will make use of the names()
function used above.
## height(in) weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
In the function, we are calling the names of the variables, as we did earlier, with names(df)
. We are then saying that the first entry in that list([1]
) should be replaced with “height(in)” (<- "height(in)"
).
We can also change variable names by referencing its name.
#Change 'height(in)' back to 'height'
names(df)[names(df) == "height(in)"] <- "height"
#Check our work
head(df)
## height weight
## 1 58 115
## 2 59 117
## 3 60 120
## 4 61 123
## 5 62 126
## 6 63 129
Just like when we changed the name using the column number, we start by calling the names of all the variables with names(df)
. Then, we are saying that within that list of names ([names(df)
) we want the column exactly named (==
) ‘height(in)’ ("height(in)"]
). Lastly, we now want that name to be replaced with ‘height’ (<- "height"
).
3.4 Generating a Count Table
If you wanted to know the count of each unique entry of a variable, you would use the table()
function. This generates a count of how many entries are the same in a given variable.
##
## 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
You’ll notice in this case that the count table is not terribly useful; there’s only one entry for each height. To better illustrate this, we can use another built-in dataset, mtcars.
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
##
## 4 6 8
## 11 7 14
From this, we can see that 11 entries have 4 cylinders, 7 have 6 cylinders, and 14 have 8 cylinders.