8.4 Dataframe column names

One of the nice things about dataframes is that each column will have a name. You can use these name to access specific columns by name without having to know which column number it is.

To access the names of a dataframe, use the function names(). This will return a string vector with the names of the dataframe. Let’s use names() to get the names of the ToothGrowth dataframe:

# What are the names of columns in the ToothGrowth dataframe?
names(ToothGrowth)
## [1] "len"    "supp"   "dose"   "len.cm" "index"

To access a specific column in a dataframe by name, you use the $ operator in the form df$name where df is the name of the dataframe, and name is the name of the column you are interested in. This operation will then return the column you want as a vector.

Let’s use the $ operator to get a vector of just the length column (called len) from the ToothGrowth dataframe:

# Return the len column of ToothGrowth
ToothGrowth$len
##  [1]  4.2 11.5  7.3  5.8  6.4 10.0 11.2 11.2  5.2  7.0 16.5 16.5 15.2 17.3
## [15] 22.5 17.3 13.6 14.5 18.8 15.5 23.6 18.5 33.9 25.5 26.4 32.5 26.7 21.5
## [29] 23.3 29.5 15.2 21.5 17.6  9.7 14.5 10.0  8.2  9.4 16.5  9.7 19.7 23.3
## [43] 23.6 26.4 20.0 25.2 25.8 21.2 14.5 27.3 25.5 26.4 22.4 24.5 24.8 30.9
## [57] 26.4 27.3 29.4 23.0

Because the $ operator returns a vector, you can easily calculate descriptive statistics on columns of a dataframe by applying your favorite vector function (like mean() or table()) to a column using $. Let’s calculate the mean tooth length with mean(), and the frequency of each supplement with table():

# What is the mean of the len column of ToothGrowth?
mean(ToothGrowth$len)
## [1] 19

# Give me a table of the supp column of ToothGrowth.
table(ToothGrowth$supp)
## 
## OJ VC 
## 30 30

If you want to access several columns by name, you can forgo the $ operator, and put a character vector of column names in brackets:

# Give me the len AND supp columns of ToothGrowth
head(ToothGrowth[c("len", "supp")])
##    len supp
## 1  4.2   VC
## 2 11.5   VC
## 3  7.3   VC
## 4  5.8   VC
## 5  6.4   VC
## 6 10.0   VC

8.4.1 Adding new columns

You can add new columns to a dataframe using the $ and assignment <- operators. To do this, just use the df$name notation and assign a new vector of data to it.

For example, let’s create a dataframe called survey with two columns: index and age:

# Create a new dataframe called survey
survey <- data.frame("index" = c(1, 2, 3, 4, 5),
                     "age" = c(24, 25, 42, 56, 22))

survey
##   index age
## 1     1  24
## 2     2  25
## 3     3  42
## 4     4  56
## 5     5  22

Now, let’s add a new column called sex with a vector of sex data:

# Add a new column called sex to survey
survey$sex <- c("m", "m", "f", "f", "m")

Here’s the result

# survey with new sex column
survey
##   index age sex
## 1     1  24   m
## 2     2  25   m
## 3     3  42   f
## 4     4  56   f
## 5     5  22   m

As you can see, survey has a new column with the name sex with the values we specified earlier.

8.4.2 Changing column names

To change the name of a column in a dataframe, just use a combination of the names() function, indexing, and reassignment.

# Change name of 1st column of df to "a"
names(df)[1] <- "a"

# Change name of 2nd column of df to "b"
names(df)[2] <- "b"

For example, let’s change the name of the first column of survey from index to participant.number

# Change the name of the first column of survey to "participant.number"
names(survey)[1] <- "participant.number"
survey
##   participant.number age sex
## 1                  1  24   m
## 2                  2  25   m
## 3                  3  42   f
## 4                  4  56   f
## 5                  5  22   m

Warning!!!: Change column names with logical indexing to avoid errors!

Now, there is one major potential problem with my method above – I had to manually enter the value of 1. But what if the column I want to change isn’t in the first column (either because I typed it wrong or because the order of the columns changed)? This could lead to serious problems later on.

To avoid these issues, it’s better to change column names using a logical vector using the format names(df)[names(df) == "old.name"] <- "new.name". Here’s how to read this: “Change the names of df, but only where the original name was "old.name", to "new.name".

Let’s use logical indexing to change the name of the column survey$age to survey$years:

# Change the column name from age to age.years
names(survey)[names(survey) == "age"] <- "years"
survey
##   participant.number years sex
## 1                  1    24   m
## 2                  2    25   m
## 3                  3    42   f
## 4                  4    56   f
## 5                  5    22   m