2.4 More complex data structures in R

In order to analyse data in R we need to be able to represent it and this is achieved through the range of different data types and structures available. So let’s pick up on using variables in R from the previous chapter.

We will explore two very useful structures, namely:

  1. vectors
  2. data frames

Variable names need to be descriptive but not excessively long. So you will need a convention for variable names that comprise multiple words. I suggest either snake_case or camelCase where you separate lowercase words with an underscore or through use of capitalisation. The more systematic you are, the easier for everyone.

# Different variable naming conventions
i_use_snake_case <- 0
otherPeopleUseCamelCase <- 1
some.people.use.periods <- 2
aFew.People_AREconTra_rians <- 3  # Don't be one of them!

2.4.1 Creating and using vectors

So far we have mainly focused on atomic variables, that is ones that comprise a single instance such as otherPeopleUseCamelCase contains a single numeric value, specifically 1. However, it’s often useful to store and analyse multiple instances such as the height of all the people in a sample. To do this we can have a vector of the same type of atomic variables, which for our height example would most likely be numeric (e.g., 189cm, 176cm, …).

For simplicity, we populate the vector height with random height data using built-in function rnorm(). This function takes various arguments to tell it what to do. For clarity I have named the arguments so n = 20 refers to the number of random numbers we require. Since rnorm() uses the normal distribution (hence rnorm as opposed to say rbinom() which uses a binomial distribution) we need to provide two other pieces of information: the mean or centre of the distribution and the standard deviation (sd) which is a measure of dispersion or spread of the values22.

Imagine we have generated the height (in metres) of 20 students in a vector height. When we perform a print() function we see the following.

print(height)        # Output all the elements of the height vector
##  [1] 1.856 1.565 1.704 1.745 1.711 1.634 1.877 1.636 1.953 1.641 1.846 1.993
## [13] 1.442 1.608 1.630 1.745 1.607 1.252 1.284 1.848
length(height)       # Return how many elements in the vector
## [1] 20
head(height)         # Returns the first 6 elements of the vector
## [1] 1.856 1.565 1.704 1.745 1.711 1.634

Note that all 20 values or elements are output and — as we’d expect from our understanding of vectors from maths — each value has a position. The first element is indicated by a [1], the 13th by [13] and so forth. It’s easy to imagine that if the vector were long then print(height) would become pretty unwieldy but there’s another useful R function called head() which by default allows us to peak at the first six elements. There is a similar function called tail() which returns the final six elements. We can check the length of a vector using length().

What is particularly convenient, and computationally efficient, is that we can apply functions to the entire vector, rather than having to write a loop as one would for a traditional language such as Java or C++.

# Two examples of functions applied to a numeric vector (height)
sd(height)         # This function returns the standard deviation
## [1] 0.1968807
summary(height)    # This function provides min/max, mean and quartile information
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.252   1.608   1.673   1.679   1.847   1.993

The very useful function summary(), can be applied to many different objects including vectors. We refer to it as generic. It can also be extensively customised.

For now, the final function we wish to consider for vectors is c() which is an extremely useful way to combine values into a vector. Indeed, you will yourself using this function a great deal.

# Create a string vector of greetings
greetingList <- c("Hi!", "Hello", "Good morning", "Saludo")
print(greetingList)
## [1] "Hi!"          "Hello"        "Good morning" "Saludo"

In the above example, we have made a vector of string elements23 by assigning the result of combining four string literals (“Hi!”, … , “Saludo”) into a vector that we name greetingList. Since these are character strings, R deduces that the data type should be character string and that it’s a vector since there’s more than one element. Then we output a single element of the vector by means of the print() function and specify which element using the square brackets notation.

Being able to reference sub-selections of a vector is very useful. To do this we use the square bracket notation [<n>] where \(n\) is the position of the required vector element.

# Different ways to index vector elements
print(greetingList[2])
## [1] "Hello"
n <- 3
print(greetingList[n])    # You can use a variable as an index 
## [1] "Good morning"
print(greetingList[1:3])  # Or you can specify a range vector elements
## [1] "Hi!"          "Hello"        "Good morning"

Notice in the above R chunk, that we can access more than one element at a time by using the : operator to return a range of elements. The range can be specified by literals as in [1:3] or integer24 variables e.g., [m:n].

Be careful to ensure your index only points to vector elements that exist, otherwise R will return NA which is probably not what you intend. This is a classic type of coding error :-)

# Here is an example of an invalid vector index
greetingList[5]
## [1] NA

Java and C programmers note, that indices start from one, not zero.

2.4.2 Data Frames

In order to complete our overview of the different types of variable that you can use in R, we consider the situation of more than one dimension and mixed (heterogeneous) data types.

Table of more complex data structures in R
Dimensions Homogeneous type Heterogeneous types
1 vector list
2 matrix data frame
n array n.a.

Data frames, in particular, tend to be something of a workhorse for the R data analyst, so it’s important to become comfortable using them. A data frame is typically organised so that the columns comprise different variables (that may have varying data types e.g., numeric and character) and rows comprise individual observations. Sometimes this is referred to as ‘rectangular’ data because each row is the same length.

In the example below we use the built-in dataset mtcars which contains data concerning 1974 car road tests from the US magazine Motor Trend. For more information you can type ?mtcars as you can for any other R package or function.

head(mtcars)          # Show the first six (default) rows of the dataframe mtcars
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
summary(mtcars$mpg)   # Produce summary stats for the mpg column of the mtcars data set
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Note that each row has a name, specifically a type of car e.g., Datsun 710. Then there are 11 named columns, each one of which corresponds to a variable. The meaning is pretty intuitive so we see, for instance, that a Datsun 710 has a fuel consumption (mpg) of 22.8. This is a flexible and convenient way to store data which is equivalent for example to the Dat View in SPSS.

In the above fragment of R, I have used a head() function since the actual data set comprises data on 32 different cars which is more awkward to display. The other important point to note is the use of the $ operator when we refer to mpg so that it’s clear which data frame we are dealing with. If you don’t specify the dataframe a variable is contained within then R will not recognise which variable you are referring to.

If you are manipulating just one data frame frequently, it is possible to dispense with the $ operator by using the attach() function. In the case of mtcars we would have attach(mtcars) and detach(mtcars) when we have finished.

Some useful functions to manipulate data frames include:

  • str() which usefully reveals the structure and the first view values for each variable in the data frame.
  • dim() shows the dimensions of the data frame: row count followed by column (variable) count.
  • summary() since, as we have already noted, this is a generic function we can not only apply it to vectors but to data frames as well.
  • View() allows you to see a spreadsheet-like display of the entire data frame. NB this function, rather inconsistently, starts with upper case V.
  • [r,c] allows us to access the \(c^{th}\) column of the \(n^{th}\) row where \(n\) and \(c\) are positive integers. See below.
# Changing the fuel consumption of the Datsun 710 and the number of cylinders

mtcars["Datsun 710","mpg"] <- 99  # Matching by value
mtcars[3,2] <- 16                 # Indexing by row and column number
head(mtcars,4)                    # Display first 4 rows only, note the extra argument
##                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710     99.0  16  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

The above code fragment shows two equivalent ways of manipulating elements in a data frame. If we match by value then we might update more than one element. On the other hand, if we use absolute indexing this can be fiddly or impractical particularly for large data frames.

Obviously we are only scratching the surface of manipulating data sets. There are many sources for more detail on managing data with R, but a good starting point is Chapter 2 of (Kabacoff 2015).

For more background on mtcars you can type help(mtcars) and R will provide a detailed data description of this built-in data set. There is an interesting note regarding the data quality. What is it? Why do you think this is important?

References

Kabacoff, Robert. 2015. R in Action: Data Analysis and Graphics with R. 2nd ed. Manning.


  1. The function defines what arguments it is expecting, what are their names and in what order they are expected. If you are unsure you can always check by entering ?<function_name> to access Help Information. Often we call functions without explicitly naming the parameters but if we do so, we must provide them in the order they are defined or expected. Thus (rnorm(n = 20, mean = 1.65, sd = 0.15) and (rnorm(20, 1.65, 0.15) are equivalent.↩︎

  2. In R, there’s no fundamental distinction between a string and a character. A “string” is a character variable that contains one or more characters.↩︎

  3. If the index variables are not integers R will do its best to coerce the values to integers.↩︎