2.4 More complex data structures in R
In order to analyse data in R we need to be able to represent it and this is achieved through the range of different data types and structures available. So let’s pick up on using variables in R from the previous chapter.
We will explore two very useful structures, namely:
Variable names need to be descriptive but not excessively long. It helps to have a convention for variable names that comprise multiple words. I suggest either snake_case or camelCase where you separate lowercase words with an underscore or through use of capitalisation. The more systematic you are, the easier for everyone.
2.4.1 Creating and using vectors
So far we have mainly focused on atomic variables, that is ones that comprise a single instance such as otherPeopleUseCamelCase
contains a single numeric value, specifically 1. However, it’s often useful to store and analyse multiple instances such as the height of all the people in a sample. To do this we can use a vector of the same type of atomic variables, which for our height example would most likely be numeric (e.g., 189cm, 176cm, …).
# This R code generates random height data using the function rnorm()
# and assigns it to the numeric vector height
set.seed(42) # Set the random number generator seed
# so the results are repeatable.
height <- round(rnorm(n = 20, mean = 1.65, sd = 0.15), 3) # Round to 3 decimal places for clarity.
For simplicity, we populate the vector height
with random height data using built-in function rnorm()
. This function takes various arguments to tell it what to do. For clarity I have named the arguments so n = 20
refers to the number of random numbers we require. Since rnorm()
uses the normal distribution (hence rnorm as opposed to say rbinom()
which uses a binomial distribution) we need to provide two other pieces of information: the mean or centre of the distribution and the standard deviation (sd) which is a measure of dispersion or spread of the values23.
Notice also that we have nested two functions: rnorm
is nested as the first argument to another function round
. This means the results of the first function are directly passed to the second or outer function. This is more concise and sometimes (not always) easier to read. Alternatively we could write.
# Create a vector without nesting function calls
set.seed(42)
height <- rnorm(n = 20, mean = 1.65, sd = 0.15)
height <- round(height, 3)
Imagine we have generated the height (in metres) of 20 students in a vector height
. When we perform a print()
function we see the following.
## [1] 1.856 1.565 1.704 1.745 1.711 1.634 1.877 1.636 1.953 1.641 1.846 1.993
## [13] 1.442 1.608 1.630 1.745 1.607 1.252 1.284 1.848
## [1] 20
## [1] 1.856 1.565 1.704 1.745 1.711 1.634
Note that all 20 values or elements are output and — as we’d expect from our understanding of vectors from maths — each value has a position. The first element is indicated by a [1]
, the 13th by [13]
and so forth. It’s easy to imagine that if the vector were long then print(height)
would become pretty unwieldy, but fortunately there’s another useful R function called head()
which by default allows us to peak at the first six elements. There is a similar function called tail()
which returns the final six elements. We can check the length of a vector using length()
.
What is particularly convenient, and computationally efficient24, is that we can apply functions to the entire vector, rather than having to write a loop as one would for a traditional language such as Java or C++. Imagine we wish to convert height from metres to inches. We can simply apply the arithmetic function to the vector as follows.
## [1] 73.07072 61.61405 67.08648 68.70065 67.36207 64.33058 73.89749 64.40932
## [9] 76.88961 64.60617 72.67702 78.46441 56.77154 63.30696 64.17310 68.70065
## [17] 63.26759 49.29124 50.55108 72.75576
Two further examples of functions applied to a numeric vector (height)
## [1] 0.1968807
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.252 1.608 1.673 1.679 1.847 1.993
A very useful function summary()
, can be applied to many different data objects including vectors. We refer to it as generic. It can also be extensively customised.
For now, the final function we wish to consider for vectors is c()
which is an extremely owerful way to combine values into a vector. Most likely, you will yourself using this function a great deal.
Remember vectors need not be limited to numeric elements.
# Create a string vector of greetings
greetings <- c("Hi!", "Hello", "Good morning", "Saludo", "Hej!")
In the above example, we have made a vector of string elements25 by assigning the result of combining five string literals (“Hi!”, … , “Saludo”) into a vector that we name greetings
. Since these are character strings, R deduces that the data type should be a character string and that it’s a vector because there’s more than one element. Then we output a single element of the vector by means of the print()
function and specify which element using the square brackets notation.
Being able to reference sub-selections of a vector is very useful. To do this we use the square bracket notation [<i>]
where \(i\) is the position of the required vector element.
## [1] "Hello"
## [1] "Good morning"
## [1] "Hi!" "Hello" "Good morning"
## [1] "Hi!" "Hello" "Good morning" "Saludo" "Hej!"
Notice in the above R chunk, that we can access more than one element at a time by using the :
operator to return a range of elements. The range can be specified by literals as in [1:3]
or integer26 variables e.g., [m:n]
.
Be careful to ensure your index only points to vector elements that exist, otherwise R will return NA
which is probably not what you intend. This is a classic type of coding error :-)
## [1] NA
Java and C programmers note, that R indices start from one, not zero.
2.4.2 Data Frames
In order to complete our overview of the different types of variable that you can use in R, we consider the situation of more than one dimension and mixed (heterogeneous) data types.
Dimensions | Homogeneous type | Heterogeneous types |
---|---|---|
1 | vector | list |
2 | matrix | data frame |
n | array | n.a. |
Data frames, in particular, tend to be something of a workhorse for the R data analyst, so it’s important to become comfortable using them. A data frame is typically organised so that the columns comprise different variables (that may have varying data types e.g., numeric and character) and rows comprise individual observations. Sometimes this is referred to as ‘rectangular’ data because each row and column is the same length.
In the example below we use the built-in dataset mtcars
which contains data concerning 1974 car road tests from the US magazine Motor Trend. For more information you can type ?mtcars
as you can for any other R package or function.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
Note that each row has a name, specifically a type of car e.g., Datsun 710. Then there are 11 named columns, each one of which corresponds to a variable. The meaning is pretty intuitive so we see, for instance, that a Datsun 710 has a fuel consumption (mpg) of 22.8. This is a flexible and convenient way to store data which is equivalent for example to the Dat View in SPSS.
In the above fragment of R, I have used a head()
function since the actual data set comprises data on 32 different cars which is more awkward to display. The other important point to note is the use of the $
operator when we refer to mpg
so that it’s clear which data frame we are dealing with. If you don’t specify the dataframe a variable is contained within then R will not recognise which variable you are referring to.
If you are manipulating just one data frame frequently, it is possible to dispense with the $
operator by using the attach()
function. In the case of mtcars
we would have attach(mtcars)
and detach(mtcars)
when we have finished.
Some useful functions to manipulate data frames include:
str()
which usefully reveals the structure and the first view values for each variable in the data frame.
dim()
shows the dimensions of the data frame: row count followed by column (variable) count.
summary()
since, as we have already noted, this is a generic function we can not only apply it to vectors but to data frames as well.View()
allows you to see a spreadsheet-like display of the entire data frame. NB this function, rather inconsistently, starts with upper case V.[r,c]
allows us to access the \(c^{th}\) column of the \(r^{th}\) row where \(r\) and \(c\) are positive integers. See below.
# Changing the fuel consumption of the Datsun 710 and the number of cylinders
mtcars["Datsun 710","mpg"] <- 99 # Matching by value
mtcars[3,2] <- 16 # Indexing by row and column number
head(mtcars,4) # Display first 4 rows only, note the extra argument
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 99.0 16 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
The above code fragment shows two equivalent ways of manipulating elements in a data frame. If we match by value then we might update more than one element. On the other hand, if we use absolute indexing this can be fiddly or impractical particularly for large data frames. It’s also vulnerable if the data frame changes, for example through adding new observations (rows) or variables (columns).
Obviously we are only scratching the surface of manipulating data sets. There are many sources for more detail on managing data with R, but a good starting point is Chapter 2 of (Kabacoff 2015).
For more background on mtcars
you can type help(mtcars)
and R will provide a detailed data description of this built-in data set. There is an interesting note regarding the data quality. What is it? Why do you think this is important?
References
The function defines what arguments it is expecting, what are their names and in what order they are expected. If you are unsure you can always check by entering
?<function_name>
to access Help Information. Often we call functions without explicitly naming the parameters but if we do so, we must provide them in the order they are defined or expected. Thus(rnorm(n = 20, mean = 1.65, sd = 0.15)
and(rnorm(20, 1.65, 0.15)
are equivalent.↩︎The difference in execution time is quite noticeable for large vectors.↩︎
In R, there’s no fundamental distinction between a string and a character. A “string” is a character variable that contains one or more characters.↩︎
If the index variables are not integers R will do its best to coerce the values to integers.↩︎