2.3 Data Structure
In R, you can use different types of data structures, depending on your needs:
- Vector,
- Matrix,
- Array,
- Dataframe, and
- List.
The type of data structure used generally differs by how information is stored and displayed, and also accessed or edited.
2.3.1 Vector
Vector is the most common and basic data structure in R. It is also a building block for complicated data structures mentioned later (such as matrix, array, and dataframe).
A sector is simply data of the same data type with an order. Note that order matters. Thus, (1,2) is different from (2,1).
The simplest vector is an integer vector. For a vector of integers from 1 to 5, we just need to use 1:5.
## [1] 1 2 3 4 5
Sometimes, we want to repeat the same number several times. We use the function rep(). The following code repeat 0 five times.
## [1] 0 0 0 0 0
This is usually a good way to initialize a vector. Slightly more general way to create a vector of integer is to use seq().
For integers from 4 to 10 with each step being 2, we have ``seq(4,10,by=2)’’. If the step is negative, then it is decreasing.
## [1] 4 6 8 10
## [1] 21 18 15 12 9 6
To create a generic vector, we use the function c(). We input elements inside are separated by commas.
We first create a generic integer vector.
## [1] 1 -1 3 2
Then we create a generic numeric vector.
## [1] 1.0 -1.0 3.5 2.0
Finally, we create a generic character vector.
## [1] "Apple" "Banana"
We can also create a vector of logical.
## [1] TRUE FALSE
It is easy to add new element to a vector.
## [1] 1.0 -1.0 3.5 2.0 5.0
Since R is a vectorized program, applying mathematical operators to the vector will take effect on all elements inside the vector.
## [1] 3.0 1.0 5.5 4.0
## [1] -1.0 -3.0 1.5 0.0
Extracting elements is using the operator []. To get the third element, we have x[3]. To get first three elements, we use x[1:3]. To get the first, third and forth elements, we use x[c(1,3,4)].
## [1] 3.5
## [1] 1.0 -1.0 3.5
## [1] 1.0 3.5 2.0
Size of a vector can be found by using length():
## [1] 10
2.3.2 Factor
A factor vector is an integer vector converted from character vectors or numeric vector. It was designed to save memory space because duplicated long strings converted to numbers, and only mapping is needed. For example, instead of having FEMALE and MALE for storage, the computer record them as 0 and 1, and how these two numbers relate to FEMALE and MALE.
A factor vector is created from character vector using factor().
## [1] Apple Banana Apple
## Levels: Apple Banana
While memory is usually not an issue these days, factor vector is sometimes converted from numeric vector to construct as categorical variable. It is often in financial analysis that we divide observations into different groups according to some numeric measures (e.g., top 10 per cent performing stocks).
The following divides data into three groups with the same length.
## [1] (0.991,4] (0.991,4] (4,7] (7,10]
## Levels: (0.991,4] (4,7] (7,10]
Sometimes we want to have label to avoid make it easier.
## [1] L L H H
## Levels: L H
Sometimes we may use quartile to cut instead. However, in this case, we need to use the option include.lowest=TRUE to avoid data missing.
## [1] [1,5] [1,5] (5,10] (5,10]
## Levels: [1,5] (5,10]
We can add label to each group.
## [1] L L H H
## Levels: L H
2.3.3 Matrix
Martrix is a two-dimensional array. Alternatively, it is stacking multiple vectors of the same length.
To define a matrix from a vector, the syntax is matrix(vector, nrow, ncol, byrow). byrow is the way we fill the array. It is either TRUE or FALSE.
The following code fills the matrix by column.
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
The following code fills the matrix by row.
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
## [5,] 17 18 19 20
Extracting elements from matrix is similar to extraction in vector.
## [1] 5 6 7 8
## [1] 1 5 9 13 17
## [1] 2
One useful operation on matrix is to swap columns and row by t(), which means transpose.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
Size of matrix is rather complicated since it has two dimensions. There are three basics operations:
- length(): total number of elements
- ncol(): total number of columns
- nrow(): total number of rows
## [1] 6
## [1] 2
## [1] 3
2.3.4 Array
Array behaves like matrix but it is multi-dimensional (more than 2). To define arrary from vector, the syntax is array(vector/input, c(nrow, ncol, nmatrix))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
2.3.5 Dataframe
Dataframe is most useful form of data type in R. It behaves like matrix but can contain vectors of different types. That is we can have vectors of chacacters and numeric together, which is not feasible under matrix or arrary.
To visualize a dataframe, one may consider a spreadsheet: Each column is a vector and each spreadsheet is a dataframe – it is a collection of columns of cells.
## c.1..2. c..Good....Bad..
## 1 1 Good
## 2 2 Bad
Note that first rows are the columns names. R will automatically name the column based on the elements inside that vector. However, it looks ugly. We should give names to rows and columns to improve readability of the data.
## GPA outcomes
## John 1 Good
## Mary 2 Bad
To skip having to rename the columns, we can simply specify the column name when creating the dataframe.
## GPA outcomes
## 1 1 Good
## 2 2 Bad
One can first define vectors and then define dataframe based on the vectors.
## x y
## 1 2 1
## 2 4 3
A more compact code can be done by defining the vectors and the dataframe at the same time.
## x y
## 1 2 1
## 2 4 3
Call particular elements in dataframe share the same syntax as in matrix or array.
## [1] 2 4
## x
## 1 2
## 2 4
## x y
## 1 2 1
## [1] 2 4
## [1] 4
To remove particular vector from dataframe, simplify assume NULL to it.
## y
## 1 1
## 2 3
New columns can be created directly.
## x y z
## 1 2 1 3
## 2 4 3 7
2.3.6 List
List is the most comprehensive data type. It can contain anything: vector, array, matrix and even dataframe.
x <- c(2, 3, 5)
df <- data.frame(y=c(2,3,4),z=c(1,3,5))
name <- c("NUS", "NTU", "SMU")
x <- list(x,df,name)
To access the first element in the list x. Use x[[1]]. Similarly for [[2]].
## [1] 2 3 5
## y z
## 1 2 1
## 2 3 3
## 3 4 5