Chapter 3 Data
We will overview of the main data storage objects in base R, including their creation, indexing and key characteristics.
Note that: in each data point is considered to be a vector which can have mode, numeric, character, logical.
Data.Type | Creation | Indexing | Characteristics |
---|---|---|---|
Vector | c(1,2,3) | [1] | Homogeneous, 1D |
List | list(1, ‘A’) | [[1]], [1] | Heterogeneous, mutable |
Matrix | matrix(1:9,3,3) | [row, col] | Homogeneous, 2D |
Data Frame | data.frame() | [row, col], $col | Heterogeneous, 2D |
Factor | factor(c(‘A’,‘B’)) | [1], levels() | Categorical, levels |
Array | array(1:12, c(3,2,2)) | [row,col,dim] | Homogeneous, multi-dimensional |
3.1 Vector
The vector is the fundamental data type in R. In R, there is no concept of scalar. It is considered to be length 1 vector.
3.1.3 Key Characteristics
- Homogeneous (all elements must be of the same type)
x <- c(44,55,66) # creating
x[2] # Indexing
x[1:2] # Indexing
x <- c(x,x,2*x)
x
length(x) # length of x
y <- vector(length=3)
y
y[1] <-2 # once an integer is inserted everything else becomes 0 (integer)
typeof(y) # to check the vector type
y[3] <- 'A'
typeof(y) # once a character is inserted everything else becomes the character type
For numeric values, there are integer and double.
x <- 10.0; typeof(x)
y <- 10L; typeof(y)
In R, explicitly declaring the ‘integer’ type (using L at the end) can be beneficial in certain scenarios even though R defaults to double for numbers. Here is why you might want to declare integer explicitly:
### 1. Momory efficiency
object.size(x) # R has some overhead
object.size(y) # same overhead, but matter in large datasets
### 2. Performance Optimization
##### integers use less computational power
ii <- as.integer(1:1e7)
dd <- as.numeric(1:1e7)
system.time(mean(ii))
system.time(mean(dd))
system.time(sd(ii))
system.time(sd(dd))
Let’s check the Recycling property of vectors in R
c(1,2,3) + c(3,4,5)
c(1,2,3) + c(3,4,5,6) # with warning
c(1,2,3) + c(3,4,5,6,7) # with warning
c(1,2,3) + c(3,4,5,6,7,8)
c(1,2,3) + 5
c(1,2,3) * c(3,4,5)
c(1,2,3) * c(3,4,5,6) # with warning
c(1,2,3) * c(3,4,5,6,7) # with warning
c(1,2,3) * c(3,4,5,6,7,8)
c(1,2,3) * 5
So R automatically recycles or repeats the shorter vector, until it is long enough to match the longer one. Same rule applies to other operators.
Here is some example of the matrix multiplication on vectors:
x <- c(1,2,3)
x %*% x
x %*% t(x)
The negative indexing in R is used for excluding elements:
x <- c(1,2,3)
x[-2]
x[-length(x)] # exclude the last element
3.1.4 Generating vector sequences
There are useful functions to generate sequences.
1:10
seq(1,10)
seq(1,10,by=2)
seq(10)
We cam repeat
rep(c(1,2,3),3)
rep(c(1,2,3),each=3)
rep(c(1,2,3),c(3,3,3))
3.1.5 Working with logical vectors
Logical vectors are very useful in conditional filtering. Let’s think about extracting numbers greater than 5 from a sequence.
x <- 1:10
# Method 1 : using indexing with logical vector #
x>5 # It produces logical vectors after comparing each element of x with 5
which(x>5) # getting the indices
x[which(x>5)] # getting the elements
# Method 2
x[x>5]
Maybe some low-level languages that do not allow these would require implementing a function like:
gN <- function(x,cut) {
stopifnot(is.numeric(x),sum(is.na(x))==0)
out <- numeric()
for (i in 1:length(x)) {
if (x[i]>cut) out = c(out,x[i])
}
out
}
gN(1:10,5L)
gN(c(1:10,'A'),3L)
gN(c(1:10,NA),3L)
We can do filter and replacing. From a vector x, replace the elements with 0 if its less than 5 and 1 otherwise.
x <- 1:20
z <- rep(1,length(x))
z[x<5] = 0
z
However the ifelse() function is more useful:
ifelse(x>=5,1,0)
ifelse(x>=5,x*10,exp(x)) # see another example
See the useful functions: any() and all() that takes logical vector as an input
any(x>5)
all(x>5)
3.1.6 Using NA and NULL Values
In R, missing values are denoted as NA (not “NA”). Many R functions include options to handle those missing values.
x<- c(3,NA,2)
xx <- c(3,"NA",2)
is.na(x)
is.na(xx)
mean(x)
mean(x,na.rm=T)
NULL, on the other hand, represents that the value simply does not exist, rather being existent but unknown
x <- c(3,NULL,2)
mean(x) # the R automatically skipped over the NULL value
Let’s see what happens when we use NA and NULL to initialize a vector.
out <- NULL
for (i in 1:10) if (i%%2==0) out <- c(out,i)
out
out <- NA
for (i in 1:10) if (i%%2==0) out <- c(out,i)
out
3.1.7 Misc
You can name the vector elements:
x = 1:10
names(x)
names(x) = LETTERS[1:10]
x
names(x)
x['H'] # indexing by name
3.1.8 DIY
Implementation of the Kendall’s tau rank coefficient. Consider two length n vectors, x and y. Implement a function that calculates kendall’s tau coefficient. And evaluate that the function by comparing the results built-in R function. Do not use the build-in R function for extracting sign.
3.2 Matrix and Array
In R, a matrix is essentially a vector with two additional attributes: rows and columns. Matrices are a special case of a more general R object type: an array, which represents multidimensional data. For example, a three dimensional array consist of rows, columns, and layers. Such structures are also referred to as tensors.
3.2.1 Creation
Vector in, Matrix out:
mat <- matrix(1:9, nrow = 3, ncol = 3) # 3x3 matrix
mat
mat <- matrix(1:9,ncol=3) # do not need to specify both ncol and nrow
mat_byrow <- matrix(1:9,ncol=3,byrow=T)
mat_byrow
See how they organize the elements. R mostly uses column-major order. Matrices are merged by ‘cbind()’ and ‘rbind()’ functions:
a <- matrix(0,2,3)
b <- matrix(1,3,3)
rbind(a,b) # row merge
a <- matrix(0,2,3)
b <- matrix(1,2,2)
cbind(a,b) #column merge
3.2.2 Indexing
mat[1, 2] # Row 1, Column 2
mat[, 2] # Entire second column (vector type)
mat[1, ] # Entire first row (vector type)
mat[,2,drop=F] # important when we make functions using matrix
mat[1,,drop=F]
# Negative indexing (filtering out)
mat[-1,]
mat[-c(1,3),-(1,2)]
mat[-c(1,3),-c(1,2),drop=F]
# Logical Filtering
x = matrix(rnorm(20),ncol=4)
x>0
x[x>0] # vector type
which(x>0)
which(x>0,arr.ind=T)
# Work: negative values to 0
xx = matrix(0,ncol=ncol(x),nrow=nrow(x))
w = which(x>0)
xx[w] = x[w]
xx
We can also specify each element:
mat <- matrix(rep(NA,9),ncol=3)
mat[2,3] <- 5
mat[3,1] <- 9
3.2.3 Key Characteristics
- Homogeneous (all elements must be of the same type)
- 2D structure (rows and columns)
- Supports arithmetic operations
We can check the homogeneous characteristics of a matrix:
mat <- matrix(1:9, ncol=3)
typeof(mat)
mat[3,3] <- 'A'
typeof(mat)
We can perform matrix linear algebra operations, which works little different than what we know from linear algebra.
x <- matrix(1:4,ncol=2)
y <- 2*x # multiplication with scalar
x * y # elementwise multiplication
x%*%y # matrix multiplication
t(x) # transpose
x + y # addition
# check how it works with a vector (useful for efficient coding in R) #
cy = c(y) # vectorize
x*cy
x+cy
# check the recycling property
x * cy[1:2]
x + cy[1:2]
x/cy[1:1]
3.2.4 Misc
In stead of using for loops on R, use apply() function. Recall that sapply() was used for vectors.
?apply
mat <- matrix(rnorm(1000),ncol=10)
apply(mat,1,mean) # rowwise mean
apply(mat,2,mean) # column wise mean
colMeans(mat)
# matrix with missing
mat2 <- mat
mat2[sample(1:1000,10)] = NA
apply(mat2,2,mean)
apply(mat2,2,mean,na.rm=T)
# apply on your own functions
absMean <- function(x) {mean(abs(x),na.rm=T)}. # Compute mean of absolute values
apply(mat2,2,function(i) absMean(i))
apply(mat2,2, function(i) mean(abs(i),na.rm=T)) # make it in one line of code
A matrix is just a vector but with two additional attributes: the number of rows and number of columns.
vx <- seq(1:6) # a vector
length(vx)
class(vx)
attributes(vx)
str(vx)
x <- matrix(vx, ncol=2)
class(x)
attributes(x)
dim(x)
ncol(x)
nrow(x)
nrow # Check the function code
ncol # Check the function code
str(x)
mvx <- as.matrix(vx)
class(mvx)
We can name matrix rows and columns.
colnames(x)
colnames(x) <- LETTERS[1:ncol(x)]
colnames(x)
x
rownames(x)
rownames(x) <- LETTERS[1:nrow(x)]
rownames(x)
x
3.2.5 Higher-Dimensional Arrays
In statistical context, a typical matrix in R has rows corresponding to observations (sample units) and columns corresponding to variables, such as weight, blood pressure. The matrix is then a 2D data structure. But suppose we also have data taken at different times, one data point per person per variable per time. Time then becomes the third dimension. In R, such data sets are called arrays.
Let’s take a simple example of students’ test scores from two different tests:
# first test scores for three students and two questions
firsttest <- matrix(c(46,21,50,30,25,50),ncol=2)
secondtest <- matrix(c(46,41,50,43,35,50),ncol=2)
firsttest
secondtest
# The two same dimension matrices saved in an array object
scores <- array(data=c(firsttest,secondtest),dim=c(3,2,2))
class(scores)
dim(scores)
scores # it is displayed layer by layer
# get the score for the second student and first question at second test
scores[2,1,2]
3.2.6 Example: Imaging data
DIY: Download an image of the Mount Rushmore National Memorial in the United States and change it to a greyscale image file using pixmap library.
library(pixmap)
library(jpeg)
img <- readJPEG("files/Mount_Rushmore_National_Memorial.jpeg")
gray_img <- (img[,,1] + img[,,2] + img[,,3]) / 3
pgm_image <- pixmapGrey(gray_img)
write.pnm(pgm_image, "files/Mount_Rushmore_National_Memorial.pgm", type = "pgm")
Let’s check the pgm file.
pgm_image
str(pgm_image)
plot(pgm_image)
The class here is of the S4 type, whose components are designated by ‘@’, rather than ‘$’ (we will be discussing it in detail later). The key item here is ‘pgm_image@grey’.
is.matrix(pgm_image@grey)
dim(pgm_image@grey) # 1800 x 2400
hist(c(pgm_image@grey))
# blot out the image
tmp <- pgm_image
tmp@grey[1:500,1:500] <- 1
tmp@grey[1000:1500,1000:1200] <- 0
plot(tmp)
# Blur a part
make_blur <- function(img, rows, cols, q) {
#### Input
#### - img : image file (class pixmap)
#### - rows: row indexes to blur (vector)
#### - cols: col indexes to blur (vector)
#### - q: a scalar for intensity of blur
nrows <- length(rows)
ncols <- length(cols)
outimg <- img
outimg@grey[rows,cols] <- (1-q) * img@grey[rows,cols] + q * matrix(runif(nrows*ncols),nrow=nrows,ncol=ncols)
return(outimg)
}
# Disguise President Roosevelt's identity by adding random noise
plot(make_blur(pgm_image,900:1500,1200:1600,0.6)) # change the q value
3.3 List
In contrast to a vector, in which all elements must be of the same mode, R’s list structure can combine objects of different types. The list plays a central role in R, forming the basis for data frames, objected-oriented programming and so on. Many output files of R functions are in the list type. Check the ‘lm’ function.
3.3.1 Creation
## [[1]]
## [1] 1
##
## [[2]]
## [1] "Hello"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 2 4 6
## $number
## [1] 1
##
## $character
## [1] "Hello"
##
## $logical
## [1] TRUE
##
## $vector
## [1] 2 4 6
## $a
## [1] 3
## [1] 3
alst$b = 4
alst[3:6] = c(F,T,F,T)
# Delete a element by stting it to NULL
alst$b = NULL
# getting the length of a list
length(alst)
## [1] 5
3.3.2 Indexing
## [1] 1
## [[1]]
## [1] 1
## [1] 4
## [1] "Hello"
## [1] TRUE
## [1] 6
## [1] "Hello"
3.3.3 Key Characteristics
- Heterogeneous (elements can have different data types)
- Mutable (can modify elements)
- Can store other lists (nested lists)
3.3.4 Misc
Transform a list to a vector.
lst <- list(name='Joe',salary='50000',union='TRUE')
lst
names(lst)
unlist(lst) # becomes a vector type
We can also do looping using the apply() family, lapply() and sapply()
lst <- list(rnorm(10),rnorm(20),rnorm(30),rnorm(40))
#compute median of each element of the list
lapply(lst,median) # list output
sapply(lst,median) # vector output
We can make recursive lists
a = list(u=5,w=30)
b = list(k='K',l='L')
ll = list(aa=a,bb=b)
ll[[1]][['u']]
ll$aa$u
c(ll)
c(ll,recursive=T)
3.4 Data Frame
A data frame is like a matrix, with a two-dimensional rows and columns structure. However it differs from a matrix in that each column may have a different mode. In that, as lists data frames are the heterogeneous analogs of vectors in one dimension, data frames are the heterogeneous analogs of matrices for two-dimensional data.
A data frame is a list, with the components of that list being equal-length vectors. The most typical form of data for n samples and p variables that are different types such as age, sex, bmi, disease status and etc.
3.4.2 Indexing
## [1] 25
## [1] "Alice" "Bob"
## [1] 25 30
## Name Age Score
## 1 Alice 25 90
## [1] "Alice" "Bob"
## [1] 25 30
## 'data.frame': 2 obs. of 3 variables:
## $ Name : chr "Alice" "Bob"
## $ Age : num 25 30
## $ Score: num 90 85
3.4.3 Key Characteristics
- Heterogeneous (columns can have different types)
- 2D structure similar to a table
- Can be converted to
tibble
(modern version)
3.4.4 Matrix like operations
Operations used in matrix can be applied. Let’s see iris data.
?iris
class(iris)
str(iris)
iris[1:3,1:4]
iris[,1]
iris[,1,drop=F]
head(iris)
tail(iris)
names(iris)
# extract petal length>1.4
iris[iris$Petal.Length>6,]
# extract setosa data #
setosa = iris[iris$Species=='setosa',]
# extract virginica data #
virginica = iris[iris$Species=='virginica',]
# merge setosa + virginica
newdat = rbind(setosa,virginica)
# add columns
n = nrow(iris)
newdat = cbind(iris,l = letters[1:n],L = LETTERS[1:n])
dim(newdat)
head(newdat)
# maybe this is more convenient treating as a list (you can update data anytime)
iris$l = letters[1:n]
iris$L = LETTERS[1:n]
head(iris)
3.4.5 Apply apply for columns in same type
apply(iris[,1:4],2,mean)
lst = lapply(iris,sort) # column-wise sorting, does not make statistical sense
lst
data.frame(lst)
3.4.6 Treatment of NA values
# Assign NA values randomly
n = nrow(iris)
p = ncol(iris)
w = cbind(sample(1:n, 10),rep(sample(1:p, p),2))
iris[w] = NA
# getting columns with at least one NA
colSums(is.na(iris))
# getting complete samples
complete.cases(iris)
new_iris = iris[complete.cases(iris),]
sum(is.na(new_iris))
# extracting data
iris[iris$Petal.Length>6,]
subset(iris,iris$Petal.Length>6)
3.5 Factors and Tables
Factors fom the baiss for many of R’s powerful operations, including many of those performed on tabular data. The motivation for factors comes from the notion of nominal, or categorical, variables in statistics. These values are nonnumerical in nature, corresponding to categories such as Male and Female, although they may be coded using numbers. Manipulating factors is extremely important in statistical modeling suchh as regressions.
An R factor might be viewed simply as a vector with a bit more information added. That extra information consists of a record of the distinct values in that vector, called levels.
3.5.1 Creation
x <- c(3,4,5,6)
xf <- factor(x,levels=c(3,6,4,5))
x
xf
str(x)
str(xf)
length(xf)
xf[2] <- 88
str(c(xf,55)) # becomes numeric
fct <- factor(c("Low", "Medium", "High", "Low"))
3.5.3 Key Characteristics
- Categorical (used for grouping data)
- Levels define the possible values
- Improves efficiency in statistical modeling
3.5.4 Common Functions Used with Factors
Let’s see the apply family, tapply() function.
ages <- c(25,26,55,37,21,42)
affils <- c('R','D','D','R','U','D') # works as a character vector
tapply(ages,affils,mean)
# if you want to customize the order of the factor levels
f.affils <- factor(affils,levels=c('U','D','R'))
tapply(ages,f.affils,mean)
Let’s see the iris example to calculate mean values of sepal length for each combined categories of species and Petal.Length>3.75 or not.
comb <- with(iris,list(Species,ifelse(Petal.Length>3.75,'long petal','short petal')))
table(comb) # count the number of observations in each combination
with(iris,tapply(Sepal.Length,comb,mean)) # mean of Sepal.Length
# Divide data corresponding to factor levels
with(iris, split(iris[,1:4],comb))
# Use 'by' function
by(iris[,1:4],iris$Species, function(m) mean(m$Sepal.Length,na.rm=T))
by(iris[,1:4],comb, function(m) mean(m$Sepal.Length,na.rm=T))
3.5.5 Working with Tables
Calculate contingency table for comb object from above.
str(list)
ta = table(comb)
ta
str(ta)
class(ta)
table(comb[[1]],comb[[2]])
table(data.frame(comb))
# works for one variable
table(iris$Species)
# Array like operations on table object
apply(iris[,1:4],2,median)
ta <- with(iris,table(Species, ifelse(Sepal.Width>3,'wide','narrow'), ifelse(Sepal.Length>5.8,'long','short')))
ta/nrow(ta)
dim(ta)
apply(ta,1,sum) # marginal frequency of Species
apply(ta,2,sum) # marginal frequency of Sepal.Width
apply(ta,c(1,2),sum) # marginal frequency of Species and Sepal.Width
apply(ta,c(1,3),sum) # marginal frequency of Species and Sepal.Length
apply(ta,c(2,3),sum) # marginal frequency of Sepal.Width and Sepal.Length
In statistics, marginal values of a variable are those obtained when this variable is held constant while others are summed.
addmargins(ta)
dimnames(ta)
dimnames(ta)[[3]] = c('L','S')
ta
#check the table function by typing
table
3.5.6 Misc
R includes a number of other functions that are handy for working with tables and factors.
aggregate(iris[,1:4],list(iris$Species,ifelse(iris$Petal.Width>median(iris$Petal.Width),1,0)),mean)
x = rnorm(100)
summary(x)
cut(x,breaks=c(min(x),median(x),max(x)))
3.5.7 Example: working with texts
DIY: Compute word counts from a text file (Use split() funtion)
tx <- scan('files/Research_summary.txt',what='character', quiet=T)
str(tx)
tx <- gsub('[.,()-]','',tx) # omit . and , and '(' and ')'
words <- split(1:length(tx),tx) # the split function automatically consider tx as a factor type
ta.words <- table(tx) # compare this with the result from the split function