Chapter 3 Data

We will overview of the main data storage objects in base R, including their creation, indexing and key characteristics.

Note that: in each data point is considered to be a vector which can have mode, numeric, character, logical.

Table 3.1: Summary of Data Storage Object Types in R
Data.Type	Creation	Indexing	Characteristics
Vector	c(1,2,3)	[1]	Homogeneous, 1D
List	list(1, ‘A’)	[[1]], [1]	Heterogeneous, mutable
Matrix	matrix(1:9,3,3)	[row, col]	Homogeneous, 2D
Data Frame	data.frame()	[row, col], $col	Heterogeneous, 2D
Factor	factor(c(‘A’,‘B’))	[1], levels()	Categorical, levels
Array	array(1:12, c(3,2,2))	[row,col,dim]	Homogeneous, multi-dimensional

3.1 Vector

The vector is the fundamental data type in R. In R, there is no concept of scalar. It is considered to be length 1 vector.

3.1.1 Creation

vec <- c(1, 2, 3, 4, 5)  # Numeric vector
char_vec <- c("A", "B", "C")  # Character vector
log_vec <- c(TRUE, FALSE, TRUE)  # Logical vector

3.1.2 Indexing

vec[1]  # First element

## [1] 1

vec[length(vec)]  # Last element

## [1] 5

vec[2:4]  # Slicing elements from index 2 to 4

## [1] 2 3 4

3.1.3 Key Characteristics

Homogeneous (all elements must be of the same type)

x <- c(44,55,66) # creating
x[2] # Indexing
x[1:2] # Indexing
x <- c(x,x,2*x)
x
length(x) # length of x

y <- vector(length=3)
y
y[1] <-2 # once an integer is inserted everything else becomes 0 (integer)
typeof(y) # to check the vector type
y[3] <- 'A'
typeof(y) # once a character is inserted everything else becomes the character type

For numeric values, there are integer and double.

x <- 10.0; typeof(x)
y <- 10L; typeof(y)

In R, explicitly declaring the ‘integer’ type (using L at the end) can be beneficial in certain scenarios even though R defaults to double for numbers. Here is why you might want to declare integer explicitly:

### 1. Momory efficiency
object.size(x) # R has some overhead
object.size(y) # same overhead, but matter in large datasets

### 2. Performance Optimization
##### integers use less computational power
ii <- as.integer(1:1e7)
dd <- as.numeric(1:1e7)
system.time(mean(ii))
system.time(mean(dd))
system.time(sd(ii))
system.time(sd(dd))

Let’s check the Recycling property of vectors in R

c(1,2,3) + c(3,4,5) 
c(1,2,3) + c(3,4,5,6) # with warning
c(1,2,3) + c(3,4,5,6,7) # with warning
c(1,2,3) + c(3,4,5,6,7,8)
c(1,2,3) + 5

c(1,2,3) * c(3,4,5) 
c(1,2,3) * c(3,4,5,6) # with warning
c(1,2,3) * c(3,4,5,6,7) # with warning
c(1,2,3) * c(3,4,5,6,7,8)
c(1,2,3) * 5

So R automatically recycles or repeats the shorter vector, until it is long enough to match the longer one. Same rule applies to other operators.

Here is some example of the matrix multiplication on vectors:

x <- c(1,2,3)
x %*% x 
x %*% t(x)

The negative indexing in R is used for excluding elements:

x <- c(1,2,3)
x[-2]
x[-length(x)] # exclude the last element

3.1.4 Generating vector sequences

There are useful functions to generate sequences.

1:10
seq(1,10)
seq(1,10,by=2)
seq(10)

We cam repeat

rep(c(1,2,3),3)
rep(c(1,2,3),each=3)
rep(c(1,2,3),c(3,3,3))

3.1.5 Working with logical vectors

Logical vectors are very useful in conditional filtering. Let’s think about extracting numbers greater than 5 from a sequence.

x <- 1:10

# Method 1 : using indexing with logical vector #
x>5 # It produces logical vectors after comparing each element of x with 5
which(x>5) # getting the indices
x[which(x>5)] # getting the elements

# Method 2
x[x>5]

Maybe some low-level languages that do not allow these would require implementing a function like:

gN <- function(x,cut) {
  stopifnot(is.numeric(x),sum(is.na(x))==0)
  out <- numeric()
  for (i in 1:length(x)) {
    if (x[i]>cut) out = c(out,x[i])
  }
  out
}
gN(1:10,5L)
gN(c(1:10,'A'),3L)
gN(c(1:10,NA),3L)

We can do filter and replacing. From a vector x, replace the elements with 0 if its less than 5 and 1 otherwise.

x <- 1:20
z <- rep(1,length(x))
z[x<5] = 0
z

However the ifelse() function is more useful:

ifelse(x>=5,1,0)
ifelse(x>=5,x*10,exp(x)) # see another example

See the useful functions: any() and all() that takes logical vector as an input

any(x>5)
all(x>5)

3.1.6 Using NA and NULL Values

In R, missing values are denoted as NA (not “NA”). Many R functions include options to handle those missing values.

x<- c(3,NA,2)
xx <- c(3,"NA",2) 
is.na(x)
is.na(xx)
mean(x)
mean(x,na.rm=T)

NULL, on the other hand, represents that the value simply does not exist, rather being existent but unknown

x <- c(3,NULL,2)
mean(x) # the R automatically skipped over the NULL value

Let’s see what happens when we use NA and NULL to initialize a vector.

out <- NULL
for (i in 1:10) if (i%%2==0) out <- c(out,i)
out

out <- NA
for (i in 1:10) if (i%%2==0) out <- c(out,i)
out

3.1.7 Misc

You can name the vector elements:

x = 1:10
names(x)
names(x) = LETTERS[1:10]
x
names(x)
x['H'] # indexing by name

3.1.8 DIY

Implementation of the Kendall’s tau rank coefficient. Consider two length $n$ vectors, $x$ and $y$ . Implement a function that calculates kendall’s tau coefficient. And evaluate that the function by comparing the results built-in R function. Do not use the build-in R function for extracting sign.

3.2 Matrix and Array

In R, a matrix is essentially a vector with two additional attributes: rows and columns. Matrices are a special case of a more general R object type: an array, which represents multidimensional data. For example, a three dimensional array consist of rows, columns, and layers. Such structures are also referred to as tensors.

3.2.1 Creation

Vector in, Matrix out:

mat <- matrix(1:9, nrow = 3, ncol = 3)  # 3x3 matrix
mat
mat <- matrix(1:9,ncol=3) # do not need to specify both ncol and nrow
mat_byrow <- matrix(1:9,ncol=3,byrow=T)
mat_byrow

See how they organize the elements. R mostly uses column-major order. Matrices are merged by ‘cbind()’ and ‘rbind()’ functions:

a  <- matrix(0,2,3)
b <- matrix(1,3,3)
rbind(a,b) # row merge

a <- matrix(0,2,3)
b <- matrix(1,2,2)
cbind(a,b) #column merge

3.2.2 Indexing

mat[1, 2]  # Row 1, Column 2
mat[, 2]   # Entire second column (vector type)
mat[1, ]   # Entire first row (vector type)
mat[,2,drop=F] # important when we make functions using matrix
mat[1,,drop=F]

# Negative indexing (filtering out)
mat[-1,]
mat[-c(1,3),-(1,2)]
mat[-c(1,3),-c(1,2),drop=F]

# Logical Filtering
x = matrix(rnorm(20),ncol=4)
x>0
x[x>0] # vector type
which(x>0)
which(x>0,arr.ind=T)

# Work: negative values to 0
xx = matrix(0,ncol=ncol(x),nrow=nrow(x))
w = which(x>0)
xx[w] = x[w]
xx

We can also specify each element:

mat <- matrix(rep(NA,9),ncol=3)
mat[2,3] <- 5
mat[3,1] <- 9

3.2.3 Key Characteristics

Homogeneous (all elements must be of the same type)
2D structure (rows and columns)
Supports arithmetic operations

We can check the homogeneous characteristics of a matrix:

mat <- matrix(1:9, ncol=3)
typeof(mat)
mat[3,3] <- 'A'
typeof(mat)

We can perform matrix linear algebra operations, which works little different than what we know from linear algebra.

x <- matrix(1:4,ncol=2)
y <- 2*x # multiplication with scalar
x * y # elementwise multiplication
x%*%y # matrix multiplication
t(x) # transpose
x + y # addition

# check how it works with a vector (useful for efficient coding in R) #
cy = c(y) # vectorize
x*cy
x+cy

# check the recycling property
x * cy[1:2]
x + cy[1:2]


x/cy[1:1]

3.2.4 Misc

In stead of using for loops on R, use apply() function. Recall that sapply() was used for vectors.

?apply
mat <- matrix(rnorm(1000),ncol=10)
apply(mat,1,mean) # rowwise mean
apply(mat,2,mean) # column wise mean
colMeans(mat)

# matrix with missing
mat2 <- mat
mat2[sample(1:1000,10)] = NA
apply(mat2,2,mean) 
apply(mat2,2,mean,na.rm=T) 

# apply on your own functions
absMean <- function(x) {mean(abs(x),na.rm=T)}. # Compute mean of absolute values 
apply(mat2,2,function(i) absMean(i))
apply(mat2,2, function(i) mean(abs(i),na.rm=T)) # make it in one line of code

A matrix is just a vector but with two additional attributes: the number of rows and number of columns.

vx <- seq(1:6) # a vector
length(vx)
class(vx)
attributes(vx)
str(vx)

x <- matrix(vx, ncol=2)
class(x)
attributes(x)
dim(x)
ncol(x)
nrow(x)
nrow # Check the function code
ncol # Check the function code
str(x)

mvx <- as.matrix(vx)
class(mvx)

We can name matrix rows and columns.

colnames(x)
colnames(x) <- LETTERS[1:ncol(x)]
colnames(x)
x

rownames(x)
rownames(x) <- LETTERS[1:nrow(x)]
rownames(x)
x

3.2.5 Higher-Dimensional Arrays

In statistical context, a typical matrix in R has rows corresponding to observations (sample units) and columns corresponding to variables, such as weight, blood pressure. The matrix is then a 2D data structure. But suppose we also have data taken at different times, one data point per person per variable per time. Time then becomes the third dimension. In R, such data sets are called arrays.

Let’s take a simple example of students’ test scores from two different tests:

# first test scores for three students and two questions
firsttest <- matrix(c(46,21,50,30,25,50),ncol=2)
secondtest <- matrix(c(46,41,50,43,35,50),ncol=2)
firsttest
secondtest

# The two same dimension matrices saved in an array object
scores <- array(data=c(firsttest,secondtest),dim=c(3,2,2))
class(scores)
dim(scores)
scores # it is displayed layer by layer
# get the score for the second student and first question at second test
scores[2,1,2]

3.2.6 Example: Imaging data

DIY: Download an image of the Mount Rushmore National Memorial in the United States and change it to a greyscale image file using pixmap library.

library(pixmap)
library(jpeg)

img <- readJPEG("files/Mount_Rushmore_National_Memorial.jpeg")
gray_img <- (img[,,1] + img[,,2] + img[,,3]) / 3
pgm_image <- pixmapGrey(gray_img)
write.pnm(pgm_image, "files/Mount_Rushmore_National_Memorial.pgm", type = "pgm")

Let’s check the pgm file.

pgm_image
str(pgm_image)
plot(pgm_image)

The class here is of the S4 type, whose components are designated by ‘@’, rather than ‘$’ (we will be discussing it in detail later). The key item here is ‘pgm_image@grey’.

is.matrix(pgm_image@grey)
dim(pgm_image@grey) # 1800 x 2400
hist(c(pgm_image@grey))

# blot out the image
tmp <- pgm_image
tmp@grey[1:500,1:500] <- 1
tmp@grey[1000:1500,1000:1200] <- 0
plot(tmp)

# Blur a part
make_blur <- function(img, rows, cols, q) {
  #### Input
  #### - img : image file (class pixmap)
  #### - rows: row indexes to blur (vector)
  #### - cols: col indexes to blur (vector)
  #### - q:  a scalar for intensity of blur
  nrows <- length(rows)
  ncols <- length(cols)
  outimg <- img
  outimg@grey[rows,cols] <- (1-q) * img@grey[rows,cols] + q * matrix(runif(nrows*ncols),nrow=nrows,ncol=ncols)
  return(outimg)
}

# Disguise President Roosevelt's identity by adding random noise
plot(make_blur(pgm_image,900:1500,1200:1600,0.6)) # change the q value

3.3 List

In contrast to a vector, in which all elements must be of the same mode, R’s list structure can combine objects of different types. The list plays a central role in R, forming the basis for data frames, objected-oriented programming and so on. Many output files of R functions are in the list type. Check the ‘lm’ function.

3.3.1 Creation

lst <- list(1, "Hello", TRUE, c(2, 4, 6))  
lst

## [[1]]
## [1] 1
## 
## [[2]]
## [1] "Hello"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 2 4 6

nlst <-  list(number=1, character="Hello", logical=TRUE, vector=c(2, 4, 6))  
nlst

## $number
## [1] 1
## 
## $character
## [1] "Hello"
## 
## $logical
## [1] TRUE
## 
## $vector
## [1] 2 4 6

alst <- vector(mode='list')
alst[['a']] = 3
alst

## $a
## [1] 3

# adding to a existing list
alst[[1]]

## [1] 3

alst$b = 4
alst[3:6] = c(F,T,F,T)

# Delete a element by stting it to NULL
alst$b = NULL

# getting the length of a list
length(alst)

## [1] 5

3.3.2 Indexing

lst[[1]]  # Access first element (returns value)

## [1] 1

lst[1]    # Access first element (returns a list)

## [[1]]
## [1] 1

lst[[4]][2]  # Access second element of the fourth item (which is a vector)

## [1] 4

nlst$character

## [1] "Hello"

nlst$logical

## [1] TRUE

nlst$vector[3]

## [1] 6

nlst[['character']]

## [1] "Hello"

3.3.3 Key Characteristics

Heterogeneous (elements can have different data types)
Mutable (can modify elements)
Can store other lists (nested lists)

3.3.4 Misc

Transform a list to a vector.

lst <- list(name='Joe',salary='50000',union='TRUE')
lst
names(lst)
unlist(lst) # becomes a vector type

We can also do looping using the apply() family, lapply() and sapply()

lst <- list(rnorm(10),rnorm(20),rnorm(30),rnorm(40))
#compute median of each element of the list
lapply(lst,median) # list output
sapply(lst,median) # vector output

We can make recursive lists

a = list(u=5,w=30)
b = list(k='K',l='L')
ll = list(aa=a,bb=b)
ll[[1]][['u']]
ll$aa$u

c(ll)
c(ll,recursive=T)

3.4 Data Frame

A data frame is like a matrix, with a two-dimensional rows and columns structure. However it differs from a matrix in that each column may have a different mode. In that, as lists data frames are the heterogeneous analogs of vectors in one dimension, data frames are the heterogeneous analogs of matrices for two-dimensional data.

A data frame is a list, with the components of that list being equal-length vectors. The most typical form of data for $n$ samples and $p$ variables that are different types such as age, sex, bmi, disease status and etc.

3.4.1 Creation

df <- data.frame(Name = c("Alice", "Bob"), Age = c(25, 30), Score = c(90, 85))

3.4.2 Indexing

df[1, 2]   # Row 1, Column 2

## [1] 25

df$Name    # Access "Name" column

## [1] "Alice" "Bob"

df[ , "Age"]  # Access "Age" column

## [1] 25 30

df[1, ]    # preferable when you write a general code

##    Name Age Score
## 1 Alice  25    90

df[[1]] # data frame can be viewed as a collection of equal length lists

## [1] "Alice" "Bob"

df$Age # preferable when you analyze data

## [1] 25 30

str(df)

## 'data.frame':    2 obs. of  3 variables:
##  $ Name : chr  "Alice" "Bob"
##  $ Age  : num  25 30
##  $ Score: num  90 85

3.4.3 Key Characteristics

Heterogeneous (columns can have different types)
2D structure similar to a table
Can be converted to tibble (modern version)

3.4.4 Matrix like operations

Operations used in matrix can be applied. Let’s see iris data.

?iris
class(iris)
str(iris)
iris[1:3,1:4]
iris[,1]
iris[,1,drop=F]
head(iris)
tail(iris)
names(iris)

# extract petal length>1.4
iris[iris$Petal.Length>6,]

# extract setosa data #
setosa = iris[iris$Species=='setosa',]

# extract virginica data #
virginica = iris[iris$Species=='virginica',]

# merge setosa + virginica
newdat = rbind(setosa,virginica)

# add columns
n = nrow(iris)
newdat = cbind(iris,l = letters[1:n],L = LETTERS[1:n])
dim(newdat)
head(newdat)

# maybe this is more convenient treating as a list (you can update data anytime)
iris$l = letters[1:n] 
iris$L = LETTERS[1:n]
head(iris)

3.4.5 Apply apply for columns in same type

apply(iris[,1:4],2,mean)
lst = lapply(iris,sort) # column-wise sorting, does not make statistical sense
lst
data.frame(lst)

3.4.6 Treatment of NA values

# Assign NA values randomly
n = nrow(iris)
p = ncol(iris)
w = cbind(sample(1:n, 10),rep(sample(1:p, p),2))
iris[w] = NA

# getting columns with at least one NA
colSums(is.na(iris))

# getting complete samples
complete.cases(iris)
new_iris = iris[complete.cases(iris),]
sum(is.na(new_iris))

# extracting data
iris[iris$Petal.Length>6,]
subset(iris,iris$Petal.Length>6)

3.5 Factors and Tables

Factors fom the baiss for many of R’s powerful operations, including many of those performed on tabular data. The motivation for factors comes from the notion of nominal, or categorical, variables in statistics. These values are nonnumerical in nature, corresponding to categories such as Male and Female, although they may be coded using numbers. Manipulating factors is extremely important in statistical modeling suchh as regressions.

An R factor might be viewed simply as a vector with a bit more information added. That extra information consists of a record of the distinct values in that vector, called levels.

3.5.1 Creation

x <- c(3,4,5,6)
xf <- factor(x,levels=c(3,6,4,5))
x
xf 
str(x)
str(xf)
length(xf)
xf[2] <- 88
str(c(xf,55)) # becomes numeric
fct <- factor(c("Low", "Medium", "High", "Low"))

3.5.2 Indexing

fct[1]  # First element
levels(fct)  # View factor levels

3.5.3 Key Characteristics

Categorical (used for grouping data)
Levels define the possible values
Improves efficiency in statistical modeling

3.5.4 Common Functions Used with Factors

Let’s see the apply family, tapply() function.

ages <- c(25,26,55,37,21,42)
affils <- c('R','D','D','R','U','D') # works as a character vector
tapply(ages,affils,mean)

# if you want to customize the order of the factor levels
f.affils <- factor(affils,levels=c('U','D','R'))
tapply(ages,f.affils,mean)

Let’s see the iris example to calculate mean values of sepal length for each combined categories of species and Petal.Length>3.75 or not.

comb <- with(iris,list(Species,ifelse(Petal.Length>3.75,'long petal','short petal')))
table(comb) # count the number of observations in each combination
with(iris,tapply(Sepal.Length,comb,mean)) # mean of Sepal.Length

# Divide data corresponding to factor levels
with(iris, split(iris[,1:4],comb))

# Use 'by' function
by(iris[,1:4],iris$Species, function(m) mean(m$Sepal.Length,na.rm=T))
by(iris[,1:4],comb, function(m) mean(m$Sepal.Length,na.rm=T))

3.5.5 Working with Tables

Calculate contingency table for comb object from above.

str(list)
ta = table(comb)
ta
str(ta)
class(ta)
table(comb[[1]],comb[[2]])
table(data.frame(comb))

# works for one variable
table(iris$Species)

# Array like operations on table object
apply(iris[,1:4],2,median)
ta <- with(iris,table(Species, ifelse(Sepal.Width>3,'wide','narrow'), ifelse(Sepal.Length>5.8,'long','short')))
ta/nrow(ta)
dim(ta)
apply(ta,1,sum) # marginal frequency of Species
apply(ta,2,sum) # marginal frequency of Sepal.Width
apply(ta,c(1,2),sum) # marginal frequency of Species and Sepal.Width
apply(ta,c(1,3),sum) # marginal frequency of Species and Sepal.Length
apply(ta,c(2,3),sum) # marginal frequency of Sepal.Width and Sepal.Length

In statistics, marginal values of a variable are those obtained when this variable is held constant while others are summed.

addmargins(ta)
dimnames(ta)
dimnames(ta)[[3]] = c('L','S')
ta

#check the table function by typing
table

3.5.6 Misc

R includes a number of other functions that are handy for working with tables and factors.

aggregate(iris[,1:4],
      list(iris$Species,ifelse(iris$Petal.Width>median(iris$Petal.Width),1,0)),mean)

x = rnorm(100)
summary(x)
cut(x,breaks=c(min(x),median(x),max(x)))

3.5.7 Example: working with texts

DIY: Compute word counts from a text file (Use split() funtion)

tx <- scan('files/Research_summary.txt',what='character', quiet=T)
str(tx)
tx <- gsub('[.,()-]','',tx) # omit . and , and '(' and ')'
words <- split(1:length(tx),tx) # the split function automatically consider tx as a factor type
ta.words <- table(tx) # compare this with the result from the split function