第 9 章 Data Analysis

After cleaning up data (correct class via dplyr::mutate(), and, to capture data anomaly, summarise data features via dplyr::select() and dplyr::summarise()), it is time to move on to data analysis. The common task is like:

不同年級、學院借閱量有什麼不同？

每年被借書籍的上架年月range

9.1 Data

9.1.1 Library data

library(readr)
libraryData <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/library100_102.csv")

9.2 Analysis by groups

9.2.1 範例：類別變數

在類別資料初步整理時（見Summary on non-numerical variables），有些書籍館藏地的次數統計並沒有列出，我們打算進行

對每一個書籍館藏地，統計它的總借閱次數。

假設有N個書籍館藏地，令i表示其中的第i個， $i=1,\dots,N$ 。
對應第i個書籍館藏地，令borrowCount_i代表它的總借閱次數。

9.2.2 Control Flow: `for`迴圈

要完成「每一個」型態的任務，先從其中一個開始：

迴圈概念A

先以i<-1開始

第i個館藏地的值是什麼？(存在storageLocation_i物件)
如何選出符合i館藏地要求的資料？(存在subsample_i物件)
選出上述的子樣本後要如何算出borrowCount_i

對於第一個步驟，我們要先能回答以下問題：

資料中的書籍館藏地有那幾類？請將類別存在storageLocations物件中。[hint:使用levels()]

libraryData$書籍館藏地 %>% as.factor %>% levels -> storageLocations
storageLocations

i<-1
# 1
storageLocation_i <- storageLocations[i] 
# 2
numericalIndexForSubsample<-which(libraryData$書籍館藏地==storageLocation_i)
subsample_i <- libraryData[numericalIndexForSubsample,]
# 3
borrowCount_i <- nrow(subsample_i)

由於總共有18個可能館藏地，你可以把i<-1換成i<-2再跑一次上述程式，重覆做18次到i<-18。這裡有兩潛在問題要解決：

i<-1一直換到i<-18太麻煩，有沒有更快的方法？
borrowCount_i的內容會一直被最近一次i執行下的值取代，有沒有什麼方法可以保存所有18次的值？

先從第2個問題解決，方法是創造的一資訊儲存用的物件，要創造對的物件要先思考：

2.1. 每一次存起來的元素其mode (粗略來說是比class更上層的分類，可使用`mode()`函數查詢) 是什麼? 

2.2. 總共有幾個元素要存?

以我們這裡的狀況，2.1答案是numeric，2.2答案是18。

創造儲存容器可以用vector()函數，用法如下：

allBorrowCount <- vector("numeric",18)

迴圈概念B

創造資訊儲存容器（contingent）。（allBorrowCount）

先以i<-1開始

第i個館藏地的值是什麼？(存在storageLocation_i物件)
如何選出符合i館藏地要求的資料？(存在subsample_i物件)
選出上述的子樣本後要如何算出borrowCount_i
把borrowCount_i存在allBorrowCount的第i個位置。

# 0
allBorrowCount <- vector("numeric",18)
i<-1
  # 1
  storageLocation_i <- storageLocations[i] 
  # 2
  numericalIndexForSubsample<-which(libraryData$書籍館藏地==storageLocation_i)
  subsample_i <- libraryData[numericalIndexForSubsample,]
  # 3
  borrowCount_i <- nrow(subsample_i)
  # 4
  allBorrowCount[[i]]<-borrowCount_i

這裡步驟4使用了$與[]以外的第三種object extraction[[]]。

[]與[[]]的差別：

[]一次可以取很多個元素；[[]]只能取一個。

a<-c("A","D","E")
# 兩者相同
a[1]
a[[1]]
# 取多個
a[c(1,2)]
# ERROR a[[c(1,2)]]

若元素有命名，[]會保留名稱。

b<-c(element1="A",element2="D",element3="E")
b[1]
b[[1]]

若物件為Recursive object（如RStudio Environment panel中有小箭頭的物件）, []會保有parent object的class，[[]]則會是所選元素的class。

libraryData["書籍館藏地"] %>% class

libraryData[["書籍館藏地"]] %>% class

若很確定要選的元素是「一」個，用[[]]較能避免programming error。

迴圈概念B裡的i要從1執行到18, 我們可以寫成如下的迴圈：

R裡的for迴圈：

# 步驟0：創造儲存物件
for(i in ...){

 ... 其他步驟

}

# 0
allBorrowCount <- vector("numeric",18)
for(i in c(1:18)){
  # 1
  storageLocation_i <- storageLocations[i] 
  # 2
  numericalIndexForSubsample<-which(libraryData$書籍館藏地==storageLocation_i)
  subsample_i <- libraryData[numericalIndexForSubsample,]
  # 3
  borrowCount_i <- nrow(subsample_i)
  # 4
  allBorrowCount[[i]]<-borrowCount_i  
}

由於index i是跟著storageLocations的個數走, 此時可以用seq_along(storageLocations)取代c(1:18), 即

for(i in seq_along(storageLocations)){
...
}

請使用for loop算出每個學院的借書量。

9.2.3 `dplyr::group_by()`

9.2.3.1 基本概念

迴圈用在反覆要做的類似任務上，若用在

依變數分群，再對每群子樣本進行分析，

則可使用dplyr::group_by() 來完成for-loops下的subsample_i分群架構，資料分析者只需要用串接方式去定義迴圈概念裡的步驟3（即要對subsample_i做什麼）。

回到一開始的問題：

對每一個書籍館藏地，統計它的總借閱次數。

dplyr::group_by() 完成不同館藏地的借閱資料子樣本subsample_i。
[%>%串接] 要對每個subsample_i做的事，如算出它的總借閱次數borrowCount_i。

libraryData %>%
  group_by(書籍館藏地) %>% # 依書籍館藏地不同形成不同subsample_i
  summarise(
    borrowCount=n() #每個subsample_i計算 borrowCount=nrow(subsample_i)
  ) -> result
result

dplyr::n()會計算承繼資料來源的樣本數。如承繼資料有分群，則n()會分別計算各群的樣本數。

這裡我們沒法用nrow(.)，因為它不認得分群的結果，會只計算整個「不分群」下的完整樣本數。

在summarise()裡的設定形式LHS=RHS，其中：

RHS：為計算內容：（1）必需只能回傳「一個」值；（2）若出現data frame有的變數，則會當成subsample_i$變數的方式取出該群樣本的變數值。

這裡的串接處理步驟使用了dplyr::summarise()：

=右手邊只能放每個來自不同書籍館藏地之subsample有出現的變數名稱
borrowCount=length(讀者借閱冊數)相當於for-loop裡改用borrowCount_i=length(subsample_i$讀者借閱冊數)，即：

# 0
allBorrowCount <- vector("numeric",18)
for(i in c(1:18)){
  # 1
  storageLocation_i <- storageLocations[i] 
  # 2
  numericalIndexForSubsample<-which(libraryData$書籍館藏地==storageLocation_i)
  subsample_i <- libraryData[numericalIndexForSubsample,]
  # 3
  borrowCount_i <- length(subsample_i$學號)
  # 4
  allBorrowCount[[i]]<-borrowCount_i  
}

請使用dplyr::group_by()算出每個學院的借書量。

9.2.3.2 多變數分群

當有多個變數用來做分群subsample時，group_by()可以輕易做到。如：算出每個學院在每個入學年的借書量：

libraryData %>%
  group_by(學院,入學年) %>%
  summarise(
    borrowCount=length(學號)
  ) -> result2
result2

算出每個學號在學的總借書量。
算出每個學號在每學期的借書量。

9.3 Filter observations

9.3.1 `dplyr::filter()`

Usage:

.data %>% filter(Logical predicates defined in terms of the variables in .data. )

9.3.1.1 範例：只選入學年為100且學院為社會科學院的觀測值

libraryData %>%
  filter(入學年==100, 學院=="社會科學院")

9.3.1.2 範例：選入學年為100-102且學院為社會科學院的觀測值

libraryData %>%
  filter(between(入學年,100,102), 學院=="社會科學院")

多數dplyr函數的argument也可以用一連串串接承現，串接最後的值才會是argument input值。

以下寫法比較易讀：

libraryData %>%
  filter(入學年 %>% between(100,102), 學院=="社會科學院")

9.4 練習題

線上練習

https://garylkl.shinyapps.io/Chapter4/#section-practice-8