第 8 章 Data Transformation

In this chapter, we are going to learn pipe operator and some data processing functions from dplyr package.

8.1 主要參考資料

RStudio針對常用的套件有設計cheatsheet(見:https://www.rstudio.com/resources/cheatsheets/),建議同學下載我們有教過的:

8.2 Pipe operator

程式設計有很多部份像是個生產線加工流程,pipe operator讓流程更透明易懂。

Consider the following task:

converting the following Taiwan calendar dates into western calendar dates:

民國年月<-c("099/01","099/02","099/03")

We normally layout our strategy in sequential statement such as:

  1. Append “1” to the front of 民國年月, and “/01” to the back of 民國年月。

  2. Given updated 民國年月, use lubridate::ymd() to convert it to a Date(POSIXct) object.

  3. Add 911 years to the step 2 object (using lubridate::years()).

library(stringr)
library(lubridate)
民國年月<-c("099/01","099/02","099/03")

# 1
str_c("1",民國年月,"/01")->民國年月

# 2
lubridate::ymd(民國年月) -> 民國年月

# 3
民國年月 + years(911)
民國年月<-c("099/01","099/02","099/03")

library(dplyr)
# 1
民國年月 %>% 
  str_c("1", . ,"/01") %>%
  ymd %>%
  { . + years(911) }

%>% pipe operator:

  1. 預設:%>% 把它左邊(LHS)的物件流入它右邊(RHS)函數的argument第一個位置。 )

  2. 若要流入右邊函數的「非」第一位置argument,則用.在右邊函數中標示。

  3. %>% 可進行串接,則下一個%>%會以上個函數輸出結果當做新的LHS物件。

  4. 若右側是一串運算,而非函數,則用{...}把右側運算括住,並把LHS物件要出現的位置用.取代。

8.3 Data cleaning

In data science, it is very common to

  • subsample (取子樣本,即原樣本的一部份觀察值) based on some criteria: filter()

  • select certain variables (選取某些變數): select()

  • create a new variables (創新變數): mutate()

  • compute a statistic value based on observations (計算某些樣本特徵值,如一群觀察值的平均值、最大值等): summarise()

  • analysis by groups: group_by()

These can easily be achieved through the dplyr package.

library(dplyr)

When we first import a dataset, we usually have to check variable classes, NAs and other anomalies before we continue to exploratory data analysis. Tools provided by dplyr are very handy for this task.

範例資料

政府開放平台:106學年大專院校校別學生數

library(readr)
collegeData <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/103_student.csv")

8.3.1 dplyr::mutate

library(dplyr)
mutate(
  collegeData,
  學校代碼=as.factor(學校代碼),
  學校名稱=as.factor(學校名稱)
  ) -> collegeData2
  1. 上述的程式如果用以前的寫法,要如何寫呢?

  2. 日間/進修別,等級別改成factor,同時一年級男生,一年級女生改成numeric。

  3. 請使用%>%方式寫。

8.3.2 Parsing characters

要將character class轉成不同class,雖然可以用as.XXX(), XXX為class name, 但它有時會判斷失敗,而把該筆資料轉成NA。

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=as.numeric(一年級男生),
    一年級女生=as.numeric(一年級女生)
  ) -> collegeData3

請檢查前面一年級男/女生的結果。

as.numeric("1,234")

Package readr comes with a variety of character (vector) parsers. They are names as parse_*() where * is the value type to parse to.

Package readr parsers usages:

parse_logical(x, na = c("", "NA"), locale = default_locale())

parse_integer(x, na = c("", "NA"), locale = default_locale())

parse_double(x, na = c("", "NA"), locale = default_locale())

parse_character(x, na = c("", "NA"), locale = default_locale())

where

  • na=... defines characters that should be parsed as NA.

  • locale =... is used to setup an environment that describes certain patterns of writing used in some part of the world, where … can be supplied by locale() function. (Note:locale means 場景)

Reference: RDS, Ch11.3 Parsing a vector

We can use locale=locale(xxx=...) to set xxx pattern of writing to different styles. For example, xxx can be

  • grouping_mark (千分位符號)

  • decimal_mark (小數點符號)

library(readr)
cl<-locale(grouping_mark = ",") # 創造cl來描述千分位符號使用","的習慣場景
collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl)
  ) -> collegeData4

When there are parsing problem warnings, it is good to use problems(.the_parsed_result_object) such as

problems(collegeData4$一年級男生)
  1. 選出collegeData4中一年級男生有parsing問題的樣本之學校名稱,日間/進修別,等級別

  2. collegeData4裡存不存在觀測值是「一年級男生,或一年級女生只有一個有parsing問題」。(hint: setequal(.x,.y)可比對兩向量的內容集合是否相同,setdiff(.x,.y)會留下.x集合中不在.y集合的元素)

關於set***(.x,.y)一系列的說明可見:

New variables generated within mutate() can be sequentially called–variables defined at the front can be called immediately at the back. The following two examples are equivalent.

collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl)
  ) -> collegeData4

collegeData4 %>%
  mutate(
    一年級男生每千人=一年級男生/1000,
    一年級女生每千人=一年級女生/1000
    ) -> collegeData5
collegeData %>% 
  mutate(
    `日間∕進修別`=as.factor(`日間∕進修別`),
    等級別=as.factor(等級別),
    一年級男生=parse_number(一年級男生,locale=cl),
    一年級女生=parse_number(一年級女生,locale=cl),
    一年級男生每千人=一年級男生/1000,
    一年級女生每千人=一年級女生/1000
    ) -> collegeData6
head(collegeData5[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])

head(collegeData6[,c("一年級男生","一年級女生","一年級男生每千人","一年級女生每千人")])

8.3.3 Apply to multiple variables

If you want to mutate a group of variables you can use


  • mutate_all(.data, .funs): for all variables 對所有變數做.funs的動作。.funs可使用funs()函數,使用範例如下:

範例 一個函數

library(stringr)
collegeData %>%
  select(contains("一年級")) -> college1stYearData
college1stYearData %>%
  mutate_all(funs(parse_number(.,locale=cl))) ->
  college1stYearDataNumeric

範例 二個函數

college1stYearDataNumeric %>%
  mutate_all(
    funs(
      每百人=./100,
      每千人=./1000
    )
  )

執行以下程式:

library(googlesheets)
gs_auth(new_user = TRUE)
gsSurvey<-gs_key("1mC9bnxj11NCNoOCw0Vmn4nxERbHtLjeGo9v9C9b2GDE")  #download sheet information as an object
classSurvey<-gs_read(gsSurvey,ws=1) #access the 1st worksheet
classSurvey %>% select(居住地行政區,性別) ->
  classSurveyPartial

將classSurveyPartial的所有變數轉成factor class。


  • mutate_at(.data, .vars, .funs): for some variables 對.vars所定義選出的變數做.funs的動作, 其中.vars可透過vars()來定義。

範例 直接使用變數名稱:

collegeData %>% mutate_at(
  vars(一年級男生,一年級女生),
  funs(parse_number(.,locale=cl))
) -> college1stYearDataA

college1stYearDataA %>%
  select(一年級男生,一年級女生)

vars() can work with the following select helper functions:

  • 使用文字特徵:
    starts_with():以什麼文字開頭
    ends_with():以什麼文字結尾
    contains():包含什麼文字
    matches():其他複雜可透過regular expression表示的文字規則

  • 使用數字特徵:想選出變數x001,x002,…,x083
    num_range():num_range("x",1:83)

範例 利用變數名稱規則

collegeData %>% mutate_at(
  vars(starts_with("一年級")),
  funs(parse_number(.,locale=cl))
) -> college1stYearDataB

college1stYearDataB %>% 
  select(一年級男生,一年級女生)

將原始collegeData資料裡的所有男/女生變數都轉成numeric class。

8.4 Data anomaly

參考資料My favourite R package for: summarising data

It is crucial that we check our data before any further investigation. There are mainly two potential threats at this stage:

  • outliers(離群值)

  • NAs(無輸入值)

To check if this is the case, we start with data summary for each variable. The dfSummary() from summarytools is very handy for the firsthand investigation.

範例資料 台北大學100-102學年入學學生的圖書借閱資料(部份樣本,且不含續借)

library(readr)
libraryData <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/library100_102.csv")

請將上述資料class整理正確。

libraryData %>%
  mutate(
    學院=as.factor(學院),
    書籍館藏地=as.factor(書籍館藏地),
    書籍類別=as.factor(書籍類別),
    借閱時間=ymd_hms(借閱時間,tz="Asia/Taipei")
  ) -> libraryData

8.4.1 function101

參考資料RDS, Chapter 19

範例:記算數值變數的最大及最小值

## 1
max(libraryData$讀者借閱冊數)-> maxNumber
min(libraryData$讀者借閱冊數)-> minNumber
paste0("最大值: ",maxNumber, " 最小值: ",minNumber)

## 2
max(libraryData$入學年)-> maxNumber
min(libraryData$入學年)-> minNumber
paste0("最大值: ",maxNumber, " 最小值: ",minNumber)

## 3
max(libraryData$書籍上架年月)-> maxNumber
min(libraryData$書籍上架年月)-> minNumber
paste0("最大值: ",maxNumber, " 最小值: ",minNumber)

以上三段程式會變的部份是libraryData$讀者借閱冊數libraryData$入學年、或libraryData$書籍上架年月。也就是說讓我們把它想成是變數x,則x有三個可能值,且在每個x值下,我們在進行以下三行運算:

max(x)-> maxNumber
min(x)-> minNumber
paste0("最大值: ",maxNumber, " 最小值: ",minNumber)

由於會反覆用在不同可能x,我們可以把以下三行寫成function:

minMaxFun <- function(x){
  max(x)-> maxNumber
  min(x)-> minNumber
  paste0("最大值: ",maxNumber, " 最小值: ",minNumber)
}

其中minMaxFun及argument: x可自行命名。

minMaxFun(libraryData$讀者借閱冊數)

minMaxFun(libraryData$入學年)

minMaxFun(libraryData$書籍上架年月)

自定函數output:

若函數要有value return,有以下兩個做法:(1)最後一行並「不以物件儲存」; 或(2)程式中的return(想儲存物件名稱) 來決定。

(1)最後一行並「不以物件儲存」

minMaxFun1 <- function(x){
  max(x)-> maxNumber
  min(x)-> minNumber
  paste0("最大值: ",maxNumber, " 最小值: ",minNumber)
}
minMaxFun1(libraryData$讀者借閱冊數)-> outputValue1
outputValue1

(2)程式中的return(想儲存物任名稱) 來決定

minMaxFun2 <- function(x){
  max(x)-> maxNumber
  min(x)-> minNumber
  paste0("最大值: ",maxNumber, " 最小值: ",minNumber)-> output
  return(maxNumber)
}
minMaxFun2(libraryData$讀者借閱冊數)-> outputValue2
outputValue2
請設計一個函數not_numeric(x), 它可以用來確認物件x不是數值class,即「不是數值class」return value為TRUE,反之為FALSE。
not_numeric<-function(x){
  !is.numeric(x)
}

8.4.2 dplyr::select

dplyr::select( ) 用來選出資料集中所要的變數。

libraryData %>%
  select(讀者借閱冊數,書籍出版日期) 

8.4.3 dplyr::summarise

dplyr::summarise( )用來計算變數的值所能產生的敍述統計值,如「平均值」,「最大/小值」等。

libraryData %>%
  select(讀者借閱冊數,書籍出版日期) %>%
  summarise(
    讀者平均借閱冊數=mean(讀者借閱冊數),
    最小書籍出版日期=min(書籍出版日期),
    最大書籍出版日期=max(書籍出版日期)
  )
上述寫法會有NA,你會如何修正?
libraryData %>%
  select(讀者借閱冊數,書籍出版日期) %>%
  summarise(
    讀者平均借閱冊數=mean(讀者借閱冊數),
    最小書籍出版日期=min(書籍出版日期,na.rm=T),
    最大書籍出版日期=max(書籍出版日期,na.rm=T)
  )

Sometimes it is more convenient to apply it with some conditions. Here we look at select_if():

  • select_if(.data, .preidcate, .funs)

where

**.predicate** is a function that returns a **single** TRUE or FALSE.

**.funs** is a collection of functions to be applied to variable <span style="color:tomato;">**names**</span>; if not specified, nothing is done on varible names.

A predicate(謂語邏輯) is a function that returns a single TRUE or FALSE.

以下取自Wiki:

謂語(predicate of grammar)與主語相對,兩者是陳述與被陳述的關係。
謂語則用來對主語或主詞加以陳述,來說明主語「是什麼」「做什麼」或
「怎麼樣」。

譬如:

在「這個球是黃色的」這句中,「這個球」是主詞,「是黃色的」是謂語,
「是」字是動詞,「黃色的」是表語。

在「他正在打籃球」這句中,「他」是主詞,「正在打籃球」是謂語,
「打」是動詞,「籃球」是賓語。

dplyr::select(),dplyr::mutate()均可做_at, _all, _if的彈性延伸,用法一致:

…_at(.vars,.funs)

…_all(.funs)

…_if(.predicate, .funs)

其中:

.vars表示可用vars(...)把變數名稱要求條件寫在…

.funs表示可用funs(...)把要做的函數操作寫在…

.predicate表示可用謂詞邏輯函數來決定變數選擇與否

更多訊息可見Summarise and mutate multiple columns

8.4.4 summarytools::dfSummary

Package summarytools has a very nice summary function dfSummary() for us to summarise variables in a data frame.

At the first stage of data summarisation, attentions are on:

  • extreme values/outliers(極值)

  • NAs(缺失資料)

  • distribution(分配): mode(眾數), unusual pattern(左/右長尾,雙峰), etc.

Numerical variables

library(summarytools)
libraryData %>%
  select_if(
    is.numeric
  ) %>% summarytools::dfSummary() %>% summarytools::view(.)

Summary on numerical variables

有發現什麼奇怪現象嗎?

Non-numerical variables

library(summarytools)
libraryData %>%
  select_if(
   not_numeric
  ) %>% summarytools::dfSummary() %>% summarytools::view(.)

Summary on non-numerical variables

有發現什麼奇怪現象嗎?

8.4.5 NA correction

在libraryData有部份書籍出版日期應為NA而誤植為9999,所以我們必需進行以下的修改

書籍出版日期為9999, 則書籍出版日期改成NA

要更動的:libraryData$書籍出版日期

(libraryData$書籍出版日期==9999) -> logiIndex # 回傳每筆書籍出版日期是否為9999
 libraryData$書籍出版日期[logiIndex] <-  # 選出為9999的樣本
   NA # 將它換成NA

當變數值存在NA時,上述做法並不佳,原因如下:

當變數的值原本就有NA時,

  • 邏輯判斷會回覆NA。
a <- c(1,3,4,NA,5,6)
(a>3)

[1] FALSE FALSE TRUE NA TRUE TRUE

  • 用邏輯判斷產生的logical vector來選觀測值時, NA會被選出來。
a[(a>3)]

[1] 4 NA 5 6

  • 使用which把TRUE的位置選出來,使用位置來選元素。
a[which((a>3))]

[1] 4 5 6

較佳的做法是使用which()取出logiIndex為TRUE的位置。

(libraryData$書籍出版日期==9999) -> logiIndex # 回傳每筆書籍出版日期是否為9999
 libraryData$書籍出版日期[which(logiIndex)] <-  # 選出為9999的樣本
   NA # 將它換成NA

8.4.6 練習資料

100-102學屆學生去識別化成績資料

library(readr)
transcriptData <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/transcript100_102.csv")

8.5 練習題

期中考訂正

完成以下所有程序,期中考分數加上15分。

  1. 依你的寫法訂正錯誤(並非copy paste正確答案),也就是用你期中考相同的符號及邏輯想法去改,如果非pipe寫法,也請維持非pipe寫法。

空白試題及資料檔點此下載

  • 試題與資料檔要在同目錄下才能成功引入資料

  • 訂正完,請把檔名加上你的學號,檔案內容開頭的id, name也要寫你的學號,姓名;並上傳到以下網址: 期中考訂正作業上傳網址

  1. 在訂正過程中,每一個錯誤題目請照Hypothes.is電子標籤說明在課程電子書或其他網路電子書標「2」個解此題所需要的重要觀念。

以上任何步驟未完成或不完整,將不會給予任何加分。

8.6 Logical

Regarding Midterm ans14,

!is.na(hwGrades$hw1) + !is.na(hwGrades$hw2) + !is.na(hwGrades$Exercise3)

won’t give you the right count of hw submissions. But the following will:

(!is.na(hwGrades$hw1)) + (!is.na(hwGrades$hw2)) + (!is.na(hwGrades$Exercise3))
To be short,

If you want ! to give you number for operation, make sure to put parenthesis around the logical contents that you want to negate on.

To understand why, here is the detail explanation regarding logicals and ! operator.


F has an intrinsic value of 0, while T has an intrinsic value of 1. Therefore, the following will return TRUE:

0==F
1==T

! indicates logical negation which returns T/F. Since 0==F and 1==T, the following two are the same,

!0
!F

Also the following two:

!1
!T

! takes negation of everything behinds it. Therefore, these are the same

!F  
! F
!   F

They are all like

!(F)
!F+!T+!F
!(F+!(T+!F))

To limit its inclusion, put ) at the place to stop it.

(!F)+(!T)+(!F)
hwGrades <- read_csv("https://raw.githubusercontent.com/tpemartin/github-data/master/hwGrades.csv")

To see the results of the following,

!is.na(hwGrades$hw1) + !is.na(hwGrades$hw2) + !is.na(hwGrades$Exercise3)

Let’s take only the 1st observation to understand how it works.

!is.na(hwGrades$hw1[1]) + !is.na(hwGrades$hw2[1]) + !is.na(hwGrades$Exercise3[1])

which is the same as

!is.na(NA)+!is.na(NA)+!is.na(NA)

is the same as

!(is.na(NA) + !(is.na(NA)+ !is.na(NA)))

is equivalent to !(T + !(T+ !T))=!(T + !(T+ F))=!(T+!(1))=!(T+F)=!1=F. In order to ensure the summation works as we want, we need

(!is.na(hwGrades$hw1)) + (!is.na(hwGrades$hw2)) + (!is.na(hwGrades$Exercise3))
To be short,

If you want ! to give you number for operation, make sure to put parenthesis around the logical contents that you want to negate on.