Chapter 10 探索式數據分析首部曲

10.1

作者小編定義這一套微微書是案例分析的全過程。在第一章,作者小編介紹了統計程式語言R如何下載放在網際網路上的數據集,即便數據集是一堆看似凌亂數字,還有些許說明文字在「(網)頁頭」。經過下載、觀察、修剪,最後成果跟R官方建議套件MASS收錄的波士頓數據集「Boston」一模一樣。

這一組數據集到底「為何而生」啊?

我們在第一章發現幾個套件收錄了「波士頓數據集」:

  1. MASS
  2. mlbench
  3. spData

點出來之後,發現了一些「線索」:

  1. Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing contains the original data by Harrison and Rubinfeld (1978), the dataframe BostonHousing2 the corrected version with additional spatial information (see references below).
  1. It contains the Harrison and Rubinfeld (1978) data corrected for a few minor errors and augmented with the latitude and longitude of the observations.

這兩條線索都提到了一篇文章,「Harrison and Rubinfeld (1978)」:

Harrison, D. and Rubinfeld, D.L. (1978). Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5, 81–102.

作者小編循線找到它,這一篇文章的入口頁。那裡有「摘要」,絕對是「公開的」、「免費的」,任何人都看得到、拿得到:

10.1.1 Harrison and Rubinfeld (1978)的摘要

This paper investigates the methodological problems associated with the use of housing market data to measure the willingness to pay for clean air. With the use of a hedonic housing price model and data for the Boston metropolitan area, quantitative estimates of the willingness to pay for air quality improvements are generated. Marginal air pollution damages (as revealed in the housing market) are found to increase with the level of air pollution and with household income. The results are relatively sensitive to the specification of the hedonic housing price equation, but insensitive to the specification of the air quality demand equation.

作者小編大學唸的是數學、統計、資訊,看到上面這一段「專業英文」,「克服心理障礙」的「作為、作法」是

先搜尋數學、統計、資訊相關的英文。

看到了

  1. data
  2. model
  3. quantitative
  4. estimates
  5. increase
  6. sensitive
  7. equation
  8. insensitive

作者小編望著這幾個,實際上是8個英文字,沈思了幾分鐘之後,決定要進一步「知道」

什麼是「hedonic housing price model」?

10.1.2 什麼是「hedonic housing price model」?

把這幾個字

hedonic housing price model

丟給Google Search,我們發現了「Hedonic Price Model」的定義。它被登錄在知名出版社SpringerEncyclopedia of Quality of Life and Well-Being Research。那到底什麼是「Hedonic Price Model」?

Hedonic pricing treats a marketed good, usually a house, as a sum of individual goods (characteristics or attributes) that cannot be sold separately in the market. The main objective of a hedonic pricing model is to estimate the contribution of such characteristics or attributes to the price of house. This is why they have become a core strategy to estimate the implicit prices of nonmarketable goods.

又看到了幾個跟作者小編專業有關的英文單字:

  1. characteristics
  2. attributes
  3. model
  4. estimate
  5. strategy
  6. implicit

走過一次上述的歷程,再根據作者小編的「庫」,作者小編決定把這一套案例分析微微書「BOSTON」定調為

搜尋「1970年代波士頓都會區的最佳『hedonic housing price model』」。

10.2 學習目標

10.3 準備動作:再一次檢視波士頓數據集這一張表

為了得到「best hedonic housing price model」,也為了讓統計程式語言R幫我們完成任務,作者小編先來「再一次」檢視波士頓數據集在R的零零總總。同時認識、學習R支持一張「表」的環境。請讀者諸君先記住

R用「data.frame」稱呼一張「表」。

10.3.1 準備波士頓數據集

這一回我們不再自己下載、自己抓、自己耙,我們請套件「MASS」幫忙,再請「data」叫出波士頓數據集「Boston」。為了「不動」原始數據集,我們「備份」一版在「boston」。

require(MASS)
data(Boston)
boston <- Boston

為了「小量示範」,我們只抓「boston」的前八筆觀察值:

boston <- boston[1:8,]

10.3.2 進一步認識R的一張表data.frame

先確認「boston」是不是一張表(data.frame)?

作者小編寫到這裡,忽然閃過一個念頭,為了回應作者小編個人的「學發展才是大學教育本質」的理念,在這一套「案例分析波士頓房價模型」的微微書裡,不僅提示「正確的程式碼」,也會提示「錯誤的程式碼」,也會提示「『條條』大路通羅馬的程式碼」。

# 先確認「`boston`」是不是一張表(`data.frame`)?
class(boston)
## [1] "data.frame"
class(boston) == "data.frame"
## [1] TRUE
is.data.frame(boston)
## [1] TRUE

確認是一張表(data.frame)之後,我們想知道這一張表「多大」?以下這一段意圖示範如何使用「if」跟「else」,這是最基礎的「假如條件為真就…不為真就…」的使用範例。作者小編假設

  1. 第一次經手套件「MASS」的「Boston」,對於「Boston」是不是一張表,是無知的。
  2. 不知道有「is.data.frame」這一支程式。
  3. 只知道「class」這一支程式(第一章提示過!)。

這是一種「預防性」寫法的程式,再則因為「boston」這一個字並不是「專用字」,在R的環境裡,「boston」不會只有一個意思、一種可能。

if(class(boston) == "data.frame"){
  dim(boston)
} else {
  cat("「boston」不是一張表。")
}
## [1]  8 14

如果我們完全掌握、知情「boston」的「緣起」,那只要這麼寫即可得知這一張表「多大」的答案:

dim(boston)
## [1]  8 14

上述兩個數字一前一後表示「boston」這張表收錄了8筆觀察值,而且有14個變數。這8筆觀察值的「編號」分別是,

rownames(boston)
## [1] "1" "2" "3" "4" "5" "6" "7" "8"

如果看到這種形態的「觀察值編號」,通常是R的預設值。接著,「boston」這張表的欄位名稱,從左而右依序是

colnames(boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

至於每一個欄位、每一個變數的內容物是

  1. 整數(integer)
  2. 實數(numeric)
  3. 複數(complex)
  4. 文字(character)
  5. 邏輯(logical)
  6. factor

到底是哪一種?

sapply(boston, class)
##      crim        zn     indus      chas       nox        rm       age       dis 
## "numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" "numeric" 
##       rad       tax   ptratio     black     lstat      medv 
## "integer" "numeric" "numeric" "numeric" "numeric" "numeric"

這樣一張看似不大不小的「輸出」,對一般人而言,並不好看、並不容易看!「boston」這張表的變數內容物只有numeric, integer。

unique(sapply(boston, class))
## [1] "numeric" "integer"

所以,哪幾個變數是「整數」呢?

colnames(boston)[which(sapply(boston, class) == "integer")]
## [1] "chas" "rad"

哪幾個變數是「實數」呢?

colnames(boston)[which(sapply(boston, class) == "numeric")]
##  [1] "crim"    "zn"      "indus"   "nox"     "rm"      "age"     "dis"    
##  [8] "tax"     "ptratio" "black"   "lstat"   "medv"

10.3.3 抓出或是改變「boston」這張表的某部分

  1. 第一筆觀察值
boston[1,]
##      crim zn indus chas   nox    rm  age  dis rad tax ptratio black lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.09   1 296    15.3 396.9  4.98   24
boston["1",]
##      crim zn indus chas   nox    rm  age  dis rad tax ptratio black lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.09   1 296    15.3 396.9  4.98   24
  1. 第一個欄位、第一個變數的內容物(測量值)
boston[,1]
## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985 0.08829 0.14455
boston[,"crim"]
## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985 0.08829 0.14455
boston[[1]]
## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985 0.08829 0.14455
boston[["crim"]]
## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985 0.08829 0.14455
boston$crim
## [1] 0.00632 0.02731 0.02729 0.03237 0.06905 0.02985 0.08829 0.14455
  1. 第一個欄位、第一個變數的子表
boston[1]
##      crim
## 1 0.00632
## 2 0.02731
## 3 0.02729
## 4 0.03237
## 5 0.06905
## 6 0.02985
## 7 0.08829
## 8 0.14455
boston["crim"]
##      crim
## 1 0.00632
## 2 0.02731
## 3 0.02729
## 4 0.03237
## 5 0.06905
## 6 0.02985
## 7 0.08829
## 8 0.14455
  1. 第一個格子(cell)的測量值
boston[1,1]
## [1] 0.00632
boston["1",1]
## [1] 0.00632
boston[1,"crim"]
## [1] 0.00632
boston["1","crim"]
## [1] 0.00632
  1. 抓子表
boston[1:3]
##      crim   zn indus
## 1 0.00632 18.0  2.31
## 2 0.02731  0.0  7.07
## 3 0.02729  0.0  7.07
## 4 0.03237  0.0  2.18
## 5 0.06905  0.0  2.18
## 6 0.02985  0.0  2.18
## 7 0.08829 12.5  7.87
## 8 0.14455 12.5  7.87
boston[c(1,2,3)]
##      crim   zn indus
## 1 0.00632 18.0  2.31
## 2 0.02731  0.0  7.07
## 3 0.02729  0.0  7.07
## 4 0.03237  0.0  2.18
## 5 0.06905  0.0  2.18
## 6 0.02985  0.0  2.18
## 7 0.08829 12.5  7.87
## 8 0.14455 12.5  7.87
boston[colnames(boston)[1:3]]
##      crim   zn indus
## 1 0.00632 18.0  2.31
## 2 0.02731  0.0  7.07
## 3 0.02729  0.0  7.07
## 4 0.03237  0.0  2.18
## 5 0.06905  0.0  2.18
## 6 0.02985  0.0  2.18
## 7 0.08829 12.5  7.87
## 8 0.14455 12.5  7.87
boston[colnames(boston)[c(1,2,3)]]
##      crim   zn indus
## 1 0.00632 18.0  2.31
## 2 0.02731  0.0  7.07
## 3 0.02729  0.0  7.07
## 4 0.03237  0.0  2.18
## 5 0.06905  0.0  2.18
## 6 0.02985  0.0  2.18
## 7 0.08829 12.5  7.87
## 8 0.14455 12.5  7.87
boston[c("crim","zn","indus")]
##      crim   zn indus
## 1 0.00632 18.0  2.31
## 2 0.02731  0.0  7.07
## 3 0.02729  0.0  7.07
## 4 0.03237  0.0  2.18
## 5 0.06905  0.0  2.18
## 6 0.02985  0.0  2.18
## 7 0.08829 12.5  7.87
## 8 0.14455 12.5  7.87
  1. 改變觀察值出現順序
boston[c(3,2,1),]
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
##   medv
## 3 34.7
## 2 21.6
## 1 24.0
  1. 改變欄位出現順序
boston[c("zn","crim","indus")]
##     zn    crim indus
## 1 18.0 0.00632  2.31
## 2  0.0 0.02731  7.07
## 3  0.0 0.02729  7.07
## 4  0.0 0.03237  2.18
## 5  0.0 0.06905  2.18
## 6  0.0 0.02985  2.18
## 7 12.5 0.08829  7.87
## 8 12.5 0.14455  7.87
  1. 上下併表
boston[3,]
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
##   medv
## 3 34.7
boston[2,]
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio black lstat
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.9  9.14
##   medv
## 2 21.6
boston[1,]
##      crim zn indus chas   nox    rm  age  dis rad tax ptratio black lstat medv
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.09   1 296    15.3 396.9  4.98   24
rbind(boston[3,], boston[2,], boston[1,])
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
##   medv
## 3 34.7
## 2 21.6
## 1 24.0
  1. 單欄左右併表
cbind(boston[,3,drop=FALSE],
      boston[,2,drop=FALSE],
      boston[,1,drop=FALSE])
##   indus   zn    crim
## 1  2.31 18.0 0.00632
## 2  7.07  0.0 0.02731
## 3  7.07  0.0 0.02729
## 4  2.18  0.0 0.03237
## 5  2.18  0.0 0.06905
## 6  2.18  0.0 0.02985
## 7  7.87 12.5 0.08829
## 8  7.87 12.5 0.14455
  1. 多欄左右併表
cbind(boston[,c(1,2,3)],boston[,c(2,3,1)],boston[,c(3,2,1)])
##      crim   zn indus   zn indus    crim indus   zn    crim
## 1 0.00632 18.0  2.31 18.0  2.31 0.00632  2.31 18.0 0.00632
## 2 0.02731  0.0  7.07  0.0  7.07 0.02731  7.07  0.0 0.02731
## 3 0.02729  0.0  7.07  0.0  7.07 0.02729  7.07  0.0 0.02729
## 4 0.03237  0.0  2.18  0.0  2.18 0.03237  2.18  0.0 0.03237
## 5 0.06905  0.0  2.18  0.0  2.18 0.06905  2.18  0.0 0.06905
## 6 0.02985  0.0  2.18  0.0  2.18 0.02985  2.18  0.0 0.02985
## 7 0.08829 12.5  7.87 12.5  7.87 0.08829  7.87 12.5 0.08829
## 8 0.14455 12.5  7.87 12.5  7.87 0.14455  7.87 12.5 0.14455
  1. 改變觀察值編號
rownames(boston)[1] <- "-1"
boston
##       crim   zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## -1 0.00632 18.0  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2  0.02731  0.0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3  0.02729  0.0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4  0.03237  0.0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5  0.06905  0.0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6  0.02985  0.0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
## 7  0.08829 12.5  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60 12.43
## 8  0.14455 12.5  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90 19.15
##    medv
## -1 24.0
## 2  21.6
## 3  34.7
## 4  33.4
## 5  36.2
## 6  28.7
## 7  22.9
## 8  27.1
  1. 改變欄位名稱,也就是改變變數名稱
colnames(boston)[1] <- "犯罪率"
boston
##     犯罪率   zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## -1 0.00632 18.0  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2  0.02731  0.0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3  0.02729  0.0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4  0.03237  0.0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5  0.06905  0.0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6  0.02985  0.0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
## 7  0.08829 12.5  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60 12.43
## 8  0.14455 12.5  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90 19.15
##    medv
## -1 24.0
## 2  21.6
## 3  34.7
## 4  33.4
## 5  36.2
## 6  28.7
## 7  22.9
## 8  27.1
  1. 改變某一個格子的測量值
boston[1,1] <- -1.23
boston
##      犯罪率   zn indus chas   nox    rm  age    dis rad tax ptratio  black
## -1 -1.23000 18.0  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2   0.02731  0.0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3   0.02729  0.0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4   0.03237  0.0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5   0.06905  0.0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6   0.02985  0.0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
## 7   0.08829 12.5  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60
## 8   0.14455 12.5  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90
##    lstat medv
## -1  4.98 24.0
## 2   9.14 21.6
## 3   4.03 34.7
## 4   2.94 33.4
## 5   5.33 36.2
## 6   5.21 28.7
## 7  12.43 22.9
## 8  19.15 27.1
  1. 改變某一個變數的內容物(測量值)
boston$zn <- as.logical(boston$zn)
boston
##      犯罪率    zn indus chas   nox    rm  age    dis rad tax ptratio  black
## -1 -1.23000  TRUE  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2   0.02731 FALSE  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3   0.02729 FALSE  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4   0.03237 FALSE  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5   0.06905 FALSE  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6   0.02985 FALSE  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
## 7   0.08829  TRUE  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60
## 8   0.14455  TRUE  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90
##    lstat medv
## -1  4.98 24.0
## 2   9.14 21.6
## 3   4.03 34.7
## 4   2.94 33.4
## 5   5.33 36.2
## 6   5.21 28.7
## 7  12.43 22.9
## 8  19.15 27.1
  1. 刪去某一筆觀察值
boston[-1,]
##    犯罪率    zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 2 0.02731 FALSE  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729 FALSE  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237 FALSE  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905 FALSE  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985 FALSE  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
## 7 0.08829  TRUE  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60 12.43
## 8 0.14455  TRUE  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90 19.15
##   medv
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
## 7 22.9
## 8 27.1
  1. 刪去某一個變數
boston[,-1]
##       zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat medv
## -1  TRUE  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
## 2  FALSE  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
## 3  FALSE  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
## 4  FALSE  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
## 5  FALSE  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
## 6  FALSE  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7
## 7   TRUE  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60 12.43 22.9
## 8   TRUE  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90 19.15 27.1
  1. 變成空表
boston[integer(0),]
##  [1] 犯罪率  zn      indus   chas    nox     rm      age     dis     rad    
## [10] tax     ptratio black   lstat   medv   
## <0 列> (或零長度的 row.names)
  1. 按房價由小到大排列
boston[order(boston$medv),]
##      犯罪率    zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 2   0.02731 FALSE  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 7   0.08829  TRUE  7.87    0 0.524 6.012 66.6 5.5605   5 311    15.2 395.60
## -1 -1.23000  TRUE  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 8   0.14455  TRUE  7.87    0 0.524 6.172 96.1 5.9505   5 311    15.2 396.90
## 6   0.02985 FALSE  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
## 4   0.03237 FALSE  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 3   0.02729 FALSE  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 5   0.06905 FALSE  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
##    lstat medv
## 2   9.14 21.6
## 7  12.43 22.9
## -1  4.98 24.0
## 8  19.15 27.1
## 6   5.21 28.7
## 4   2.94 33.4
## 3   4.03 34.7
## 5   5.33 36.2

當然R對一張表,能做的「動作」肯定不只這一些,如果讀者諸君想要進一步認識表,也就是「data.frame」,請從以下的頁面出發:

knitr::include_url("https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame")

10.3.4 課堂練習

10.4 深度觀察波士頓數據集的房價變數

這一套案例分析微微書「BOSTON」定調為「搜尋1970年代波士頓都會區的最佳『hedonic housing price model』」。為了呼應個人咸信、咸認

學『發展』才是高等教育的本質,

作者小編想起大學時代學過、常用的證明技巧,

這「1-2-3」意味著

凡事從簡單的做起,

既然想「搜尋1970年代波士頓都會區的最佳『hedonic housing price model』」,作者小編沒住過波士頓,當然也沒住過1970年代的波士頓,對波士頓房價肯定「毫無所悉」。

怎麼辦呢?

為此我們先抓、先看「MASS::Boston」的第14個欄位,medv,也就是「以千美元計的房價中位數」的零零總總,至於medv的欄位說明,請讀者諸君前進MASS深入了解。並且定義第一個問題

無解!作者小編寫到這裡,回想起在旭、續、序寫過

「統計學」到底教什麼?學什麼?作者小篇認為是學「收證、蒐證、確證」。

「收證」等同於「收集數據」。

「蒐證」等同於「解析數據」。

「確證」等同於「定調是非」。

編寫這一套微微書的「現在」是西元2021年,是疫情持續由冷轉熱的當下,我們「緬懷」1970年代認真收集「普查數據(census)」的美國人、的前輩,也因為諸多學者專家的努力,讓我們即便相距超過50年、半世紀,也可以「享用」波士頓數據集。也就是說,2021年的我們不需要經歷統計學的第一段

收集數據

直接進入第二段

解析數據

就好像「現代的我們」跟「過去的我們」分工合作般,「過去的我們」努力收數據、整理數據,而「現代的我們」接著要認識數據、理解數據、洞悉數據。所以第一個問題是

Q1:認識變數medv的數據。

作者小編準備「哪一把刀?」、「哪一把劍?」,讓作者小編可以認識變數medv收到的數據呢?作者小編除了這一套BOSTON案例分析微微書,也試著編寫學統計同時學R微微書庫。試圖用1900年代Arthur Bowley提議的「莖葉圖」踏上學習、講授初等統計學之路。讀者諸君如果不清楚、不認識「莖葉圖」,請前進話說從莖葉圖跟「莖葉圖」見上一面,認識認識它。有了變數medv「莖葉圖」的成果,我們就可以再一次編修第一個問題的文字說明

Q1:認識變數medv的分配。

為什麼可以這樣改寫?為什麼「數據」兩字換成「分配」?還有什麼是「分配」?請讀者諸君繼續「看」下去,

10.4.1 抓取變數medv並檢查它在R的基本資料

  1. 載入套件MASS的波士頓數據集
require(MASS) # 用「require」不用「library」,是因為如果R已經將套件MASS載入環境,就不需要再載一次。
data(Boston)
boston <- Boston 
# 「不動」原始數據集,不論它有「多原始」,是R程式設計的一項絕佳習慣。
# 即便一般使用者是無法任意改變套件MASS的內容物!
  1. 抓取變數medv的數據
medv <- boston[,"medv"] # 中括號內,「逗號」前面指定「列」,「逗號」後面指定「行」。
medv # 呼叫medv,讓我們知道它的長相。
##   [1] 24.0 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.0 18.9 21.7 20.4 18.2
##  [16] 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8 18.4 21.0
##  [31] 12.7 14.5 13.2 13.1 13.5 18.9 20.0 21.0 24.7 30.8 34.9 26.6 25.3 24.7 21.2
##  [46] 19.3 20.0 16.6 14.4 19.4 19.7 20.5 25.0 23.4 18.9 35.4 24.7 31.6 23.3 19.6
##  [61] 18.7 16.0 22.2 25.0 33.0 23.5 19.4 22.0 17.4 20.9 24.2 21.7 22.8 23.4 24.1
##  [76] 21.4 20.0 20.8 21.2 20.3 28.0 23.9 24.8 22.9 23.9 26.6 22.5 22.2 23.6 28.7
##  [91] 22.6 22.0 22.9 25.0 20.6 28.4 21.4 38.7 43.8 33.2 27.5 26.5 18.6 19.3 20.1
## [106] 19.5 19.5 20.4 19.8 19.4 21.7 22.8 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3
## [121] 22.0 20.3 20.5 17.3 18.8 21.4 15.7 16.2 18.0 14.3 19.2 19.6 23.0 18.4 15.6
## [136] 18.1 17.4 17.1 13.3 17.8 14.0 14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4
## [151] 21.5 19.6 15.3 19.4 17.0 15.6 13.1 41.3 24.3 23.3 27.0 50.0 50.0 50.0 22.7
## [166] 25.0 50.0 23.8 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2
## [181] 39.8 36.2 37.9 32.5 26.4 29.6 50.0 32.0 29.8 34.9 37.0 30.5 36.4 31.1 29.1
## [196] 50.0 33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50.0 22.6 24.4 22.5 24.4 20.0
## [211] 21.7 19.3 22.4 28.1 23.7 25.0 23.3 28.7 21.5 23.0 26.7 21.7 27.5 30.1 44.8
## [226] 50.0 37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29.0 24.0 25.1 31.5 23.7 23.3
## [241] 22.0 20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.8 29.6 42.8 21.9
## [256] 20.9 44.0 50.0 36.0 30.1 33.8 43.1 48.8 31.0 36.5 22.8 30.7 50.0 43.5 20.7
## [271] 21.1 25.2 24.4 35.2 32.4 32.0 33.2 33.1 29.1 35.1 45.4 35.4 46.0 50.0 32.2
## [286] 22.0 20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.9 21.7 28.6 27.1 20.3 22.5 29.0
## [301] 24.8 22.0 26.4 33.1 36.1 28.4 33.4 28.2 22.8 20.3 16.1 22.1 19.4 21.6 23.8
## [316] 16.2 17.8 19.8 23.1 21.0 23.8 23.1 20.4 18.5 25.0 24.6 23.0 22.2 19.3 22.6
## [331] 19.8 17.1 19.4 22.2 20.7 21.1 19.5 18.5 20.6 19.0 18.7 32.7 16.5 23.9 31.2
## [346] 17.5 17.2 23.1 24.5 26.6 22.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6
## [361] 25.0 19.9 20.8 16.8 21.9 27.5 21.9 23.1 50.0 50.0 50.0 50.0 50.0 13.8 13.8
## [376] 15.0 13.9 13.3 13.1 10.2 10.4 10.9 11.3 12.3  8.8  7.2 10.5  7.4 10.2 11.5
## [391] 15.1 23.2  9.7 13.8 12.7 13.1 12.5  8.5  5.0  6.3  5.6  7.2 12.1  8.3  8.5
## [406]  5.0 11.9 27.9 17.2 27.5 15.0 17.2 17.9 16.3  7.0  7.2  7.5 10.4  8.8  8.4
## [421] 16.7 14.2 20.8 13.4 11.7  8.3 10.2 10.9 11.0  9.5 14.5 14.1 16.1 14.3 11.7
## [436] 13.4  9.6  8.7  8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.6 14.1 13.0
## [451] 13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20.0 16.4 17.7 19.5 20.2 21.4
## [466] 19.9 19.0 19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.3 16.7 12.0 14.6 21.4
## [481] 23.0 23.7 25.0 21.8 20.6 21.2 19.1 20.6 15.2  7.0  8.1 13.6 20.1 21.8 24.5
## [496] 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9 22.0 11.9
# 此時medv已經獨立出來,不再是boston的一部份,當然boston還是有一個欄位放著medv的數據,欄位名稱叫做medv。

我們抓到的、看到的,放在medv的房價中位數,真的跟bostonmedv一 模一樣嗎?R有沒有可能出錯呢?因為有人說過類似的話

沒有「智人失誤」這種事,只有「程式失誤」這種事。

請看我們的檢查碼:

medv == boston$medv # 我們將看到「一片TRUE海」,代表R沒犯「程式失誤」這種事!
##   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
##  [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [406] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [421] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [436] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [451] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [466] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [481] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [496] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
sum(medv == boston$medv) # 如果一模一樣,會看到506!
## [1] 506
sum(medv != boston$medv) # 如果一模一樣,會看到0!
## [1] 0
  1. 檢查medv在R的基本資料
class(medv) # 取得變數medv的內容物屬性。答案會是「numeric」,帶有小數點的實數。
## [1] "numeric"
dim(medv) # 只是Boston這一張表的一個欄位、的一個變數,所以只是一條數字。沒有答案。
## NULL
NROW(medv) # 取得變數medv的樣本數。答案是506。
## [1] 506

10.4.2 繪製medv的莖葉圖

require(aplpack)
stem.leaf(medv, unit = 0.1) # 因為每一筆medv數據都帶著一位小數,所以設定「unit = 0.1」。
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 506
##     3     5 | 006
##     4     6 | 3
##    11     7 | 0022245
##    21     8 | 1334455788
##    24     9 | 567
##    34    10 | 2224455899
##    43    11 | 035778899
##    52    12 | 013567778
##    76    13 | 011112333444455668888899
##    94    14 | 011123344555668999
##   110    15 | 0001222344666667
##   126    16 | 0111223455667788
##   148    17 | 0111222344455567888889
##   173    18 | 0122233444555566777889999
##   210    19 | 0011112233333444444555566666778889999
##   246    20 | 000001111122333344445556666667788899
##   (31)   21 | 0001122222444445566777777788999
##   229    22 | 00000001222223344555666667788889999
##   194    23 | 0000111111122223333445667777888899999
##   157    24 | 0011123334444555667778888
##   132    25 | 00000000123
##   121    26 | 24456667
##   113    27 | 011555599
##   104    28 | 0124456777
##    94    29 | 0011466889
##    84    30 | 1113578
##    77    31 | 01255667
##    69    32 | 0024579
##    62    33 | 011223448
##    53    34 | 67999
##    48    35 | 1244
##    44    36 | 012245
##    38    37 | 02369
## HI: 38.7 39.8 41.3 41.7 42.3 42.8 43.1 43.5 43.8 44 44.8 45.4 46 46.7 48.3 48.5 48.8 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50

作者小編看到上述莖葉圖之後,內心一片安詳、如釋重負。為什麼呢?請看

medv
##   [1] 24.0 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.0 18.9 21.7 20.4 18.2
##  [16] 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8 18.4 21.0
##  [31] 12.7 14.5 13.2 13.1 13.5 18.9 20.0 21.0 24.7 30.8 34.9 26.6 25.3 24.7 21.2
##  [46] 19.3 20.0 16.6 14.4 19.4 19.7 20.5 25.0 23.4 18.9 35.4 24.7 31.6 23.3 19.6
##  [61] 18.7 16.0 22.2 25.0 33.0 23.5 19.4 22.0 17.4 20.9 24.2 21.7 22.8 23.4 24.1
##  [76] 21.4 20.0 20.8 21.2 20.3 28.0 23.9 24.8 22.9 23.9 26.6 22.5 22.2 23.6 28.7
##  [91] 22.6 22.0 22.9 25.0 20.6 28.4 21.4 38.7 43.8 33.2 27.5 26.5 18.6 19.3 20.1
## [106] 19.5 19.5 20.4 19.8 19.4 21.7 22.8 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3
## [121] 22.0 20.3 20.5 17.3 18.8 21.4 15.7 16.2 18.0 14.3 19.2 19.6 23.0 18.4 15.6
## [136] 18.1 17.4 17.1 13.3 17.8 14.0 14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4
## [151] 21.5 19.6 15.3 19.4 17.0 15.6 13.1 41.3 24.3 23.3 27.0 50.0 50.0 50.0 22.7
## [166] 25.0 50.0 23.8 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2
## [181] 39.8 36.2 37.9 32.5 26.4 29.6 50.0 32.0 29.8 34.9 37.0 30.5 36.4 31.1 29.1
## [196] 50.0 33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50.0 22.6 24.4 22.5 24.4 20.0
## [211] 21.7 19.3 22.4 28.1 23.7 25.0 23.3 28.7 21.5 23.0 26.7 21.7 27.5 30.1 44.8
## [226] 50.0 37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29.0 24.0 25.1 31.5 23.7 23.3
## [241] 22.0 20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.8 29.6 42.8 21.9
## [256] 20.9 44.0 50.0 36.0 30.1 33.8 43.1 48.8 31.0 36.5 22.8 30.7 50.0 43.5 20.7
## [271] 21.1 25.2 24.4 35.2 32.4 32.0 33.2 33.1 29.1 35.1 45.4 35.4 46.0 50.0 32.2
## [286] 22.0 20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.9 21.7 28.6 27.1 20.3 22.5 29.0
## [301] 24.8 22.0 26.4 33.1 36.1 28.4 33.4 28.2 22.8 20.3 16.1 22.1 19.4 21.6 23.8
## [316] 16.2 17.8 19.8 23.1 21.0 23.8 23.1 20.4 18.5 25.0 24.6 23.0 22.2 19.3 22.6
## [331] 19.8 17.1 19.4 22.2 20.7 21.1 19.5 18.5 20.6 19.0 18.7 32.7 16.5 23.9 31.2
## [346] 17.5 17.2 23.1 24.5 26.6 22.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6
## [361] 25.0 19.9 20.8 16.8 21.9 27.5 21.9 23.1 50.0 50.0 50.0 50.0 50.0 13.8 13.8
## [376] 15.0 13.9 13.3 13.1 10.2 10.4 10.9 11.3 12.3  8.8  7.2 10.5  7.4 10.2 11.5
## [391] 15.1 23.2  9.7 13.8 12.7 13.1 12.5  8.5  5.0  6.3  5.6  7.2 12.1  8.3  8.5
## [406]  5.0 11.9 27.9 17.2 27.5 15.0 17.2 17.9 16.3  7.0  7.2  7.5 10.4  8.8  8.4
## [421] 16.7 14.2 20.8 13.4 11.7  8.3 10.2 10.9 11.0  9.5 14.5 14.1 16.1 14.3 11.7
## [436] 13.4  9.6  8.7  8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.6 14.1 13.0
## [451] 13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20.0 16.4 17.7 19.5 20.2 21.4
## [466] 19.9 19.0 19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.3 16.7 12.0 14.6 21.4
## [481] 23.0 23.7 25.0 21.8 20.6 21.2 19.1 20.6 15.2  7.0  8.1 13.6 20.1 21.8 24.5
## [496] 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9 22.0 11.9

哪一種呈現、顯示數據、數字的方式,可以、可能讓你我感覺、感受輕鬆點呢?根據作者小編在話說從莖葉圖的經驗值,得知出現在莖葉圖最下方的「HI」,表示「莖葉圖」這一支程式找到「疑似離群值」的數字。這一項發現呼應spData的背景說明裡有這麼一段:

Gilley and Pace (1996) also point out that MEDV is censored, in that median values at or over USD 50,000 are set to USD 50,000.

為此,我們繼續繪製「莖葉圖」,但是這一次我們要求「莖葉圖」把「疑似離群值」的數字加進「莖葉圖」裡。

stem.leaf(medv, unit = 0.1, trim.outliers = FALSE)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 506
##     3     5 | 006
##     4     6 | 3
##    11     7 | 0022245
##    21     8 | 1334455788
##    24     9 | 567
##    34    10 | 2224455899
##    43    11 | 035778899
##    52    12 | 013567778
##    76    13 | 011112333444455668888899
##    94    14 | 011123344555668999
##   110    15 | 0001222344666667
##   126    16 | 0111223455667788
##   148    17 | 0111222344455567888889
##   173    18 | 0122233444555566777889999
##   210    19 | 0011112233333444444555566666778889999
##   246    20 | 000001111122333344445556666667788899
##   (31)   21 | 0001122222444445566777777788999
##   229    22 | 00000001222223344555666667788889999
##   194    23 | 0000111111122223333445667777888899999
##   157    24 | 0011123334444555667778888
##   132    25 | 00000000123
##   121    26 | 24456667
##   113    27 | 011555599
##   104    28 | 0124456777
##    94    29 | 0011466889
##    84    30 | 1113578
##    77    31 | 01255667
##    69    32 | 0024579
##    62    33 | 011223448
##    53    34 | 67999
##    48    35 | 1244
##    44    36 | 012245
##    38    37 | 02369
##    33    38 | 7
##    32    39 | 8
##          40 | 
##    31    41 | 37
##    29    42 | 38
##    27    43 | 158
##    24    44 | 08
##    22    45 | 4
##    21    46 | 07
##          47 | 
##    19    48 | 358
##          49 | 
##    16    50 | 0000000000000000

這是「單莖的莖葉圖」,我們看到的畫面就在表達「分配」。作者小編說過這一句話

數學有分數,統計有分配。看到分配,進而掌握分配,就掌握了統計!

走到這裡,這才是我們真正想回答、釐清的問題,

Q1:認識變數medv的分配。

作者小編根據「Gilley and Pace (1996)」的論述以及上述「單莖莖葉圖」的結果,再一次提醒自己,也提醒讀者諸君在回答「Q1:認識變數medv的分配。」的路上,

記得「房價中位數」只要超過五萬美元通通被紀錄為五萬美元。

這已經違背台灣實價登錄的法律了!這裡不處理「法律問題」,我們要回答的是「統計問題」,所以在這裡作者小編先提醒幾件事

  1. 「房價中位數」是一群房價的中位數,所以一定有50%的房價高過這一個數字,而且無法得知高多少?
  2. 「房價中位數」只要超過五萬美元通通被紀錄為五萬美元,這是統計學的「設限資料」,也表示五萬美元這些數字的紀錄誤差可能更高。
  3. 「房價中位數」的分配可能是「右尾的(skewed to the right)」,或是「非常右尾的(extremely skewed to the right)」。

10.4.3 製作medv的次數分配表

上述的「單莖莖葉圖」可以「看出來」(所需背景知識,請讀者諸君務必前進話說從莖葉圖好好認識莖葉圖),

  1. (4.9, 5.9]這個範圍內,有「3」筆「房價中位數」、
  2. (5.9, 6.9]這個範圍內,有「4 - 3 = 1」筆「房價中位數」、
  3. (6.9, 7.9]這個範圍內,有「11 - 4 = 7」筆「房價中位數」、
  4. (47.9, 48.9]這個範圍內,有「19 - 16 = 3」筆「房價中位數」、
  5. (48.9, 49.9]這個範圍內,有「0」筆「房價中位數」、
  6. (49.9, 50.9]這個範圍內,有「16」筆「房價中位數」、

有了這樣的觀察,我們可以用R整理出一張次數分配表。詳情請前進次數分配表繼續了解。但是

可以不要那麼麻煩嗎?

於是乎,作者小編再一次呼叫Google Search,並且輸入

how to get the frequency table of a continuous variable in r

找到以下的參考碼:

df$value.cut = cut(df$value, breaks=c(0, 25, 100))
with(df, table(value.cut, status, useNA='ifany'))

作者小編暫時不解釋這一段「外來文」,先根據自身的經驗值改改看

table(cut(medv, breaks = seq(4.9, 50.9, 1.0)))
## 
##   (4.9,5.9]   (5.9,6.9]   (6.9,7.9]   (7.9,8.9]   (8.9,9.9]  (9.9,10.9] 
##           3           1           7          10           3          10 
## (10.9,11.9] (11.9,12.9] (12.9,13.9] (13.9,14.9] (14.9,15.9] (15.9,16.9] 
##           9           9          24          18          16          16 
## (16.9,17.9] (17.9,18.9] (18.9,19.9] (19.9,20.9] (20.9,21.9] (21.9,22.9] 
##          22          25          37          36          31          35 
## (22.9,23.9] (23.9,24.9] (24.9,25.9] (25.9,26.9] (26.9,27.9] (27.9,28.9] 
##          37          25          11           8           9          10 
## (28.9,29.9] (29.9,30.9] (30.9,31.9] (31.9,32.9] (32.9,33.9] (33.9,34.9] 
##          10           7           8           7           9           5 
## (34.9,35.9] (35.9,36.9] (36.9,37.9] (37.9,38.9] (38.9,39.9] (39.9,40.9] 
##           4           6           5           1           1           0 
## (40.9,41.9] (41.9,42.9] (42.9,43.9] (43.9,44.9] (44.9,45.9] (45.9,46.9] 
##           2           2           3           2           1           2 
## (46.9,47.9] (47.9,48.9] (48.9,49.9] (49.9,50.9] 
##           0           3           0          16

成功!因為跟我們在上述「單莖莖葉圖」的觀察結果一模一樣!

請看作者小編的解析、的分解動作

seq(4.9, 50.9, 1.0) # 定義數字範圍的切點。
##  [1]  4.9  5.9  6.9  7.9  8.9  9.9 10.9 11.9 12.9 13.9 14.9 15.9 16.9 17.9 18.9
## [16] 19.9 20.9 21.9 22.9 23.9 24.9 25.9 26.9 27.9 28.9 29.9 30.9 31.9 32.9 33.9
## [31] 34.9 35.9 36.9 37.9 38.9 39.9 40.9 41.9 42.9 43.9 44.9 45.9 46.9 47.9 48.9
## [46] 49.9 50.9
# 根據切點定義「左開右關」的數字範圍(interval),然後把每一個數字落入哪一個範圍標示出來。
# interval是作者小編在大一微積分學到的英文字。
cut(medv, breaks = seq(4.9, 50.9, 1.0)) 
##   [1] (23.9,24.9] (20.9,21.9] (33.9,34.9] (32.9,33.9] (35.9,36.9] (27.9,28.9]
##   [7] (21.9,22.9] (26.9,27.9] (15.9,16.9] (17.9,18.9] (14.9,15.9] (17.9,18.9]
##  [13] (20.9,21.9] (19.9,20.9] (17.9,18.9] (18.9,19.9] (22.9,23.9] (16.9,17.9]
##  [19] (19.9,20.9] (17.9,18.9] (12.9,13.9] (18.9,19.9] (14.9,15.9] (13.9,14.9]
##  [25] (14.9,15.9] (12.9,13.9] (15.9,16.9] (13.9,14.9] (17.9,18.9] (20.9,21.9]
##  [31] (11.9,12.9] (13.9,14.9] (12.9,13.9] (12.9,13.9] (12.9,13.9] (17.9,18.9]
##  [37] (19.9,20.9] (20.9,21.9] (23.9,24.9] (29.9,30.9] (33.9,34.9] (25.9,26.9]
##  [43] (24.9,25.9] (23.9,24.9] (20.9,21.9] (18.9,19.9] (19.9,20.9] (15.9,16.9]
##  [49] (13.9,14.9] (18.9,19.9] (18.9,19.9] (19.9,20.9] (24.9,25.9] (22.9,23.9]
##  [55] (17.9,18.9] (34.9,35.9] (23.9,24.9] (30.9,31.9] (22.9,23.9] (18.9,19.9]
##  [61] (17.9,18.9] (15.9,16.9] (21.9,22.9] (24.9,25.9] (32.9,33.9] (22.9,23.9]
##  [67] (18.9,19.9] (21.9,22.9] (16.9,17.9] (19.9,20.9] (23.9,24.9] (20.9,21.9]
##  [73] (21.9,22.9] (22.9,23.9] (23.9,24.9] (20.9,21.9] (19.9,20.9] (19.9,20.9]
##  [79] (20.9,21.9] (19.9,20.9] (27.9,28.9] (22.9,23.9] (23.9,24.9] (21.9,22.9]
##  [85] (22.9,23.9] (25.9,26.9] (21.9,22.9] (21.9,22.9] (22.9,23.9] (27.9,28.9]
##  [91] (21.9,22.9] (21.9,22.9] (21.9,22.9] (24.9,25.9] (19.9,20.9] (27.9,28.9]
##  [97] (20.9,21.9] (37.9,38.9] (42.9,43.9] (32.9,33.9] (26.9,27.9] (25.9,26.9]
## [103] (17.9,18.9] (18.9,19.9] (19.9,20.9] (18.9,19.9] (18.9,19.9] (19.9,20.9]
## [109] (18.9,19.9] (18.9,19.9] (20.9,21.9] (21.9,22.9] (17.9,18.9] (17.9,18.9]
## [115] (17.9,18.9] (17.9,18.9] (20.9,21.9] (18.9,19.9] (19.9,20.9] (18.9,19.9]
## [121] (21.9,22.9] (19.9,20.9] (19.9,20.9] (16.9,17.9] (17.9,18.9] (20.9,21.9]
## [127] (14.9,15.9] (15.9,16.9] (17.9,18.9] (13.9,14.9] (18.9,19.9] (18.9,19.9]
## [133] (22.9,23.9] (17.9,18.9] (14.9,15.9] (17.9,18.9] (16.9,17.9] (16.9,17.9]
## [139] (12.9,13.9] (16.9,17.9] (13.9,14.9] (13.9,14.9] (12.9,13.9] (14.9,15.9]
## [145] (10.9,11.9] (12.9,13.9] (14.9,15.9] (13.9,14.9] (16.9,17.9] (14.9,15.9]
## [151] (20.9,21.9] (18.9,19.9] (14.9,15.9] (18.9,19.9] (16.9,17.9] (14.9,15.9]
## [157] (12.9,13.9] (40.9,41.9] (23.9,24.9] (22.9,23.9] (26.9,27.9] (49.9,50.9]
## [163] (49.9,50.9] (49.9,50.9] (21.9,22.9] (24.9,25.9] (49.9,50.9] (22.9,23.9]
## [169] (22.9,23.9] (21.9,22.9] (16.9,17.9] (18.9,19.9] (22.9,23.9] (22.9,23.9]
## [175] (21.9,22.9] (28.9,29.9] (22.9,23.9] (23.9,24.9] (28.9,29.9] (36.9,37.9]
## [181] (38.9,39.9] (35.9,36.9] (36.9,37.9] (31.9,32.9] (25.9,26.9] (28.9,29.9]
## [187] (49.9,50.9] (31.9,32.9] (28.9,29.9] (33.9,34.9] (36.9,37.9] (29.9,30.9]
## [193] (35.9,36.9] (30.9,31.9] (28.9,29.9] (49.9,50.9] (32.9,33.9] (29.9,30.9]
## [199] (33.9,34.9] (33.9,34.9] (31.9,32.9] (23.9,24.9] (41.9,42.9] (47.9,48.9]
## [205] (49.9,50.9] (21.9,22.9] (23.9,24.9] (21.9,22.9] (23.9,24.9] (19.9,20.9]
## [211] (20.9,21.9] (18.9,19.9] (21.9,22.9] (27.9,28.9] (22.9,23.9] (24.9,25.9]
## [217] (22.9,23.9] (27.9,28.9] (20.9,21.9] (22.9,23.9] (25.9,26.9] (20.9,21.9]
## [223] (26.9,27.9] (29.9,30.9] (43.9,44.9] (49.9,50.9] (36.9,37.9] (30.9,31.9]
## [229] (45.9,46.9] (30.9,31.9] (23.9,24.9] (30.9,31.9] (40.9,41.9] (47.9,48.9]
## [235] (28.9,29.9] (23.9,24.9] (24.9,25.9] (30.9,31.9] (22.9,23.9] (22.9,23.9]
## [241] (21.9,22.9] (19.9,20.9] (21.9,22.9] (22.9,23.9] (16.9,17.9] (17.9,18.9]
## [247] (23.9,24.9] (19.9,20.9] (23.9,24.9] (25.9,26.9] (23.9,24.9] (23.9,24.9]
## [253] (28.9,29.9] (41.9,42.9] (20.9,21.9] (19.9,20.9] (43.9,44.9] (49.9,50.9]
## [259] (35.9,36.9] (29.9,30.9] (32.9,33.9] (42.9,43.9] (47.9,48.9] (30.9,31.9]
## [265] (35.9,36.9] (21.9,22.9] (29.9,30.9] (49.9,50.9] (42.9,43.9] (19.9,20.9]
## [271] (20.9,21.9] (24.9,25.9] (23.9,24.9] (34.9,35.9] (31.9,32.9] (31.9,32.9]
## [277] (32.9,33.9] (32.9,33.9] (28.9,29.9] (34.9,35.9] (44.9,45.9] (34.9,35.9]
## [283] (45.9,46.9] (49.9,50.9] (31.9,32.9] (21.9,22.9] (19.9,20.9] (22.9,23.9]
## [289] (21.9,22.9] (23.9,24.9] (27.9,28.9] (36.9,37.9] (26.9,27.9] (22.9,23.9]
## [295] (20.9,21.9] (27.9,28.9] (26.9,27.9] (19.9,20.9] (21.9,22.9] (28.9,29.9]
## [301] (23.9,24.9] (21.9,22.9] (25.9,26.9] (32.9,33.9] (35.9,36.9] (27.9,28.9]
## [307] (32.9,33.9] (27.9,28.9] (21.9,22.9] (19.9,20.9] (15.9,16.9] (21.9,22.9]
## [313] (18.9,19.9] (20.9,21.9] (22.9,23.9] (15.9,16.9] (16.9,17.9] (18.9,19.9]
## [319] (22.9,23.9] (20.9,21.9] (22.9,23.9] (22.9,23.9] (19.9,20.9] (17.9,18.9]
## [325] (24.9,25.9] (23.9,24.9] (22.9,23.9] (21.9,22.9] (18.9,19.9] (21.9,22.9]
## [331] (18.9,19.9] (16.9,17.9] (18.9,19.9] (21.9,22.9] (19.9,20.9] (20.9,21.9]
## [337] (18.9,19.9] (17.9,18.9] (19.9,20.9] (18.9,19.9] (17.9,18.9] (31.9,32.9]
## [343] (15.9,16.9] (22.9,23.9] (30.9,31.9] (16.9,17.9] (16.9,17.9] (22.9,23.9]
## [349] (23.9,24.9] (25.9,26.9] (21.9,22.9] (23.9,24.9] (17.9,18.9] (29.9,30.9]
## [355] (17.9,18.9] (19.9,20.9] (16.9,17.9] (20.9,21.9] (21.9,22.9] (21.9,22.9]
## [361] (24.9,25.9] (18.9,19.9] (19.9,20.9] (15.9,16.9] (20.9,21.9] (26.9,27.9]
## [367] (20.9,21.9] (22.9,23.9] (49.9,50.9] (49.9,50.9] (49.9,50.9] (49.9,50.9]
## [373] (49.9,50.9] (12.9,13.9] (12.9,13.9] (14.9,15.9] (12.9,13.9] (12.9,13.9]
## [379] (12.9,13.9] (9.9,10.9]  (9.9,10.9]  (9.9,10.9]  (10.9,11.9] (11.9,12.9]
## [385] (7.9,8.9]   (6.9,7.9]   (9.9,10.9]  (6.9,7.9]   (9.9,10.9]  (10.9,11.9]
## [391] (14.9,15.9] (22.9,23.9] (8.9,9.9]   (12.9,13.9] (11.9,12.9] (12.9,13.9]
## [397] (11.9,12.9] (7.9,8.9]   (4.9,5.9]   (5.9,6.9]   (4.9,5.9]   (6.9,7.9]  
## [403] (11.9,12.9] (7.9,8.9]   (7.9,8.9]   (4.9,5.9]   (10.9,11.9] (26.9,27.9]
## [409] (16.9,17.9] (26.9,27.9] (14.9,15.9] (16.9,17.9] (16.9,17.9] (15.9,16.9]
## [415] (6.9,7.9]   (6.9,7.9]   (6.9,7.9]   (9.9,10.9]  (7.9,8.9]   (7.9,8.9]  
## [421] (15.9,16.9] (13.9,14.9] (19.9,20.9] (12.9,13.9] (10.9,11.9] (7.9,8.9]  
## [427] (9.9,10.9]  (9.9,10.9]  (10.9,11.9] (8.9,9.9]   (13.9,14.9] (13.9,14.9]
## [433] (15.9,16.9] (13.9,14.9] (10.9,11.9] (12.9,13.9] (8.9,9.9]   (7.9,8.9]  
## [439] (7.9,8.9]   (11.9,12.9] (9.9,10.9]  (16.9,17.9] (17.9,18.9] (14.9,15.9]
## [445] (9.9,10.9]  (10.9,11.9] (13.9,14.9] (11.9,12.9] (13.9,14.9] (12.9,13.9]
## [451] (12.9,13.9] (14.9,15.9] (15.9,16.9] (16.9,17.9] (13.9,14.9] (13.9,14.9]
## [457] (11.9,12.9] (12.9,13.9] (13.9,14.9] (19.9,20.9] (15.9,16.9] (16.9,17.9]
## [463] (18.9,19.9] (19.9,20.9] (20.9,21.9] (18.9,19.9] (18.9,19.9] (18.9,19.9]
## [469] (18.9,19.9] (19.9,20.9] (18.9,19.9] (18.9,19.9] (22.9,23.9] (28.9,29.9]
## [475] (12.9,13.9] (12.9,13.9] (15.9,16.9] (11.9,12.9] (13.9,14.9] (20.9,21.9]
## [481] (22.9,23.9] (22.9,23.9] (24.9,25.9] (20.9,21.9] (19.9,20.9] (20.9,21.9]
## [487] (18.9,19.9] (19.9,20.9] (14.9,15.9] (6.9,7.9]   (7.9,8.9]   (12.9,13.9]
## [493] (19.9,20.9] (20.9,21.9] (23.9,24.9] (22.9,23.9] (18.9,19.9] (17.9,18.9]
## [499] (20.9,21.9] (16.9,17.9] (15.9,16.9] (21.9,22.9] (19.9,20.9] (22.9,23.9]
## [505] (21.9,22.9] (10.9,11.9]
## 46 Levels: (4.9,5.9] (5.9,6.9] (6.9,7.9] (7.9,8.9] (8.9,9.9] ... (49.9,50.9]
# 計算每一個範圍內落入幾個數字?
# 這裡的「幾個」就是英文的「frequency」,也就是中文的「頻次」。
table(cut(medv, breaks = seq(4.9, 50.9, 1.0))) 
## 
##   (4.9,5.9]   (5.9,6.9]   (6.9,7.9]   (7.9,8.9]   (8.9,9.9]  (9.9,10.9] 
##           3           1           7          10           3          10 
## (10.9,11.9] (11.9,12.9] (12.9,13.9] (13.9,14.9] (14.9,15.9] (15.9,16.9] 
##           9           9          24          18          16          16 
## (16.9,17.9] (17.9,18.9] (18.9,19.9] (19.9,20.9] (20.9,21.9] (21.9,22.9] 
##          22          25          37          36          31          35 
## (22.9,23.9] (23.9,24.9] (24.9,25.9] (25.9,26.9] (26.9,27.9] (27.9,28.9] 
##          37          25          11           8           9          10 
## (28.9,29.9] (29.9,30.9] (30.9,31.9] (31.9,32.9] (32.9,33.9] (33.9,34.9] 
##          10           7           8           7           9           5 
## (34.9,35.9] (35.9,36.9] (36.9,37.9] (37.9,38.9] (38.9,39.9] (39.9,40.9] 
##           4           6           5           1           1           0 
## (40.9,41.9] (41.9,42.9] (42.9,43.9] (43.9,44.9] (44.9,45.9] (45.9,46.9] 
##           2           2           3           2           1           2 
## (46.9,47.9] (47.9,48.9] (48.9,49.9] (49.9,50.9] 
##           0           3           0          16

接下來,作者小編嘗試把上述的「table」變成「data.frame」。

x <- table(cut(medv, breaks = seq(4.9, 50.9, 1.0)))
x <- as.data.frame(x) # 想「變」就要想到「`as.`」。
x
##           Var1 Freq
## 1    (4.9,5.9]    3
## 2    (5.9,6.9]    1
## 3    (6.9,7.9]    7
## 4    (7.9,8.9]   10
## 5    (8.9,9.9]    3
## 6   (9.9,10.9]   10
## 7  (10.9,11.9]    9
## 8  (11.9,12.9]    9
## 9  (12.9,13.9]   24
## 10 (13.9,14.9]   18
## 11 (14.9,15.9]   16
## 12 (15.9,16.9]   16
## 13 (16.9,17.9]   22
## 14 (17.9,18.9]   25
## 15 (18.9,19.9]   37
## 16 (19.9,20.9]   36
## 17 (20.9,21.9]   31
## 18 (21.9,22.9]   35
## 19 (22.9,23.9]   37
## 20 (23.9,24.9]   25
## 21 (24.9,25.9]   11
## 22 (25.9,26.9]    8
## 23 (26.9,27.9]    9
## 24 (27.9,28.9]   10
## 25 (28.9,29.9]   10
## 26 (29.9,30.9]    7
## 27 (30.9,31.9]    8
## 28 (31.9,32.9]    7
## 29 (32.9,33.9]    9
## 30 (33.9,34.9]    5
## 31 (34.9,35.9]    4
## 32 (35.9,36.9]    6
## 33 (36.9,37.9]    5
## 34 (37.9,38.9]    1
## 35 (38.9,39.9]    1
## 36 (39.9,40.9]    0
## 37 (40.9,41.9]    2
## 38 (41.9,42.9]    2
## 39 (42.9,43.9]    3
## 40 (43.9,44.9]    2
## 41 (44.9,45.9]    1
## 42 (45.9,46.9]    2
## 43 (46.9,47.9]    0
## 44 (47.9,48.9]    3
## 45 (48.9,49.9]    0
## 46 (49.9,50.9]   16
colnames(x) <- c("範圍", "頻次")
head(x)
##         範圍 頻次
## 1  (4.9,5.9]    3
## 2  (5.9,6.9]    1
## 3  (6.9,7.9]    7
## 4  (7.9,8.9]   10
## 5  (8.9,9.9]    3
## 6 (9.9,10.9]   10
tail(x)
##           範圍 頻次
## 41 (44.9,45.9]    1
## 42 (45.9,46.9]    2
## 43 (46.9,47.9]    0
## 44 (47.9,48.9]    3
## 45 (48.9,49.9]    0
## 46 (49.9,50.9]   16

以上是對比「單莖莖葉圖」的次數分配表,如果我們想要其他範圍的次數分配表,怎麼寫呢?

cbind(table(cut(medv, breaks = seq(4.9, 50.9, 3.0))))
##             [,1]
## (4.9,7.9]     11
## (7.9,10.9]    23
## (10.9,13.9]   42
## (13.9,16.9]   50
## (16.9,19.9]   84
## (19.9,22.9]  102
## (22.9,25.9]   73
## (25.9,28.9]   27
## (28.9,31.9]   25
## (31.9,34.9]   21
## (34.9,37.9]   15
## (37.9,40.9]    2
## (40.9,43.9]    7
## (43.9,46.9]    5
## (46.9,49.9]    3
cbind(table(cut(medv, breaks = seq(4.9, 50.9, 6.0))))
##             [,1]
## (4.9,10.9]    34
## (10.9,16.9]   92
## (16.9,22.9]  186
## (22.9,28.9]  100
## (28.9,34.9]   46
## (34.9,40.9]   17
## (40.9,46.9]   12
cbind(table(cut(medv, breaks = seq(4.9, 50.9, 8.0))))
##             [,1]
## (4.9,12.9]    52
## (12.9,20.9]  194
## (20.9,28.9]  166
## (28.9,36.9]   56
## (36.9,44.9]   16

看過一輪之後,我們對當年,1970年代波士頓都會區房價中位數的區段以及對應的頻次有了一些理解。接下來,作者小編想要看看「相對次數分配表」,看看各個區段的佔比。

cbind(prop.table(table(cut(medv, breaks = seq(4.9, 50.9, 3.0)))))*100.0
##                   [,1]
## (4.9,7.9]    2.2448980
## (7.9,10.9]   4.6938776
## (10.9,13.9]  8.5714286
## (13.9,16.9] 10.2040816
## (16.9,19.9] 17.1428571
## (19.9,22.9] 20.8163265
## (22.9,25.9] 14.8979592
## (25.9,28.9]  5.5102041
## (28.9,31.9]  5.1020408
## (31.9,34.9]  4.2857143
## (34.9,37.9]  3.0612245
## (37.9,40.9]  0.4081633
## (40.9,43.9]  1.4285714
## (43.9,46.9]  1.0204082
## (46.9,49.9]  0.6122449

成功!上述這一句話對初學者可能會有壓力!讓我們「由內而外步步看」:

  1. 定切點
  2. 定範圍
  3. 抓頻次
  4. 抓比例
  5. 併起來
  6. 乘以百

10.4.4 開始描述medv的次數分配表

  1. stem.leaf
  2. seq
  3. cut
  4. table
  5. prop.table

的協助下,我們有了變數medv,1970年代波士頓都會區房價中位數的

分配。

我們再一次呈現這兩種答案:

  1. 莖葉圖(單莖,所以是最小的一棵樹!)
  2. 次數分配表(比起莖葉圖,使用者或是製表人在定義頻次計算範圍時有更高的彈性!但是頻次數字的大小無法像莖葉圖般輕鬆反應高低!)
## 1 | 2: represents 1.2
##  leaf unit: 0.1
##             n: 506
##     3     5 | 006
##     4     6 | 3
##    11     7 | 0022245
##    21     8 | 1334455788
##    24     9 | 567
##    34    10 | 2224455899
##    43    11 | 035778899
##    52    12 | 013567778
##    76    13 | 011112333444455668888899
##    94    14 | 011123344555668999
##   110    15 | 0001222344666667
##   126    16 | 0111223455667788
##   148    17 | 0111222344455567888889
##   173    18 | 0122233444555566777889999
##   210    19 | 0011112233333444444555566666778889999
##   246    20 | 000001111122333344445556666667788899
##   (31)   21 | 0001122222444445566777777788999
##   229    22 | 00000001222223344555666667788889999
##   194    23 | 0000111111122223333445667777888899999
##   157    24 | 0011123334444555667778888
##   132    25 | 00000000123
##   121    26 | 24456667
##   113    27 | 011555599
##   104    28 | 0124456777
##    94    29 | 0011466889
##    84    30 | 1113578
##    77    31 | 01255667
##    69    32 | 0024579
##    62    33 | 011223448
##    53    34 | 67999
##    48    35 | 1244
##    44    36 | 012245
##    38    37 | 02369
##    33    38 | 7
##    32    39 | 8
##          40 | 
##    31    41 | 37
##    29    42 | 38
##    27    43 | 158
##    24    44 | 08
##    22    45 | 4
##    21    46 | 07
##          47 | 
##    19    48 | 358
##          49 | 
##    16    50 | 0000000000000000
##             [,1]
## (4.9,7.9]     11
## (7.9,10.9]    23
## (10.9,13.9]   42
## (13.9,16.9]   50
## (16.9,19.9]   84
## (19.9,22.9]  102
## (22.9,25.9]   73
## (25.9,28.9]   27
## (28.9,31.9]   25
## (31.9,34.9]   21
## (34.9,37.9]   15
## (37.9,40.9]    2
## (40.9,43.9]    7
## (43.9,46.9]    5
## (46.9,49.9]    3

接下來的挑戰是

如何清楚表達看到的分配?

或許「分配」的觀察員想回答以下的問題?

  1. 代表數字?代表誤差?
  2. 分散程度?離散程度?
  3. 單峰?雙峰?多峰?
  4. 對稱嗎?
  5. 扁寬型?瘦高型?
  6. 長尾?
  7. 長尾在右?長尾在左?
  8. 離群值?

有圖有真相。

作者小編相信看著「莖葉圖」,一般人比較容易回答上述問題,但是「看圖說故事」的答案可能比較主觀,因此統計學家努力「創造數字」,讓上述問題的答案可以比較客觀。接下來,我們不直接回答上述問題,先看看R的官方與協作者提供的「敘述統計量(摘要統計量)」有哪些?

10.4.4.1 專家的意見

  1. base
summary(medv)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00
  1. skimr
require(skimr)
skim(medv)
Table 10.1: Data summary
Name medv
Number of rows 506
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 22.53 9.2 5 17.02 21.2 25 50 ▂▇▅▁▁
  1. summarytools
require(summarytools)
descr(medv)
## Descriptive Statistics  
## medv  
## N: 506  
## 
##                       medv
## ----------------- --------
##              Mean    22.53
##           Std.Dev     9.20
##               Min     5.00
##                Q1    17.00
##            Median    21.20
##                Q3    25.00
##               Max    50.00
##               MAD     5.93
##               IQR     7.98
##                CV     0.41
##          Skewness     1.10
##       SE.Skewness     0.11
##          Kurtosis     1.45
##           N.Valid   506.00
##         Pct.Valid   100.00

10.4.4.2 DIY

  1. 常算數字
mean(medv) # 平均數
## [1] 22.53281
mean(medv, trim = 0.05) # 去頭去尾平均數
## [1] 21.90592
mean(medv, trim = 0.10) # 去頭去尾平均數
## [1] 21.56232
median(medv) # 中位數
## [1] 21.2
min(medv) # 最小值
## [1] 5
max(medv) # 最大值
## [1] 50
quantile(medv, probs = c(0.25, 0.5, 0.75)) # 四分位數
##    25%    50%    75% 
## 17.025 21.200 25.000
sd(medv) # 標準差
## [1] 9.197104
diff(range(medv)) # 間距
## [1] 45
IQR(medv) # 四分位數間距
## [1] 7.975
mad(medv) # 平均絕對誤差
## [1] 5.9304
  1. 少用數字兼自創數字
quantile(medv, probs = seq(0, 1, 0.1)) # 十分位數
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##  5.00 12.75 15.30 18.20 19.70 21.20 22.70 24.15 28.20 34.80 50.00
quantile(medv, probs = seq(0, 1, 0.01)) # 百分位數
##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9%    10% 
##  5.000  7.010  7.560  8.415  8.940 10.200 10.590 11.370 11.840 12.390 12.750 
##    11%    12%    13%    14%    15%    16%    17%    18%    19%    20%    21% 
## 13.100 13.360 13.500 13.800 13.975 14.280 14.500 14.890 15.000 15.300 15.600 
##    22%    23%    24%    25%    26%    27%    28%    29%    30%    31%    32% 
## 16.100 16.315 16.620 17.025 17.200 17.435 17.740 17.845 18.200 18.400 18.560 
##    33%    34%    35%    36%    37%    38%    39%    40%    41%    42%    43% 
## 18.765 18.900 19.100 19.300 19.400 19.490 19.600 19.700 19.900 20.000 20.100 
##    44%    45%    46%    47%    48%    49%    50%    51%    52%    53%    54% 
## 20.300 20.400 20.530 20.600 20.800 21.000 21.200 21.400 21.560 21.700 21.800 
##    55%    56%    57%    58%    59%    60%    61%    62%    63%    64%    65% 
## 22.000 22.000 22.200 22.400 22.600 22.700 22.900 23.000 23.100 23.200 23.300 
##    66%    67%    68%    69%    70%    71%    72%    73%    74%    75%    76% 
## 23.530 23.700 23.840 23.945 24.150 24.400 24.500 24.700 24.940 25.000 25.280 
##    77%    78%    79%    80%    81%    82%    83%    84%    85%    86%    87% 
## 26.585 27.090 27.500 28.200 28.700 29.100 29.800 30.140 31.025 31.600 32.270 
##    88%    89%    90%    91%    92%    93%    94%    95%    96%    97%    98% 
## 33.040 33.345 34.800 35.310 36.200 37.265 40.850 43.400 45.880 49.820 50.000 
##    99%   100% 
## 50.000 50.000
`十分位數` <- quantile(medv, probs = seq(0, 1, 0.1)) # 十分位數
`百分位數` <- quantile(medv, probs = seq(0, 1, 0.01)) # 百分位數
`十分位數`["100%"] - `十分位數`["0%"] # 一種間距
## 100% 
##   45
`十分位數`["90%"] - `十分位數`["10%"] # 一種間距
##   90% 
## 22.05
`十分位數`["80%"] - `十分位數`["20%"] # 一種間距
##  80% 
## 12.9
`百分位數`["99%"] - `百分位數`["1%"] # 一種間距
##   99% 
## 42.99
`百分位數`["89%"] - `百分位數`["11%"] # 一種間距
##    89% 
## 20.245
`百分位數`["79%"] - `百分位數`["21%"] # 一種間距
##  79% 
## 11.9
  1. 高階數字

While skewness and kurtosis are not as often calculated and reported as mean and standard deviation, they can be useful at times. Skewness is the 3rd moment around the mean, and characterizes whether the distribution is symmetric (skewness = 0). Kurtosis is a function of the 4th central moment, and characterizes peakedness, where the normal distribution has a value of 3 and smaller values correspond to thinner tails (less peakedness).

require(moments)
CV <- function(x){
  (sd(x) / mean(x)) * 100.0
}
CD <- function(x){
  (mad(x) / median(x)) * 100.0
}
skewness(medv) # 偏度
## Error in UseMethod("skewness"): 沒有適用的方法可將 'skewness' 套用到 "c('double', 'numeric')" 類別的物件
kurtosis(medv) # 扁度
## Error in UseMethod("kurtosis"): 沒有適用的方法可將 'kurtosis' 套用到 "c('double', 'numeric')" 類別的物件
kurtosis(medv) - 3 # 扁度
## Error in UseMethod("kurtosis"): 沒有適用的方法可將 'kurtosis' 套用到 "c('double', 'numeric')" 類別的物件
CV(medv) # 變異係數(一種風險指標)
## [1] 40.81651
CD(medv) # 誤差係數(自創數字,希望可以是一種風險指標???)
## [1] 27.97358

10.4.5 總結在一份敘述統計量報表裡

10.4.5.1 DIY

mySummary <- data.frame(最小值 = min(medv),
              Q1 = quantile(medv, probs = 0.25),
              平均數 = mean(medv),
              中位數 = quantile(medv, probs = 0.50),
              Q3 = quantile(medv, probs = 0.75),
              最大值 = max(medv),
              標準差 = sd(medv),
              間距 = diff(range(medv)),
              偏度 = moments::skewness(medv),
              扁度 = moments::kurtosis(medv))
rownames(mySummary) <- "medv"
mySummary
##      最小值     Q1   平均數 中位數 Q3 最大值   標準差 間距     偏度     扁度
## medv      5 17.025 22.53281   21.2 25     50 9.197104   45 1.104811 4.468629

10.4.5.2 專家意見

skim(medv)
Table 10.2: Data summary
Name medv
Number of rows 506
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 22.53 9.2 5 17.02 21.2 25 50 ▂▇▅▁▁

作者小編寫到這裡,再一次提醒自己,不論從什麼觀點切入解題、切入教學,再再都希望可以提示、提供

發展答案程式碼的過程,當然層面肯定涉及統計、資訊與專業(房價)。

當然發展之前,清楚理解問題是關鍵力量,而且要「刻意練習」,不要「看不起小問題」。就像下面這三句話,純粹只想驗證套件skimr提供關於「缺失值個數」與「完整觀察值比例」的答案。

sum(is.na(medv)) # 缺失值個數
## [1] 0
sum(!is.na(medv)) # 完整觀察值個數
## [1] 506
sum(!is.na(medv))/NROW(medv) # 完整觀察值比例
## [1] 1

接下來,套件skimr也提供了「DIY」敘述統計量報表,或說是「摘要統計量報表」的選項,讓讀者諸君可以盡情地設計自己的報表。

my_skim <- skim_with(
  numeric = sfl(mean = mean, 
                median = median,
                'stand deviation' = sd,
                mad = mad,
                iqr = IQR, 
                p99 = ~ quantile(., probs = .99),
                skewness = moments::skewness,
                kurtosis = moments::kurtosis,
                CV = CV,
                CD = CD),
  append = FALSE
)
my_skim(medv)
Table 10.3: Data summary
Name medv
Number of rows 506
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean median stand deviation mad iqr p99 skewness kurtosis CV CD
data 0 1 22.53 21.2 9.2 5.93 7.98 50 1.1 4.47 40.82 27.97

最後,如果讀者諸君想要深入了解skimr::skim有多少本事,請詳閱以下使用說明:

knitr::include_url("https://www.rdocumentation.org/packages/skimr/versions/2.1.2/topics/skim")

10.4.5.3 抓到專家意見的某一個敘述統計量

sumMEDV <- skim(medv) # 先存起來。
sumMEDV # 呼叫它,顯示它。
Table 10.4: Data summary
Name medv
Number of rows 506
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 22.53 9.2 5 17.02 21.2 25 50 ▂▇▅▁▁
names(sumMEDV) # 想抓就要知道名字,有名字呼叫它就抓得到!
##  [1] "skim_type"     "skim_variable" "n_missing"     "complete_rate"
##  [5] "numeric.mean"  "numeric.sd"    "numeric.p0"    "numeric.p25"  
##  [9] "numeric.p50"   "numeric.p75"   "numeric.p100"  "numeric.hist"
sumMEDV$numeric.mean # 抓平均數
## [1] 22.53281
sumMEDV$numeric.sd # 抓標準差
## [1] 9.197104

事實證明,輕鬆又簡單!

10.4.6 課堂練習

10.5 深入檢視波士頓數據集的背景

計算房價中位數medv的工作先到此,讓我們暫停計算,不是泡杯咖啡聊是非,就是拿本書來,轉換一下心情。作者小編似乎遇上了瓶頸,所以決定找找跟本案例有關的文獻,一樣泡杯咖啡,換個場景充個電,看看、想想如何突破瓶頸?

10.5.1 MASS提供的使用手冊

knitr::include_url("https://www.rdocumentation.org/packages/MASS/versions/7.3-53/topics/Boston")

10.5.2 mlbench提供的使用手冊

knitr::include_url("https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/BostonHousing")

10.5.3 spData提供的使用手冊

knitr::include_url("https://www.rdocumentation.org/packages/spData/versions/0.3.8/topics/boston")

10.5.4 參考文獻:Harrison and Rubinfeld (1978)

knitr::include_graphics("www/0000186.pdf", auto_pdf = TRUE)
## Error in knitr::include_graphics("www/0000186.pdf", auto_pdf = TRUE): Cannot find the file(s): "www/0000186.pdf"

10.5.5 自己下載的背景說明

作者小編已經跟統計程式語言(SR)交手超過30年,多年來沒什麼特別的企圖,直到三年前才決定在課堂上只講授R,不論是手算、電算、還是電算手算,甚至於是理論的推導,讓大大小小的學生都可以輕鬆擁有計算環境與計算能力,還能把上課期間個人創作的程式碼帶走,不論是升學還是就業。多年來,已經習慣制式的環境,比如說,想要算「1, 2, 3, NA」的平均數,作者小編會這麼做:

mean(c(1,2,3,NA))
## [1] NA

發現不行時,呼叫mean的使用手冊(在R提供的環境是在「console」打入「?mean」或是「??mean」)

knitr::include_url("https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean")

勤讀一番之後,發現應該改成

mean(c(1,2,3,NA), na.rm = TRUE)
## [1] 2

但是這一回在本套微微書的第一章有機會前進「StatLib」數據庫,並且發展下載波士頓數據集的程式碼,下載成功之後發現應該加寫一段程式碼,讓自己擁有波士頓數據集的簡要說明,除了少剪(寫微微書某個橋段時)也可以少打(比如說寫報告時)一些英文字,更可以透過程式有機會變出其他的應用來!接下來,請讀者諸君在此欣賞欣賞R的文字處理能力!回顧第一章的耙蟲程式,作者小編照抄過來

# 大步一:下載。
URL <- "http://lib.stat.cmu.edu/datasets/boston"
x <- read.delim(URL)
# 大步二:抓變數名稱。
expBoston <- c(colnames(x), as.character(x[1:19,]))
expBoston
##  [1] "The.Boston.house.price.data.of.Harrison..D..and.Rubinfeld..D.L...Hedonic"       
##  [2] " prices and the demand for clean air', J. Environ. Economics & Management,"     
##  [3] " vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics" 
##  [4] " ...', Wiley, 1980.   N.B. Various transformations are used in the table on"    
##  [5] " pages 244-261 of the latter."                                                  
##  [6] " Variables in order:"                                                           
##  [7] " CRIM     per capita crime rate by town"                                        
##  [8] " ZN       proportion of residential land zoned for lots over 25,000 sq.ft."     
##  [9] " INDUS    proportion of non-retail business acres per town"                     
## [10] " CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
## [11] " NOX      nitric oxides concentration (parts per 10 million)"                   
## [12] " RM       average number of rooms per dwelling"                                 
## [13] " AGE      proportion of owner-occupied units built prior to 1940"               
## [14] " DIS      weighted distances to five Boston employment centres"                 
## [15] " RAD      index of accessibility to radial highways"                            
## [16] " TAX      full-value property-tax rate per $10,000"                             
## [17] " PTRATIO  pupil-teacher ratio by town"                                          
## [18] " B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town"       
## [19] " LSTAT    % lower status of the population"                                     
## [20] " MEDV     Median value of owner-occupied homes in $1000's"
paste(expBoston[1:5], collapse = "")
## [1] "The.Boston.house.price.data.of.Harrison..D..and.Rubinfeld..D.L...Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980.   N.B. Various transformations are used in the table on pages 244-261 of the latter."

但是發現「The.Boston.house.price.data.of.Harrison..D..and.Rubinfeld..D.L…Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …’, Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.」的前頭出現許多「難看的斑點」。出現「斑點」的原因是,那一段文字來自於下載成果那一張表的欄位名稱,R會自動地在「空白處」換成「句點(斑點)」。

那有沒有機會阻止R這麼做呢?

於是乎,照例作者小編找到read.delim使用說明

knitr::include_url("https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table")

照樣從上而下瀏覽、瀏覽,發現只要加上「header = FALSE」就可以阻止R這一項自動自發。

# 大步一:下載。
URL <- "http://lib.stat.cmu.edu/datasets/boston"
x <- read.delim(URL, header = FALSE)
# 大步二:抓變數名稱。
expBoston <- as.character(x[1:20,])
expBoston
##  [1] " The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic"      
##  [2] " prices and the demand for clean air', J. Environ. Economics & Management,"     
##  [3] " vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics" 
##  [4] " ...', Wiley, 1980.   N.B. Various transformations are used in the table on"    
##  [5] " pages 244-261 of the latter."                                                  
##  [6] " Variables in order:"                                                           
##  [7] " CRIM     per capita crime rate by town"                                        
##  [8] " ZN       proportion of residential land zoned for lots over 25,000 sq.ft."     
##  [9] " INDUS    proportion of non-retail business acres per town"                     
## [10] " CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
## [11] " NOX      nitric oxides concentration (parts per 10 million)"                   
## [12] " RM       average number of rooms per dwelling"                                 
## [13] " AGE      proportion of owner-occupied units built prior to 1940"               
## [14] " DIS      weighted distances to five Boston employment centres"                 
## [15] " RAD      index of accessibility to radial highways"                            
## [16] " TAX      full-value property-tax rate per $10,000"                             
## [17] " PTRATIO  pupil-teacher ratio by town"                                          
## [18] " B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town"       
## [19] " LSTAT    % lower status of the population"                                     
## [20] " MEDV     Median value of owner-occupied homes in $1000's"
paste(expBoston[1:5], collapse = "")
## [1] " The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980.   N.B. Various transformations are used in the table on pages 244-261 of the latter."

「斑點」不見了,而且讓我們擁有與網頁一模一樣的簡介文字。接下來,作者小編整合變數名稱以及變數的簡要說明成為一張表:

# 大步一:下載。
URL <- "http://lib.stat.cmu.edu/datasets/boston"
x <- read.delim(URL, header = FALSE)
# 大步二:抓變數名稱。
colsBoston <- as.character(x[7:20,])
colsBoston
##  [1] " CRIM     per capita crime rate by town"                                        
##  [2] " ZN       proportion of residential land zoned for lots over 25,000 sq.ft."     
##  [3] " INDUS    proportion of non-retail business acres per town"                     
##  [4] " CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
##  [5] " NOX      nitric oxides concentration (parts per 10 million)"                   
##  [6] " RM       average number of rooms per dwelling"                                 
##  [7] " AGE      proportion of owner-occupied units built prior to 1940"               
##  [8] " DIS      weighted distances to five Boston employment centres"                 
##  [9] " RAD      index of accessibility to radial highways"                            
## [10] " TAX      full-value property-tax rate per $10,000"                             
## [11] " PTRATIO  pupil-teacher ratio by town"                                          
## [12] " B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town"       
## [13] " LSTAT    % lower status of the population"                                     
## [14] " MEDV     Median value of owner-occupied homes in $1000's"
colsBoston[12] <- sub("B ", "BLACK ", colsBoston[12])
colsBoston
##  [1] " CRIM     per capita crime rate by town"                                        
##  [2] " ZN       proportion of residential land zoned for lots over 25,000 sq.ft."     
##  [3] " INDUS    proportion of non-retail business acres per town"                     
##  [4] " CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
##  [5] " NOX      nitric oxides concentration (parts per 10 million)"                   
##  [6] " RM       average number of rooms per dwelling"                                 
##  [7] " AGE      proportion of owner-occupied units built prior to 1940"               
##  [8] " DIS      weighted distances to five Boston employment centres"                 
##  [9] " RAD      index of accessibility to radial highways"                            
## [10] " TAX      full-value property-tax rate per $10,000"                             
## [11] " PTRATIO  pupil-teacher ratio by town"                                          
## [12] " BLACK        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town"   
## [13] " LSTAT    % lower status of the population"                                     
## [14] " MEDV     Median value of owner-occupied homes in $1000's"
colsBostonFull <- colsBoston
colsBostonFull
##  [1] " CRIM     per capita crime rate by town"                                        
##  [2] " ZN       proportion of residential land zoned for lots over 25,000 sq.ft."     
##  [3] " INDUS    proportion of non-retail business acres per town"                     
##  [4] " CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
##  [5] " NOX      nitric oxides concentration (parts per 10 million)"                   
##  [6] " RM       average number of rooms per dwelling"                                 
##  [7] " AGE      proportion of owner-occupied units built prior to 1940"               
##  [8] " DIS      weighted distances to five Boston employment centres"                 
##  [9] " RAD      index of accessibility to radial highways"                            
## [10] " TAX      full-value property-tax rate per $10,000"                             
## [11] " PTRATIO  pupil-teacher ratio by town"                                          
## [12] " BLACK        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town"   
## [13] " LSTAT    % lower status of the population"                                     
## [14] " MEDV     Median value of owner-occupied homes in $1000's"
colsBoston <- sapply(colsBoston, function(w){
  strsplit(w, " ")[[1]][2]
})
colsBoston <- unname(colsBoston)
colsBoston
##  [1] "CRIM"    "ZN"      "INDUS"   "CHAS"    "NOX"     "RM"      "AGE"    
##  [8] "DIS"     "RAD"     "TAX"     "PTRATIO" "BLACK"   "LSTAT"   "MEDV"
colsBostonFull <- data.frame(cbind(colsBoston, colsBostonFull),
                             stringsAsFactors = FALSE)
colsBostonFull
##    colsBoston
## 1        CRIM
## 2          ZN
## 3       INDUS
## 4        CHAS
## 5         NOX
## 6          RM
## 7         AGE
## 8         DIS
## 9         RAD
## 10        TAX
## 11    PTRATIO
## 12      BLACK
## 13      LSTAT
## 14       MEDV
##                                                                     colsBostonFull
## 1                                           CRIM     per capita crime rate by town
## 2        ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
## 3                        INDUS    proportion of non-retail business acres per town
## 4   CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
## 5                      NOX      nitric oxides concentration (parts per 10 million)
## 6                                    RM       average number of rooms per dwelling
## 7                  AGE      proportion of owner-occupied units built prior to 1940
## 8                    DIS      weighted distances to five Boston employment centres
## 9                               RAD      index of accessibility to radial highways
## 10                               TAX      full-value property-tax rate per $10,000
## 11                                            PTRATIO  pupil-teacher ratio by town
## 12     BLACK        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
## 13                                       LSTAT    % lower status of the population
## 14                        MEDV     Median value of owner-occupied homes in $1000's
colsBostonFull <- apply(colsBostonFull, 1, function(w){
  trimws(strsplit(w[2], w[1])[[1]][2])
})
colsBostonFull
##  [1] "per capita crime rate by town"                                        
##  [2] "proportion of residential land zoned for lots over 25,000 sq.ft."     
##  [3] "proportion of non-retail business acres per town"                     
##  [4] "Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"
##  [5] "nitric oxides concentration (parts per 10 million)"                   
##  [6] "average number of rooms per dwelling"                                 
##  [7] "proportion of owner-occupied units built prior to 1940"               
##  [8] "weighted distances to five Boston employment centres"                 
##  [9] "index of accessibility to radial highways"                            
## [10] "full-value property-tax rate per $10,000"                             
## [11] "pupil-teacher ratio by town"                                          
## [12] "1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town"       
## [13] "% lower status of the population"                                     
## [14] "Median value of owner-occupied homes in $1000's"
colsBostonFull <- data.frame(cbind(colsBoston, colsBostonFull),
                             stringsAsFactors = FALSE)
colsBostonFull
##    colsBoston
## 1        CRIM
## 2          ZN
## 3       INDUS
## 4        CHAS
## 5         NOX
## 6          RM
## 7         AGE
## 8         DIS
## 9         RAD
## 10        TAX
## 11    PTRATIO
## 12      BLACK
## 13      LSTAT
## 14       MEDV
##                                                           colsBostonFull
## 1                                          per capita crime rate by town
## 2       proportion of residential land zoned for lots over 25,000 sq.ft.
## 3                       proportion of non-retail business acres per town
## 4  Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
## 5                     nitric oxides concentration (parts per 10 million)
## 6                                   average number of rooms per dwelling
## 7                 proportion of owner-occupied units built prior to 1940
## 8                   weighted distances to five Boston employment centres
## 9                              index of accessibility to radial highways
## 10                              full-value property-tax rate per $10,000
## 11                                           pupil-teacher ratio by town
## 12        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
## 13                                      % lower status of the population
## 14                       Median value of owner-occupied homes in $1000's

為了讓讀者諸君更親近波士頓數據集,作者小編大坦翻譯,並且根據中文翻譯提出對應的中文欄位名稱,如下。然後繼續統整在表colsBostonFull裡。

`中文翻譯` <- c("每個城鎮人均犯罪率。",
            "超過25,000平方呎的住宅用地比例。",
            "每個城鎮非零售業務英畝的比例。",
            "是否鄰近Charles River。",
            "氮氧化合物濃度。",
            "每個住宅的平均房間數。",
            "1940年之前建造的自用住宅比例。",
            "距波士頓五大商圈的加權平均距離。",
            "環狀高速公路的可觸指標。",
            "財產稅占比(每一萬美元)。",
            "生師比。",
            "黑人的比例。",
            "(比較)低社經地位人口的比例。",
            "以千美元計的房價中位數。")

`中文變數名稱` <- c("犯罪率",
              "住宅用地比例",
              "非商業區比例",
              "河邊宅",
              "空汙指標",
              "平均房間數",
              "老房子比例",
              "加權平均距離",
              "交通便利性",
              "財產稅率",
              "生師比",
              "黑人指數",
              "低社經人口比例",
              "房價中位數")

colsBostonFull$中文翻譯 <- `中文翻譯`
colsBostonFull$中文變數名稱 <- `中文變數名稱`
colnames(colsBostonFull)[1] <- "變數名"
colnames(colsBostonFull)[2] <- "說明"
colsBostonFull
##     變數名
## 1     CRIM
## 2       ZN
## 3    INDUS
## 4     CHAS
## 5      NOX
## 6       RM
## 7      AGE
## 8      DIS
## 9      RAD
## 10     TAX
## 11 PTRATIO
## 12   BLACK
## 13   LSTAT
## 14    MEDV
##                                                                     說明
## 1                                          per capita crime rate by town
## 2       proportion of residential land zoned for lots over 25,000 sq.ft.
## 3                       proportion of non-retail business acres per town
## 4  Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
## 5                     nitric oxides concentration (parts per 10 million)
## 6                                   average number of rooms per dwelling
## 7                 proportion of owner-occupied units built prior to 1940
## 8                   weighted distances to five Boston employment centres
## 9                              index of accessibility to radial highways
## 10                              full-value property-tax rate per $10,000
## 11                                           pupil-teacher ratio by town
## 12        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
## 13                                      % lower status of the population
## 14                       Median value of owner-occupied homes in $1000's
##                            中文翻譯   中文變數名稱
## 1              每個城鎮人均犯罪率。         犯罪率
## 2  超過25,000平方呎的住宅用地比例。   住宅用地比例
## 3    每個城鎮非零售業務英畝的比例。   非商業區比例
## 4           是否鄰近Charles River。         河邊宅
## 5                  氮氧化合物濃度。       空汙指標
## 6            每個住宅的平均房間數。     平均房間數
## 7    1940年之前建造的自用住宅比例。     老房子比例
## 8  距波士頓五大商圈的加權平均距離。   加權平均距離
## 9          環狀高速公路的可觸指標。     交通便利性
## 10         財產稅占比(每一萬美元)。       財產稅率
## 11                         生師比。         生師比
## 12                     黑人的比例。       黑人指數
## 13     (比較)低社經地位人口的比例。 低社經人口比例
## 14         以千美元計的房價中位數。     房價中位數

更標準的作法如下,因為我們可以認定「變數名」是擁有者,可以被設定為「觀察值編號」:

rownames(colsBostonFull) <- colsBostonFull$`變數名`
colsBostonFull <- colsBostonFull[,-1]
colsBostonFull
##                                                                          說明
## CRIM                                            per capita crime rate by town
## ZN           proportion of residential land zoned for lots over 25,000 sq.ft.
## INDUS                        proportion of non-retail business acres per town
## CHAS    Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
## NOX                        nitric oxides concentration (parts per 10 million)
## RM                                       average number of rooms per dwelling
## AGE                    proportion of owner-occupied units built prior to 1940
## DIS                      weighted distances to five Boston employment centres
## RAD                                 index of accessibility to radial highways
## TAX                                  full-value property-tax rate per $10,000
## PTRATIO                                           pupil-teacher ratio by town
## BLACK          1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
## LSTAT                                        % lower status of the population
## MEDV                          Median value of owner-occupied homes in $1000's
##                                 中文翻譯   中文變數名稱
## CRIM                每個城鎮人均犯罪率。         犯罪率
## ZN      超過25,000平方呎的住宅用地比例。   住宅用地比例
## INDUS     每個城鎮非零售業務英畝的比例。   非商業區比例
## CHAS             是否鄰近Charles River。         河邊宅
## NOX                     氮氧化合物濃度。       空汙指標
## RM                每個住宅的平均房間數。     平均房間數
## AGE       1940年之前建造的自用住宅比例。     老房子比例
## DIS     距波士頓五大商圈的加權平均距離。   加權平均距離
## RAD             環狀高速公路的可觸指標。     交通便利性
## TAX             財產稅占比(每一萬美元)。       財產稅率
## PTRATIO                         生師比。         生師比
## BLACK                       黑人的比例。       黑人指數
## LSTAT       (比較)低社經地位人口的比例。 低社經人口比例
## MEDV            以千美元計的房價中位數。     房價中位數

10.5.6 課堂練習

10.6 學期小專題

10.7 課後練習題