4.8 集計テーブル

厳密にはグラフの作成ではないが，質的データが2変数以上ある場合には，それぞれのラベルの組み合わせでいくつデータが存在するかを数え上げることも重要である．

4.8.1 データの準備

仮想的なアンケートのデータを用意する．所属するクラブと，専攻，身長と収入を100人に答えてもらったアンケートとしよう．

set.seed(1)

n = 100

choices <- list(
  club = c("Football", "Baseball", "Basketball", "American-football"),
  major = c("Math", "History", "Pychology", "Physics", "Linguistics")
)
answers <- data.frame(
  id = 1:n,
  club = sample(x=choices$club, size=n, replace=TRUE),
  major = sample(x=choices$major, size=n, replace=TRUE),
  height = rnorm(n=n, mean=170, sd=5) %>% round(1),
  income = sample(x=seq(3, 5, 0.1)*100, size=n, replace=TRUE)
)

answers %>% head()

##   id              club       major height income
## 1  1          Football     Physics  163.8    370
## 2  2 American-football     Physics  170.0    320
## 3  3        Basketball        Math  177.6    480
## 4  4          Football Linguistics  167.6    480
## 5  5          Baseball Linguistics  174.0    300
## 6  6          Football        Math  165.1    490

このように

id: 生徒のID
club: 所属クラブ
major: 専攻学科
height: 身長
income: 収入

という列を持つデータとなっている．

4.8.2 集計

ここでは２つの実装方法を紹介する．まずはtable()関数を利用する方法である．これは1変数の場合でも利用したが，この関数は引数として2つ以上の変数を指定することもできる．いまはclub，hobbyという二つの質的変数を知りたいとして，集計を行おう．

次のように，引数として二つ変数を渡せば良い．この時，二つの変数の要素数（ベクトルの次元）が一致していない場合はエラーとなるので注意して欲しい．

table(answers$club, answers$major)

##                    
##                     History Linguistics Math Physics Pychology
##   American-football       3           2    4       7         4
##   Baseball                9           7    6       3         6
##   Basketball              5           1    5       2         9
##   Football                4           7    6       7         3

次に，tidyverseというライブラリ群で提供されるdplyrライブラリを利用した方法を紹介する． group_byとcountという関数を用いる．

answers %>% 
  group_by(club, major) %>% 
  count()

## # A tibble: 20 × 3
## # Groups:   club, major [20]
##    club              major           n
##    <chr>             <chr>       <int>
##  1 American-football History         3
##  2 American-football Linguistics     2
##  3 American-football Math            4
##  4 American-football Physics         7
##  5 American-football Pychology       4
##  6 Baseball          History         9
##  7 Baseball          Linguistics     7
##  8 Baseball          Math            6
##  9 Baseball          Physics         3
## 10 Baseball          Pychology       6
## 11 Basketball        History         5
## 12 Basketball        Linguistics     1
## 13 Basketball        Math            5
## 14 Basketball        Physics         2
## 15 Basketball        Pychology       9
## 16 Football          History         4
## 17 Football          Linguistics     7
## 18 Football          Math            6
## 19 Football          Physics         7
## 20 Football          Pychology       3

この結果は表示の通り，table()のように縦軸，横軸にそれぞれのラベルを対応づけたテーブルではなく，１列目，２列目に変数のラベル，３列目にそれらに応じた集計結果を返している．

Exercise 4.4 (集計) まず，以下のコードを実行しなさい．

ex_data <- data.frame(
  x = sample(x=c("A","B","C"), size=200, replace=TRUE),
  y = sample(x=c("YES", "NO"), size=200, replace=TRUE)
)

ex_data %>% head()

##   x   y
## 1 C YES
## 2 B  NO
## 3 B  NO
## 4 B  NO
## 5 A YES
## 6 C YES

ベクトルx,yについて集計を行いなさい．table関数，dplyrライブラリどちらを用いても構わない．