Chapter 3 Data transformation

3.1 Cleaning Trivial Variables

The dataset has the following variables.

##  [1] "id"             "rated"          "created_at"     "last_move_at"   "turns"          "victory_status" "winner"        
##  [8] "increment_code" "white_id"       "white_rating"   "black_id"       "black_rating"   "moves"          "opening_eco"   
## [15] "opening_name"   "opening_ply"

We wanted to remove id, created_at, last_move_at, increment_code, moves, opening_eco, opening_ply since some of them, like id or increment_code do not have pivotal information for our problems. Or they are providing similar information as other variables, such as opening_eco.

The rest variables are:

## [1] "rated"          "turns"          "victory_status" "winner"         "white_id"       "white_rating"   "black_id"      
## [8] "black_rating"   "opening_name"

3.2 Cleaning Rated Status

## 
## False FALSE  True  TRUE 
##  2048  1855  8723  7432

We changed all the strings to uppercase letter for consistency.

## 
## FALSE  TRUE 
##  3903 16155

3.3 Cleaning Turns

In this chess dataset, we assume that games which only have few turns are not meaningful since it is not usual to reach the final status by only few turns in a chess game.

As the plot shows, the majority of the games have turns over 10 so that we just directly deleted games whose number of turns are under 10.

3.4 Cleaning Users

For those players who only have played few games, their data may be outliers or cannot provide useful information for our analysis. Therefore, we keep the games played by users who only had played the chess game at least 3 times.

3.5 Cleaning Opening Names

## 
##        Sicilian Defense       Queen's Pawn Game          French Defense               Ruy Lopez        King's Pawn Game 
##                     452                     221                     198                     173                     170 
##            Italian Game         English Opening        Philidor Defense       Caro-Kann Defense    Scandinavian Defense 
##                     154                     138                     115                     104                      97 
##       Zukertort Opening       Four Knights Game             Scotch Game Queen's Gambit Declined             Indian Game 
##                      80                      71                      68                      66                      61 
##    Van't Kruijs Opening        Bishop's Opening          Modern Defense            Slav Defense       Hungarian Opening 
##                      56                      53                      51                      46                      37

The original opening names are really messy since some openings have very detailed explanation. By doing related research, we decided to use the more general definition of each opening for each game. This tidy version is more operable for visualization and conclusion.

3.6 Cleaned Dataset

## [1] 3393    9

The cleaned dataset have 3393 observations and 9 variables.

rated turns victory_status winner white_id white_rating black_id black_rating opening_name
4 TRUE 61 mate white daniamurashov 1439 adivanov2009 1454 Queen’s Pawn Game
5 TRUE 95 mate white nik221107 1523 adivanov2009 1469 Philidor Defense
15 FALSE 31 mate white shivangithegenius 1094 sureka_akshat 1141 Four Knights Game
16 FALSE 43 resign black sureka_akshat 1141 shivangithegenius 1094 Italian Game
17 FALSE 52 resign black shivangithegenius 1094 sureka_akshat 1141 Four Knights Game
18 FALSE 66 mate black sureka_akshat 1141 shivangithegenius 1094 Four Knights Game