Chapter 6 Restructuring Data

Sometimes data may not come organized or formatted in the way we would like. The good thing is there are lots of functions that can be used to restructure data.

6.1 Pivots

A pivot is useful for converting data to wide or long format. pivot_longer() will take multiple columns and turn it into one variable. pivot_wider() takes a column and converts it to multiple different variables. A step-by-step breakdown of pivots can be found in section 12.3 of R for Data Science.

We will practice pivoting in the example below.

6.2 Joins

Joins are helpful for combining two separate datasets. They are joined based on the keys, which can be specified by the coder. For now we will refer to the two data frames as x and y. There are four types of joins:

  • inner_join() : only keeps observations from x that have a matching key in y
  • left_join() : keeps all observations in x
  • right_join() : keeps all observations in y
  • full_join() : keeps all observations in x and y

Chapter 13 of R for Data Science has a more thorough explanation of joins.

The join functions are part of the dplyr library. This package is in the tidyverse, so it will automatically be loaded if the tidyverse is loaded.

library(dplyr)
set.seed(2023)

6.2.1 Example

We will practice pivoting and joins using the Lahman package.

library(Lahman)

Say we want to join the Batting and Fielding tables. Both of them have over 100,000 observations, so let’s filter it down to only look at the 2021 season.

batting21 <- Batting %>% 
  filter(yearID == 2021)
batting21 <- battingStats(batting21)

fielding21 <- Fielding %>% 
  filter(yearID == 2021)

Let’s take a look at both of these data frames.

batting21
playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP BA PA TB SlugPct OBP OPS BABIP
suterbr01 2021 1 MIL NL 61 4 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0.000 4 0 0.000 0.000 0.000 0.000
ennsdi01 2021 1 TBA AL 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA
selmasa01 2021 1 SFN NL 7 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0.000 1 0 0.000 0.000 0.000 NaN
rojasjo02 2021 1 LAA AL 61 168 26 35 14 0 6 15 2 1 15 50 0 1 0 0 2 0.208 184 67 0.399 0.277 0.676 0.259
vargail01 2021 2 PIT NL 7 13 0 1 0 0 0 1 0 0 0 3 0 0 0 0 0 0.077 13 1 0.077 0.077 0.154 0.100
litteza01 2021 1 SFN NL 63 4 0 1 0 0 0 0 0 0 0 3 0 0 0 0 0 0.250 4 1 0.250 0.250 0.500 1.000

The data has 29 variables and 1708 observations.

fielding21
playerID yearID stint teamID lgID POS G GS InnOuts PO A E DP PB WP SB CS ZR
pederjo01 2021 2 ATL NL OF 49 43 1120 79 1 0 0 NA NA NA NA NA
smithdr01 2021 1 NYN NL P 31 1 124 2 1 0 1 NA NA NA NA NA
lindofr01 2021 1 NYN NL SS 124 122 3088 160 275 10 58 NA NA NA NA NA
valaipa01 2021 1 BAL AL 2B 72 58 1516 109 121 0 24 NA NA NA NA NA
higgipj01 2021 1 CHN NL C 6 5 138 54 3 0 1 1 NA 1 1 NA
degotal01 2021 1 HOU AL 2B 2 2 48 2 0 0 0 NA NA NA NA NA

The data has 18 variables and 2314 observations.

Before joining these tables, we are going to compute a summarized version of the fielding data and adjust the format.

The pivot_wider() function changes the setup of the data by increasing the number of columns and decreasing the number of rows. This is in the tidyr package, which is part of the tidyverse.

library(tidyr)

We will use DJ LeMahieu as an example to help illustrate exactly what pivoting does.

lemahieu <- Fielding %>% 
  filter(playerID == "lemahdj01") %>% 
  arrange(desc(yearID))
playerID yearID stint teamID lgID POS G GS InnOuts PO A E DP PB WP SB CS ZR
lemahdj01 2022 1 NYA AL 1B 35 31 795 230 13 1 16 NA NA NA NA NA
lemahdj01 2022 1 NYA AL 2B 41 35 938 53 91 2 21 NA NA NA NA NA
lemahdj01 2022 1 NYA AL 3B 47 43 1157 23 106 1 7 NA NA NA NA NA
lemahdj01 2021 1 NYA AL 1B 55 33 963 286 13 1 21 NA NA NA NA NA
lemahdj01 2021 1 NYA AL 2B 83 77 1989 117 156 2 48 NA NA NA NA NA
lemahdj01 2021 1 NYA AL 3B 39 36 897 18 62 6 6 NA NA NA NA NA
lemahdj01 2020 1 NYA AL 1B 11 1 72 24 0 0 1 NA NA NA NA NA
lemahdj01 2020 1 NYA AL 2B 37 34 812 51 82 4 19 NA NA NA NA NA
lemahdj01 2020 1 NYA AL 3B 11 11 261 10 16 2 2 NA NA NA NA NA
lemahdj01 2019 1 NYA AL 1B 40 28 786 215 19 2 24 NA NA NA NA NA
lemahdj01 2019 1 NYA AL 2B 75 66 1739 118 155 2 32 NA NA NA NA NA
lemahdj01 2019 1 NYA AL 3B 52 47 1200 18 87 4 7 NA NA NA NA NA
lemahdj01 2018 1 COL NL 2B 128 127 3345 209 378 4 90 NA NA NA NA NA
lemahdj01 2017 1 COL NL 2B 153 151 3906 251 470 8 106 NA NA NA NA NA
lemahdj01 2016 1 COL NL 2B 146 144 3728 276 422 6 91 NA NA NA NA NA
lemahdj01 2015 1 COL NL 2B 149 146 3852 300 452 9 120 NA NA NA NA NA
lemahdj01 2014 1 COL NL 1B 1 0 3 0 0 0 0 NA NA NA NA NA
lemahdj01 2014 1 COL NL 2B 144 135 3539 257 413 6 99 NA NA NA NA NA
lemahdj01 2014 1 COL NL 3B 7 4 115 2 5 0 0 NA NA NA NA NA
lemahdj01 2014 1 COL NL SS 1 0 3 0 0 0 0 NA NA NA NA NA
lemahdj01 2013 1 COL NL 1B 1 0 3 2 0 0 0 NA NA NA NA NA
lemahdj01 2013 1 COL NL 2B 90 86 2250 168 271 3 57 NA NA NA NA NA
lemahdj01 2013 1 COL NL 3B 14 9 302 6 24 0 2 NA NA NA NA NA
lemahdj01 2013 1 COL NL SS 1 0 3 0 0 0 0 NA NA NA NA NA
lemahdj01 2012 1 COL NL 1B 1 0 9 1 0 0 0 NA NA NA NA NA
lemahdj01 2012 1 COL NL 2B 67 60 1527 105 204 2 33 NA NA NA NA NA
lemahdj01 2012 1 COL NL 3B 9 5 138 2 8 0 0 NA NA NA NA NA
lemahdj01 2012 1 COL NL SS 2 0 6 0 0 0 0 NA NA NA NA NA
lemahdj01 2011 1 CHN NL 1B 1 1 24 8 0 0 1 NA NA NA NA NA
lemahdj01 2011 1 CHN NL 2B 15 8 233 16 22 0 5 NA NA NA NA NA
lemahdj01 2011 1 CHN NL 3B 11 6 180 6 12 4 5 NA NA NA NA NA

Currently the data is set up so if a player plays multiple positions in one season there will be a separate row for each. There are three rows for the 2021 season; one for first base, one for second base, and one for third base. We want to pivot the data so each player only has one row per season, but create columns to show which positions they played.

This is what the data looks like when pivoted:
playerID yearID stint teamID Inn_1B Inn_2B Inn_3B Inn_SS
lemahdj01 2022 1 NYA 795 938 1157 0
lemahdj01 2021 1 NYA 963 1989 897 0
lemahdj01 2020 1 NYA 72 812 261 0
lemahdj01 2019 1 NYA 786 1739 1200 0
lemahdj01 2018 1 COL 0 3345 0 0
lemahdj01 2017 1 COL 0 3906 0 0
lemahdj01 2016 1 COL 0 3728 0 0
lemahdj01 2015 1 COL 0 3852 0 0
lemahdj01 2014 1 COL 3 3539 115 3
lemahdj01 2013 1 COL 3 2250 302 3
lemahdj01 2012 1 COL 9 1527 138 6
lemahdj01 2011 1 CHN 24 233 180 0

Each position LeMahieu has played at has it’s own column.

Now it’s time to pivot the entire ‘fielding21’ data frame.

names_from specifies which existing column to use for the new column names. In this case it is ‘POS’ - instead of having one column for position, each different position will now have their own column. values_from says what values should be in those new columns. In this example that is ‘InnOuts’ - the data values in the newly created columns are how many outs a player played in said position.

fielding21_wide <- fielding21 %>%  
  pivot_wider(names_from = POS, values_from = InnOuts) %>% 
  select(playerID:lgID, P:`2B`) %>% 
  replace_na(list(C = 0, `1B` = 0, `2B` = 0, `3B` = 0, 
                  SS = 0, P = 0, OF = 0)) %>%
  summarize(Inn_1B = sum(`1B`),            
            Inn_2B = sum(`2B`),            
            Inn_3B = sum(`3B`), 
            Inn_SS = sum(SS), 
            Inn_C = sum(C),
            Inn_OF = sum(OF), 
            Inn_P = sum(P), 
            .by = c(playerID, yearID, stint, teamID))
playerID yearID stint teamID Inn_1B Inn_2B Inn_3B Inn_SS Inn_C Inn_OF Inn_P
smithdo02 2021 1 NYN 212 0 0 0 0 2579 0
bummeaa01 2021 1 CHA 0 0 0 0 0 0 169
altuvjo01 2021 1 HOU 0 3789 0 0 0 0 0
woodal02 2021 1 SFN 0 0 0 0 0 0 416
dickeco01 2021 1 MIA 0 0 0 0 0 1275 0
perezhe01 2021 1 WAS 24 45 0 0 0 36 6

This data frame has 11 variables and 1699 observations.

Using a similar process we could find a player’s primary position, if that was desired.

fielding21 %>%
  select(playerID:InnOuts) %>%
  group_by(playerID, stint) %>%
  arrange(desc(InnOuts), .by_group = TRUE) %>%
  ungroup() %>%
  mutate(InnProp = InnOuts / sum(InnOuts), 
         POS_Q = ifelse(InnProp > 0.25, POS, NA),
         .by = c(playerID, stint)) %>%
  group_by(playerID, stint) %>%
  summarize(Pos = paste(na.omit(POS_Q), collapse = "/")) %>% 
  head(20) %>% 
  kable() %>% 
  kable_styling("striped") %>% 
  scroll_box(height = "300px")
playerID stint Pos
abadfe01 1 P
abbotco01 1 P
abreual01 1 P
abreubr01 1 P
abreujo02 1 1B
acevedo01 1 P
acunaro01 1 OF
adamewi01 1 SS
adamewi01 2 SS
adamja01 1 P
adamsau02 1 P
adamsma01 1 1B
adamsri03 1 C
adamsri03 2 C
adelljo01 1 OF
adonjo01 1 P
adriaeh01 1 OF/3B
aguilje01 1 1B
aguilmi01 1 P
ahmedni01 1 SS


Now we are ready to practice some joins!

Inner Join

Remember, a key must be present in both datasets to be joined. There are three columns that overlap between the two: ‘playerID’, ‘yearID’, and ‘stint’. We will specify that these variables are the keys in the join_by argument.

join_inner <- inner_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))
playerID yearID stint teamID.x lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP BA PA TB SlugPct OBP OPS BABIP teamID.y Inn_1B Inn_2B Inn_3B Inn_SS Inn_C Inn_OF Inn_P
tomka01 2021 2 PIT NL 39 92 9 14 2 1 2 11 1 0 17 30 1 5 0 3 1 0.152 117 24 0.261 0.308 0.569 0.190 PIT 0 0 0 0 0 734 0
cottojh01 2021 1 TEX AL 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA TEX 0 0 0 0 0 0 92
freelky01 2021 1 COL NL 25 33 3 6 1 0 0 4 0 0 2 17 0 0 4 1 0 0.182 40 7 0.212 0.222 0.434 0.353 COL 0 0 0 0 0 3 362
fostema01 2021 1 CHA AL 37 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA CHA 0 0 0 0 0 0 117
detmere01 2021 1 LAA AL 5 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0.000 2 0 0.000 0.500 0.500 NaN LAA 0 0 0 0 0 0 62
slateau01 2021 1 SFN NL 129 274 39 66 12 1 12 32 15 2 28 84 2 4 0 0 7 0.241 306 116 0.423 0.320 0.743 0.303 SFN 0 0 0 0 0 1808 1

This creates a new data frame with 37 variables and 1699 observations. It is the same amount of rows as fielding21_wide, the smaller dataset.

Left Join

The code for each type of join is identical, besides the first word of the join function.

join_left <- left_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))
playerID yearID stint teamID.x lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP BA PA TB SlugPct OBP OPS BABIP teamID.y Inn_1B Inn_2B Inn_3B Inn_SS Inn_C Inn_OF Inn_P
pinedmi01 2021 1 MIN AL 22 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0.000 3 0 0.000 0.000 0.000 0.000 MIN 0 0 0 0 0 0 328
leedy01 2021 1 ATL NL 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA ATL 0 0 0 0 0 0 6
bonifjo01 2021 1 PHI NL 7 11 0 1 0 0 0 2 0 0 1 6 0 0 0 0 0 0.091 12 1 0.091 0.167 0.258 0.200 PHI 0 0 0 0 0 63 0
carraca01 2021 1 NYN NL 12 15 0 0 0 0 0 0 0 0 0 10 0 0 4 0 0 0.000 19 0 0.000 0.000 0.000 0.000 NYN 0 0 0 0 0 0 161
milneho01 2021 1 MIL NL 19 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0.000 2 0 0.000 0.500 0.500 NaN MIL 0 0 0 0 0 0 65
kirilal01 2021 1 MIN AL 59 215 23 54 11 1 8 34 1 1 14 52 2 1 0 1 3 0.251 231 91 0.423 0.299 0.722 0.295 MIN 641 0 0 0 0 795 0

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as batting21, the left dataset.

Right Join

join_right <- right_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))
playerID yearID stint teamID.x lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP BA PA TB SlugPct OBP OPS BABIP teamID.y Inn_1B Inn_2B Inn_3B Inn_SS Inn_C Inn_OF Inn_P
guerrde01 2021 1 OAK AL 53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA OAK 0 0 0 0 0 0 197
newcose01 2021 1 ATL NL 32 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0.000 2 0 0.000 0.000 0.000 NaN ATL 0 0 0 0 0 0 97
lowrije01 2021 1 OAK AL 139 457 55 112 28 0 14 69 0 0 49 108 1 2 0 4 8 0.245 512 182 0.398 0.318 0.716 0.289 OAK 0 1754 9 0 0 0 0
ruizke01 2021 2 WAS NL 23 81 9 23 3 0 2 14 0 0 6 4 0 2 0 0 3 0.284 89 32 0.395 0.348 0.743 0.280 WAS 0 0 0 0 531 0 0
brachbr01 2021 1 CIN NL 35 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA CIN 0 0 0 0 0 0 90
garzara01 2021 2 MIN AL 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA MIN 0 0 0 0 0 0 58

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as fielding21_wide, the right dataset.

Full Join

join_full <- full_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))
playerID yearID stint teamID.x lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP BA PA TB SlugPct OBP OPS BABIP teamID.y Inn_1B Inn_2B Inn_3B Inn_SS Inn_C Inn_OF Inn_P
milonto01 2021 1 TOR AL 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA TOR 0 0 0 0 0 0 42
uriaslu01 2021 1 MIL NL 150 490 77 122 25 1 23 75 5 1 63 116 3 10 1 3 9 0.249 567 218 0.445 0.345 0.790 0.280 MIL 0 409 1525 1607 0 0 0
antonte01 2021 1 CIN NL 23 3 1 0 0 0 0 0 0 0 1 3 0 0 1 0 0 0.000 5 0 0.000 0.250 0.250 NaN CIN 0 0 0 0 0 0 101
hoffmje02 2021 1 CIN NL 32 15 1 2 0 0 0 1 0 0 0 3 0 0 1 0 0 0.133 16 2 0.133 0.133 0.266 0.167 CIN 0 0 0 0 0 0 219
mazzach01 2021 1 TBA AL 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA 0 0 NA NA NA NA TBA 0 0 0 0 0 0 82
bruihju01 2021 1 LAN NL 21 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 NA 1 0 NA 1.000 NA NA LAN 0 0 0 0 0 0 56

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as as batting21, the larger dataset.

6.2.2 Another Example

Let’s try another example. This time we will use the Pitching and Salaries tables. First we will filter both down to one season (2016).

pitching16 <- Pitching %>% 
  filter(yearID == 2016)

salaries16 <- Salaries %>% 
  filter(yearID == 2016)
pitching16
playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
mccarbr01 2016 1 LAN NL 2 3 10 9 0 0 0 120 29 22 2 26 44 0.207 4.95 1 2 2 0 171 0 24 1 2 3
wimmeal01 2016 1 MIN AL 1 3 16 0 0 0 0 52 14 8 2 11 14 0.237 4.15 1 0 0 0 72 2 8 0 2 3
osullse01 2016 1 BOS AL 2 0 5 4 0 0 0 64 30 16 3 6 13 0.337 6.75 0 2 2 0 98 1 17 0 1 1
zastrro01 2016 1 CHN NL 1 0 8 1 0 0 0 48 12 2 0 5 17 0.207 1.13 0 0 1 0 66 1 3 0 2 1
salasfe01 2016 2 NYN NL 0 1 17 0 0 0 0 52 11 4 3 0 19 0.177 2.08 0 1 0 0 62 2 4 0 0 1
boydma01 2016 1 DET AL 6 5 20 18 0 0 0 292 97 49 17 29 82 0.258 4.53 0 1 4 0 412 1 51 0 3 8

The data has 30 variables and 824 observations.

salaries16
yearID teamID lgID playerID salary
2016 SEA AL lindad01 8000000
2016 OAK AL fuldsa01 1925000
2016 MIA NL jacksed01 507500
2016 LAA AL smithjo05 5250000
2016 BOS AL vazquch01 513000
2016 CIN NL cothaca01 509675

The data has 5 variables and 853 observations.

The common columns are ‘playerID’, ‘yearID’, and ‘teamID’, and ‘leagueID’. Let’s see what happens if we do not use all four of these variables as keys while joining.

join_half_keys <- inner_join(pitching16, salaries16, join_by(playerID, yearID))
playerID yearID stint teamID.x lgID.x W L G GS CG SHO SV IPouts H ER HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP teamID.y lgID.y salary
lobstky01 2016 1 PIT NL 2 0 14 0 0 0 0 75 25 11 2 12 15 0.263 3.96 1 2 2 0 110 3 11 1 0 4 PIT NL 520000
gonzagi01 2016 1 WAS NL 11 11 32 32 0 0 0 532 179 90 19 59 171 0.262 4.57 2 7 9 0 765 0 98 8 5 17 WAS NL 12100000
harriwi02 2016 1 HOU AL 1 2 66 0 0 0 12 192 52 16 3 15 69 0.220 2.25 1 4 1 0 255 19 17 1 2 6 HOU AL 525500
thornma01 2016 1 SDN NL 1 0 18 0 0 0 0 51 22 11 2 6 9 0.314 5.82 0 0 1 0 77 8 12 0 0 4 SDN NL 1600000
tonkimi01 2016 1 MIN AL 3 2 65 0 0 0 0 215 80 40 13 24 80 0.280 5.02 0 5 3 0 315 23 46 2 0 7 MIN AL 515000
lestejo01 2016 1 CHN NL 19 5 32 32 2 0 0 608 154 55 21 52 197 0.211 2.44 0 4 6 0 796 0 57 4 4 15 CHN NL 25000000

There are two columns for ‘teamID’ and two columns for ‘lgID’.

Now let’s look at what happens if we use all four common variables as keys.

join_all_keys <- inner_join(pitching16, salaries16, join_by(playerID, yearID, teamID, lgID))
playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP salary
morribr01 2016 1 MIA NL 0 0 24 0 0 0 1 53 15 6 4 10 13 0.238 3.06 1 0 1 0 74 4 7 0 0 2 1350000
matusbr01 2016 1 BAL AL 0 0 7 0 0 0 0 18 11 8 3 7 1 0.407 12.00 0 0 0 0 35 2 8 0 1 1 3900000
santihe01 2016 1 LAA AL 10 4 22 22 0 0 0 362 104 57 20 57 107 0.233 4.25 0 2 4 1 515 0 61 3 4 9 5000000
wislema01 2016 1 ATL NL 7 13 27 26 0 0 1 470 159 87 26 49 115 0.259 5.00 3 5 4 1 671 1 90 2 3 12 507500
floydga01 2016 1 TOR AL 2 4 28 0 0 0 0 93 23 14 4 8 30 0.205 4.06 1 4 3 0 124 10 14 0 1 2 1000000
sanchaa01 2016 1 TOR AL 15 2 30 30 0 0 0 576 161 64 15 63 161 0.224 3.00 0 5 5 0 790 0 69 1 2 14 517800

There are no repeat columns! Whenever the data frames you want to join share columns, make sure to specify all of them in the join_by argument.

Here’s a graph created from the joined data.

It should be noted that not all players have salary data. This may be because of a variety of reasons such as arbitration, lack of games played, or simply that their salary data is not available.