Chapter 6 Restructuring Data

Sometimes data may not come organized or formatted in the way we would like. The good thing is there are lots of functions that can be used to restructure data.

6.1 Pivots

A pivot is useful for converting data to wide or long format. pivot_longer() will take multiple columns and turn it into one variable. pivot_wider() takes a column and converts it to multiple different variables. A step-by-step breakdown of pivots can be found in section 12.3 of R for Data Science.

We will practice pivoting in the example below.

6.2 Joins

Joins are helpful for combining two separate datasets. They are joined based on the keys, which can be specified by the coder. For now we will refer to the two data frames as x and y. There are four types of joins:

inner_join() : only keeps observations from x that have a matching key in y
left_join() : keeps all observations in x
right_join() : keeps all observations in y
full_join() : keeps all observations in x and y

Chapter 13 of R for Data Science has a more thorough explanation of joins.

The join functions are part of the dplyr library. This package is in the tidyverse, so it will automatically be loaded if the tidyverse is loaded.

library(dplyr)
set.seed(2023)

6.2.1 Example

We will practice pivoting and joins using the Lahman package.

library(Lahman)

Say we want to join the Batting and Fielding tables. Both of them have over 100,000 observations, so let’s filter it down to only look at the 2021 season.

batting21 <- Batting %>% 
  filter(yearID == 2021)
batting21 <- battingStats(batting21)

fielding21 <- Fielding %>% 
  filter(yearID == 2021)

Let’s take a look at both of these data frames.

batting21

playerID	yearID	stint	teamID	lgID	G	AB	R	H	X2B	HR	RBI	SB	CS	BB	SO	HBP	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP
suterbr01	2021	1	MIL	NL	61	4	0	0	0	0	0	0	0	0	3	0	0	0.000	4	0	0.000	0.000	0.000	0.000
ennsdi01	2021	1	TBA	AL	9	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA
selmasa01	2021	1	SFN	NL	7	1	0	0	0	0	0	0	0	0	1	0	0	0.000	1	0	0.000	0.000	0.000	NaN
rojasjo02	2021	1	LAA	AL	61	168	26	35	14	6	15	2	1	15	50	1	2	0.208	184	67	0.399	0.277	0.676	0.259
vargail01	2021	2	PIT	NL	7	13	0	1	0	0	1	0	0	0	3	0	0	0.077	13	1	0.077	0.077	0.154	0.100
litteza01	2021	1	SFN	NL	63	4	0	1	0	0	0	0	0	0	3	0	0	0.250	4	1	0.250	0.250	0.500	1.000

The data has 29 variables and 1708 observations.

fielding21

playerID	yearID	stint	teamID	lgID	POS	G	GS	InnOuts	PO	A	E	DP	PB	WP	SB	CS	ZR
pederjo01	2021	2	ATL	NL	OF	49	43	1120	79	1	0	0	NA	NA	NA	NA	NA
smithdr01	2021	1	NYN	NL	P	31	1	124	2	1	0	1	NA	NA	NA	NA	NA
lindofr01	2021	1	NYN	NL	SS	124	122	3088	160	275	10	58	NA	NA	NA	NA	NA
valaipa01	2021	1	BAL	AL	2B	72	58	1516	109	121	0	24	NA	NA	NA	NA	NA
higgipj01	2021	1	CHN	NL	C	6	5	138	54	3	0	1	1	NA	1	1	NA
degotal01	2021	1	HOU	AL	2B	2	2	48	2	0	0	0	NA	NA	NA	NA	NA

The data has 18 variables and 2314 observations.

Before joining these tables, we are going to compute a summarized version of the fielding data and adjust the format.

The pivot_wider() function changes the setup of the data by increasing the number of columns and decreasing the number of rows. This is in the tidyr package, which is part of the tidyverse.

library(tidyr)

We will use DJ LeMahieu as an example to help illustrate exactly what pivoting does.

lemahieu <- Fielding %>% 
  filter(playerID == "lemahdj01") %>% 
  arrange(desc(yearID))

playerID	yearID	stint	teamID	lgID	POS	G	GS	InnOuts	PO	A	E	DP	PB	WP	SB	CS	ZR
lemahdj01	2022	1	NYA	AL	1B	35	31	795	230	13	1	16	NA	NA	NA	NA	NA
lemahdj01	2022	1	NYA	AL	2B	41	35	938	53	91	2	21	NA	NA	NA	NA	NA
lemahdj01	2022	1	NYA	AL	3B	47	43	1157	23	106	1	7	NA	NA	NA	NA	NA
lemahdj01	2021	1	NYA	AL	1B	55	33	963	286	13	1	21	NA	NA	NA	NA	NA
lemahdj01	2021	1	NYA	AL	2B	83	77	1989	117	156	2	48	NA	NA	NA	NA	NA
lemahdj01	2021	1	NYA	AL	3B	39	36	897	18	62	6	6	NA	NA	NA	NA	NA
lemahdj01	2020	1	NYA	AL	1B	11	1	72	24	0	0	1	NA	NA	NA	NA	NA
lemahdj01	2020	1	NYA	AL	2B	37	34	812	51	82	4	19	NA	NA	NA	NA	NA
lemahdj01	2020	1	NYA	AL	3B	11	11	261	10	16	2	2	NA	NA	NA	NA	NA
lemahdj01	2019	1	NYA	AL	1B	40	28	786	215	19	2	24	NA	NA	NA	NA	NA
lemahdj01	2019	1	NYA	AL	2B	75	66	1739	118	155	2	32	NA	NA	NA	NA	NA
lemahdj01	2019	1	NYA	AL	3B	52	47	1200	18	87	4	7	NA	NA	NA	NA	NA
lemahdj01	2018	1	COL	NL	2B	128	127	3345	209	378	4	90	NA	NA	NA	NA	NA
lemahdj01	2017	1	COL	NL	2B	153	151	3906	251	470	8	106	NA	NA	NA	NA	NA
lemahdj01	2016	1	COL	NL	2B	146	144	3728	276	422	6	91	NA	NA	NA	NA	NA
lemahdj01	2015	1	COL	NL	2B	149	146	3852	300	452	9	120	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	1B	1	0	3	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	2B	144	135	3539	257	413	6	99	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	3B	7	4	115	2	5	0	0	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	SS	1	0	3	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	1B	1	0	3	2	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	2B	90	86	2250	168	271	3	57	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	3B	14	9	302	6	24	0	2	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	SS	1	0	3	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	1B	1	0	9	1	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	2B	67	60	1527	105	204	2	33	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	3B	9	5	138	2	8	0	0	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	SS	2	0	6	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2011	1	CHN	NL	1B	1	1	24	8	0	0	1	NA	NA	NA	NA	NA
lemahdj01	2011	1	CHN	NL	2B	15	8	233	16	22	0	5	NA	NA	NA	NA	NA
lemahdj01	2011	1	CHN	NL	3B	11	6	180	6	12	4	5	NA	NA	NA	NA	NA

Currently the data is set up so if a player plays multiple positions in one season there will be a separate row for each. There are three rows for the 2021 season; one for first base, one for second base, and one for third base. We want to pivot the data so each player only has one row per season, but create columns to show which positions they played.

This is what the data looks like when pivoted:

playerID	yearID	stint	teamID	Inn_1B	Inn_2B	Inn_3B	Inn_SS
lemahdj01	2022	1	NYA	795	938	1157	0
lemahdj01	2021	1	NYA	963	1989	897	0
lemahdj01	2020	1	NYA	72	812	261	0
lemahdj01	2019	1	NYA	786	1739	1200	0
lemahdj01	2018	1	COL	0	3345	0	0
lemahdj01	2017	1	COL	0	3906	0	0
lemahdj01	2016	1	COL	0	3728	0	0
lemahdj01	2015	1	COL	0	3852	0	0
lemahdj01	2014	1	COL	3	3539	115	3
lemahdj01	2013	1	COL	3	2250	302	3
lemahdj01	2012	1	COL	9	1527	138	6
lemahdj01	2011	1	CHN	24	233	180	0

Each position LeMahieu has played at has it’s own column.

Now it’s time to pivot the entire ‘fielding21’ data frame.

names_from specifies which existing column to use for the new column names. In this case it is ‘POS’ - instead of having one column for position, each different position will now have their own column. values_from says what values should be in those new columns. In this example that is ‘InnOuts’ - the data values in the newly created columns are how many outs a player played in said position.

fielding21_wide <- fielding21 %>%  
  pivot_wider(names_from = POS, values_from = InnOuts) %>% 
  select(playerID:lgID, P:`2B`) %>% 
  replace_na(list(C = 0, `1B` = 0, `2B` = 0, `3B` = 0, 
                  SS = 0, P = 0, OF = 0)) %>%
  summarize(Inn_1B = sum(`1B`),            
            Inn_2B = sum(`2B`),            
            Inn_3B = sum(`3B`), 
            Inn_SS = sum(SS), 
            Inn_C = sum(C),
            Inn_OF = sum(OF), 
            Inn_P = sum(P), 
            .by = c(playerID, yearID, stint, teamID))

playerID	yearID	stint	teamID	Inn_1B	Inn_2B	Inn_OF	Inn_P
smithdo02	2021	1	NYN	212	0	2579	0
bummeaa01	2021	1	CHA	0	0	0	169
altuvjo01	2021	1	HOU	0	3789	0	0
woodal02	2021	1	SFN	0	0	0	416
dickeco01	2021	1	MIA	0	0	1275	0
perezhe01	2021	1	WAS	24	45	36	6

This data frame has 11 variables and 1699 observations.

Using a similar process we could find a player’s primary position, if that was desired.

fielding21 %>%
  select(playerID:InnOuts) %>%
  group_by(playerID, stint) %>%
  arrange(desc(InnOuts), .by_group = TRUE) %>%
  ungroup() %>%
  mutate(InnProp = InnOuts / sum(InnOuts), 
         POS_Q = ifelse(InnProp > 0.25, POS, NA),
         .by = c(playerID, stint)) %>%
  group_by(playerID, stint) %>%
  summarize(Pos = paste(na.omit(POS_Q), collapse = "/")) %>% 
  head(20) %>% 
  kable() %>% 
  kable_styling("striped") %>% 
  scroll_box(height = "300px")

playerID	stint	Pos
abadfe01	1	P
abbotco01	1	P
abreual01	1	P
abreubr01	1	P
abreujo02	1	1B
acevedo01	1	P
acunaro01	1	OF
adamewi01	1	SS
adamewi01	2	SS
adamja01	1	P
adamsau02	1	P
adamsma01	1	1B
adamsri03	1	C
adamsri03	2	C
adelljo01	1	OF
adonjo01	1	P
adriaeh01	1	OF/3B
aguilje01	1	1B
aguilmi01	1	P
ahmedni01	1	SS

Now we are ready to practice some joins!

Inner Join

Remember, a key must be present in both datasets to be joined. There are three columns that overlap between the two: ‘playerID’, ‘yearID’, and ‘stint’. We will specify that these variables are the keys in the join_by argument.

join_inner <- inner_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	X3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_OF	Inn_P
tomka01	2021	2	PIT	NL	39	92	9	14	2	1	2	11	1	0	17	30	1	5	0	3	1	0.152	117	24	0.261	0.308	0.569	0.190	PIT	734	0
cottojh01	2021	1	TEX	AL	23	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	TEX	0	92
freelky01	2021	1	COL	NL	25	33	3	6	1	0	0	4	0	0	2	17	0	0	4	1	0	0.182	40	7	0.212	0.222	0.434	0.353	COL	3	362
fostema01	2021	1	CHA	AL	37	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	CHA	0	117
detmere01	2021	1	LAA	AL	5	1	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0.000	2	0	0.000	0.500	0.500	NaN	LAA	0	62
slateau01	2021	1	SFN	NL	129	274	39	66	12	1	12	32	15	2	28	84	2	4	0	0	7	0.241	306	116	0.423	0.320	0.743	0.303	SFN	1808	1

This creates a new data frame with 37 variables and 1699 observations. It is the same amount of rows as fielding21_wide, the smaller dataset.

Left Join

The code for each type of join is identical, besides the first word of the join function.

join_left <- left_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	X3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_1B	Inn_OF	Inn_P
pinedmi01	2021	1	MIN	AL	22	3	0	0	0	0	0	0	0	0	0	1	0	0	0	0	2	0.000	3	0	0.000	0.000	0.000	0.000	MIN	0	0	328
leedy01	2021	1	ATL	NL	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	ATL	0	0	6
bonifjo01	2021	1	PHI	NL	7	11	0	1	0	0	0	2	0	0	1	6	0	0	0	0	0	0.091	12	1	0.091	0.167	0.258	0.200	PHI	0	63	0
carraca01	2021	1	NYN	NL	12	15	0	0	0	0	0	0	0	0	0	10	0	0	4	0	0	0.000	19	0	0.000	0.000	0.000	0.000	NYN	0	0	161
milneho01	2021	1	MIL	NL	19	1	0	0	0	0	0	0	0	0	1	1	0	0	0	0	0	0.000	2	0	0.000	0.500	0.500	NaN	MIL	0	0	65
kirilal01	2021	1	MIN	AL	59	215	23	54	11	1	8	34	1	1	14	52	2	1	0	1	3	0.251	231	91	0.423	0.299	0.722	0.295	MIN	641	795	0

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as batting21, the left dataset.

Right Join

join_right <- right_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	HR	RBI	BB	SO	IBB	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_2B	Inn_3B	Inn_C	Inn_P
guerrde01	2021	1	OAK	AL	53	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	OAK	0	0	0	197
newcose01	2021	1	ATL	NL	32	1	0	0	0	0	0	0	1	0	0	1	0	0	0.000	2	0	0.000	0.000	0.000	NaN	ATL	0	0	0	97
lowrije01	2021	1	OAK	AL	139	457	55	112	28	14	69	49	108	1	2	0	4	8	0.245	512	182	0.398	0.318	0.716	0.289	OAK	1754	9	0	0
ruizke01	2021	2	WAS	NL	23	81	9	23	3	2	14	6	4	0	2	0	0	3	0.284	89	32	0.395	0.348	0.743	0.280	WAS	0	0	531	0
brachbr01	2021	1	CIN	NL	35	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	CIN	0	0	0	90
garzara01	2021	2	MIN	AL	18	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	MIN	0	0	0	58

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as fielding21_wide, the right dataset.

Full Join

join_full <- full_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	X3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_2B	Inn_3B	Inn_SS	Inn_P
milonto01	2021	1	TOR	AL	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	TOR	0	0	0	42
uriaslu01	2021	1	MIL	NL	150	490	77	122	25	1	23	75	5	1	63	116	3	10	1	3	9	0.249	567	218	0.445	0.345	0.790	0.280	MIL	409	1525	1607	0
antonte01	2021	1	CIN	NL	23	3	1	0	0	0	0	0	0	0	1	3	0	0	1	0	0	0.000	5	0	0.000	0.250	0.250	NaN	CIN	0	0	0	101
hoffmje02	2021	1	CIN	NL	32	15	1	2	0	0	0	1	0	0	0	3	0	0	1	0	0	0.133	16	2	0.133	0.133	0.266	0.167	CIN	0	0	0	219
mazzach01	2021	1	TBA	AL	14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	TBA	0	0	0	82
bruihju01	2021	1	LAN	NL	21	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	NA	1	0	NA	1.000	NA	NA	LAN	0	0	0	56

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as as batting21, the larger dataset.

6.2.2 Another Example

Let’s try another example. This time we will use the Pitching and Salaries tables. First we will filter both down to one season (2016).

pitching16 <- Pitching %>% 
  filter(yearID == 2016)

salaries16 <- Salaries %>% 
  filter(yearID == 2016)

pitching16

playerID	yearID	stint	teamID	lgID	W	L	G	GS	IPouts	H	ER	HR	BB	SO	BAOpp	ERA	IBB	WP	HBP	BFP	GF	R	SH	SF	GIDP
mccarbr01	2016	1	LAN	NL	2	3	10	9	120	29	22	2	26	44	0.207	4.95	1	2	2	171	0	24	1	2	3
wimmeal01	2016	1	MIN	AL	1	3	16	0	52	14	8	2	11	14	0.237	4.15	1	0	0	72	2	8	0	2	3
osullse01	2016	1	BOS	AL	2	0	5	4	64	30	16	3	6	13	0.337	6.75	0	2	2	98	1	17	0	1	1
zastrro01	2016	1	CHN	NL	1	0	8	1	48	12	2	0	5	17	0.207	1.13	0	0	1	66	1	3	0	2	1
salasfe01	2016	2	NYN	NL	0	1	17	0	52	11	4	3	0	19	0.177	2.08	0	1	0	62	2	4	0	0	1
boydma01	2016	1	DET	AL	6	5	20	18	292	97	49	17	29	82	0.258	4.53	0	1	4	412	1	51	0	3	8

The data has 30 variables and 824 observations.

salaries16

yearID	teamID	lgID	playerID	salary
2016	SEA	AL	lindad01	8000000
2016	OAK	AL	fuldsa01	1925000
2016	MIA	NL	jacksed01	507500
2016	LAA	AL	smithjo05	5250000
2016	BOS	AL	vazquch01	513000
2016	CIN	NL	cothaca01	509675

The data has 5 variables and 853 observations.

The common columns are ‘playerID’, ‘yearID’, and ‘teamID’, and ‘leagueID’. Let’s see what happens if we do not use all four of these variables as keys while joining.

join_half_keys <- inner_join(pitching16, salaries16, join_by(playerID, yearID))

playerID	yearID	stint	teamID.x	lgID.x	W	L	G	GS	CG	SV	IPouts	H	ER	HR	BB	SO	BAOpp	ERA	IBB	WP	HBP	BFP	GF	R	SH	SF	GIDP	teamID.y	lgID.y	salary
lobstky01	2016	1	PIT	NL	2	0	14	0	0	0	75	25	11	2	12	15	0.263	3.96	1	2	2	110	3	11	1	0	4	PIT	NL	520000
gonzagi01	2016	1	WAS	NL	11	11	32	32	0	0	532	179	90	19	59	171	0.262	4.57	2	7	9	765	0	98	8	5	17	WAS	NL	12100000
harriwi02	2016	1	HOU	AL	1	2	66	0	0	12	192	52	16	3	15	69	0.220	2.25	1	4	1	255	19	17	1	2	6	HOU	AL	525500
thornma01	2016	1	SDN	NL	1	0	18	0	0	0	51	22	11	2	6	9	0.314	5.82	0	0	1	77	8	12	0	0	4	SDN	NL	1600000
tonkimi01	2016	1	MIN	AL	3	2	65	0	0	0	215	80	40	13	24	80	0.280	5.02	0	5	3	315	23	46	2	0	7	MIN	AL	515000
lestejo01	2016	1	CHN	NL	19	5	32	32	2	0	608	154	55	21	52	197	0.211	2.44	0	4	6	796	0	57	4	4	15	CHN	NL	25000000

There are two columns for ‘teamID’ and two columns for ‘lgID’.

Now let’s look at what happens if we use all four common variables as keys.

join_all_keys <- inner_join(pitching16, salaries16, join_by(playerID, yearID, teamID, lgID))

playerID	yearID	stint	teamID	lgID	W	L	G	GS	SV	IPouts	H	ER	HR	BB	SO	BAOpp	ERA	IBB	WP	HBP	BK	BFP	GF	R	SH	SF	GIDP	salary
morribr01	2016	1	MIA	NL	0	0	24	0	1	53	15	6	4	10	13	0.238	3.06	1	0	1	0	74	4	7	0	0	2	1350000
matusbr01	2016	1	BAL	AL	0	0	7	0	0	18	11	8	3	7	1	0.407	12.00	0	0	0	0	35	2	8	0	1	1	3900000
santihe01	2016	1	LAA	AL	10	4	22	22	0	362	104	57	20	57	107	0.233	4.25	0	2	4	1	515	0	61	3	4	9	5000000
wislema01	2016	1	ATL	NL	7	13	27	26	1	470	159	87	26	49	115	0.259	5.00	3	5	4	1	671	1	90	2	3	12	507500
floydga01	2016	1	TOR	AL	2	4	28	0	0	93	23	14	4	8	30	0.205	4.06	1	4	3	0	124	10	14	0	1	2	1000000
sanchaa01	2016	1	TOR	AL	15	2	30	30	0	576	161	64	15	63	161	0.224	3.00	0	5	5	0	790	0	69	1	2	14	517800

There are no repeat columns! Whenever the data frames you want to join share columns, make sure to specify all of them in the join_by argument.

Here’s a graph created from the joined data.

It should be noted that not all players have salary data. This may be because of a variety of reasons such as arbitration, lack of games played, or simply that their salary data is not available.