5 Restructuring Data

Sometimes data may not come organized or formatted in the way we would like. The good thing is there are lots of functions that can be used to restructure data.

5.1 Pivots

A pivot is useful for converting data to wide or long format. pivot_longer() will take multiple columns and turn it into one variable. pivot_wider() takes a column and converts it to multiple different variables. A step-by-step breakdown of pivots can be found in section 12.3 of R for Data Science.

We will practice pivoting in the example below.

5.2 Joins

Joins are helpful for combining two separate datasets. They are joined based on the keys, which can be specified by the coder. For now we will refer to the two data frames as x and y. There are four types of joins:

inner_join() : only keeps observations from x that have a matching key in y
left_join() : keeps all observations in x
right_join() : keeps all observations in y
full_join() : keeps all observations in x and y

Chapter 13 of R for Data Science has a more thorough explanation of joins.

The join functions are part of the dplyr library. This package is in the tidyverse, so it will automatically be loaded if the tidyverse is loaded.

library(dplyr)
set.seed(2023)

5.2.1 Example

We will practice pivoting and joins using the Lahman package.

library(Lahman)

Say we want to join the Batting and Fielding tables. Both of them have over 100,000 observations, so let’s filter it down to only look at the 2021 season.

batting21 <- Batting %>% 
  filter(yearID == 2021)
batting21 <- battingStats(batting21)

fielding21 <- Fielding %>% 
  filter(yearID == 2021)

Let’s take a look at both of these data frames.

batting21

playerID	yearID	stint	teamID	lgID	G	AB	R	H	X2B	X3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP
swansda01	2021	1	ATL	NL	160	588	78	146	33	2	27	88	9	3	52	167	4	5	1	7	7	0.248	653	264	0.449	0.311	0.760	0.297
ennsdi01	2021	1	TBA	AL	9	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA
semiema01	2021	1	TOR	AL	162	652	115	173	39	2	45	102	15	1	66	146	0	3	0	3	9	0.265	724	351	0.538	0.334	0.872	0.276
romanjo03	2021	1	TOR	AL	62	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA
varshda01	2021	1	ARI	NL	95	284	41	70	17	2	11	38	6	0	30	67	3	0	1	0	4	0.246	315	124	0.437	0.318	0.755	0.286
llovema01	2021	1	PHI	NL	6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA

The data has 29 variables and 1706 observations.

fielding21

playerID	yearID	stint	teamID	lgID	POS	G	GS	InnOuts	PO	A	PB	WP	SB	CS	ZR
penafe01	2021	1	LAA	AL	P	2	0	5	0	0	NA	NA	0	0	NA
smithjo05	2021	2	SEA	AL	P	23	0	54	1	1	NA	NA	2	0	NA
lintz01	2021	1	MIN	AL	OF	1	0	6	0	0	NA	NA	NA	NA	NA
valaipa01	2021	1	BAL	AL	OF	3	0	27	3	0	NA	NA	NA	NA	NA
hildetr01	2021	1	NYN	NL	P	2	0	7	0	0	NA	NA	0	0	NA
degotal01	2021	1	HOU	AL	2B	2	2	48	2	0	NA	NA	NA	NA	NA

The data has 18 variables and 2312 observations.

Before joining these tables, we are going to compute a summarized version of the fielding data and adjust the format.

The pivot_wider() function changes the setup of the data by increasing the number of columns and decreasing the number of rows. This is in the tidyr package, which is part of the tidyverse.

library(tidyr)

We will use DJ LeMahieu as an example to help illustrate exactly what pivoting does.

lemahieu <- Fielding %>% 
  filter(playerID == "lemahdj01") %>% 
  arrange(desc(yearID))

playerID	yearID	stint	teamID	lgID	POS	G	GS	InnOuts	PO	A	E	DP	PB	WP	SB	CS	ZR
lemahdj01	2021	1	NYA	AL	1B	55	33	963	286	13	1	21	NA	NA	NA	NA	NA
lemahdj01	2021	1	NYA	AL	2B	83	77	1989	117	156	2	48	NA	NA	NA	NA	NA
lemahdj01	2021	1	NYA	AL	3B	39	36	897	18	62	6	6	NA	NA	NA	NA	NA
lemahdj01	2020	1	NYA	AL	1B	11	1	72	24	0	0	1	NA	NA	NA	NA	NA
lemahdj01	2020	1	NYA	AL	2B	37	34	812	51	82	4	19	NA	NA	NA	NA	NA
lemahdj01	2020	1	NYA	AL	3B	11	11	261	10	16	2	2	NA	NA	NA	NA	NA
lemahdj01	2019	1	NYA	AL	1B	40	28	786	215	19	2	24	NA	NA	NA	NA	NA
lemahdj01	2019	1	NYA	AL	2B	75	66	1739	118	155	2	32	NA	NA	NA	NA	NA
lemahdj01	2019	1	NYA	AL	3B	52	47	1200	18	87	4	7	NA	NA	NA	NA	NA
lemahdj01	2018	1	COL	NL	2B	128	127	3345	209	378	4	90	NA	NA	NA	NA	NA
lemahdj01	2017	1	COL	NL	2B	153	151	3906	251	470	8	106	NA	NA	NA	NA	NA
lemahdj01	2016	1	COL	NL	2B	146	144	3728	276	422	6	91	NA	NA	NA	NA	NA
lemahdj01	2015	1	COL	NL	2B	149	146	3852	300	452	9	120	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	1B	1	0	3	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	2B	144	135	3539	257	413	6	99	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	3B	7	4	115	2	5	0	0	NA	NA	NA	NA	NA
lemahdj01	2014	1	COL	NL	SS	1	0	3	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	1B	1	0	3	2	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	2B	90	86	2250	168	271	3	57	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	3B	14	9	302	6	24	0	2	NA	NA	NA	NA	NA
lemahdj01	2013	1	COL	NL	SS	1	0	3	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	1B	1	0	9	1	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	2B	67	60	1527	105	204	2	33	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	3B	9	5	138	2	8	0	0	NA	NA	NA	NA	NA
lemahdj01	2012	1	COL	NL	SS	2	0	6	0	0	0	0	NA	NA	NA	NA	NA
lemahdj01	2011	1	CHN	NL	1B	1	1	24	8	0	0	1	NA	NA	NA	NA	NA
lemahdj01	2011	1	CHN	NL	2B	15	8	233	16	22	0	5	NA	NA	NA	NA	NA
lemahdj01	2011	1	CHN	NL	3B	11	6	180	6	12	4	5	NA	NA	NA	NA	NA

Currently the data is set up so if a player plays multiple positions in one season there will be a separate row for each. There are three rows for the 2021 season; one for first base, one for second base, and one for third base. We want to pivot the data so each player only has one row per season, but create columns to show which positions they played.

This is what the data looks like when pivoted:

playerID	yearID	stint	teamID	Inn_1B	Inn_2B	Inn_3B	Inn_SS
lemahdj01	2021	1	NYA	963	1989	897	0
lemahdj01	2020	1	NYA	72	812	261	0
lemahdj01	2019	1	NYA	786	1739	1200	0
lemahdj01	2018	1	COL	0	3345	0	0
lemahdj01	2017	1	COL	0	3906	0	0
lemahdj01	2016	1	COL	0	3728	0	0
lemahdj01	2015	1	COL	0	3852	0	0
lemahdj01	2014	1	COL	3	3539	115	3
lemahdj01	2013	1	COL	3	2250	302	3
lemahdj01	2012	1	COL	9	1527	138	6
lemahdj01	2011	1	CHN	24	233	180	0

Each position LeMahieu has played at has it’s own column.

Now it’s time to pivot the entire ‘fielding21’ data frame.

names_from specifies which existing column to use for the new column names. In this case it is ‘POS’ - instead of having one column for position, each different position will now have their own column. values_from says what values should be in those new columns. In this example that is ‘InnOuts’ - the data values in the newly created columns are how many outs a player played in said position.

fielding21_wide <- fielding21 %>%  
  pivot_wider(names_from = POS, values_from = InnOuts) %>% 
  select(playerID:lgID, P:`2B`) %>% 
  replace_na(list(C = 0, `1B` = 0, `2B` = 0, `3B` = 0, 
                  SS = 0, P = 0, OF = 0)) %>%
  summarize(Inn_1B = sum(`1B`),            
            Inn_2B = sum(`2B`),            
            Inn_3B = sum(`3B`), 
            Inn_SS = sum(SS), 
            Inn_C = sum(C),
            Inn_OF = sum(OF), 
            Inn_P = sum(P), 
            .by = c(playerID, yearID, stint, teamID))

playerID	yearID	stint	teamID	Inn_1B	Inn_2B	Inn_C	Inn_OF	Inn_P
smithjo05	2021	1	HOU	0	0	0	0	65
bummeaa01	2021	1	CHA	0	0	0	0	169
altuvjo01	2021	1	HOU	0	3789	0	0	0
woodhu01	2021	1	TEX	0	0	0	0	15
dickeco01	2021	1	MIA	0	0	0	1275	0
perezmi03	2021	1	PIT	3	0	1465	0	0

This data frame has 11 variables and 1697 observations.

Using a similar process we could find a player’s primary position, if that was desired.

fielding21 %>%
  select(playerID:InnOuts) %>%
  group_by(playerID, stint) %>%
  arrange(desc(InnOuts), .by_group = TRUE) %>%
  ungroup() %>%
  mutate(InnProp = InnOuts / sum(InnOuts), 
         POS_Q = ifelse(InnProp > 0.25, POS, NA),
         .by = c(playerID, stint)) %>%
  group_by(playerID, stint) %>%
  summarize(Pos = paste(na.omit(POS_Q), collapse = "/")) %>% 
  head(20) %>% 
  kable() %>% 
  kable_styling("striped") %>% 
  scroll_box(height = "300px")

playerID	stint	Pos
abadfe01	1	P
abbotco01	1	P
abreual01	1	P
abreubr01	1	P
abreujo02	1	1B
acevedo01	1	P
acunaro01	1	OF
adamewi01	1	SS
adamewi01	2	SS
adamja01	1	P
adamsau02	1	P
adamsma01	1	1B
adamsri03	1	C
adamsri03	2	C
adelljo01	1	OF
adonjo01	1	P
adriaeh01	1	OF/3B
aguilje01	1	1B
aguilmi01	1	P
ahmedni01	1	SS

Now we are ready to practice some joins!

Inner Join

Remember, a key must be present in both datasets to be joined. There are three columns that overlap between the two: ‘playerID’, ‘yearID’, and ‘stint’. We will specify that these variables are the keys in the join_by argument.

join_inner <- inner_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	RBI	BB	SO	SH	SF	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_OF	Inn_P
topaju01	2021	1	MIL	NL	4	1	0	0	0	0	0	1	0	0	0.000	1	0	0.000	0.000	0.000	NaN	MIL	0	10
cottojh01	2021	1	TEX	AL	23	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	TEX	0	92
freelky01	2021	1	COL	NL	25	33	3	6	1	4	2	17	4	1	0.182	40	7	0.212	0.222	0.434	0.353	COL	3	362
fostema01	2021	1	CHA	AL	37	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	CHA	0	117
detmere01	2021	1	LAA	AL	5	1	0	0	0	0	1	1	0	0	0.000	2	0	0.000	0.500	0.500	NaN	LAA	0	62
smeltde01	2021	1	MIN	AL	1	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	MIN	0	14

This creates a new data frame with 37 variables and 1697 observations. It is the same amount of rows as fielding21_wide, the smaller dataset.

Left Join

The code for each type of join is identical, besides the first word of the join function.

join_left <- left_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	HR	RBI	BB	SO	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_C	Inn_OF	Inn_P
pivetni01	2021	1	BOS	AL	31	2	0	0	0	0	0	0	2	0	0	0	0	0.000	2	0	0.000	0.000	0.000	NaN	BOS	0	0	465
leekh01	2021	1	NYN	NL	11	18	2	1	1	0	1	0	13	0	0	0	0	0.056	18	2	0.111	0.056	0.167	0.200	NYN	0	134	0
bonifjo01	2021	1	PHI	NL	7	11	0	1	0	0	2	1	6	0	0	0	0	0.091	12	1	0.091	0.167	0.258	0.200	PHI	0	63	0
carraca01	2021	1	NYN	NL	12	15	0	0	0	0	0	0	10	0	4	0	0	0.000	19	0	0.000	0.000	0.000	0.000	NYN	0	0	161
minayju01	2021	1	MIN	AL	29	0	0	0	0	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	MIN	0	0	120
kirkal01	2021	1	TOR	AL	60	165	19	40	8	8	24	19	22	3	0	2	7	0.242	189	72	0.436	0.328	0.764	0.234	TOR	1014	0	0

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as batting21, the left dataset.

Right Join

join_right <- right_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	RBI	BB	SO	SF	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_1B	Inn_3B	Inn_OF	Inn_P
guerrja01	2021	1	WAS	NL	6	1	0	0	0	0	0	1	0	0.000	1	0	0.0	0.000	0.000	NaN	WAS	0	0	0	18
newsolj01	2021	1	SEA	AL	7	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	SEA	0	0	0	44
lowthza01	2021	1	BAL	AL	10	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	BAL	0	0	0	89
ruizri01	2021	2	COL	NL	30	35	1	6	1	4	3	9	2	0.171	40	7	0.2	0.225	0.425	0.214	COL	24	75	9	0
brachbr01	2021	1	CIN	NL	35	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	CIN	0	0	0	90
garzara01	2021	2	MIN	AL	18	0	0	0	0	0	0	0	0	NA	0	0	NA	NA	NA	NA	MIN	0	0	0	58

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as fielding21_wide, the right dataset.

Full Join

join_full <- full_join(batting21, fielding21_wide, by = join_by(playerID, yearID, stint))

playerID	yearID	stint	teamID.x	lgID	G	AB	R	H	X2B	X3B	HR	RBI	SB	CS	BB	SO	IBB	HBP	SH	SF	GIDP	BA	PA	TB	SlugPct	OBP	OPS	BABIP	teamID.y	Inn_1B	Inn_C	Inn_P
minormi01	2021	1	KCA	AL	28	7	0	0	0	0	0	0	0	0	0	4	0	0	0	0	0	0.000	7	0	0.000	0.000	0.000	0.000	KCA	0	0	476
urquijo01	2021	1	HOU	AL	20	4	0	1	0	0	0	0	0	0	0	3	0	0	0	0	0	0.250	4	1	0.250	0.250	0.500	1.000	HOU	0	0	321
antonte01	2021	1	CIN	NL	23	3	1	0	0	0	0	0	0	0	1	3	0	0	1	0	0	0.000	5	0	0.000	0.250	0.250	NaN	CIN	0	0	101
holadbr01	2021	1	ARI	NL	13	31	2	6	2	0	0	1	0	0	1	15	0	2	0	0	0	0.194	34	8	0.258	0.265	0.523	0.375	ARI	0	255	3
mccanja02	2021	1	NYN	NL	121	375	29	87	12	1	10	46	1	2	32	115	1	2	0	3	12	0.232	412	131	0.349	0.294	0.643	0.304	NYN	135	2479	0
bruihju01	2021	1	LAN	NL	21	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	NA	1	0	NA	1.000	NA	NA	LAN	0	0	56

This creates a new data frame with 37 variables and 37 observations. It is the same amount of rows as as batting21, the larger dataset.

5.2.2 Another Example

Let’s try another example. This time we will use the Pitching and Salaries tables. First we will filter both down to one season (2016).

pitching16 <- Pitching %>% 
  filter(yearID == 2016)

salaries16 <- Salaries %>% 
  filter(yearID == 2016)

pitching16

playerID	yearID	stint	teamID	lgID	W	L	G	GS	IPouts	H	ER	HR	BB	SO	BAOpp	ERA	IBB	WP	HBP	BFP	GF	R	SH	SF	GIDP
mccarbr01	2016	1	LAN	NL	2	3	10	9	120	29	22	2	26	44	0.207	4.95	1	2	2	171	0	24	1	2	3
wimmeal01	2016	1	MIN	AL	1	3	16	0	52	14	8	2	11	14	0.237	4.15	1	0	0	72	2	8	0	2	3
osullse01	2016	1	BOS	AL	2	0	5	4	64	30	16	3	6	13	0.337	6.75	0	2	2	98	1	17	0	1	1
zastrro01	2016	1	CHN	NL	1	0	8	1	48	12	2	0	5	17	0.207	1.13	0	0	1	66	1	3	0	2	1
salasfe01	2016	2	NYN	NL	0	1	17	0	52	11	4	3	0	19	0.177	2.08	0	1	0	62	2	4	0	0	1
boydma01	2016	1	DET	AL	6	5	20	18	292	97	49	17	29	82	0.258	4.53	0	1	4	412	1	51	0	3	8

The data has 30 variables and 824 observations.

salaries16

yearID	teamID	lgID	playerID	salary
2016	SEA	AL	lindad01	8000000
2016	OAK	AL	fuldsa01	1925000
2016	MIA	NL	jacksed01	507500
2016	LAA	AL	smithjo05	5250000
2016	BOS	AL	vazquch01	513000
2016	CIN	NL	cothaca01	509675

The data has 5 variables and 853 observations.

The common columns are ‘playerID’, ‘yearID’, and ‘teamID’, and ‘leagueID’. Let’s see what happens if we do not use all four of these variables as keys while joining.

join_half_keys <- inner_join(pitching16, salaries16, join_by(playerID, yearID))

playerID	yearID	stint	teamID.x	lgID.x	W	L	G	GS	CG	SV	IPouts	H	ER	HR	BB	SO	BAOpp	ERA	IBB	WP	HBP	BFP	GF	R	SH	SF	GIDP	teamID.y	lgID.y	salary
lobstky01	2016	1	PIT	NL	2	0	14	0	0	0	75	25	11	2	12	15	0.263	3.96	1	2	2	110	3	11	1	0	4	PIT	NL	520000
gonzagi01	2016	1	WAS	NL	11	11	32	32	0	0	532	179	90	19	59	171	0.262	4.57	2	7	9	765	0	98	8	5	17	WAS	NL	12100000
harriwi02	2016	1	HOU	AL	1	2	66	0	0	12	192	52	16	3	15	69	0.220	2.25	1	4	1	255	19	17	1	2	6	HOU	AL	525500
thornma01	2016	1	SDN	NL	1	0	18	0	0	0	51	22	11	2	6	9	0.314	5.82	0	0	1	77	8	12	0	0	4	SDN	NL	1600000
tonkimi01	2016	1	MIN	AL	3	2	65	0	0	0	215	80	40	13	24	80	0.280	5.02	0	5	3	315	23	46	2	0	7	MIN	AL	515000
lestejo01	2016	1	CHN	NL	19	5	32	32	2	0	608	154	55	21	52	197	0.211	2.44	0	4	6	796	0	57	4	4	15	CHN	NL	25000000

There are two columns for ‘teamID’ and two columns for ‘lgID’.

Now let’s look at what happens if we use all four common variables as keys.

join_all_keys <- inner_join(pitching16, salaries16, join_by(playerID, yearID, teamID, lgID))

playerID	yearID	stint	teamID	lgID	W	L	G	GS	SV	IPouts	H	ER	HR	BB	SO	BAOpp	ERA	IBB	WP	HBP	BK	BFP	GF	R	SH	SF	GIDP	salary
morribr01	2016	1	MIA	NL	0	0	24	0	1	53	15	6	4	10	13	0.238	3.06	1	0	1	0	74	4	7	0	0	2	1350000
matusbr01	2016	1	BAL	AL	0	0	7	0	0	18	11	8	3	7	1	0.407	12.00	0	0	0	0	35	2	8	0	1	1	3900000
santihe01	2016	1	LAA	AL	10	4	22	22	0	362	104	57	20	57	107	0.233	4.25	0	2	4	1	515	0	61	3	4	9	5000000
wislema01	2016	1	ATL	NL	7	13	27	26	1	470	159	87	26	49	115	0.259	5.00	3	5	4	1	671	1	90	2	3	12	507500
floydga01	2016	1	TOR	AL	2	4	28	0	0	93	23	14	4	8	30	0.205	4.06	1	4	3	0	124	10	14	0	1	2	1000000
sanchaa01	2016	1	TOR	AL	15	2	30	30	0	576	161	64	15	63	161	0.224	3.00	0	5	5	0	790	0	69	1	2	14	517800

There are no repeat columns! Whenever the data frames you want to join share columns, make sure to specify all of them in the join_by argument.

Here’s a graph created from the joined data.

It should be noted that not all players have salary data. This may be because of a variety of reasons such as arbitration, lack of games played, or simply that their salary data is not available.