Chapter 6 Combining data

Often you’ll want to take two (or more) data sources and connect them. Want to know how community demographics impact crime rates? Or how the characteristics of the college someone attended impact their ability to repay their student loans? Anytime we have one data in one data set and want to associate it with data in another, what we want to do is merge those two data sets. We want to take the two data sets and make them into one.

As was said in Where to Find Data, it is rare to find all the information you need or want for an analysis in one single source. As such, you’ll probably be pulling form multiple sources. How do you combine them then?

This chapter will walk through the nuts and bolts of merging, along with some common mistakes that can be made.

6.1 Getting Started

We’ll begin that with some fake data I’m going to generate below, just to make the examples really really clear. Let’s pretend this data is for a state that has 6 counties.

set.seed(5) # set.seed tells R which iteration of random numbers I want to use below. I run it so that the examples below are consistent for everyone.
ID <- c(1,2,3,4,5,6) # unique ID for each entry
MedianIncome <- rnorm(6, 45000, 5000) # median income
Manufacturing <- rnorm(6, .43, .1) # % of workers in manufacturing
CollegeEdu <- rnorm(6, 45000, 5000)
Murder <- rnorm(6, 7.8, 4.3) # Muder rate
df1 <- data.frame(ID,CollegeEdu,Manufacturing) # creating data set with 3 variables
df2 <- data.frame(ID, Murder) # creating data set with 2 variables

Okay, so we have two data frames now. df1, which has information about a counties economy…

df1

##   ID CollegeEdu Manufacturing
## 1  1   39598.04     0.3827834
## 2  2   44212.33     0.3664629
## 3  3   39641.20     0.4014226
## 4  4   44305.07     0.4438108
## 5  5   42013.43     0.5527630
## 6  6   34080.17     0.3498221

and df2, which has information about the murder rate…

df2

##   ID    Murder
## 1  1  8.835514
## 2  2  6.684772
## 3  3 11.672201
## 4  4 11.850038
## 5  5 14.112236
## 6  6 10.839073

So if you wanted to study the impact of local economies on murder rates, you’d want to combine or merge these two data sets. You want to make sure that the information for county 1 is matched from both data sets, so that you have consistent information too. To do that we’ll use the merge command, merge(). The big hurdle with the merge command which we’ll keep practicing in this section is that we need to tell R what variable we want to use that connects the two data sets. How does R know that the information in a given row is for the same unit in the two data sets? In this case we have the variable ID which is unique to each entry and identifies which county is which. And we need to tell R what to name that new data set we’re creating when we merge the two together with the by.x and by.y. Below I’ll name the new data set df.merged for the sake of clarity.

df.merged <-merge(df1, df2, by.x="ID", by.y="ID")
df.merged

##   ID CollegeEdu Manufacturing    Murder
## 1  1   39598.04     0.3827834  8.835514
## 2  2   44212.33     0.3664629  6.684772
## 3  3   39641.20     0.4014226 11.672201
## 4  4   44305.07     0.4438108 11.850038
## 5  5   42013.43     0.5527630 14.112236
## 6  6   34080.17     0.3498221 10.839073

Where we once had two data sets, we now have one with both sets of information combined. You can look back above and see that R has correctly placed the values for Murder for each county with its matching information for the economy.

Okay, let’s make it a bit more complicated and create a little more data. Let’s say we have a data set with multiple entries for each county because we have an observation for each year.

set.seed(5)
ID <- c(1,1,2,2,3,3,4,4,5,5,6,6) # unique ID for each entry
Year <- c(2005, 2006,2005, 2006,2005, 2006,2005, 2006,2005, 2006,2005, 2006)
Poverty <- rnorm(12, .15, .03)
df3 <- data.frame(ID,Year,Poverty)
df3

##    ID Year   Poverty
## 1   1 2005 0.1247743
## 2   1 2006 0.1915308
## 3   2 2005 0.1123352
## 4   2 2006 0.1521043
## 5   3 2005 0.2013432
## 6   3 2006 0.1319128
## 7   4 2005 0.1358350
## 8   4 2006 0.1309389
## 9   5 2005 0.1414268
## 10  5 2006 0.1541432
## 11  6 2005 0.1868289
## 12  6 2006 0.1259466

So let’s say we want to combine df3 with df1. df1 only has 6 rows and df3 has 12. How many rows will our new data set have after merging?

df.merged2 <- merge(df1, df3, by.x="ID", by.y="ID")
nrow(df.merged2)

## [1] 12

We have 12 rows. Why? Because R matched each unique ID in df1 to each in df3. Each row that had the value “1” in df1 was simply matched multiple times to each row in df3 that had the value “1”. df1 only had 6 rows, so those repeated to match the 12 rows in df3. As such, the values for CollegeEdu and Manufacturing appear twice below, because we only have one value for ID 1, but we have two different values for Poverty.

df.merged2

##    ID CollegeEdu Manufacturing Year   Poverty
## 1   1   39598.04     0.3827834 2005 0.1247743
## 2   1   39598.04     0.3827834 2006 0.1915308
## 3   2   44212.33     0.3664629 2005 0.1123352
## 4   2   44212.33     0.3664629 2006 0.1521043
## 5   3   39641.20     0.4014226 2005 0.2013432
## 6   3   39641.20     0.4014226 2006 0.1319128
## 7   4   44305.07     0.4438108 2005 0.1358350
## 8   4   44305.07     0.4438108 2006 0.1309389
## 9   5   42013.43     0.5527630 2005 0.1414268
## 10  5   42013.43     0.5527630 2006 0.1541432
## 11  6   34080.17     0.3498221 2005 0.1868289
## 12  6   34080.17     0.3498221 2006 0.1259466

What if we only wanted to keep the 6 observations that were from the same year for the two data sets? We can do that, as long as we have a varaible identifying the year in both.

First let’s add a variable for year to df1

df1$Year <- 2006

We can use multiple columns to match our data to make sure that R only merges the ones we want. We’ll give the two by options a list of values to look for in merging below.

df.merged2 <- merge(df1, df3, by.x=c("ID","Year"), by.y=c("ID","Year"))
df.merged2

##   ID Year CollegeEdu Manufacturing   Poverty
## 1  1 2006   39598.04     0.3827834 0.1915308
## 2  2 2006   44212.33     0.3664629 0.1521043
## 3  3 2006   39641.20     0.4014226 0.1319128
## 4  4 2006   44305.07     0.4438108 0.1309389
## 5  5 2006   42013.43     0.5527630 0.1541432
## 6  6 2006   34080.17     0.3498221 0.1259466

Now the merged data has only 6 rows, because only the ones from teh year 2006 matched between the two data sets.

6.2 Unequal Observations

In the above examples we had an observation for each of the six values in ID, but R can handle it if that’s not true as well.

set.seed(5)
ID <- c(1,3,5,6) # unique ID for each entry
Population <- rnorm(4, 30000, 10000)
df4 <- data.frame(ID,Population)
df4

##   ID Population
## 1  1   21591.45
## 2  3   43843.59
## 3  5   17445.08
## 4  6   30701.43

We only have 4 rows in df4. So if we merge it with df1, how many rows will we have?

df.merged3 <- merge(df1, df4, by.x="ID", by.y="ID")
df.merged3

##   ID CollegeEdu Manufacturing Year Population
## 1  1   39598.04     0.3827834 2006   21591.45
## 2  3   39641.20     0.4014226 2006   43843.59
## 3  5   42013.43     0.5527630 2006   17445.08
## 4  6   34080.17     0.3498221 2006   30701.43

4 rows. Each observation that is present in df4 was matched to one in df1, and those that are absent are lost. If we want to make sure all of our observations from df1 are still included, we can set all.x to TRUE.

df.merged3 <- merge(df1, df4, by.x="ID", by.y="ID", all.x=TRUE)
df.merged3

##   ID CollegeEdu Manufacturing Year Population
## 1  1   39598.04     0.3827834 2006   21591.45
## 2  2   44212.33     0.3664629 2006         NA
## 3  3   39641.20     0.4014226 2006   43843.59
## 4  4   44305.07     0.4438108 2006         NA
## 5  5   42013.43     0.5527630 2006   17445.08
## 6  6   34080.17     0.3498221 2006   30701.43

The new data set still doesn’t have information for the rows that were absent in df4, but we still have all the data that was present in df1. We set all.x to true because we wanted to keep all of the observations in the first dataset in our merge. We could use all.y if we were worried about keeping all of the observations from df4, or just all if we wanted to retain all observations from both data sets.

I’ll demonstrate that with a few more examples.

set.seed(5)
ID.x <- c("A", "B", "C", "D", "E") # unique ID for each entry
Size <- rnorm(5, 30000, 10000)
ID.y <- c("A", "C", "D", "E", "F", "G")
Wealth <- rnorm(6, 60000, 30000)
x <-  data.frame(ID.x,Size)
y <-  data.frame(ID.y,Wealth)

x

##   ID.x     Size
## 1    A 21591.45
## 2    B 43843.59
## 3    C 17445.08
## 4    D 30701.43
## 5    E 47114.41

##   ID.y   Wealth
## 1    A 41912.76
## 2    C 45835.01
## 3    D 40938.86
## 4    E 51426.79
## 5    F 64143.25
## 6    G 96828.91

Only keeping those observations in both data sets.

xy.merge1 <- merge(x,y, by.x="ID.x", by.y="ID.y")
xy.merge1

##   ID.x     Size   Wealth
## 1    A 21591.45 41912.76
## 2    C 17445.08 45835.01
## 3    D 30701.43 40938.86
## 4    E 47114.41 51426.79

Keeping all observations in the first data set.

xy.merge2 <- merge(x,y, by.x="ID.x", by.y="ID.y", all.x=T)
xy.merge2

##   ID.x     Size   Wealth
## 1    A 21591.45 41912.76
## 2    B 43843.59       NA
## 3    C 17445.08 45835.01
## 4    D 30701.43 40938.86
## 5    E 47114.41 51426.79

Keeping all observations in the second data set.

xy.merge3 <- merge(x,y, by.x="ID.x", by.y="ID.y", all.y=T)
xy.merge3

##   ID.x     Size   Wealth
## 1    A 21591.45 41912.76
## 2    C 17445.08 45835.01
## 3    D 30701.43 40938.86
## 4    E 47114.41 51426.79
## 5    F       NA 64143.25
## 6    G       NA 96828.91

And keeping all observations in both data sets.

xy.merge3 <- merge(x,y, by.x="ID.x", by.y="ID.y", all=T)
xy.merge3

##   ID.x     Size   Wealth
## 1    A 21591.45 41912.76
## 2    B 43843.59       NA
## 3    C 17445.08 45835.01
## 4    D 30701.43 40938.86
## 5    E 47114.41 51426.79
## 6    F       NA 64143.25
## 7    G       NA 96828.91

6.3 More Practice

Those were some simplified examples with data sets we generated just to demonstrate the key concepts. Data is going to be a bit messier in the wild though. The following examples use the Lahman package, to show how this might work in the real world. If you haven’t used the Lahman package before you’ll need to install it with install.packages(“Lahman”). Since I’ve already got it installed on my computer, I can read it in with just the library command.

library(Lahman)

The Lahman data set has a bunch of different data sets about baseball. No knowledge about baseball is required to understand these examples though beyond knowing what a team is as a general concept. The data sets contain different information, some are focused on teams, some are focused on players, some are about stadiums, some are just about the playoffs. What’s great is that all of the data sets are linked in different ways, allowing you to mix and match.

data(package="Lahman")

One of the big challenges we face in merging data sets is having something that links the two rows that have the same information. When we give the by option in the merge command the name of a column, that is telling R where to look for matches between the two data sets to merge them in a coherent way. We don’t want information from one unit getting matched up with other units.

Let’s start with two data sets about teams that are in the Lahman package.

data(Teams)
head(Teams)

##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
## 6         PH1

data(TeamsFranchises)
head(TeamsFranchises)

##   franchID                    franchName active NAassoc
## 1      ALT         Altoona Mountain City      N    <NA>
## 2      ANA Los Angeles Angels of Anaheim      Y    <NA>
## 3      ARI          Arizona Diamondbacks      Y    <NA>
## 4      ATH        Philadelphia Athletics      N     PNA
## 5      ATL                Atlanta Braves      Y     BNA
## 6      BAL             Baltimore Orioles      Y    <NA>

In both data frames we have a column called “franchID” which has a unique (shortened) name for every tame that has played professional baseball. If we want to combine these two data sets that’s probably the best way to tell R how to match them. Let’s keep an eye on how many matches we have by checking the number of rows in our data sets before and after we merge.

nrow(Teams)

## [1] 2895

nrow(TeamsFranchises)

## [1] 120

teams.merge <- merge(Teams,TeamsFranchises, by.x="franchID", by.y="franchID")
nrow(teams.merge)

## [1] 2895

Great. We startated with 2895 observations in Teams and 120 in TeamFranchises, and we ended up with 2895. That means that there were multiple observations in Teams, so the information in TeamsFranchises matched multiple times.

head(teams.merge)

##   franchID yearID lgID teamID divID Rank   G Ghome  W  L DivWin WCWin
## 1      ALT   1884   UA    ALT  <NA>   10  25    NA  6 19   <NA>  <NA>
## 2      ANA   1991   AL    CAL     W    7 162    81 81 81      N  <NA>
## 3      ANA   1998   AL    ANA     W    2 162    81 85 77      N     N
## 4      ANA   2011   AL    LAA     W    2 162    81 86 76      N     N
## 5      ANA   1994   AL    CAL     W    4 115    63 47 68   <NA>  <NA>
## 6      ANA   1977   AL    CAL     W    5 162    81 74 88      N  <NA>
##   LgWin WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER
## 1     N  <NA>  90  899  223  30   6   2  22  130  NA NA  NA NA 216 114
## 2     N     N 653 5470 1396 245  29 115 448  928  94 56  38 31 649 591
## 3     N     N 787 5630 1530 314  27 147 510 1028  93 45  48 41 783 720
## 4     N     N 667 5513 1394 289  34 155 442 1086 135 52  51 32 633 581
## 5  <NA>  <NA> 543 3943 1042 178  16 120 402  715  65 54  27 29 660 618
## 6     N     N 675 5410 1380 233  40 131 542  880 159 89  36 51 695 594
##    ERA CG SHO SV IPouts   HA HRA BBA  SOA   E  DP    FP
## 1 4.67 20   0  0    659  292   3  52   93 156   4 0.862
## 2 3.69 18  10 50   4325 1351 141 543  990 102 156 0.984
## 3 4.49  3   5 52   4332 1481 164 630 1091 106 146 0.983
## 4 3.57 12  11 39   4395 1388 142 476 1058  93 157 0.985
## 5 5.42 11   4 21   3081 1149 150 436  682  76 110 0.983
## 6 3.72 53  13 26   4313 1383 136 572  965 147 137 0.976
##                            name                       park attendance BPF
## 1         Altoona Mountain City                       <NA>         NA 101
## 2             California Angels            Anaheim Stadium    2416236  99
## 3                Anaheim Angels Edison International Field    2519280 102
## 4 Los Angeles Angels of Anaheim              Angel Stadium    3166321  93
## 5             California Angels            Anaheim Stadium    1512622 101
## 6             California Angels            Anaheim Stadium    1432633  96
##   PPF teamIDBR teamIDlahman45 teamIDretro                    franchName
## 1 109      ALT            ALT         ALT         Altoona Mountain City
## 2 100      CAL            CAL         CAL Los Angeles Angels of Anaheim
## 3 102      ANA            ANA         ANA Los Angeles Angels of Anaheim
## 4  93      LAA            ANA         ANA Los Angeles Angels of Anaheim
## 5 101      CAL            CAL         CAL Los Angeles Angels of Anaheim
## 6  96      CAL            CAL         CAL Los Angeles Angels of Anaheim
##   active NAassoc
## 1      N    <NA>
## 2      Y    <NA>
## 3      Y    <NA>
## 4      Y    <NA>
## 5      Y    <NA>
## 6      Y    <NA>

Let’s make the next one a little more complicated then. Let’s bring together the data on teams in the Teams data set with another data set named TeamsHalf.

data(TeamsHalf)
head(TeamsHalf)

##   yearID lgID teamID Half divID DivWin Rank  G  W  L
## 1   1981   NL    ATL    1     W      N    4 54 25 29
## 2   1981   NL    ATL    2     W      N    5 52 25 27
## 3   1981   AL    BAL    1     E      N    2 54 31 23
## 4   1981   AL    BAL    2     E      N    4 51 28 23
## 5   1981   AL    BOS    1     E      N    5 56 30 26
## 6   1981   AL    BOS    2     E      N    2 52 29 23

We don’t have a column in TeamsHalf called franchID, but both data sets do have a column called teamID. We can use that to merge them then.

nrow(Teams)

## [1] 2895

nrow(TeamsHalf)

## [1] 52

teams.merge2 <- merge(Teams,TeamsHalf, by.x="teamID", by.y="teamID")
nrow(teams.merge2)

## [1] 4090

Wait, we started with a data set containing 2895 rows, and one with 52 rows, and we combined them and got 4136 rows? What happened? This is a common challenge that occurs when combining data sets.

Let’s look back at the original data.

head(TeamsHalf)

##   yearID lgID teamID Half divID DivWin Rank  G  W  L
## 1   1981   NL    ATL    1     W      N    4 54 25 29
## 2   1981   NL    ATL    2     W      N    5 52 25 27
## 3   1981   AL    BAL    1     E      N    2 54 31 23
## 4   1981   AL    BAL    2     E      N    4 51 28 23
## 5   1981   AL    BOS    1     E      N    5 56 30 26
## 6   1981   AL    BOS    2     E      N    2 52 29 23

head(Teams)

##   yearID lgID teamID franchID divID Rank  G Ghome  W  L DivWin WCWin LgWin
## 1   1871   NA    BS1      BNA  <NA>    3 31    NA 20 10   <NA>  <NA>     N
## 2   1871   NA    CH1      CNA  <NA>    2 28    NA 19  9   <NA>  <NA>     N
## 3   1871   NA    CL1      CFC  <NA>    8 29    NA 10 19   <NA>  <NA>     N
## 4   1871   NA    FW1      KEK  <NA>    7 19    NA  7 12   <NA>  <NA>     N
## 5   1871   NA    NY2      NNA  <NA>    5 33    NA 16 17   <NA>  <NA>     N
## 6   1871   NA    PH1      PNA  <NA>    1 28    NA 21  7   <NA>  <NA>     Y
##   WSWin   R   AB   H X2B X3B HR BB SO SB CS HBP SF  RA  ER  ERA CG SHO SV
## 1  <NA> 401 1372 426  70  37  3 60 19 73 16  NA NA 303 109 3.55 22   1  3
## 2  <NA> 302 1196 323  52  21 10 60 22 69 21  NA NA 241  77 2.76 25   0  1
## 3  <NA> 249 1186 328  35  40  7 26 25 18  8  NA NA 341 116 4.11 23   0  0
## 4  <NA> 137  746 178  19   8  2 33  9 16  4  NA NA 243  97 5.17 19   1  0
## 5  <NA> 302 1404 403  43  21  1 33 15 46 15  NA NA 313 121 3.72 32   1  0
## 6  <NA> 376 1281 410  66  27  9 46 23 56 12  NA NA 266 137 4.95 27   0  0
##   IPouts  HA HRA BBA SOA   E DP    FP                    name
## 1    828 367   2  42  23 243 24 0.834    Boston Red Stockings
## 2    753 308   6  28  22 229 16 0.829 Chicago White Stockings
## 3    762 346  13  53  34 234 15 0.818  Cleveland Forest Citys
## 4    507 261   5  21  17 163  8 0.803    Fort Wayne Kekiongas
## 5    879 373   7  42  22 235 14 0.840        New York Mutuals
## 6    747 329   3  53  16 194 13 0.845  Philadelphia Athletics
##                           park attendance BPF PPF teamIDBR teamIDlahman45
## 1          South End Grounds I         NA 103  98      BOS            BS1
## 2      Union Base-Ball Grounds         NA 104 102      CHI            CH1
## 3 National Association Grounds         NA  96 100      CLE            CL1
## 4               Hamilton Field         NA 101 107      KEK            FW1
## 5     Union Grounds (Brooklyn)         NA  90  88      NYU            NY2
## 6     Jefferson Street Grounds         NA 102  98      ATH            PH1
##   teamIDretro
## 1         BS1
## 2         CH1
## 3         CL1
## 4         FW1
## 5         NY2
## 6         PH1

The issue was that the values in teamID weren’t unique to each row. Each row of data contains information about a team in a given year. So there are multiple entries for each team in TeamHalf and Teams, and so those matched multiple times when we combined them. But we can fix that by including multiple columns in our by=.

nrow(Teams)

## [1] 2895

nrow(TeamsHalf)

## [1] 52

teams.merge2 <- merge(Teams,TeamsHalf, by.x=c("teamID", "yearID"), by.y=c("teamID", "yearID"))
nrow(teams.merge2)

## [1] 52

Now we have 52 rows, because each of the 52 rows in TeamsHalf matched 1 to 1 with a row in Teams, and all other observations didn’t make it to the data set. You can adjust whether you want to keep all those rows with all= as we described above.

6.4 Crosswalks

Crosswalks are a mergers best friend. What all of the examples we’ve done above rely on us having a column in both data sets that contains the same information so that we can merge the data. That isn’t always true though. Sometimes we have to figure out a way to add the data we need in order to be able to merge the data we care about.

Let’s say we wanted to combine data on players and franchises or teams. Specifically, let’s merge the data set Appearances with TeamsFranchises.

data("Appearances")
head(Appearances)

##   yearID teamID lgID  playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1   1871    TRO   NA abercda01     1  1         1         1   0   0    0
## 2   1871    RC1   NA  addybo01    25 25        25        25   0   0    0
## 3   1871    CL1   NA allisar01    29 29        29        29   0   0    0
## 4   1871    WS3   NA allisdo01    27 27        27        27   0  27    0
## 5   1871    RC1   NA ansonca01    25 25        25        25   0   5    1
## 6   1871    FW1   NA armstbo01    12 12        12        12   0   0    0
##   G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1    0    0    1    0    0    0    0    0    0    0
## 2   22    0    3    0    0    0    0    0    0    0
## 3    2    0    0    0   29    0   29    0    0    0
## 4    0    0    0    0    0    0    0    0    0    0
## 5    2   20    0    1    0    0    1    0    0    0
## 6    0    0    0    0   11    1   12    0    0    0

data("TeamsFranchises")
head(TeamsFranchises)

##   franchID                    franchName active NAassoc
## 1      ALT         Altoona Mountain City      N    <NA>
## 2      ANA Los Angeles Angels of Anaheim      Y    <NA>
## 3      ARI          Arizona Diamondbacks      Y    <NA>
## 4      ATH        Philadelphia Athletics      N     PNA
## 5      ATL                Atlanta Braves      Y     BNA
## 6      BAL             Baltimore Orioles      Y    <NA>

We have the franchID in the TeamsFranchises data set, and teamID in Appearances. What we need to do is find a data set that has both teamID and franchId so that we can merge Appearances and TeamsFranchses together. We had that information in the Teams dataset, so let’s extract it and create a unique crosswalk for our situation. And we only walk the unique combinations of those names, so we’ll clean out any duplicates. duplicated() will check if there are any duplicated entries in our data set, and the exclamation! tells the command that we want duplicates excluded.

crosswalk <- Teams[, c("teamID", "franchID") ]
nrow(crosswalk)

## [1] 2895

crosswalk <- crosswalk[!duplicated(crosswalk),]
nrow(crosswalk)

## [1] 151

Now our new object crosswalk is a list of each unique teamId and franchID. So we can merge it with TeamsFranchises and use it to merge TeamsFranchises to Appearances.

nrow(TeamsFranchises)

## [1] 120

nrow(crosswalk)

## [1] 151

TeamsFranchises2 <- merge(TeamsFranchises, crosswalk, by.x="franchID", by.y="franchID" )
nrow(TeamsFranchises2)

## [1] 151

nrow(Appearances)

## [1] 105789

TeamsFrancises2.Appearances <- merge(TeamsFranchises2, Appearances, by.x="teamID", by.y="teamID" )
nrow(TeamsFrancises2.Appearances)

## [1] 106945

There were a few players in the Appearances data set that had multiple entries because they played for multiple teams in a given year. That’s fine, the main thing is that we were able to merge the two data sets.

Creating a crosswalk in this case was fairly simple, because we were using information that was all built into a single package in R. Often times though that’s a step that can take a bit of elbow grease.

Sometimes, we won’t be able to find data so simply that helps us to match our different sources. For instance, let’s read in the data from the CPS we’ve used in previous chapters. And let’s read in some data about arrests rates in US States that you should recognize from other chapters

dat <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/RClass/cps_00003.csv")
USArrests <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/USArrestsAugmented.csv")

What I want to do is combine the information about state crime rates to see how that affects individuals in the CPS data. How can I do that, do I have a common column? First, let’s look at both data sets and look for common columns.

head(dat)

##   YEAR SERIAL MONTH  HWTFINL       CPSID REGION STATEFIP METRO METAREA
## 1 2018      1    11 1703.832 2.01708e+13     32        1     2    3440
## 2 2018      1    11 1703.832 2.01708e+13     32        1     2    3440
## 3 2018      3    11 1957.313 2.01809e+13     32        1     2    5240
## 4 2018      4    11 1687.784 2.01710e+13     32        1     2    5240
## 5 2018      4    11 1687.784 2.01710e+13     32        1     2    5240
## 6 2018      4    11 1687.784 2.01710e+13     32        1     2    5240
##   STATECENSUS FAMINC PERNUM   WTFINL      CPSIDP AGE SEX RACE EMPSTAT
## 1          63    830      1 1703.832 2.01708e+13  26   2  100      10
## 2          63    830      2 1845.094 2.01708e+13  26   1  100      10
## 3          63    100      1 1957.313 2.01809e+13  48   2  200      21
## 4          63    820      1 1687.784 2.01710e+13  53   2  200      10
## 5          63    820      2 2780.421 2.01710e+13  16   1  200      10
## 6          63    820      3 2780.421 2.01710e+13  16   1  200      10
##   LABFORCE EDUC VOTED VOREG
## 1        2  111    98    98
## 2        2  123    98    98
## 3        2   73     2    99
## 4        2   81     2    99
## 5        2   50    99    99
## 6        2   50    99    99

head(USArrests)

##        State Murder Assault UrbanPop Rape Region
## 1    Alabama   13.2     236       58 21.2  South
## 2     Alaska   10.0     263       48 44.5   West
## 3    Arizona    8.1     294       80 31.0   West
## 4   Arkansas    8.8     190       50 19.5  South
## 5 California    9.0     276       91 40.6   West
## 6   Colorado    7.9     204       78 38.7   West

So I’ve got the state for each individual in STATEFIP, but that’s a numeric code for each state. And in the USArrets data I have the name of each state. R wont recognize that 1 is Alabama, even if I know that from experience. So what can I do?

I need to find some file that can crosswalk the two values, so that I can either add the names of states to the CPS data or the numeric doe to the USArrests data.

I’ve spent a lot of my time creating new crosswalks, so I’ve built up a few strategies for now to develop them. A lot of these are built on using data cleaning strategies in R in order to generate a column in R with information that matches across the two data sets. And you’d probably have to build that for yourself. You’d need to find some website with that information (thank gosh for Wikipedia), copy the information, paste it into a csv file, save the file, load it into R, and then get to work on merging. So it’s not fun, but its necessary. Like I said, it takes some elbow grease.

6.5 What can go wrong

It’s good to save your file under a new name anytime you merge, because you’ll make mistakes, and you don’t want to overwrite your existing data.

What can go wrong? As I’ve said before, R is a stickler for spelling, but it won’t let your merge command process if it doesn’t find the variables you indicate in the command. So if you say to use Id instead of ID, it just wont process. That’s what it’s telling you if it says . It’s saying you either misspelled your column or that column isn’t in the data.

Alternatively, you might think you have common values in two columns, but if you don’t nothing will merge. What if we’d used STATEFIP and State from the CPS and USArrests data?

dat2 <- merge(dat, USArrests, by.x="STATEFIP", by.y="State")
nrow(dat2)

## [1] 0

No values matched for the two columns. We need to go back and check out the variables we’re trying to use, and figure out a way to get them merged. All the merging is happening out of sight, and you don’t want to lose data unnecessarily. Always keep an eye on the data going in and going out when merging.