Chapter 6 Combining data
Often you’ll want to take two (or more) data sources and connect them. Want to know how community demographics impact crime rates? Or how the characteristics of the college someone attended impact their ability to repay their student loans? Anytime we have one data in one data set and want to associate it with data in another, what we want to do is merge those two data sets. We want to take the two data sets and make them into one.
As was said in Where to Find Data, it is rare to find all the information you need or want for an analysis in one single source. As such, you’ll probably be pulling form multiple sources. How do you combine them then?
This chapter will walk through the nuts and bolts of merging, along with some common mistakes that can be made.
6.1 Getting Started
We’ll begin that with some fake data I’m going to generate below, just to make the examples really really clear. Let’s pretend this data is for a state that has 6 counties.
set.seed(5) # set.seed tells R which iteration of random numbers I want to use below. I run it so that the examples below are consistent for everyone.
ID <- c(1,2,3,4,5,6) # unique ID for each entry
MedianIncome <- rnorm(6, 45000, 5000) # median income
Manufacturing <- rnorm(6, .43, .1) # % of workers in manufacturing
CollegeEdu <- rnorm(6, 45000, 5000)
Murder <- rnorm(6, 7.8, 4.3) # Muder rate
df1 <- data.frame(ID,CollegeEdu,Manufacturing) # creating data set with 3 variables
df2 <- data.frame(ID, Murder) # creating data set with 2 variables
Okay, so we have two data frames now. df1, which has information about a counties economy…
## ID CollegeEdu Manufacturing
## 1 1 39598.04 0.3827834
## 2 2 44212.33 0.3664629
## 3 3 39641.20 0.4014226
## 4 4 44305.07 0.4438108
## 5 5 42013.43 0.5527630
## 6 6 34080.17 0.3498221
and df2, which has information about the murder rate…
## ID Murder
## 1 1 8.835514
## 2 2 6.684772
## 3 3 11.672201
## 4 4 11.850038
## 5 5 14.112236
## 6 6 10.839073
So if you wanted to study the impact of local economies on murder rates, you’d want to combine or merge these two data sets. You want to make sure that the information for county 1 is matched from both data sets, so that you have consistent information too. To do that we’ll use the merge command, merge(). The big hurdle with the merge command which we’ll keep practicing in this section is that we need to tell R what variable we want to use that connects the two data sets. How does R know that the information in a given row is for the same unit in the two data sets? In this case we have the variable ID which is unique to each entry and identifies which county is which. And we need to tell R what to name that new data set we’re creating when we merge the two together with the by.x and by.y. Below I’ll name the new data set df.merged for the sake of clarity.
## ID CollegeEdu Manufacturing Murder
## 1 1 39598.04 0.3827834 8.835514
## 2 2 44212.33 0.3664629 6.684772
## 3 3 39641.20 0.4014226 11.672201
## 4 4 44305.07 0.4438108 11.850038
## 5 5 42013.43 0.5527630 14.112236
## 6 6 34080.17 0.3498221 10.839073
Where we once had two data sets, we now have one with both sets of information combined. You can look back above and see that R has correctly placed the values for Murder for each county with its matching information for the economy.
Okay, let’s make it a bit more complicated and create a little more data. Let’s say we have a data set with multiple entries for each county because we have an observation for each year.
set.seed(5)
ID <- c(1,1,2,2,3,3,4,4,5,5,6,6) # unique ID for each entry
Year <- c(2005, 2006,2005, 2006,2005, 2006,2005, 2006,2005, 2006,2005, 2006)
Poverty <- rnorm(12, .15, .03)
df3 <- data.frame(ID,Year,Poverty)
df3
## ID Year Poverty
## 1 1 2005 0.1247743
## 2 1 2006 0.1915308
## 3 2 2005 0.1123352
## 4 2 2006 0.1521043
## 5 3 2005 0.2013432
## 6 3 2006 0.1319128
## 7 4 2005 0.1358350
## 8 4 2006 0.1309389
## 9 5 2005 0.1414268
## 10 5 2006 0.1541432
## 11 6 2005 0.1868289
## 12 6 2006 0.1259466
So let’s say we want to combine df3 with df1. df1 only has 6 rows and df3 has 12. How many rows will our new data set have after merging?
## [1] 12
We have 12 rows. Why? Because R matched each unique ID in df1 to each in df3. Each row that had the value “1” in df1 was simply matched multiple times to each row in df3 that had the value “1”. df1 only had 6 rows, so those repeated to match the 12 rows in df3. As such, the values for CollegeEdu and Manufacturing appear twice below, because we only have one value for ID 1, but we have two different values for Poverty.
## ID CollegeEdu Manufacturing Year Poverty
## 1 1 39598.04 0.3827834 2005 0.1247743
## 2 1 39598.04 0.3827834 2006 0.1915308
## 3 2 44212.33 0.3664629 2005 0.1123352
## 4 2 44212.33 0.3664629 2006 0.1521043
## 5 3 39641.20 0.4014226 2005 0.2013432
## 6 3 39641.20 0.4014226 2006 0.1319128
## 7 4 44305.07 0.4438108 2005 0.1358350
## 8 4 44305.07 0.4438108 2006 0.1309389
## 9 5 42013.43 0.5527630 2005 0.1414268
## 10 5 42013.43 0.5527630 2006 0.1541432
## 11 6 34080.17 0.3498221 2005 0.1868289
## 12 6 34080.17 0.3498221 2006 0.1259466
What if we only wanted to keep the 6 observations that were from the same year for the two data sets? We can do that, as long as we have a varaible identifying the year in both.
First let’s add a variable for year to df1
We can use multiple columns to match our data to make sure that R only merges the ones we want. We’ll give the two by options a list of values to look for in merging below.
## ID Year CollegeEdu Manufacturing Poverty
## 1 1 2006 39598.04 0.3827834 0.1915308
## 2 2 2006 44212.33 0.3664629 0.1521043
## 3 3 2006 39641.20 0.4014226 0.1319128
## 4 4 2006 44305.07 0.4438108 0.1309389
## 5 5 2006 42013.43 0.5527630 0.1541432
## 6 6 2006 34080.17 0.3498221 0.1259466
Now the merged data has only 6 rows, because only the ones from teh year 2006 matched between the two data sets.
6.2 Unequal Observations
In the above examples we had an observation for each of the six values in ID, but R can handle it if that’s not true as well.
set.seed(5)
ID <- c(1,3,5,6) # unique ID for each entry
Population <- rnorm(4, 30000, 10000)
df4 <- data.frame(ID,Population)
df4
## ID Population
## 1 1 21591.45
## 2 3 43843.59
## 3 5 17445.08
## 4 6 30701.43
We only have 4 rows in df4. So if we merge it with df1, how many rows will we have?
## ID CollegeEdu Manufacturing Year Population
## 1 1 39598.04 0.3827834 2006 21591.45
## 2 3 39641.20 0.4014226 2006 43843.59
## 3 5 42013.43 0.5527630 2006 17445.08
## 4 6 34080.17 0.3498221 2006 30701.43
4 rows. Each observation that is present in df4 was matched to one in df1, and those that are absent are lost. If we want to make sure all of our observations from df1 are still included, we can set all.x to TRUE.
## ID CollegeEdu Manufacturing Year Population
## 1 1 39598.04 0.3827834 2006 21591.45
## 2 2 44212.33 0.3664629 2006 NA
## 3 3 39641.20 0.4014226 2006 43843.59
## 4 4 44305.07 0.4438108 2006 NA
## 5 5 42013.43 0.5527630 2006 17445.08
## 6 6 34080.17 0.3498221 2006 30701.43
The new data set still doesn’t have information for the rows that were absent in df4, but we still have all the data that was present in df1. We set all.x to true because we wanted to keep all of the observations in the first dataset in our merge. We could use all.y if we were worried about keeping all of the observations from df4, or just all if we wanted to retain all observations from both data sets.
I’ll demonstrate that with a few more examples.
set.seed(5)
ID.x <- c("A", "B", "C", "D", "E") # unique ID for each entry
Size <- rnorm(5, 30000, 10000)
ID.y <- c("A", "C", "D", "E", "F", "G")
Wealth <- rnorm(6, 60000, 30000)
x <- data.frame(ID.x,Size)
y <- data.frame(ID.y,Wealth)
x
## ID.x Size
## 1 A 21591.45
## 2 B 43843.59
## 3 C 17445.08
## 4 D 30701.43
## 5 E 47114.41
## ID.y Wealth
## 1 A 41912.76
## 2 C 45835.01
## 3 D 40938.86
## 4 E 51426.79
## 5 F 64143.25
## 6 G 96828.91
Only keeping those observations in both data sets.
## ID.x Size Wealth
## 1 A 21591.45 41912.76
## 2 C 17445.08 45835.01
## 3 D 30701.43 40938.86
## 4 E 47114.41 51426.79
Keeping all observations in the first data set.
## ID.x Size Wealth
## 1 A 21591.45 41912.76
## 2 B 43843.59 NA
## 3 C 17445.08 45835.01
## 4 D 30701.43 40938.86
## 5 E 47114.41 51426.79
Keeping all observations in the second data set.
## ID.x Size Wealth
## 1 A 21591.45 41912.76
## 2 C 17445.08 45835.01
## 3 D 30701.43 40938.86
## 4 E 47114.41 51426.79
## 5 F NA 64143.25
## 6 G NA 96828.91
And keeping all observations in both data sets.
## ID.x Size Wealth
## 1 A 21591.45 41912.76
## 2 B 43843.59 NA
## 3 C 17445.08 45835.01
## 4 D 30701.43 40938.86
## 5 E 47114.41 51426.79
## 6 F NA 64143.25
## 7 G NA 96828.91
6.3 More Practice
Those were some simplified examples with data sets we generated just to demonstrate the key concepts. Data is going to be a bit messier in the wild though. The following examples use the Lahman package, to show how this might work in the real world. If you haven’t used the Lahman package before you’ll need to install it with install.packages(“Lahman”). Since I’ve already got it installed on my computer, I can read it in with just the library command.
The Lahman data set has a bunch of different data sets about baseball. No knowledge about baseball is required to understand these examples though beyond knowing what a team is as a general concept. The data sets contain different information, some are focused on teams, some are focused on players, some are about stadiums, some are just about the playoffs. What’s great is that all of the data sets are linked in different ways, allowing you to mix and match.
One of the big challenges we face in merging data sets is having something that links the two rows that have the same information. When we give the by option in the merge command the name of a column, that is telling R where to look for matches between the two data sets to merge them in a coherent way. We don’t want information from one unit getting matched up with other units.
Let’s start with two data sets about teams that are in the Lahman package.
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
## franchID franchName active NAassoc
## 1 ALT Altoona Mountain City N <NA>
## 2 ANA Los Angeles Angels of Anaheim Y <NA>
## 3 ARI Arizona Diamondbacks Y <NA>
## 4 ATH Philadelphia Athletics N PNA
## 5 ATL Atlanta Braves Y BNA
## 6 BAL Baltimore Orioles Y <NA>
In both data frames we have a column called “franchID” which has a unique (shortened) name for every tame that has played professional baseball. If we want to combine these two data sets that’s probably the best way to tell R how to match them. Let’s keep an eye on how many matches we have by checking the number of rows in our data sets before and after we merge.
## [1] 2895
## [1] 120
## [1] 2895
Great. We startated with 2895 observations in Teams and 120 in TeamFranchises, and we ended up with 2895. That means that there were multiple observations in Teams, so the information in TeamsFranchises matched multiple times.
## franchID yearID lgID teamID divID Rank G Ghome W L DivWin WCWin
## 1 ALT 1884 UA ALT <NA> 10 25 NA 6 19 <NA> <NA>
## 2 ANA 1991 AL CAL W 7 162 81 81 81 N <NA>
## 3 ANA 1998 AL ANA W 2 162 81 85 77 N N
## 4 ANA 2011 AL LAA W 2 162 81 86 76 N N
## 5 ANA 1994 AL CAL W 4 115 63 47 68 <NA> <NA>
## 6 ANA 1977 AL CAL W 5 162 81 74 88 N <NA>
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER
## 1 N <NA> 90 899 223 30 6 2 22 130 NA NA NA NA 216 114
## 2 N N 653 5470 1396 245 29 115 448 928 94 56 38 31 649 591
## 3 N N 787 5630 1530 314 27 147 510 1028 93 45 48 41 783 720
## 4 N N 667 5513 1394 289 34 155 442 1086 135 52 51 32 633 581
## 5 <NA> <NA> 543 3943 1042 178 16 120 402 715 65 54 27 29 660 618
## 6 N N 675 5410 1380 233 40 131 542 880 159 89 36 51 695 594
## ERA CG SHO SV IPouts HA HRA BBA SOA E DP FP
## 1 4.67 20 0 0 659 292 3 52 93 156 4 0.862
## 2 3.69 18 10 50 4325 1351 141 543 990 102 156 0.984
## 3 4.49 3 5 52 4332 1481 164 630 1091 106 146 0.983
## 4 3.57 12 11 39 4395 1388 142 476 1058 93 157 0.985
## 5 5.42 11 4 21 3081 1149 150 436 682 76 110 0.983
## 6 3.72 53 13 26 4313 1383 136 572 965 147 137 0.976
## name park attendance BPF
## 1 Altoona Mountain City <NA> NA 101
## 2 California Angels Anaheim Stadium 2416236 99
## 3 Anaheim Angels Edison International Field 2519280 102
## 4 Los Angeles Angels of Anaheim Angel Stadium 3166321 93
## 5 California Angels Anaheim Stadium 1512622 101
## 6 California Angels Anaheim Stadium 1432633 96
## PPF teamIDBR teamIDlahman45 teamIDretro franchName
## 1 109 ALT ALT ALT Altoona Mountain City
## 2 100 CAL CAL CAL Los Angeles Angels of Anaheim
## 3 102 ANA ANA ANA Los Angeles Angels of Anaheim
## 4 93 LAA ANA ANA Los Angeles Angels of Anaheim
## 5 101 CAL CAL CAL Los Angeles Angels of Anaheim
## 6 96 CAL CAL CAL Los Angeles Angels of Anaheim
## active NAassoc
## 1 N <NA>
## 2 Y <NA>
## 3 Y <NA>
## 4 Y <NA>
## 5 Y <NA>
## 6 Y <NA>
Let’s make the next one a little more complicated then. Let’s bring together the data on teams in the Teams data set with another data set named TeamsHalf.
## yearID lgID teamID Half divID DivWin Rank G W L
## 1 1981 NL ATL 1 W N 4 54 25 29
## 2 1981 NL ATL 2 W N 5 52 25 27
## 3 1981 AL BAL 1 E N 2 54 31 23
## 4 1981 AL BAL 2 E N 4 51 28 23
## 5 1981 AL BOS 1 E N 5 56 30 26
## 6 1981 AL BOS 2 E N 2 52 29 23
We don’t have a column in TeamsHalf called franchID, but both data sets do have a column called teamID. We can use that to merge them then.
## [1] 2895
## [1] 52
## [1] 4090
Wait, we started with a data set containing 2895 rows, and one with 52 rows, and we combined them and got 4136 rows? What happened? This is a common challenge that occurs when combining data sets.
Let’s look back at the original data.
## yearID lgID teamID Half divID DivWin Rank G W L
## 1 1981 NL ATL 1 W N 4 54 25 29
## 2 1981 NL ATL 2 W N 5 52 25 27
## 3 1981 AL BAL 1 E N 2 54 31 23
## 4 1981 AL BAL 2 E N 4 51 28 23
## 5 1981 AL BOS 1 E N 5 56 30 26
## 6 1981 AL BOS 2 E N 2 52 29 23
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA> N
## 2 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA> N
## 6 1871 NA PH1 PNA <NA> 1 28 NA 21 7 <NA> <NA> Y
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55 22 1 3
## 2 <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0 1
## 3 <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0 0
## 4 <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1 0
## 5 <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72 32 1 0
## 6 <NA> 376 1281 410 66 27 9 46 23 56 12 NA NA 266 137 4.95 27 0 0
## IPouts HA HRA BBA SOA E DP FP name
## 1 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5 879 373 7 42 22 235 14 0.840 New York Mutuals
## 6 747 329 3 53 16 194 13 0.845 Philadelphia Athletics
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 South End Grounds I NA 103 98 BOS BS1
## 2 Union Base-Ball Grounds NA 104 102 CHI CH1
## 3 National Association Grounds NA 96 100 CLE CL1
## 4 Hamilton Field NA 101 107 KEK FW1
## 5 Union Grounds (Brooklyn) NA 90 88 NYU NY2
## 6 Jefferson Street Grounds NA 102 98 ATH PH1
## teamIDretro
## 1 BS1
## 2 CH1
## 3 CL1
## 4 FW1
## 5 NY2
## 6 PH1
The issue was that the values in teamID weren’t unique to each row. Each row of data contains information about a team in a given year. So there are multiple entries for each team in TeamHalf and Teams, and so those matched multiple times when we combined them. But we can fix that by including multiple columns in our by=.
## [1] 2895
## [1] 52
teams.merge2 <- merge(Teams,TeamsHalf, by.x=c("teamID", "yearID"), by.y=c("teamID", "yearID"))
nrow(teams.merge2)
## [1] 52
Now we have 52 rows, because each of the 52 rows in TeamsHalf matched 1 to 1 with a row in Teams, and all other observations didn’t make it to the data set. You can adjust whether you want to keep all those rows with all= as we described above.
6.4 Crosswalks
Crosswalks are a mergers best friend. What all of the examples we’ve done above rely on us having a column in both data sets that contains the same information so that we can merge the data. That isn’t always true though. Sometimes we have to figure out a way to add the data we need in order to be able to merge the data we care about.
Let’s say we wanted to combine data on players and franchises or teams. Specifically, let’s merge the data set Appearances with TeamsFranchises.
## yearID teamID lgID playerID G_all GS G_batting G_defense G_p G_c G_1b
## 1 1871 TRO NA abercda01 1 1 1 1 0 0 0
## 2 1871 RC1 NA addybo01 25 25 25 25 0 0 0
## 3 1871 CL1 NA allisar01 29 29 29 29 0 0 0
## 4 1871 WS3 NA allisdo01 27 27 27 27 0 27 0
## 5 1871 RC1 NA ansonca01 25 25 25 25 0 5 1
## 6 1871 FW1 NA armstbo01 12 12 12 12 0 0 0
## G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 1 0 0 1 0 0 0 0 0 0 0
## 2 22 0 3 0 0 0 0 0 0 0
## 3 2 0 0 0 29 0 29 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0
## 5 2 20 0 1 0 0 1 0 0 0
## 6 0 0 0 0 11 1 12 0 0 0
## franchID franchName active NAassoc
## 1 ALT Altoona Mountain City N <NA>
## 2 ANA Los Angeles Angels of Anaheim Y <NA>
## 3 ARI Arizona Diamondbacks Y <NA>
## 4 ATH Philadelphia Athletics N PNA
## 5 ATL Atlanta Braves Y BNA
## 6 BAL Baltimore Orioles Y <NA>
We have the franchID in the TeamsFranchises data set, and teamID in Appearances. What we need to do is find a data set that has both teamID and franchId so that we can merge Appearances and TeamsFranchses together. We had that information in the Teams dataset, so let’s extract it and create a unique crosswalk for our situation. And we only walk the unique combinations of those names, so we’ll clean out any duplicates. duplicated() will check if there are any duplicated entries in our data set, and the exclamation! tells the command that we want duplicates excluded.
## [1] 2895
## [1] 151
Now our new object crosswalk is a list of each unique teamId and franchID. So we can merge it with TeamsFranchises and use it to merge TeamsFranchises to Appearances.
## [1] 120
## [1] 151
TeamsFranchises2 <- merge(TeamsFranchises, crosswalk, by.x="franchID", by.y="franchID" )
nrow(TeamsFranchises2)
## [1] 151
## [1] 105789
TeamsFrancises2.Appearances <- merge(TeamsFranchises2, Appearances, by.x="teamID", by.y="teamID" )
nrow(TeamsFrancises2.Appearances)
## [1] 106945
There were a few players in the Appearances data set that had multiple entries because they played for multiple teams in a given year. That’s fine, the main thing is that we were able to merge the two data sets.
Creating a crosswalk in this case was fairly simple, because we were using information that was all built into a single package in R. Often times though that’s a step that can take a bit of elbow grease.
Sometimes, we won’t be able to find data so simply that helps us to match our different sources. For instance, let’s read in the data from the CPS we’ve used in previous chapters. And let’s read in some data about arrests rates in US States that you should recognize from other chapters
dat <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/RClass/cps_00003.csv")
USArrests <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/USArrestsAugmented.csv")
What I want to do is combine the information about state crime rates to see how that affects individuals in the CPS data. How can I do that, do I have a common column? First, let’s look at both data sets and look for common columns.
## YEAR SERIAL MONTH HWTFINL CPSID REGION STATEFIP METRO METAREA
## 1 2018 1 11 1703.832 2.01708e+13 32 1 2 3440
## 2 2018 1 11 1703.832 2.01708e+13 32 1 2 3440
## 3 2018 3 11 1957.313 2.01809e+13 32 1 2 5240
## 4 2018 4 11 1687.784 2.01710e+13 32 1 2 5240
## 5 2018 4 11 1687.784 2.01710e+13 32 1 2 5240
## 6 2018 4 11 1687.784 2.01710e+13 32 1 2 5240
## STATECENSUS FAMINC PERNUM WTFINL CPSIDP AGE SEX RACE EMPSTAT
## 1 63 830 1 1703.832 2.01708e+13 26 2 100 10
## 2 63 830 2 1845.094 2.01708e+13 26 1 100 10
## 3 63 100 1 1957.313 2.01809e+13 48 2 200 21
## 4 63 820 1 1687.784 2.01710e+13 53 2 200 10
## 5 63 820 2 2780.421 2.01710e+13 16 1 200 10
## 6 63 820 3 2780.421 2.01710e+13 16 1 200 10
## LABFORCE EDUC VOTED VOREG
## 1 2 111 98 98
## 2 2 123 98 98
## 3 2 73 2 99
## 4 2 81 2 99
## 5 2 50 99 99
## 6 2 50 99 99
## State Murder Assault UrbanPop Rape Region
## 1 Alabama 13.2 236 58 21.2 South
## 2 Alaska 10.0 263 48 44.5 West
## 3 Arizona 8.1 294 80 31.0 West
## 4 Arkansas 8.8 190 50 19.5 South
## 5 California 9.0 276 91 40.6 West
## 6 Colorado 7.9 204 78 38.7 West
So I’ve got the state for each individual in STATEFIP, but that’s a numeric code for each state. And in the USArrets data I have the name of each state. R wont recognize that 1 is Alabama, even if I know that from experience. So what can I do?
I need to find some file that can crosswalk the two values, so that I can either add the names of states to the CPS data or the numeric doe to the USArrests data.
I’ve spent a lot of my time creating new crosswalks, so I’ve built up a few strategies for now to develop them. A lot of these are built on using data cleaning strategies in R in order to generate a column in R with information that matches across the two data sets. And you’d probably have to build that for yourself. You’d need to find some website with that information (thank gosh for Wikipedia), copy the information, paste it into a csv file, save the file, load it into R, and then get to work on merging. So it’s not fun, but its necessary. Like I said, it takes some elbow grease.
6.5 What can go wrong
It’s good to save your file under a new name anytime you merge, because you’ll make mistakes, and you don’t want to overwrite your existing data.
What can go wrong? As I’ve said before, R is a stickler for spelling, but it won’t let your merge command process if it doesn’t find the variables you indicate in the command. So if you say to use Id instead of ID, it just wont process. That’s what it’s telling you if it says . It’s saying you either misspelled your column or that column isn’t in the data.
Alternatively, you might think you have common values in two columns, but if you don’t nothing will merge. What if we’d used STATEFIP and State from the CPS and USArrests data?
## [1] 0
No values matched for the two columns. We need to go back and check out the variables we’re trying to use, and figure out a way to get them merged. All the merging is happening out of sight, and you don’t want to lose data unnecessarily. Always keep an eye on the data going in and going out when merging.