Chapter 7 Changing Levels

Oftentimes you’ll want to use data from one unit of analysis to estimate a value at another unit of analysis. For instance, data from IPUMS is at the individual level, but it is often employed to make estimates about the characteristics of State’s or metropolitan area’s. So how do we transform data from one unit of analysis to another?

7.1 Movin’ On Up

Let’s start with a simplified example and some fictional data on counties within states. To keep this simple, we have information on the population of different counties. There are just 2 states in the data, and those 2 states have 4 and 3 counties to keep the example straightforward.

set.seed(5)
State <- c(1,1,1,1,2,2,2) # unique ID for each entry
County <- c(1,2,3,4,1,2,3)
Population<- round(rnorm(7, 450000, 50000),0) # median income
df1 <- data.frame(State,County,Population)
df1

##   State County Population
## 1     1      1     407957
## 2     1      2     519218
## 3     1      3     387225
## 4     1      4     453507
## 5     2      1     535572
## 6     2      2     419855
## 7     2      3     426392

What I would like to get is the total population of the state based on the various values for counties. I can do that with the aggregate() command.

pop.agg <- aggregate(Population~State, data=df1, FUN=sum)
pop.agg

##   State Population
## 1     1    1767907
## 2     2    1381819

Let me annotate that command below.

We can do more than take the sum with the aggregate command. We can find the median…

pop.agg2 <- aggregate(Population~State, data=df1, FUN=median)
pop.agg2

##   State Population
## 1     1     430732
## 2     2     426392

Or the mean….

pop.agg3 <- aggregate(Population~State, data=df1, FUN=mean)
pop.agg3

##   State Population
## 1     1   441976.8
## 2     2   460606.3

We can also aggregate our data at multiple levels at the same time. For instance, let’s say we have data for counties in states for multiple years, like below.

set.seed(5)
State <- c(3,3,3,3,4,4,4,3,3,3,3,4,4,4) # unique ID for each entry
County <- c(1,2,3,4,1,2,3,1,2,3,4,1,2,3)
Year <- c(2005,2005,2005,2005,2005,2005,2005, 2006,2006,2006,2006,2006,2006,2006)
Population<- round(rnorm(14, 450000, 50000),0) # median income
df2 <- data.frame(State,County,Year, Population)
df2

##    State County Year Population
## 1      3      1 2005     407957
## 2      3      2 2005     519218
## 3      3      3 2005     387225
## 4      3      4 2005     453507
## 5      4      1 2005     535572
## 6      4      2 2005     419855
## 7      4      3 2005     426392
## 8      3      1 2006     418231
## 9      3      2 2006     435711
## 10     3      3 2006     456905
## 11     3      4 2006     511382
## 12     4      1 2006     409911
## 13     4      2 2006     395980
## 14     4      3 2006     442123

To get the total population for each state IN EACH YEAR we just need to add a plus sign and the name of the additional variable we want accounted for.

pop.agg4 <- aggregate(Population~State+Year, data=df2, FUN=sum)
pop.agg4

##   State Year Population
## 1     3 2005    1767907
## 2     4 2005    1381819
## 3     3 2006    1822229
## 4     4 2006    1248014

7.2 Weights

So the aggregate command allows us to estimate the values for higher levels of analysis. It can’t guarantee that our estimates will be accurate though. Let’s once again go back to the data we used in previous sections from the Current Population Survey.

dat <- read.csv("~/Desktop/RClass/cps_00003.csv")
head(dat)

##   YEAR SERIAL MONTH  HWTFINL       CPSID REGION STATEFIP METRO METAREA
## 1 2018      1    11 1703.832 2.01708e+13     32        1     2    3440
## 2 2018      1    11 1703.832 2.01708e+13     32        1     2    3440
## 3 2018      3    11 1957.313 2.01809e+13     32        1     2    5240
## 4 2018      4    11 1687.784 2.01710e+13     32        1     2    5240
## 5 2018      4    11 1687.784 2.01710e+13     32        1     2    5240
## 6 2018      4    11 1687.784 2.01710e+13     32        1     2    5240
##   STATECENSUS FAMINC PERNUM   WTFINL      CPSIDP AGE SEX RACE EMPSTAT
## 1          63    830      1 1703.832 2.01708e+13  26   2  100      10
## 2          63    830      2 1845.094 2.01708e+13  26   1  100      10
## 3          63    100      1 1957.313 2.01809e+13  48   2  200      21
## 4          63    820      1 1687.784 2.01710e+13  53   2  200      10
## 5          63    820      2 2780.421 2.01710e+13  16   1  200      10
## 6          63    820      3 2780.421 2.01710e+13  16   1  200      10
##   LABFORCE EDUC VOTED VOREG
## 1        2  111    98    98
## 2        2  123    98    98
## 3        2   73     2    99
## 4        2   81     2    99
## 5        2   50    99    99
## 6        2   50    99    99

nrow(dat)

## [1] 122744

And let’s say we want to estimate the population of individual states using this data. Can we do that? We only have 122744 observations, but the population of the United States is much larger than that. This is why the data comes with “weights”, for exactly these types of calculations. The weights are calculated so that individual observations can approximate their size within the given population. A more detailed discussion of weights can be found here.

Let’s see what we’d estimate the population of the United States to be using the column WTFINL (weight final).

sum(dat$WTFINL)

## [1] 323833483

Those weights add up to 323,833,483. The United States population in 2018 was estimated to be 327.876 Million, so that’s really close! And, to be clear, that’s not an accident; that’s the point of the weights. Okay, let’s take a look at each individuals state’s population in 2018, as estimated by the CPS. The variable for state in that data is given by STATEFIP. We want to use the weight variable as the variable we want aggregated.

state.agg <- aggregate(WTFINL~STATEFIP, dat, sum)
head(state.agg)

##   STATEFIP   WTFINL
## 1        1  4815491
## 2        2   709713
## 3        4  7059460
## 4        5  2971831
## 5        6 39335631
## 6        8  5616496

Let’s just check out math with the first state, Alabama (STATEFIP=1). We’ve estimated its population as 4,815,491. A quick Google search says that it’s population was 4.89 Million. That’s pretty close, but not exact. The CPS is just an approximation, but it’s going to give us close estimates for state level data and it’s margins of error should all be consistently close.

7.3 Merging Aggregations

Often times, when I need to aggregate one piece of data, I’ll want to add it back in to my original data. Notice that the column of data that’s generated for the total state population is the same as the name that was used for the individual weights in our original data. Let’s see what happens if I merge that back to my original data. I’d recommend reading the [section on merging] prior to going further. I’m going to create a subset of the data before going forward just to make this example a little clearer.

We can merge the state populations to the original data using STATEFIP which is in both sets of data. If you have the same variable with the same name in both sets of data you can just say by, rather than both by.x and by.y.

dat2 <- dat[, c("STATEFIP", "WTFINL")]
dat3 <- merge(dat2, state.agg, by="STATEFIP")
head(dat3)

##   STATEFIP WTFINL.x WTFINL.y
## 1        1 1703.832  4815491
## 2        1 1845.094  4815491
## 3        1 1957.313  4815491
## 4        1 1687.784  4815491
## 5        1 2780.421  4815491
## 6        1 2780.421  4815491

Both sets of data had a column with the same name, so in order to distinguish them R added .x and .y to the end of the variable name. That’s a bit messy, especially since they hold really different information now. We can rename the columns either before or after our merge using the colnames() command to make sure our data remains clear.

colnames(dat3) <- c("STATEFIP", "WTFINL", "State.Pop")
head(dat3)

##   STATEFIP   WTFINL State.Pop
## 1        1 1703.832   4815491
## 2        1 1845.094   4815491
## 3        1 1957.313   4815491
## 4        1 1687.784   4815491
## 5        1 2780.421   4815491
## 6        1 2780.421   4815491

That command renamed all of the columns in the data. We can also just rename on column by specifying the number of what we want to rename. Let’s rename the third column again to “ST.Pop”.

colnames(dat3)[3] <-  "ST.Pop"
head(dat3)

##   STATEFIP   WTFINL  ST.Pop
## 1        1 1703.832 4815491
## 2        1 1845.094 4815491
## 3        1 1957.313 4815491
## 4        1 1687.784 4815491
## 5        1 2780.421 4815491
## 6        1 2780.421 4815491

Okay, but let’s say we weren’t just interested in the total population in the state, but also what share of the population is college educated (or a given gender or gender or age groups, etc.). Do we have the tools to figure that out yet?

Let’s work on figuring out the gender breakdown of states. We have a variable for each respondents gender called SEX. Since there are only two levels to that variable, if we can figure out the share of women in a state, we’ll know the share of men as well (since it would just be 100- the percentage of women).

So let’s start by adding a new variable that accounts for whether the respondent is female. But, instead of making the value for the new variable 0 or 1, we’ll use the weights in the data, so that when we add up all of the females in the survey we’ll have an accurate proportion.

dat$Female <- ifelse(dat$SEX==2, dat$WTFINL, 0)

gender.agg <- aggregate(Female~STATEFIP, data=dat, FUN=sum)
head(gender.agg)

##   STATEFIP   Female
## 1        1  2502752
## 2        2   347774
## 3        4  3582658
## 4        5  1522759
## 5        6 19924925
## 6        8  2812541

Okay, that’s an accurate count for how many women are in each state, but how do we figure out the percentage of women? We need to know the total number of people in each state then. So let’s do a second aggregate, and merge that into that same file.

pop.agg <- aggregate(WTFINL~STATEFIP, data=dat, FUN=sum)
head(pop.agg)

##   STATEFIP   WTFINL
## 1        1  4815491
## 2        2   709713
## 3        4  7059460
## 4        5  2971831
## 5        6 39335631
## 6        8  5616496

gender.agg2 <- merge(gender.agg, pop.agg, by="STATEFIP")

colnames(gender.agg2) <- c("STATEFIP", "Females", "TotPop")
head(gender.agg2)

##   STATEFIP  Females   TotPop
## 1        1  2502752  4815491
## 2        2   347774   709713
## 3        4  3582658  7059460
## 4        5  1522759  2971831
## 5        6 19924925 39335631
## 6        8  2812541  5616496

Now we just need to add a new variable for the percentage, which we can do by dividing the number of females by the total population.

gender.agg2$FemPct <- gender.agg2$Females/gender.agg2$TotPop
head(gender.agg2)

##   STATEFIP  Females   TotPop    FemPct
## 1        1  2502752  4815491 0.5197293
## 2        2   347774   709713 0.4900206
## 3        4  3582658  7059460 0.5074975
## 4        5  1522759  2971831 0.5123976
## 5        6 19924925 39335631 0.5065363
## 6        8  2812541  5616496 0.5007644