2.1 Descriptive Statistics
Let’s do a quick summary of all of the variables. In R, we use the summary() command where Stata would use summ or tab.
What is the age range of the population? What about the average age? Have a look at the below graph that describes the age range.
The relevant commands for each of the variables are laid out. Take a look at their summaries. Make a rough table of your findings, perhaps including percentages for each variable. R has a built in calculator, so if you wanted to manually calculate a percentage, you can do that in the console pane on the lower left. Note any differences in variables that you think might be particularly important. Summarise verbally any patterns that you observe.
Make sure you intuit what each variable is measuring - think about whether it is a characteristic of the individual woman or of her household.
#--- Drop any factor levels that have no observations
tz <- droplevels(tz)
#--- Summaries of Age
summary(tz$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15 16 19 19 21 24
ggplot(tz, aes(x = as.factor(age))) +
geom_bar()
#--- Summaries of other variables
summary(tz$serostat)
## hiv negative hiv positive
## 2703 59
summary(tz$water)
## Piped to dwelling Piped to local source
## 278 847
## Open Well Protected well
## 712 311
## Stream, river, lake, other NA's
## 613 1
summary(tz$toilet)
## No toilet Toilet NA's
## 731 2026 5
summary(tz$electricity)
## no yes NA's
## 2549 210 3
summary(tz$radio)
## no yes NA's
## 1020 1740 2
summary(tz$tv)
## no yes NA's
## 2592 169 1
summary(tz$fridge)
## no yes NA's
## 2668 93 1
summary(tz$bike)
## no yes NA's
## 1316 1444 2
summary(tz$car)
## no yes NA's
## 2718 43 1
summary(tz$floor)
## earth, sand cement carpet other
## 2019 710 27 6
summary(tz$wall)
## grass poles and mud sun-dried bricks baked bricks
## 19 1097 672 468
## wood, timber cement blocks stones NA's
## 29 295 179 3
summary(tz$roof)
## grass,thatch,mud iron sheets asbestos, other NA's
## 1313 1417 29 3
summary(tz$educat)
## no education, preschool primary secondary
## 478 1782 502
summary(tz$married)
## never married currently married formerly married
## 1548 1085 129
summary(tz$partners)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 1.0 1.1 1.0 98.0 2
The below code recodes age into two different age groups, and the sexual partners variable into 0, 1, or 2+ sexual partners.
From a measurement perspective, do you think this categorisation is useful or is there a better way to handle these variables? In other words, what are the implications of categorising these variables? Do you think this will affect your analysis?
#--- Recoding the age variable and specifying it is a factor
tz <- tz %>% mutate(age.group = ifelse(age < 20, "14-19", "20-24")) %>%
mutate(age.group = as.factor(age.group))
#--- Recoding the partner variable
tz <- tz %>% mutate(partners.cat = cut(partners,
breaks=c(-Inf, 0, 1, Inf),
labels=c("0","1","2+")))