2.1 Descriptive Statistics

Let’s do a quick summary of all of the variables. In R, we use the summary() command where Stata would use summ or tab.

What is the age range of the population? What about the average age? Have a look at the below graph that describes the age range.

The relevant commands for each of the variables are laid out. Take a look at their summaries. Make a rough table of your findings, perhaps including percentages for each variable. R has a built in calculator, so if you wanted to manually calculate a percentage, you can do that in the console pane on the lower left. Note any differences in variables that you think might be particularly important. Summarise verbally any patterns that you observe.

Make sure you intuit what each variable is measuring - think about whether it is a characteristic of the individual woman or of her household.

#--- Drop any factor levels that have no observations
tz <- droplevels(tz)

#--- Summaries of Age
summary(tz$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      15      16      19      19      21      24
ggplot(tz, aes(x = as.factor(age))) +
  geom_bar()

#--- Summaries of other variables
summary(tz$serostat)
##  hiv negative hiv  positive 
##          2703            59
summary(tz$water)
##          Piped to dwelling      Piped to local source 
##                        278                        847 
##                  Open Well             Protected well 
##                        712                        311 
## Stream, river, lake, other                       NA's 
##                        613                          1
summary(tz$toilet)
## No toilet    Toilet      NA's 
##       731      2026         5
summary(tz$electricity)
##   no  yes NA's 
## 2549  210    3
summary(tz$radio)
##   no  yes NA's 
## 1020 1740    2
summary(tz$tv)
##   no  yes NA's 
## 2592  169    1
summary(tz$fridge)
##   no  yes NA's 
## 2668   93    1
summary(tz$bike)
##   no  yes NA's 
## 1316 1444    2
summary(tz$car)
##   no  yes NA's 
## 2718   43    1
summary(tz$floor)
## earth, sand      cement      carpet       other 
##        2019         710          27           6
summary(tz$wall)
##            grass    poles and mud sun-dried bricks     baked bricks 
##               19             1097              672              468 
##     wood, timber    cement blocks           stones             NA's 
##               29              295              179                3
summary(tz$roof)
## grass,thatch,mud      iron sheets  asbestos, other             NA's 
##             1313             1417               29                3
summary(tz$educat)
## no education, preschool                 primary               secondary 
##                     478                    1782                     502
summary(tz$married)
##     never married currently married  formerly married 
##              1548              1085               129
summary(tz$partners)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     1.0     1.1     1.0    98.0       2

The below code recodes age into two different age groups, and the sexual partners variable into 0, 1, or 2+ sexual partners.

From a measurement perspective, do you think this categorisation is useful or is there a better way to handle these variables? In other words, what are the implications of categorising these variables? Do you think this will affect your analysis?

#--- Recoding the age variable and specifying it is a factor
tz <- tz %>% mutate(age.group = ifelse(age < 20, "14-19", "20-24")) %>% 
             mutate(age.group = as.factor(age.group))

#--- Recoding the partner variable
tz <- tz %>% mutate(partners.cat = cut(partners,
                   breaks=c(-Inf, 0, 1, Inf), 
                   labels=c("0","1","2+")))