3.2 Descriptive Statistics
We first look at the first few rows of the dataset.
R’s summary output differs from Stata. We use the summary() function to examine a variable; this is equivalent to the summ function in Stata. To get the equivalent to summ, detail we can use the describe() function in the psych package.
We can quickly graphically look at the distribution of a variable using the hist() command. Let’s take a look at the bweight variable. Because R is capable of storing more than one dataset in its working memory, we have to tell R both the dataset (bab9) and the variable (bweight) that we are interested in. We do this by using a dollar sign - the dollar sign says from the dataset on the left, extract the variable on the right.
#--- Investigate the first few rows of the dataset
head(bab9)
## id matage ht gestwks sex bweight matagegp gestcat
## 1 1 33 no 37.7 female 2410 2 2
## 2 2 34 no 39.2 female 2977 2 2
## 3 3 34 no 35.7 female 2100 2 1
## 4 4 30 no 39.3 male 3270 2 2
## 5 5 35 no 38.4 female 2620 3 2
## 6 6 37 no 37.9 male 3260 3 2
#--- Get summary statistics
summary(bab9)
## id matage ht gestwks sex
## Min. : 1 Min. :23 no :552 Min. :24.7 male :326
## 1st Qu.:161 1st Qu.:31 yes: 89 1st Qu.:38.0 female:315
## Median :321 Median :34 Median :39.2
## Mean :321 Mean :34 Mean :38.7
## 3rd Qu.:481 3rd Qu.:37 3rd Qu.:40.2
## Max. :641 Max. :43 Max. :42.3
## bweight matagegp gestcat
## Min. : 630 Min. :1.00 Min. :1.00
## 1st Qu.:2850 1st Qu.:2.00 1st Qu.:2.00
## Median :3200 Median :2.00 Median :2.00
## Mean :3129 Mean :2.38 Mean :1.86
## 3rd Qu.:3550 3rd Qu.:3.00 3rd Qu.:2.00
## Max. :4650 Max. :4.00 Max. :2.00
#--- Get detailed summary statistics
describe(bab9)
## vars n mean sd median trimmed mad min max range
## id 1 641 321.00 185.19 321.0 321.00 237.22 1.0 641.0 640.0
## matage 2 641 33.97 3.87 34.0 34.10 4.45 23.0 43.0 20.0
## ht* 3 641 1.14 0.35 1.0 1.05 0.00 1.0 2.0 1.0
## gestwks 4 641 38.69 2.33 39.1 39.04 1.48 24.7 42.4 17.7
## sex* 5 641 1.49 0.50 1.0 1.49 0.00 1.0 2.0 1.0
## bweight 6 641 3129.14 652.78 3200.0 3181.91 518.91 630.0 4650.0 4020.0
## matagegp 7 641 2.38 0.81 2.0 2.40 1.48 1.0 4.0 3.0
## gestcat 8 641 1.86 0.35 2.0 1.95 0.00 1.0 2.0 1.0
## skew kurtosis se
## id 0.00 -1.21 7.31
## matage -0.27 -0.48 0.15
## ht* 2.08 2.35 0.01
## gestwks -2.05 6.19 0.09
## sex* 0.03 -2.00 0.02
## bweight -0.96 1.78 25.78
## matagegp -0.09 -0.58 0.03
## gestcat -2.08 2.35 0.01
#--- Plot birthweight
hist(bab9$bweight)
We now use the sapply() command to check the class of each variable in the data. The sapply() function iterates over each column of the dataset of interest and performs the given function, in this case, class(). In other words, sapply() is a way of performing the same action to each variable of the dataset. class() tells us what data type each of our variables is.
We pipe the dataset of interest into the sapply() command. Piping is fundamental to tidy R code. A standard pipe looks like %>%. You can insert a pipe with Ctrl-Shift-M. The pipe takes whatever is on the left of it and then uses it as the first argument of the function on the right. You can see below two equivalent ways of coding the sapply() command. Pipes are useful as they allow you to chain together multiple different functions. While in this case, the piping seems to have made little functional difference, its utility will become clear in future practicals as the coding becomes more complex.
We see from our sapply() explorations that matagegp & gestcat have been read in as numeric (i.e. numbers), but they should be what R calls factor variables. Factor variables are R’s way of representing categorical variables. We convert them to factors using the as.factor() command and assign this new factor variable to the old numeric variable, overwriting it in the process.
#--- Check the class of each variable
bab9 %>% sapply(class)
## id matage ht gestwks sex bweight matagegp
## "numeric" "numeric" "factor" "numeric" "factor" "numeric" "numeric"
## gestcat
## "numeric"
#--- Same code without a pipe:
# sapply(bab9, class)
#--- Convert factors
bab9$matagegp <- as.factor(bab9$matagegp)
bab9$gestcat <- as.factor(bab9$gestcat)