3.2 Initial Investigations

We might want to look at this data. We can do that by clicking on the name of the dataset in the environment pane, or with the following code:

#--- Look at the dataset (note the capitalisation, R is case sensitive!)
View(sdg)

#--- We can also just look at the first few rows (useful for large datasets)
head(sdg)

##       country code reg        gdp gini       pop delta.pop int.migrant
## 1 Afghanistan  AFG EMR   561.7787   NA  34656032     33.84        1.18
## 2      Angola  AGO AFR  3110.8080   NA  28813463     42.20        0.43
## 3   Argentina  ARG AMR 12449.2200 42.7  43847430     10.84        4.81
## 4     Armenia  ARM EUR  3606.1520 32.4   2924816     -1.14        6.34
## 5  Bangladesh  BGD SEA  1358.7800   NA 163000000     12.10        0.88
## 6      Belize  BLZ AMR  4810.5660   NA    366954     26.21       14.99
##     urb delta.urb emp.ratio slums pop.density largest.city sanitation
## 1 27.13      3.90     48.05  62.7       53.08        51.49       45.1
## 2 44.82      7.88     63.81  55.5       23.11        44.43       88.6
## 3 91.89      1.63     57.02  16.7       16.02        38.06       96.2
## 4 62.56     -1.59     52.90  14.4      102.73        56.85       96.2
## 5 35.04      7.52     59.68  55.1     1251.84        31.94       57.7
## 6 43.85     -2.20     62.22  10.8       16.09           NA       93.5
##   water million  tb urb.pov electric pollution urban.pov.hc  primary
## 1  78.2   13.97 189    27.6     98.7  48.01676         27.6       NA
## 2  75.4   24.55 370      NA     51.0  36.39543           NA 84.01231
## 3  99.0   43.94  25    13.6       NA  13.44397          4.7 99.34679
## 4 100.0   35.57  41      NA    100.0  25.50769         30.0 96.07425
## 5  86.5   14.66 225      NA     90.7  89.39291         21.3 90.50861
## 6  98.9      NA  25     4.7    100.0  27.04049           NA 96.14116
##   health.exp tb.cure case.d diarrhea.trt imm.dpt mat.mort nurse.mw beds
## 1   63.87641      87     58         40.7      65   1291.0    0.360  0.5
## 2   23.96155      34     64           NA      64       NA       NA   NA
## 3   30.72721      52     87         59.1      92     32.4       NA  4.7
## 4   53.51334      78     89           NA      94     19.0    4.994  3.9
## 5   66.97587      93     57         66.1      97    210.0    0.213  0.6
## 6   23.00839      35     87         42.5      95     45.0    1.959  1.1
##    ari                lmic
## 1 61.5          Low income
## 2   NA Lower middle income
## 3 94.3 Upper middle income
## 4 56.8 Lower middle income
## 5 42.0 Lower middle income
## 6 82.2 Upper middle income

#--- We can also just get the column names
names(sdg)

##  [1] "country"      "code"         "reg"          "gdp"         
##  [5] "gini"         "pop"          "delta.pop"    "int.migrant" 
##  [9] "urb"          "delta.urb"    "emp.ratio"    "slums"       
## [13] "pop.density"  "largest.city" "sanitation"   "water"       
## [17] "million"      "tb"           "urb.pov"      "electric"    
## [21] "pollution"    "urban.pov.hc" "primary"      "health.exp"  
## [25] "tb.cure"      "case.d"       "diarrhea.trt" "imm.dpt"     
## [29] "mat.mort"     "nurse.mw"     "beds"         "ari"         
## [33] "lmic"

Note the well-commented code telling you what each snippet does!

Since we are interested in TB, we might want to look at some TB-specific data. The column ‘tb’ contains information about the TB incidence rate expressed as the number of cases per 100,000 people.

#--- Look at the TB values for the first few observations
head(sdg$tb)

## [1] 189 370  25  41 225  25

#--- Get some summary statistics for the values of TB incidence
summary(sdg$tb)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    11.0    50.5   110.4   152.0   834.0       7

Note that here we use a dollar sign to “index” the ‘tb’ column. Typing “sdg$tb” says that we are interested in the column tb in the dataset sdg.

EXERCISE: What is the mean GDP for all nations? How many nations have missing GDP data?

We can get some more detailed summary information using the psych package.

EXERCISE: Install the psych package and load its library

#--- Get a detailed summary
describe(sdg$tb)

##    vars   n   mean     sd median trimmed  mad min max range skew kurtosis
## X1    1 210 110.39 149.66   50.5   78.43 63.9   0 834   834 2.06     4.61
##       se
## X1 10.33

#--- Get a summary by group
describeBy(sdg$tb, sdg$lmic)

## 
##  Descriptive statistics by group 
## group: High income
##    vars  n  mean    sd median trimmed  mad min max range skew kurtosis
## X1    1 71 18.88 26.34    8.2   13.26 6.23   0 164   164 2.99    11.43
##      se
## X1 3.13
## -------------------------------------------------------- 
## group: Low income
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 31 205.84 136.22    189  189.56 139.36  35 561   526 0.95     0.43
##       se
## X1 24.47
## -------------------------------------------------------- 
## group: Lower middle income
##    vars  n   mean     sd median trimmed    mad min max range skew kurtosis
## X1    1 52 201.79 174.41  141.5  180.74 137.14 1.1 788 786.9 1.12     0.81
##       se
## X1 24.19
## -------------------------------------------------------- 
## group: Upper middle income
##    vars  n  mean     sd median trimmed   mad min max range skew kurtosis
## X1    1 56 88.72 146.91   40.5   53.21 40.77 4.6 834 829.4 3.16    10.99
##       se
## X1 19.63

EXERCISE: Get detailed summary statistics for maternal mortality.