3.4 Exercises
True or false? A bar chart is an appropriate visualization for a continuous variable.
True or false? An appropriate numerical summary for a categorical variable is a frequency table.
For descriptive statistical functions such as
mean()
andsd()
, if R returns a missing valueNA
, what option can you use to return a non-missing value (assuming the variable you are describing has some non-missing values)?What is the default method of handling missing data when using a regression function in R?
Give a reason for removing cases with missing values before doing a regression analysis.
Using the COVID dataset (
covid_20210908_rmph.rData
, see Appendix A.4), numerically and visually examine the continuous variablehospitals_per_100k
(number of hospitals per 100,000 persons) and the categorical variableCensusRegionName
.Using the United Nations Human Development Data (
unhdd2020.rmph.rData
, see Appendix A.2), create a “Table 1” of descriptive statistics (mean and standard deviation for continuous variables, frequency and proportion for categorical variables), overall and by Human Development Index group (hdi_group
). Use a complete case analysis that includes the following variables:
hdi
: Human Development Index (HDI)life
: Life expectancy at birth (years)educ_expected
: Expected years of schooling (years)gii
: Gender Inequality Indexurban
: Urban population (%)
- Using the COVID dataset (
covid_20210908_rmph.rData
, see Appendix A.4), create a “Table 1” of descriptive statistics, overall and by the number of hospitals per 100,000 persons (hospitals_per_100k
) (use a median split to create a binary version of this “by” variable). Describe the following variables in this table:
pop.usafacts
: County populationcases.usafacts.20210908
: COVID-19 cumulative cases as of 2021-09-08deaths.usafacts.20210908
: COVID-19 cumulative deaths as of 2021-09-08MedianAge2010
: Median age of county in 2010CensusRegionName
: Name of census region
Hint: Remove the statistic
option in tbl_summary()
to display the default statistics (median and interquartile range) which are more appropriate for data that are skewed, such as county population. This is equivalent to replacing "{mean} ({sd})"
with "{median} ({p25}, {p75})"
.