3.4 Exercises

True or false? A bar chart is an appropriate visualization for a continuous variable.
True or false? An appropriate numerical summary for a categorical variable is a frequency table.
For descriptive statistical functions such as mean() and sd(), if R returns a missing value NA, what option can you use to return a non-missing value (assuming the variable you are describing has some non-missing values)?
What is the default method of handling missing data when using a regression function in R?
Give a reason for removing cases with missing values before doing a regression analysis.
Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), numerically and visually examine the continuous variable hospitals_per_100k (number of hospitals per 100,000 persons) and the categorical variable CensusRegionName.
Using the United Nations Human Development Data (unhdd2020.rmph.rData, see Appendix A.2), create a “Table 1” of descriptive statistics (mean and standard deviation for continuous variables, frequency and proportion for categorical variables), overall and by Human Development Index group (hdi_group). Use a complete case analysis that includes the following variables:

hdi: Human Development Index (HDI)
life: Life expectancy at birth (years)
educ_expected: Expected years of schooling (years)
gii: Gender Inequality Index
urban: Urban population (%)

Using the COVID dataset (covid_20210908_rmph.rData, see Appendix A.4), create a “Table 1” of descriptive statistics, overall and by the number of hospitals per 100,000 persons (hospitals_per_100k) (use a median split to create a binary version of this “by” variable). Describe the following variables in this table:

pop.usafacts: County population
cases.usafacts.20210908: COVID-19 cumulative cases as of 2021-09-08
deaths.usafacts.20210908: COVID-19 cumulative deaths as of 2021-09-08
MedianAge2010: Median age of county in 2010
CensusRegionName: Name of census region

Hint: Remove the statistic option in tbl_summary() to display the default statistics (median and interquartile range) which are more appropriate for data that are skewed, such as county population. This is equivalent to replacing "{mean} ({sd})" with "{median} ({p25}, {p75})".