Data exploration

Learning outcomes/objective: Learn…

1 Why Data Exploration?

  • Fundamental step before modeling
  • Ensures understanding of dataset characteristics
  • Identifies anomalies and outliers that could affect model performance

2 Objectives of Data Exploration

  • Understanding the dataset
    • Types of variables (numerical, categorical)
    • Distribution of variables
  • Identifying Issues
    • Missing values
    • Outliers and anomalies
    • Potential biases & non-representativness (e.g., only males included)
  • Preparation oneself for modeling (later)
    • Feature selection (e.g., drop vars with many missings)
    • Data transformation and normalization
    • Choosing the right model based on data characteristics

3 Data Exploration Techniques

  • Statistical summaries
    • Descriptive statistics (mean, median, mode, standard deviation)
    • Correlation analysis
  • Visualization tools
    • Histograms/barplots for distribution
    • Box plots for outliers
    • Scatter plots for relationships
  • Handling missing data
    • Techniques: imputation, deletion, and understanding the impact on the model

4 Lab: Exploring a dataset (ESS data)

4.1 The data (ESS)

We use the European Social Survey (ESS) [Round 10 - 2020. Democracy, Digital social contacts] to tackle ML regression problems with a continuous outcome. The ESS (prepared by myself1) contains different outcomes amenable to both classification and regression as well as a lot of variables that could be used as features (~380 variables). We are interested in predicting the outcome life_satisfaction using different potential predictors.

  • life_satisfaction = stflife: measures life satisfaction (How satisfied with life as a whole?).
  • unemployed = uempla: measures unemployment (Doing last 7 days: unemployed, actively looking for job).
  • education = eisced: measures education (Highest level of education, ES - ISCED).
  • age: measures age etc.
  • country = cntry: measures a respondent’s country of origin (here held constant for France).
  • etc.
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat)

We first import the data into R:

# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))

4.2 Exploring using descriptive statistics

Here we use function from the skimr and the modelsummary package.

Q: Please quickly go through the statistics below. How can we interpret them? What do they tell us about the data? Can you spot anything interesting?

library(skimr)
library(modelsummary)

# Data overview
# skim(data) # Run this in R (output is too long)
datasummary_skim(data, type = "numeric")
Unique (#) Missing (%) Mean SD Min Median Max
respondent_id 1977 0 18947.9 5193.1 10005.0 18927.0 27908.0
life_satisfaction 12 10 7.0 2.2 0.0 8.0 10.0
unemployed_active 2 0 0.0 0.2 0.0 0.0 1.0
unemployed 2 0 0.0 0.1 0.0 0.0 1.0
education 8 1 3.1 1.9 0.0 3.0 6.0
news_politics_minutes 88 0 84.3 144.1 0.0 60.0 1200.0
internet_use_time 59 19 209.6 183.0 0.0 150.0 1380.0
trust_people 12 0 4.7 2.1 0.0 5.0 10.0
people_fair 12 0 6.0 2.0 0.0 6.0 10.0
people_helpful 12 0 4.8 2.1 0.0 5.0 10.0
trust_parliament 12 3 4.5 2.4 0.0 5.0 10.0
trust_legal_system 12 1 5.2 2.5 0.0 5.0 10.0
trust_police 12 0 6.4 2.2 0.0 7.0 10.0
trust_politicians 12 1 3.9 2.2 0.0 4.0 10.0
trust_political_parties 12 2 3.4 2.1 0.0 3.0 10.0
trust_european_parliament 12 6 4.4 2.4 0.0 5.0 10.0
trust_united_nations 12 7 5.2 2.4 0.0 5.0 10.0
voted_national_election 3 18 0.4 0.5 0.0 0.0 1.0
contacted_politician 3 0 0.9 0.3 0.0 1.0 1.0
donated_political_party 3 0 1.0 0.2 0.0 1.0 1.0
campaign_badge 3 0 0.9 0.3 0.0 1.0 1.0
signed_petition 3 0 0.7 0.4 0.0 1.0 1.0
public_demonstration 3 0 0.9 0.3 0.0 1.0 1.0
boycotted_products 3 1 0.7 0.5 0.0 1.0 1.0
posted_politics_online 3 0 0.8 0.4 0.0 1.0 1.0
volunteered_charity 2 0 0.7 0.4 0.0 1.0 1.0
feel_close_party 3 2 0.6 0.5 0.0 1.0 1.0
left_right_scale 12 11 5.1 2.2 0.0 5.0 10.0
satisfied_economy 12 3 4.6 2.2 0.0 5.0 10.0
satisfied_government 12 3 4.8 2.3 0.0 5.0 10.0
satisfied_democracy 12 2 5.2 2.4 0.0 5.0 10.0
state_education 12 3 5.1 2.2 0.0 5.0 10.0
state_health_services 12 0 6.3 2.3 0.0 7.0 10.0
eu_unification 12 7 5.5 2.6 0.0 5.0 10.0
immigration_economy 12 3 5.4 2.4 0.0 5.0 10.0
immigration_cultural_life 12 2 5.8 2.7 0.0 6.0 10.0
immigrants_country_impact 12 2 5.2 2.2 0.0 5.0 10.0
happiness 12 0 7.4 1.7 0.0 8.0 10.0
crime_victim_last_5_years 3 0 0.8 0.4 0.0 1.0 1.0
attachment_country 12 0 8.0 1.9 0.0 8.0 10.0
attachment_europe 12 1 6.1 2.5 0.0 6.0 10.0
religion_current 3 1 0.5 0.5 0.0 0.0 1.0
ever_religion 3 51 0.8 0.4 0.0 1.0 1.0
religiousness 12 1 4.7 3.5 0.0 5.0 10.0
discrimination_group_membership 3 1 0.9 0.3 0.0 1.0 1.0
discrimination_colour_race 2 0 0.0 0.2 0.0 0.0 1.0
discrimination_nationality 2 0 0.0 0.1 0.0 0.0 1.0
discrimination_religion 2 0 0.0 0.2 0.0 0.0 1.0
discrimination_language 2 0 0.0 0.1 0.0 0.0 1.0
discrimination_ethnic_group 2 0 0.0 0.1 0.0 0.0 1.0
discrimination_age 2 0 0.0 0.1 0.0 0.0 1.0
discrimination_gender 2 0 0.0 0.2 0.0 0.0 1.0
discrimination_sexuality 2 0 0.0 0.1 0.0 0.0 1.0
discrimination_disability 2 0 0.0 0.1 0.0 0.0 1.0
discrimination_other 2 0 0.0 0.2 0.0 0.0 1.0
discrimination_not_applicable 2 0 0.9 0.3 0.0 1.0 1.0
citizenship_country 3 0 0.1 0.2 0.0 0.0 1.0
born_in_country 2 0 0.1 0.3 0.0 0.0 1.0
year_first_live_in_country 74 88 1991.1 19.9 1937.0 1995.0 2019.0
feel_ethnic_group_part 3 3 0.1 0.4 0.0 0.0 1.0
father_born_in_country 3 1 0.2 0.4 0.0 0.0 1.0
mother_born_in_country 3 0 0.2 0.4 0.0 0.0 1.0
climate_change_personal_responsibility 12 1 7.5 2.1 0.0 8.0 10.0
energy_use_impact_climate_change 12 66 6.2 2.2 0.0 6.0 10.0
people_limit_energy_use_likelihood 12 66 4.2 1.9 0.0 4.0 10.0
government_action_reduce_climate_change 12 66 4.4 2.1 0.0 5.0 10.0
elections_free_fair 12 2 8.7 1.7 0.0 9.0 10.0
political_parties_clear_alternatives 12 2 8.4 1.8 0.0 9.0 10.0
media_free_criticism_government 12 1 7.9 2.3 0.0 8.0 10.0
minority_rights_protection 12 2 8.4 1.8 0.0 9.0 10.0
citizen_final_say_referendums 12 3 7.6 2.1 0.0 8.0 10.0
courts_equal_treatment 12 1 9.1 1.4 0.0 10.0 10.0
governing_parties_punished_bad_job 12 2 8.5 1.9 0.0 9.0 10.0
government_protection_against_poverty 12 1 8.7 1.7 0.0 9.0 10.0
government_income_level_measures 12 2 8.0 2.1 0.0 8.0 10.0
ordinary_people_views_prevail 12 4 7.4 2.0 0.0 8.0 10.0
people_will_cannot_be_stopped 12 2 7.4 2.2 0.0 8.0 10.0
key_decisions_by_national_government 12 4 7.4 2.0 0.0 8.0 10.0
free_fair_elections 12 4 7.2 2.3 0.0 8.0 10.0
clear_political_alternatives 12 3 5.1 2.2 0.0 5.0 10.0
media_freedom_criticism 12 2 6.4 2.6 0.0 7.0 10.0
minority_rights_protection_incountry 12 4 5.9 2.2 0.0 6.0 10.0
direct_voting_referendums 12 4 3.8 2.6 0.0 4.0 10.0
courts_equality 12 3 4.4 2.6 0.0 4.0 10.0
governing_party_punishment 12 6 5.0 2.7 0.0 5.0 10.0
government_protection_poverty 12 2 4.4 2.4 0.0 4.0 10.0
income_inequality_reduction 12 4 4.3 2.2 0.0 4.0 10.0
ordinary_people_influence 12 5 3.6 2.2 0.0 4.0 10.0
unstoppable_public_will 12 3 3.6 2.4 0.0 4.0 10.0
national_vs_eu_decisions 12 8 5.5 2.2 0.0 6.0 10.0
democracy_importance_policy_change 11 37 7.5 1.6 0.0 8.0 10.0
democracy_government_policy_change_country 12 23 4.4 2.3 0.0 5.0 10.0
democracy_importance_stick_to_policies 12 80 7.0 1.7 0.0 7.0 10.0
democracy_stick_to_policies_country 12 80 6.5 1.8 0.0 7.0 10.0
showcard_correct_version 1 0 1.0 0.0 1.0 1.0 1.0
importance_live_democracy 12 2 8.5 2.0 0.0 9.0 10.0
strong_leader_above_law_acceptable 12 2 2.1 2.6 0.0 1.0 10.0
household_members 10 0 2.7 1.4 1.0 2.0 10.0
female 2 0 0.5 0.5 0.0 1.0 1.0
year_of_birth 76 0 1971.5 18.7 1931.0 1971.0 2005.0
age 76 0 49.5 18.7 16.0 50.0 90.0
ever_lived_with_partner 3 12 0.6 0.5 0.0 1.0 1.0
ever_divorced 3 0 0.8 0.4 0.0 1.0 1.0
children_in_household_ever 3 38 0.4 0.5 0.0 0.0 1.0
education_years_fulltime 30 2 13.4 3.7 0.0 13.0 30.0
doing7days_paid_work 2 0 0.5 0.5 0.0 1.0 1.0
doing7days_education 2 0 0.1 0.3 0.0 0.0 1.0
doing7days_permanently_sick_or_disabled 2 0 0.0 0.2 0.0 0.0 1.0
doing7days_retired 2 0 0.3 0.4 0.0 0.0 1.0
doing7days_community_or_military_service 1 0 0.0 0.0 0.0 0.0 0.0
doing7days_housework_or_care 2 0 0.0 0.2 0.0 0.0 1.0
paid_work_control_last_week 3 55 1.0 0.2 0.0 1.0 1.0
ever_had_paid_job 3 57 0.2 0.4 0.0 0.0 1.0
number_of_employees 17 90 3.0 15.5 0.0 0.0 150.0
supervising_responsibility 3 8 0.6 0.5 0.0 1.0 1.0
number_supervised 57 66 17.7 50.5 0.0 5.0 500.0
work_organisation_decision 12 9 6.7 3.4 0.0 8.0 10.0
influence_policy_decisions 12 9 4.6 3.6 0.0 5.0 10.0
contracted_hours_per_week 62 16 35.9 10.4 1.0 35.0 155.0
total_hours_worked_per_week 69 11 39.5 12.4 0.0 39.0 168.0
work_abroad_more_than_6_months 3 8 0.9 0.2 0.0 1.0 1.0
unemployment_over_3_months 3 0 0.6 0.5 0.0 1.0 1.0
unemployment_over_12_months 3 65 0.5 0.5 0.0 1.0 1.0
unemployment_last_5_years 3 65 0.6 0.5 0.0 1.0 1.0
partner_paid_work_last_week 2 0 0.4 0.5 0.0 0.0 1.0
partner_education_last_week 2 0 0.0 0.1 0.0 0.0 1.0
partner_unemployed_looking 2 0 0.0 0.1 0.0 0.0 1.0
partner_unemployed_not_looking 2 0 0.0 0.1 0.0 0.0 1.0
partner_permanently_sick_disabled 2 0 0.0 0.1 0.0 0.0 1.0
partner_retired 2 0 0.2 0.4 0.0 0.0 1.0
partner_community_military_service 1 0 0.0 0.0 0.0 0.0 0.0
partner_housework_care 2 0 0.0 0.0 0.0 0.0 1.0
dngothp 2 0 0.0 0.2 0.0 0.0 1.0
partner_control_over_paid_work 3 76 1.0 0.2 0.0 1.0 1.0
partner_hours_worked_week 49 64 38.1 10.3 2.0 37.0 90.0
course_lecture_conference_attendance 3 1 0.7 0.5 0.0 1.0 1.0
internet_access_home 2 0 0.9 0.3 0.0 1.0 1.0
internet_access_work 2 0 0.5 0.5 0.0 0.0 1.0
internet_access_on_move 2 0 0.4 0.5 0.0 0.0 1.0
internet_access_other 2 0 0.4 0.5 0.0 0.0 1.0
internet_access_none 2 0 0.1 0.2 0.0 0.0 1.0
communication_feels_closer 12 1 6.1 2.7 0.0 7.0 10.0
communication_work_life_interrupt 12 4 6.7 2.2 0.0 7.0 10.0
communication_easy_coordination 12 2 7.3 2.1 0.0 8.0 10.0
communication_undermines_privacy 12 2 7.0 2.3 0.0 8.0 10.0
communication_exposes_misinformation 12 2 8.1 1.8 0.0 8.0 10.0
children_over_12_number 8 1 1.2 1.3 0.0 1.0 6.0
child_over_12_age 56 44 30.3 13.6 12.0 28.0 66.0
child_over_12_lives_in_household 3 44 0.7 0.5 0.0 1.0 1.0
travel_time_to_child_over_12 59 64 150.3 251.8 0.0 50.0 2880.0
parents_alive_mother_father 3 57 0.5 0.5 0.0 0.0 1.0
parent_age 55 35 68.2 13.1 36.0 69.0 90.0
parent_lives_in_household 3 34 0.8 0.4 0.0 1.0 1.0
travel_time_to_parent 70 47 147.4 255.4 0.0 40.0 2880.0
satisfied_with_main_job 12 44 7.6 2.0 0.0 8.0 10.0
manager_supports_work_life_balance 12 53 6.2 2.9 0.0 7.0 10.0
feel_part_of_team 11 54 8.6 1.7 0.0 9.0 10.0
take_extra_responsibilities_unpaid 12 54 4.6 3.4 0.0 5.0 10.0
work_from_home_eases_communication 12 73 5.9 3.4 0.0 7.0 10.0
limit_energy_impact_climate_change 12 68 5.8 2.3 0.0 6.0 10.0
likelihood_people_limit_energy_use 12 68 4.0 1.9 0.0 4.0 10.0
likelihood_gov_action_reduce_climate_change 12 68 4.2 2.0 0.0 4.0 10.0
respondent_overall_experience 11 1 7.9 1.6 0.0 8.0 10.0
tech_problem_starting_video 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_internet_connection 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_displaying_showcards 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_audio_clarity 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_video_clarity 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_other_issue 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_no_issues 2 0 0.0 0.1 0.0 0.0 1.0
tech_problem_not_applicable 2 0 1.0 0.2 0.0 1.0 1.0
tech_problem_refusal 1 0 0.0 0.0 0.0 0.0 0.0
tech_problem_dont_know 1 0 0.0 0.0 0.0 0.0 0.0
tech_problem_no_answer 1 0 0.0 0.0 0.0 0.0 0.0
interview_length_minutes 122 2 58.9 25.9 11.0 55.0 653.0
datasummary_skim(data %>% select(1:50), type = "categorical")
N %
country BE 0 0.0
BG 0 0.0
CH 0 0.0
CZ 0 0.0
EE 0 0.0
FI 0 0.0
FR 1977 100.0
GB 0 0.0
GR 0 0.0
HR 0 0.0
HU 0 0.0
IE 0 0.0
IS 0 0.0
IT 0 0.0
LT 0 0.0
ME 0 0.0
MK 0 0.0
NL 0 0.0
NO 0 0.0
PT 0 0.0
SI 0 0.0
SK 0 0.0
internet_use_frequency Never 196 9.9
Only occasionally 97 4.9
A few times a week 78 3.9
Most days 180 9.1
Every day 1426 72.1
political_interest Very interested 299 15.1
Quite interested 478 24.2
Hardly interested 800 40.5
Not at all interested 398 20.1
system_allows_say Not at all 489 24.7
Very little 673 34.0
Some 618 31.3
A lot 135 6.8
A great deal 25 1.3
active_role_politics Not at all able 786 39.8
A little able 575 29.1
Quite able 408 20.6
Very able 101 5.1
Completely able 86 4.4
system_allows_influence Not at all 600 30.3
Very little 686 34.7
Some 518 26.2
A lot 124 6.3
A great deal 12 0.6
confident_participate_politics Not at all confident 510 25.8
A little confident 756 38.2
Quite confident 532 26.9
Very confident 96 4.9
Completely confident 55 2.8
closeness_to_party Very close 56 2.8
Quite close 397 20.1
Not close 265 13.4
Not at all close 53 2.7
income_differences_government_action Agree strongly 725 36.7
Agree 745 37.7
Neither agree nor disagree 248 12.5
Disagree 166 8.4
Disagree strongly 67 3.4
gays_lesbians_freedom Agree strongly 1400 70.8
Agree 359 18.2
Neither agree nor disagree 111 5.6
Disagree 25 1.3
Disagree strongly 58 2.9
family_member_gay_shame Agree strongly 75 3.8
Agree 84 4.2
Neither agree nor disagree 114 5.8
Disagree 164 8.3
Disagree strongly 1515 76.6
gay_lesbian_adoption_rights Agree strongly 930 47.0
Agree 388 19.6
Neither agree nor disagree 243 12.3
Disagree 176 8.9
Disagree strongly 195 9.9
children_learns_obedience Agree strongly 940 47.5
Agree 631 31.9
Neither agree nor disagree 191 9.7
Disagree 141 7.1
Disagree strongly 63 3.2
loyalty_to_leaders Agree strongly 214 10.8
Agree 519 26.3
Neither agree nor disagree 539 27.3
Disagree 352 17.8
Disagree strongly 277 14.0
immigrants_same_ethnicity Allow many to come and live here 461 23.3
Allow some 1096 55.4
Allow a few 256 12.9
Allow none 75 3.8

4.3 Exploring using descriptive graphs

A good option to explore data are the DataExplorer and visdat packages in R. The graphs below are taken from the official github websites (DataExplorer, visdat). If the dataset is very wide, i.e., has a lot of variables, we can subset the data with data %>% select(1:10). Importantly, we should direct special attention to the outcome life_satisfaction and how it relates to other variables.

Q: Please take 10 minutes to go through the figures below. What do they tell us about the data? How can we read/interpret them? What is interesting about them?




library(ggplot2)
library(patchwork)
p1 <- ggplot(data = data, aes(x = life_satisfaction)) + 
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = 0:10) + 
  theme_light()
p2 <- ggplot(data = data, aes(x = life_satisfaction)) + 
  geom_density(fill="gray", alpha=0.8) + # try bw = 0.4
  scale_x_continuous(breaks = 0:10) + 
  theme_light()
p1+p2 

Figure 1: Histogram and density of outcome life satisfaction
Insights
  • For the visualization make sure to adapt scales to outcome variable. Also, play around with binwidth and bandwidth arguments to fully grasp distribution (see another example below).
  • The mass of the distribution lies in the upper half, hence, we would expect any predictive model to predict those values for most individuals. There is less data on the lower values. Hence, any predictive model we built will also have less training data in this are to learn from which might result in worse predictions.
  • With binary outcomes such graphs will also highlight potential imabalance, i.e., unequal sizes of the classes we want to predict.



# Try playing around with variable age
ggplot(data = data, aes(x = age)) + 
  geom_histogram(binwidth = 2) +
  theme_light()




library(DataExplorer)
library(visdat)
# Overview of dataset
plot_intro(data)

Figure 2: Overview of dataset




# Missing value distribution
data %>% 
  select(where(~mean(is.na(.)) > 0.05)) %>% # select features with more than X % missing
    plot_missing()

Figure 3: Missing value distribution
Insights
  • Variables/features/predictors with a lot of missing data are generally not useful. Using them in our models would strongly decrease the size of our training data. Consider deleting them from the dataset or maybe imputing them.
  • Also ask yourself why there are so many missing for some variables. Does it point to a bias of some kind or data errors?




# Missing value distribution
data %>% 
  select(where(~mean(is.na(.)) > 0.8)) %>% # select features with more than X % missing
    plot_missing()

Figure 4: Missing value distribution (e.g., remove vars with > 80% missing)




# View missings across variables
vis_dat(data %>% 
          select(1:30) %>%
           sample_n(1000))

Figure 5: Missings across variables & variable types
Insights
  • Figure 5 would indicate if there is systematic missingness for certain variable types.




# Visualize the missings across variable types
vis_miss(data %>% 
           select(1:30) %>%
           sample_n(1000),
           sort_miss = TRUE) # try argument "cluster = TRUE" or "sort_miss = TRUE"

Figure 6: Missings across variables
Insights
  • The legend in Figure 6 shows the overall amount of missings. High values would be problematic as only few features could be sensibly be used for predictive modelling.
  • For each variable we can also see the amount of missing on that variable.
  • Also we don’t want to built a predictive (or explanatory) model that turn out to be based on very few data points, i.e., we should be the first to use that graph on our data.




# Frequency distribution of discrete variables
data %>% 
  select(1:20) %>%
    plot_bar()

Figure 7: Frequency distribution of discrete variables
Insights
  • Figure 8 shows the number of observations across categorical variables.




data %>% 
  select(1:20) %>%
    plot_bar(with = "life_satisfaction")

Figure 8: Frequency (sum) across discrete variables
Insights
  • Figure 8 shows the sum of our outcome variable across categories of other variables. Makes more sense for a categorical outcome (see below), less so for lifesatisfaction.




# Frequency distribution by a discrete variable
data %>% 
  mutate(female = as.factor(female)) %>% # dichotomize
  select(female, country, political_interest, system_allows_say, 
         subjective_health, internet_use_frequency, unemployed) %>%
          plot_bar(by = "female")

Figure 9: Frequency (sum) across discrete variables
Insights
  • Figure 9 is helpful to discover any systematic pattern between (categorical) socio-demographics.




# View histogram of continuous variables
data %>% 
  select(1:30) %>%
      plot_histogram()

Figure 10: Histograms




## View estimated density distribution of continuous variables
data %>% 
  select(1:35) %>%
    plot_density()

Figure 11: Density distributions




# View quantile-quantile plot of continuous variables
data %>% 
  sample_n(500) %>%
  select(1:11) %>%
    plot_qq()

Figure 12: Quantile-quantile plots
Insights
  • Figure 12 A quantile-quantile plot (Q-Q plot) compares two probability distributions by plotting their quantiles against each other, i.e, here to identify whether contninous variables are far from a normal distribution. Values on the line indicate a normal distribution, deviations indicate deviations, e.g., new_politics_minutes has some outliers deviating from the normal distribution as visible in Figure 11. Using a random sample my speed up computing the plot but should keep the same distribution.




# Quantile-quantile plot of continuous variables by discrete variable
data %>% 
  select(1:10) %>% 
  plot_qq(by = "internet_use_frequency")

Figure 13: Quantile-quantile plot of continuous variables by discrete variable
Insights
  • Figure 13 shows Q-Q plots across continuous variables for subsets of a categorical variable. The aim is to identify whether continuous variable deviates from the normal distribution for certain subsets in our data (define by the categorical variable). Potentially, we could identify which subset category is responsible for the deviation from the normal variable (no clear pattersn in Figure 13).




# Overall correlation heatmap
data %>% 
  select(life_satisfaction, education, age, female, subjective_health) %>%
      plot_correlation(cor_args = list("use" = "pairwise.complete.obs"))

Figure 14: Correlation heatmap
Insights
  • Figure 14 may indicate any important predictors reflected by a stronger correlations.




# Bivariate continuous distributions based on cutting life_satisfaction
data %>% 
  select(1:22) %>%
      plot_boxplot(by = "life_satisfaction")

Figure 15: Bivariate continuous distributions based on cutting life_satisfaction
Insights
  • Figure 15 discretizes our outcome variable and shows how it changes across values of other variables. If other variables strongly vary across our categorized outcome life_satisfaction it may indicate that they have predictive power. Figure 15 shows that there is not meaningful variation for respondent_id which makes sense since the id variable does not carry any information.




# Scatterplot `life_satisfaction` with other continuous features
data %>% 
  sample_n(500) %>%
  select(1:12) %>%
      split_columns %>% # split according to variable type
        pluck(2) %>% # take numeric variables
            plot_scatterplot(by = "life_satisfaction",
                             geom_point_args = list(alpha = 0.1, size = 1),
                             geom_jitter_args = 
                               list(width = 0.3, height = 0.3),
                             ggtheme = theme_light())

Figure 16: Scatterplot life_satisfaction with other continuous features
Insights
  • Figure 16 provides an overview of the joint distribution of our outcome with other variables. It may help in discovering areas where we don’t have data in those joint distributions.

5 Exercise: Exploring a dataset (COMPASS data)

Overview of Compas dataset variables
  • id: ID of prisoner, numeric
  • name: Name of prisoner, factor
  • compas_screening_date: Date of compass screening, date
  • decile_score: the decile of the COMPAS score, numeric
  • is_recid: whether somone reoffended/recidivated (=1) or not (=0), numeric
  • is_recid_factor: same but factor variable
  • age: a continuous variable containing the age (in years) of the person, numeric
  • age_cat: age categorized
  • priors_count: number of prior crimes committed, numeric
  • sex: gender with levels “Female” and “Male”, factor
  • race: race of the person, factor
  • juv_fel_count: number of juvenile felonies, numeric
  • juv_misd_count: number of juvenile misdemeanors, numeric
  • juv_other_count: number of prior juvenile convictions that are not considered either felonies or misdemeanors, numeric
  1. To introduce classfication we are using a second data set on prisoners to predict whether they reoffend or not (recidivism). The data is based on a software that scores prisoners regarding their probability of reoffending/recidivating and whether they actually reoffended (Variable: is_recid/is_recid_factor where 1 = yes, 0 = no). Please import this dataset called data (see code below.).
  2. Then install and load the following packages: skimr, modelsummary, DataExplorer, visdat, tidyverse and patchwork (see code below.).
  3. Start by generating a few descriptive statistics to better understand the data. Use the skim() function from the skimr package to get a first overview. What kind of variables does the data include? What interesting aspects stand out?
  4. Use datasummary_skim(..., type = "numeric") from the modelsummary package to produce some nice tables for both numeric and categorical variables.
  5. Use plot_intro() (Package: DataExplorer) to get a broad overview of the data. Is there anything particular about the dataset?
  6. Missings determine success and failure of predictive models. Use the the functions plot_missing() (Package: DataExplorer) and vis_miss() (Package: vis_dat and use sort_miss = TRUE) to visualize missings. What stands out?
  7. Special attention should be given to the outcome we want to predict. Explore the outcome using table(..., useNA = "always") and graphs. Is there anything particular about it?
  8. Finally,please use functions such as plot_bar(), plot_histogram(), plot_density(), plot_qq(), plot_correlation() and plot_boxplot() to explore the data and whether certain predictors stand out in relation to is_recid. Since, the dataset is small you can simply apply those functions to the full dataset. What do you find?
# 1.
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))


# 2.
# install.packages(pacman)
pacman::p_load(skimr, 
               modelsummary, 
               DataExplorer, 
               visdat, 
               tidyverse, 
               patchwork)
Solutions
# 3.
# Summary statistics
#skim(data)

# 4.
datasummary_skim(data, type = "numeric")
datasummary_skim(data, type = "categorical")

# 5.
# Overview of data
plot_intro(data)


# 6.
# Visualize missings
plot_missing(data)
vis_miss(data, sort_miss = TRUE)



# 7.
##
table(data$is_recid, useNA = "always")
data %>% 
  ggplot(aes(x = is_recid)) + 
    geom_histogram(binwidth = 1) +
    scale_x_continuous(breaks = 0:1) + 
    theme_light()



# 8.
## Frequency distribution of discrete variables
data %>% plot_bar()

## Distribution across discrete variables
data %>% plot_bar(with = "is_recid")


## View frequency distribution by a discrete variable
data %>% plot_bar(by = "is_recid_factor")


## View histogram of continuous variables
data %>% plot_histogram()



## View estimated density distribution of continuous variables
data %>% plot_density()



## View quantile-quantile plot of continuous variables
data %>%
  sample_n(1000) %>%
    plot_qq()



## View quantile-quantile plot of continuous variables by feature `is_recid`
data %>%
  sample_n(1000) %>%
        plot_qq(by = "is_recid")


## View overcorrelation heatmap
data %>%
  plot_correlation(cor_args = list("use" = "pairwise.complete.obs"))


## View bivariate continuous distribution based on `cut`
data %>%
  plot_boxplot(by = "is_recid")

6 All the code

# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat)
# Load the .RData file into R
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "173VVsu9TZAxsCF_xzBxsxiDQMVc_DdqS")))
library(skimr)
library(modelsummary)

# Data overview
# skim(data) # Run this in R (output is too long)
datasummary_skim(data, type = "numeric")
datasummary_skim(data %>% select(1:50), type = "categorical")
library(ggplot2)
library(patchwork)
p1 <- ggplot(data = data, aes(x = life_satisfaction)) + 
  geom_histogram(binwidth = 1) +
  scale_x_continuous(breaks = 0:10) + 
  theme_light()
p2 <- ggplot(data = data, aes(x = life_satisfaction)) + 
  geom_density(fill="gray", alpha=0.8) + # try bw = 0.4
  scale_x_continuous(breaks = 0:10) + 
  theme_light()
p1+p2 
# Try playing around with variable age
ggplot(data = data, aes(x = age)) + 
  geom_histogram(binwidth = 2) +
  theme_light()
library(DataExplorer)
library(visdat)
# Overview of dataset
plot_intro(data)
# Missing value distribution
data %>% 
  select(where(~mean(is.na(.)) > 0.05)) %>% # select features with more than X % missing
    plot_missing()
# Missing value distribution
data %>% 
  select(where(~mean(is.na(.)) > 0.8)) %>% # select features with more than X % missing
    plot_missing()
# View missings across variables
vis_dat(data %>% 
          select(1:30) %>%
           sample_n(1000))
# Visualize the missings across variable types
vis_miss(data %>% 
           select(1:30) %>%
           sample_n(1000),
           sort_miss = TRUE) # try argument "cluster = TRUE" or "sort_miss = TRUE"
# Frequency distribution of discrete variables
data %>% 
  select(1:20) %>%
    plot_bar()
data %>% 
  select(1:20) %>%
    plot_bar(with = "life_satisfaction")
# Frequency distribution by a discrete variable
data %>% 
  mutate(female = as.factor(female)) %>% # dichotomize
  select(female, country, political_interest, system_allows_say, 
         subjective_health, internet_use_frequency, unemployed) %>%
          plot_bar(by = "female")
# View histogram of continuous variables
data %>% 
  select(1:30) %>%
      plot_histogram()
## View estimated density distribution of continuous variables
data %>% 
  select(1:35) %>%
    plot_density()
# View quantile-quantile plot of continuous variables
data %>% 
  sample_n(500) %>%
  select(1:11) %>%
    plot_qq()
# Quantile-quantile plot of continuous variables by discrete variable
data %>% 
  select(1:10) %>% 
  plot_qq(by = "internet_use_frequency")
# Overall correlation heatmap
data %>% 
  select(life_satisfaction, education, age, female, subjective_health) %>%
      plot_correlation(cor_args = list("use" = "pairwise.complete.obs"))
# Bivariate continuous distributions based on cutting life_satisfaction
data %>% 
  select(1:22) %>%
      plot_boxplot(by = "life_satisfaction")
# Scatterplot `life_satisfaction` with other continuous features
data %>% 
  sample_n(500) %>%
  select(1:12) %>%
      split_columns %>% # split according to variable type
        pluck(2) %>% # take numeric variables
            plot_scatterplot(by = "life_satisfaction",
                             geom_point_args = list(alpha = 0.1, size = 1),
                             geom_jitter_args = 
                               list(width = 0.3, height = 0.3),
                             ggtheme = theme_light())

Footnotes

  1. I added some missings on the life satisfaction variable!↩︎