Chapter 2 Data understanding and preparation
In this chapter, we will analyse the database in more details in order to be able to build significant prediction models. Moreover, we will look for wrong values among observations and correct them.
Our database is composed of 1,000 observations and contains the following variables:
- obs: observation number
- chk_account: category defining the balance in the checking account in Deutsch Mark (DM)
- duration: loan duration in months
- history: credit history
- new_car: purpose of credit
- used_car: purpose of credit
- furniture: purpose of credit
- radio.tv: purpose of credit
- education: purpose of credit
- retraining: purpose of credit
- amount: credit amount
- sav_acct: category defining the average balance in savings account in DM
- employment: category defining the duration of current employment
- install_rate: installment rate as % of disposable income
- male_div: borrower is a male and divorced
- male_single: borrower is a male and single
- male_mar_or_wid: borrower is a male and married or a male and widower
- co.applicant: borrower has a co-applicant
- guarantor: borrower has a guarantor
- present_resident: category defining the duration of present resident in years
- real_estate: borrower owns real estate
- prop_unkn_none: borrower owns no property
- age: borrower’s age in years
- other_install: borrower has other installment plan credit
- rent: borrower has a rent
- own_res: borrower owns residence
- num_credits: number of existing credits at this bank
- job: category defining the nature of the borrower’s job
- num_dependents: number of people for whom liable to provide maintenance
- telephone: borrower has phone in his or her name
- foreign: borrower is a foreign worker
- risk: borrower’s credit rating is good or bad (response variable)
2.1 Exploratory data analysis (EDA)
We make some cleaning operations in our dataset which is shown below.
<- read.csv("Data/GermanCredit.csv", sep = ";")
german
#Renaming OBS. in OBS
colnames(german)[1] <- "OBS"
#Col names in lower case
colnames(german) <- tolower(colnames(german))
#Creating a new variable in order to know if credit risk is good (1) or bad (0)
$risk <- ifelse(german$response == 1, "good", "bad")
german$risk <- as.factor(german$risk)
german#We remove the binary variable response because we will use the caterogical one (risk) that we previously created
<- german %>% select(-response)
german
#Inspecting the data frame, there are numerical and categorical values
datatable(german, rownames=FALSE, fillContainer = TRUE)
2.1.1 Missing values
We check for missing data. No missing data is observed in our database.
missmap(german)
2.1.2 Bad vs Good
Our database is not balanced. There is much more good credits. Therefore, models will tend to predict more often good customers rather than bad ones. We will have to pay attention to the specificity. Nevertheless, data will be balanced in the modeling part.
<- inspect_cat(select(german, risk))
vs show_plot(vs)
2.1.3 Anomalies
We detected some anomalies in the database:
- guarantor: Max = 2 : It should be 1 instead of 2.
- education: Min = -1 : It should be 1 instead of -1.
- age: Max = 125 : It should be 75 instead of 125 (impossible value).
<- summary(german$age) %>% round(0) %>% as.data.frame.vector()
summary_age
<- summary(german$education) %>% as.data.frame.vector()
summary_education
<- summary(german$guarantor) %>% as.data.frame.vector()
summary_guarantor
<- summary(german$duration) %>% round(0) %>% as.data.frame.vector()
summary_duration
<- summary(german$chk_acct) %>% as.data.frame.vector()
summary_chk_acct
<- cbind(summary_age, summary_education, summary_guarantor, summary_duration, summary_chk_acct)
summary_final %>% kable(caption = "Summary", col.names = c("Age", "Education", "Guarantor", "Duration", "Chk_acct")) summary_final
Age | Education | Guarantor | Duration | Chk_acct | |
---|---|---|---|---|---|
Min. | 19 | -1.000 | 0.000 | 4 | 0.000 |
1st Qu. | 27 | 0.000 | 0.000 | 12 | 0.000 |
Median | 33 | 0.000 | 0.000 | 18 | 1.000 |
Mean | 36 | 0.048 | 0.053 | 21 | 1.577 |
3rd Qu. | 42 | 0.000 | 0.000 | 24 | 3.000 |
Max. | 125 | 1.000 | 2.000 | 72 | 3.000 |
We will therefore modify the database with the correct information in order to optimize our future models.
#Data transformation
$age[537] <- 75
german$education [37] <- 1
german$guarantor [234] <- 1 german
Also, in order to understand better the variables, we make some modifications in our database but only for the EDA part. Indeed, we rename the job categories as well as the categories of the savings account feature.
<- german
german_EDA
$job <-
german_EDAfactor(
$job,
german_EDAlevels = 0:3,
labels = c(
"unemployed/unskilled",
"unskilled - resident",
"skilled employee/official",
"management/self-employed/
highly qualifed employee/officer"
)
)
$sav_acct <-
german_EDAfactor(
$sav_acct,
german_EDAlevels = 0:4,
labels = c(
"< 100 DM",
"100 ≤ ... < 500 DM",
"500 ≤ ··· < 1000 DM",
"≥ 1000 DM",
"unknown/no savings account"
) )
2.1.4 Analysis of the clients
2.1.4.1 Type of jobs
Most of the clients have a qualified job. This variable may play a role in our future predictions.
ggplot(german_EDA, aes(x = job, fill = job)) +
geom_bar() +
ggtitle("Types of jobs", subtitle = "The majority are skilled employee") +
theme_bw() +
theme(
axis.text = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_blank()
)
2.1.4.2 Type of savings account
Most customers have less than 100 DM in their account. Indeed, the customer base consists mainly of small savers.
ggplot(german_EDA, aes(x = sav_acct, fill = sav_acct)) +
geom_bar() +
ggtitle("Types of savings accounts", subtitle = "The majority have less than 100 DM") +
theme_bw() +
theme(
axis.text = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_blank()
)
2.1.4.3 Purpose of credit
Below, we display the risk among purpose of credit. Radio TV is the most frequent purpose for a credit and people who borrow money to buy a new car are most often defined as at-risk customers.
<- german_EDA %>% select("new_car", "used_car", "furniture", "radio.tv", "education", "retraining", "risk")
purpose
#new_car
<- purpose %>% select("new_car", "risk") %>% filter(risk == "good", new_car == 1) %>% count(name = "good")
df_new_car_good <- purpose %>% select("new_car", "risk") %>% filter(risk == "bad", new_car == 1) %>% count(name = "bad")
df_new_car_bad <- cbind(df_new_car_good, df_new_car_bad)
results_new_carrownames(results_new_car)[rownames(results_new_car) == "1"] = "new_car"
#used_car
<- purpose %>% select("used_car", "risk") %>% filter(risk == "good", used_car == 1) %>% count(name = "good")
df_used_car_good <- purpose %>% select("used_car", "risk") %>% filter(risk == "bad", used_car == 1) %>% count(name = "bad")
df_used_car_bad <- cbind(df_used_car_good, df_used_car_bad)
results_used_car rownames(results_used_car)[rownames(results_used_car) == "1"] = "used_car"
#furniture
<- purpose %>% select("furniture", "risk") %>% filter(risk == "good", furniture == 1) %>% count(name = "good")
df_furniture_good <- purpose %>% select("furniture", "risk") %>% filter(risk == "bad", furniture == 1) %>% count(name = "bad")
df_furniture_bad <- cbind(df_furniture_good, df_furniture_bad)
results_furniture rownames(results_furniture)[rownames(results_furniture) == "1"] = "furniture"
#radio.tv
<- purpose %>% select("radio.tv", "risk") %>% filter(risk == "good", radio.tv == 1) %>% count(name = "good")
df_radio.tv_good <- purpose %>% select("radio.tv", "risk") %>% filter(risk == "bad", radio.tv == 1) %>% count(name = "bad")
df_radio.tv_bad <- cbind(df_radio.tv_good, df_radio.tv_bad)
results_radio.tv rownames(results_radio.tv)[rownames(results_radio.tv) == "1"] = "radio.tv"
#education
<- purpose %>% select("education", "risk") %>% filter(risk == "good", education == 1) %>% count(name = "good")
df_education_good <- purpose %>% select("education", "risk") %>% filter(risk == "bad", education == 1) %>% count(name = "bad")
df_education_bad <- cbind(df_education_good, df_education_bad)
results_education rownames(results_education)[rownames(results_education) == "1"] = "education"
#retraining
<- purpose %>% select("retraining", "risk") %>% filter(risk == "good", retraining == 1) %>% count(name = "good")
df_retraining_good <- purpose %>% select("retraining", "risk") %>% filter(risk == "bad", retraining == 1) %>% count(name = "bad")
df_retraining_bad <- cbind(df_retraining_good, df_retraining_bad)
results_retraining rownames(results_retraining)[rownames(results_retraining) == "1"] = "retraining"
<- rbind(results_new_car, results_used_car, results_furniture, results_radio.tv, results_education, results_retraining)
all_results
<- all_results %>% melt()
results_final
<- results_final %>% mutate(purpose = c("new_car", "used_car", "furniture", "radio.tv", "education", "retraining", "new_car", "used_car", "furniture", "radio.tv", "education", "retraining"))
results_final
colnames(results_final) <- c("risk", "value", "purpose")
%>% ggplot(aes(reorder(x = purpose, -value), y = value, fill = risk)) +
results_final geom_bar(position = 'dodge', stat = 'identity') + theme_bw() +
theme(
axis.text = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12),
axis.title.y = element_blank()
+
) ggtitle("Risk analysis by purpose of credit", subtitle = "Radio TV is the most frequent purpose for a credit") + xlab('Purpose of credit')
2.1.4.4 Gender analysis
Most of the customers are men. We note, however, that the probability of being at risk among women is high. It will therefore be interesting to take gender into account during the construction of our models.
#Create a variable sex
<- german_EDA %>% mutate(sex = ifelse(male_div == 1 | male_single == 1 | male_mar_or_wid == 1, 1,0))
german_EDA
$sex <-
german_EDAfactor(german_EDA$sex,
levels = 0:1,
labels = c("female",
"male"))
%>% select(sex, risk) %>% group_by(sex, risk) %>% count() %>%
german_EDA ggplot(aes(x = sex, y = n, fill = risk)) +
geom_bar(position = 'dodge', stat = 'identity') +
geom_text(aes(label = n),
position = position_dodge(width = 0.9),
vjust = -0.25) +
ggtitle("Risk analysis by gender", subtitle = "The data set is not well balanced")
2.1.5 Boxplot - variable explanation
duration: At-risk clients have longer credit duration.
amount: The median amounts are slightly higher for the at-risk category.
age: The median for good payers is higher.
<- german %>% ggplot(aes(x = risk, y = duration)) +
b1 geom_boxplot(
outlier.colour = "red",
outlier.shape = 8,
outlier.size = 1
)
<- german %>% ggplot(aes(x = risk, y = amount)) +
b2 geom_boxplot(
outlier.colour = "red",
outlier.shape = 8,
outlier.size = 1
)
<- german %>% ggplot(aes(x = risk, y = age)) +
b3 geom_boxplot(
outlier.colour = "red",
outlier.shape = 8,
outlier.size = 1
)
grid.arrange(b1, b2, b3, nrow=1,
top = textGrob("Box Plot analysis", gp=gpar(fontsize=15, font=2)))
Below, we remark that variables age, amount and duration are right skewed. In order to modify it, we use the log transformation to change them into a normal distribution.
<- ggplot(german, aes(x = age)) +
g1 geom_histogram(bins = 20, fill = 'pink', colour = 'black') +
ggtitle('Age distribution') +
xlab('Age') +
ylab('Frequency')
<- ggplot(german, aes(x = amount)) +
g2 geom_histogram(bins = 20, fill = 'turquoise', colour = 'black') +
ggtitle('Amount distribution') +
xlab('Credit_Amount') +
ylab('Frequency')
<- ggplot(german, aes(x = duration)) +
g3 geom_histogram(bins = 20, fill = 'lightgoldenrod', colour = 'black') +
ggtitle('Duration distribution') +
xlab('Duration') +
ylab('Frequency')
grid.arrange(g1, g2, g3, nrow=1)
Even if age is still a bit right skewed, it is much better than before as we can see with the following plot. This modification is important in order to guarantee the good reliability of our models.
#Transforming variables
<- ggplot(german, aes(x = log1p(age))) +
g1 geom_histogram(bins = 20, fill = 'pink', colour = 'black') +
ggtitle('Age distribution') +
xlab('Age') +
ylab('Frequency')
<- ggplot(german, aes(x = log1p(amount))) +
g2 geom_histogram(bins = 20, fill = 'turquoise', colour = 'black') +
ggtitle('Amount distribution') +
xlab('Credit_Amount') +
ylab('Frequency')
<- ggplot(german, aes(x = log1p(duration))) +
g3 geom_histogram(bins = 20, fill = 'lightgoldenrod', colour = 'black') +
ggtitle('Duration distribution') +
xlab('Duration') +
ylab('Frequency')
grid.arrange(g1, g2, g3, nrow=1)
2.1.6 Correlation
Correlation for risk = good: There is a correlation between amount and duration. This makes sense.
<- german %>% filter(risk == "good") %>% select(amount, age, duration)
correlation_good
ggpairs(correlation_good)
Correlation for risk = bad: The amount is correlated with the duration of the contract and somewhat with the age of the person.
<- german %>% filter(risk == "bad") %>% select(amount, age, duration)
correlation_bad
ggpairs(correlation_bad)