## 5.3 Attribute Re-engineering

In the previous chapter when we do data understanding. Apart from the missing values, we also find some attributes do not make sense or have no prediction power when considering the relationship with survival. for example, we have found name has little prediction power. It is illogical to say some survived because he or she has a specific name. However there is title information buried inside the name, The title can potentially useful at least it shows the age addition to the gender.

We have also find other useful information hidden inside some variables. For example, the information about the deck is possibly hidden inside cabin. Information about group travel is buried inside of Ticket and Fare that passengers share same tickets number and fare must travel in a group. It seems that the ticket is a group ticket. Furthermore, we have also found that the group that shares tickets is mostly family members. This is further confirmed by the none 0 values in the SibSp and Parch attributes. That hidden information can be very important. We can surface them by attributes’ re-engineering.

### Title from Name attribute

The Name is initially believed is useless for predict a passenger’s fate. But we have found in it there is information about titles even maybe marriage relations. So our first task in attribute re-engineering is to create a new attribute called Title. It is abstracted from Name. It is the title of the passenger, which can be extracted from the Name attribute using a regular expression.

# Abstract Title out
data$Title <- gsub('(.*, )|(\\..*)', '', data$Name)
data %>%
group_by(Title) %>%
dplyr::count() %>%
arrange(desc(n))
## # A tibble: 18 x 2
## # Groups:   Title [18]
##    Title            n
##    <chr>        <int>
##  1 Mr             757
##  2 Miss           260
##  3 Mrs            197
##  4 Master          61
##  5 Dr               8
##  6 Rev              8
##  7 Col              4
##  8 Major            2
##  9 Mlle             2
## 10 Ms               2
## 11 Capt             1
## 12 Don              1
## 13 Dona             1
## 14 Jonkheer         1
## 16 Mme              1
## 17 Sir              1
## 18 the Countess     1

We can see that there is a total of 18 different titles. Some of them are commonly used titles and others are less common. Those less commonly used titles have a very small number, mostly, just 1.

We will group the less commonly used titles into other so to balance distribution.

# Group those less common title’s into an ‘Other’ category.
data$Title <- ifelse(data$Title %in% c("Mr", "Miss", "Mrs", "Master"), data$Title, "Other") L<- table(data$Title, data$Sex) knitr::kable(L, digits = 2, booktabs = TRUE, caption = "Title and sex confirmation") Table 5.1: Title and sex confirmation female male Master 0 61 Miss 260 0 Mr 0 757 Mrs 197 0 Other 9 25 Checking the table of Title vs Sex shows nothing anomalous. A stacked bar graph of the newly created attribute suggests it could be quite useful that the difference in survival between ‘Master’ and ‘Mr’ will be something that hasn’t been captured by the Sex attribute. data %>% filter(!is.na(Survived)) %>% ggplot(aes(x = factor(Title), fill = factor(Survived))) + geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) + scale_fill_discrete(name = "Survived") + labs(x = "Title", y = "Survival Percentage") + ggtitle("Title attribute (Proportion Survived)") ### Deck from Cabin attribute From our previous analysis, we have found out that the cabin numbers are all start with a letter. It could be a deck number of some sort. If we group cabin numbers with their initial letter, we can then treat the ordinal missing cabin’s value records as a separate group. So, we group all cabin numbers into groups according to their first letter. Create a new attribute with the name Deck. and assign records with no cabin number as U (no cabin number) for its Deck value. data$Cabin <- as.character(data$Cabin) data$Deck <- ifelse((data$Cabin == ""), "U", substr(data$Cabin, 1, 1))
# plot our newly created attribute relation with Survive
p1 <- ggplot(data[1:891,], aes(x = Deck, fill = factor(Survived))) +
geom_bar(width = 0.5) +
labs(x = "Deck number", y = "Total account") +
labs(fill = "Survived")

# plot percentage of survive
p2 <- data %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x = factor(Deck), fill = factor(Survived))) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
scale_fill_discrete(name = "Survived") +
labs(x = "Deck number", y = "Percentage") +
ggtitle("Newly created Deck number (Proportion Survived)")

grid.arrange(p1, p2, ncol = 2)

### Extract ticket class from ticket number

We knew that the values of Ticket appear has two major kinds ‘Letters Numbers’ or just ‘Numbers’. This could be worth extracting. However, just two class is too rough. As suggested during understanding data, we can group tickets by their first letter or number. let us create a Ticket_class to replace Ticket.

data$Ticket <- as.character(data$Ticket)
data$Group_size <- pmax(data$Family_size, data$Friend_size) Now let us see our newly created attribute’s prediction power, The plot shows that most people traveled alone, small and large groups have the least chance of survival while Medium-sized groups (3 and 4) seemed to have the best chance of living. ### Age in Groups We have seen the age has a strong correlation with survival. However, it is too fine granted, it is better to create demographical groups called Age_group. Age_labels <- c('0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79') data$Age_group <- cut(data$Age, c(0, 10, 20, 30, 40, 50, 60, 70, 80), include.highest=TRUE, labels= Age_labels) p1 <- data %>% filter(!is.na(Survived)) %>% ggplot(aes(x = Age_group, y = ..count.., fill = factor(Survived))) + geom_bar() + ggtitle("Survived value ove newly created Age_group") # plot percentage of survive p2 <- data %>% filter(!is.na(Survived)) %>% ggplot(aes(x = Age_group, fill = factor(Survived))) + geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) + scale_fill_discrete(name = "Survived") + labs(x = "Age group", y = "Percentage") + ggtitle("Survived percentage ove newly created Age_group") grid.arrange(p1, p2, ncol = 2) We can see here only age group “0-9” has a better chance of survive. ### Fare per passenger We have used this concept when we fill the missing value of Embarked in Section 5.2. We were comparing the records’ fare with other passengers’ fare because we believe the fare should reflect the journey that should indicate the embarked port. It is there we find out the passenger could share the fare and the ticket number. So it is faulty information if you only considering Fare values between two passengers. After we introduce a new attribute Fare_pp that stands for fare per person, its value is the true value a passenger paid for the travel. So we have, $$$Fare\_PP = Fare / Friend\_size. \tag{5.4}$$$ We do this, data$Fare_pp <- data$Fare/data$Friend_size

Let us examine our newly created attribute Fare_PP’s prediction power,

# plot Fare_PP against Survived
p1<- data %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x = Fare_pp, fill = factor(Survived))) +
geom_histogram(binwidth = 2) +
scale_y_continuous(breaks = seq(0, 500, 50)) +
scale_fill_discrete(name = "Survived") +
labs(x = "Fare (per person)", y = "Count") +
ggtitle("Survived value over Fare_pp")
p1
# plot percentage of survive
p2 <- data %>%
filter(!is.na(Survived)) %>%
ggplot(aes(x = factor(Fare_pp), fill = factor(Survived))) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
scale_fill_discrete(name = "Survived") +
labs(x = "Fare per person", y = "Percentage") +
ggtitle("Survived rate over newly created Fare_PP")
p2
# plot in box plot
data %>%
filter(!is.na(Survived)) %>%
filter(Fare > 0) %>%
ggplot(aes(factor(Survived), Fare_pp)) +
geom_boxplot(alpha = 0.2) +
scale_y_continuous(trans = "log2") +
geom_point(show.legend = FALSE) +
geom_jitter()
# grid.arrange(p1, p2, ncol = 2)

The graph confirms the fare_PP associated with the passenger’s survival. We can see that the perished passenger tend to pay less (around 8 pounds) and the average survived passenger appeared paid something around 14 pounds.