5.3 Attribute Re-engineering

In the previous chapter when we do data understanding. Apart from the missing values, we also find some attributes do not make sense or have no prediction power when considering the relationship with survival. for example, we have found name has little prediction power. It is illogical to say some survived because he or she has a specific name. However there is title information buried inside the name, The title can potentially useful at least it shows the age addition to the gender.

We have also find other useful information hidden inside some variables. For example, the information about the deck is possibly hidden inside cabin. Information about group travel is buried inside of Ticket and Fare that passengers share same tickets number and fare must travel in a group. It seems that the ticket is a group ticket. Furthermore, we have also found that the group that shares tickets is mostly family members. This is further confirmed by the none 0 values in the SibSp and Parch attributes. That hidden information can be very important. We can surface them by attributes’ re-engineering.

Title from Name attribute

The Name is initially believed is useless for predict a passenger’s fate. But we have found in it there is information about titles even maybe marriage relations. So our first task in attribute re-engineering is to create a new attribute called Title. It is abstracted from Name. It is the title of the passenger, which can be extracted from the Name attribute using a regular expression.

# Abstract Title out
 data$Title <- gsub('(.*, )|(\\..*)', '', data$Name)
 data %>%
   group_by(Title) %>%
   dplyr::count() %>%
 arrange(desc(n))

## # A tibble: 18 x 2
## # Groups:   Title [18]
##    Title            n
##    <chr>        <int>
##  1 Mr             757
##  2 Miss           260
##  3 Mrs            197
##  4 Master          61
##  5 Dr               8
##  6 Rev              8
##  7 Col              4
##  8 Major            2
##  9 Mlle             2
## 10 Ms               2
## 11 Capt             1
## 12 Don              1
## 13 Dona             1
## 14 Jonkheer         1
## 15 Lady             1
## 16 Mme              1
## 17 Sir              1
## 18 the Countess     1

We can see that there is a total of 18 different titles. Some of them are commonly used titles and others are less common. Those less commonly used titles have a very small number, mostly, just 1.

We will group the less commonly used titles into other so to balance distribution.

# Group those less common title’s into an ‘Other’ category.
data$Title <- ifelse(data$Title %in% c("Mr", "Miss", "Mrs", "Master"), data$Title, "Other")

L<- table(data$Title, data$Sex)
knitr::kable(L, digits = 2, booktabs = TRUE, caption = "Title and sex confirmation")

Table 5.1: Title and sex confirmation
	female	male
Master	0	61
Miss	260	0
Mr	0	757
Mrs	197	0
Other	9	25

Checking the table of Title vs Sex shows nothing anomalous.

A stacked bar graph of the newly created attribute suggests it could be quite useful that the difference in survival between ‘Master’ and ‘Mr’ will be something that hasn’t been captured by the Sex attribute.

data %>%
  filter(!is.na(Survived)) %>%
ggplot(aes(x = factor(Title), fill = factor(Survived))) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
  scale_fill_discrete(name = "Survived") +
  labs(x = "Title", y = "Survival Percentage") + 
  ggtitle("Title attribute (Proportion Survived)")

Figure 5.5: Survivial percentage onver Title

Deck from Cabin attribute

From our previous analysis, we have found out that the cabin numbers are all start with a letter. It could be a deck number of some sort. If we group cabin numbers with their initial letter, we can then treat the ordinal missing cabin’s value records as a separate group.

So, we group all cabin numbers into groups according to their first letter. Create a new attribute with the name Deck. and assign records with no cabin number as U (no cabin number) for its Deck value.

data$Cabin <- as.character(data$Cabin)
data$Deck <- ifelse((data$Cabin == ""), "U", substr(data$Cabin, 1, 1))
# plot our newly created attribute relation with Survive
p1 <- ggplot(data[1:891,], aes(x = Deck, fill = factor(Survived))) +
  geom_bar(width = 0.5) +
  labs(x = "Deck number", y = "Total account") + 
  labs(fill = "Survived")

# plot percentage of survive
p2 <- data %>%
  filter(!is.na(Survived)) %>%
ggplot(aes(x = factor(Deck), fill = factor(Survived))) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
  scale_fill_discrete(name = "Survived") +
  labs(x = "Deck number", y = "Percentage") + 
  ggtitle("Newly created Deck number (Proportion Survived)")

grid.arrange(p1, p2, ncol = 2)

Figure 5.6: Survivla vlaue and percentage over newly created Deck attribute

Extract ticket class from ticket number

We knew that the values of Ticket appear has two major kinds ‘Letters Numbers’ or just ‘Numbers’. This could be worth extracting. However, just two class is too rough. As suggested during understanding data, we can group tickets by their first letter or number. let us create a Ticket_class to replace Ticket.

data$Ticket <- as.character(data$Ticket)
data$Ticket_class <- ifelse((data$Ticket != " "), substr(data$Ticket, 1, 1), "")
data$Ticket_class <- as.factor(data$Ticket_class)

# plot our newly created attribute relation with Survive
p1 <- data %>%
  filter(!is.na(Survived)) %>%
  ggplot(aes(x = Ticket_class, fill = factor(Survived))) +
  geom_bar(width = 0.5) +
  labs(x = "Ticket_class", y = "Total account") + 
  labs(fill = "Survived value over Ticket class")

# plot percentage of survive
p2 <- data %>%
  filter(!is.na(Survived)) %>%
ggplot(aes(x = factor(Ticket_class), fill = factor(Survived))) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
  scale_fill_discrete(name = "Survived") +
  labs(x = "Ticket_class", y = "Percentage") + 
  ggtitle("Survived percentage over Newly created Ticket_class")

grid.arrange(p1, p2, ncol = 2)

Figure 5.7: Survival value and percentage over newly created Ticket class

Although the plot appeared to have a skewed bi-model shape, its prediction is clearly improved by ticket number.

Travel in Groups

We have seen that passengers shared ticket numbers and fares. It is a clear indication of the passenger traveling in groups. Travel in groups can be an important factor for survival in disasters. The Titanic movie impressed millions because of the love story about a couple, they want to stay together to live and to death. Generally, that is the spirit of grouping - stay together for worse or for better. Apart from two friends travel together, we have also seen the family travel together that as indicated by SibSp and Parch attributes.

To make it simple we can create a Group_size, which takes a minimum value of 1 to represent the passenger travel alone. otherwise in groups. The group size is defined as:

$\begin{equation} Group\_size = Max(Friend\_size, Family\_size). \tag{5.1} \end{equation}$

where, $\begin{equation} Friend\_size = Sum(PassengerID), \tag{5.2} \end{equation}$ that share the some ticket number and fare, which we have already created in the section @ref(fare_pp) when we create new data frame Fare_pp. $\begin{equation} Family\_size = SibSp + Parch + 1 \tag{5.3} \end{equation}$

So we do,

data$Family_size <- data$SibSp + data$Parch + 1
data$Group_size <- pmax(data$Family_size, data$Friend_size)

Now let us see our newly created attribute’s prediction power,

Figure 5.8: Survival value and percentage over newly created Group Size

The plot shows that most people traveled alone, small and large groups have the least chance of survival while Medium-sized groups (3 and 4) seemed to have the best chance of living.

Age in Groups

We have seen the age has a strong correlation with survival. However, it is too fine granted, it is better to create demographical groups called Age_group.

Age_labels <- c('0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79')

data$Age_group <- cut(data$Age, c(0, 10, 20, 30, 40, 50, 60, 70, 80), include.highest=TRUE, labels= Age_labels)

p1 <- data %>%
  filter(!is.na(Survived)) %>%
    ggplot(aes(x = Age_group, y = ..count.., fill = factor(Survived))) +
  geom_bar() +
  ggtitle("Survived value ove newly created Age_group")

# plot percentage of survive
p2 <- data %>%
  filter(!is.na(Survived)) %>%
ggplot(aes(x = Age_group, fill = factor(Survived))) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
  scale_fill_discrete(name = "Survived") +
  labs(x = "Age group", y = "Percentage") + 
  ggtitle("Survived percentage ove newly created Age_group")

grid.arrange(p1, p2, ncol = 2)

Figure 5.9: Survival value and percentage over newly created Age Group

We can see here only age group “0-9” has a better chance of survive.

Fare per passenger

We have used this concept when we fill the missing value of Embarked in Section 5.2. We were comparing the records’ fare with other passengers’ fare because we believe the fare should reflect the journey that should indicate the embarked port. It is there we find out the passenger could share the fare and the ticket number. So it is faulty information if you only considering Fare values between two passengers. After we introduce a new attribute Fare_pp that stands for fare per person, its value is the true value a passenger paid for the travel.

So we have, $\begin{equation} Fare\_PP = Fare / Friend\_size. \tag{5.4} \end{equation}$

We do this,

data$Fare_pp <- data$Fare/data$Friend_size

Let us examine our newly created attribute Fare_PP’s prediction power,

# plot Fare_PP against Survived
p1<- data %>%
  filter(!is.na(Survived)) %>%
ggplot(aes(x = Fare_pp, fill = factor(Survived))) + 
 geom_histogram(binwidth = 2) +
  scale_y_continuous(breaks = seq(0, 500, 50)) + 
  scale_fill_discrete(name = "Survived") + 
  labs(x = "Fare (per person)", y = "Count") + 
  ggtitle("Survived value over Fare_pp")
p1

Figure 5.10: Survival value and percentage over newly created Fare per person

# plot percentage of survive
p2 <- data %>%
  filter(!is.na(Survived)) %>%
ggplot(aes(x = factor(Fare_pp), fill = factor(Survived))) + 
  geom_bar(position = "fill") + 
  scale_y_continuous(labels = scales::percent, breaks = seq(0, 1, 0.1)) +
  scale_fill_discrete(name = "Survived") +
  labs(x = "Fare per person", y = "Percentage") + 
  ggtitle("Survived rate over newly created Fare_PP")
p2

Figure 5.11: Survival value and percentage over newly created Fare per person

# plot in box plot
data %>%
  filter(!is.na(Survived)) %>%
  filter(Fare > 0) %>%
ggplot(aes(factor(Survived), Fare_pp)) +
  geom_boxplot(alpha = 0.2) +
  scale_y_continuous(trans = "log2") +
  geom_point(show.legend = FALSE) + 
  geom_jitter()

Figure 5.12: Survival value over newly created Fare per person by boxplot

# grid.arrange(p1, p2, ncol = 2)

The graph confirms the fare_PP associated with the passenger’s survival. We can see that the perished passenger tend to pay less (around 8 pounds) and the average survived passenger appeared paid something around 14 pounds.