5.4 Build Re-engineered Dataset

We have done many things:

  • unified the test dataset with train dataset
  • transformed some data types
  • make up and filled the missing values for some attributes
  • re-engineered some attributes, and
  • created some new attributes

Let us look at our dataset attributes,

glimpse(data)
## Rows: 1,309
## Columns: 21
## $ PassengerId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Survived     <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, ...
## $ Pclass       <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, ...
## $ Name         <fct> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley...
## $ Sex          <fct> male, female, female, female, male, male, male, male, ...
## $ Age          <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 27.4...
## $ SibSp        <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, ...
## $ Parch        <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, ...
## $ Ticket       <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803",...
## $ Fare         <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8...
## $ Cabin        <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", "G6"...
## $ Embarked     <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, ...
## $ HasCabinNum  <chr> "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No...
## $ Friend_size  <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Fare_pp      <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8....
## $ Title        <chr> "Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master"...
## $ Deck         <chr> "U", "C", "U", "C", "U", "U", "E", "U", "U", "U", "G",...
## $ Ticket_class <fct> A, P, S, 1, 3, 3, 1, 3, 3, 2, P, 1, A, 3, 3, 2, 3, 2, ...
## $ Family_size  <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Group_size   <dbl> 2, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Age_group    <fct> 20-29, 30-39, 20-29, 30-39, 30-39, 20-29, 50-59, 0-9, ...

We can see there are 21 attributes in total. Compare with the 12 attributes in the original raw dataset, there are 9 newly added contributes. They have enriched the original attributes but some re-engineered attributes are leftover such as Name and Cabin (too many missing values). Name has been transformed into Title and Cabin has been transformed into HasCabinNum and Deck.

Clearly, we need to clean up or remove redundant attributes. For some re-engineered attributes like Deck effectively is derived from Cabin. With the Deck in place, Cabin has no need to exist. Effectively, lose Cabin will not lose any information. Fare provides misleading information because it only keeps the amount of money paid for a ticket but does not specify the amount is for group fare or single fare. So Fare_PP is the accurate replacement of the Fare. Family_size is derived from Sibsp and Parch, they are containment relations, if you want fine grant analysis, you can keep all of them. Friend_size was introduced when we calculate the ticket price. That is a person who paid for the ticket. Friend_size is different from the Family_size because the Friend_size is simply the passenger who shares the same ticket number. There is no way to know if they are a family member. At the same time, “Family_size” does not ensure the sharing of the ticket. Ticket_class is derived from the Ticket number. It is a kind of grouping of the ticket. Finally, the Age_group is a similar concept that groups the Age attribute.

Therefore, we could keep our re-engineered dataset as follows:

RE_data <- subset(data, select = -c(Name, Cabin, Fare))

Our dataset now have the following attributes:

glimpse(RE_data)
## Rows: 1,309
## Columns: 18
## $ PassengerId  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Survived     <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, ...
## $ Pclass       <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, ...
## $ Sex          <fct> male, female, female, female, male, male, male, male, ...
## $ Age          <dbl> 22.00000, 38.00000, 26.00000, 35.00000, 35.00000, 27.4...
## $ SibSp        <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, ...
## $ Parch        <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, ...
## $ Ticket       <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803",...
## $ Embarked     <fct> S, C, S, S, S, Q, S, S, S, C, S, S, S, S, S, S, Q, S, ...
## $ HasCabinNum  <chr> "No", "Yes", "No", "Yes", "No", "No", "Yes", "No", "No...
## $ Friend_size  <int> 1, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Fare_pp      <dbl> 7.250000, 35.641650, 7.925000, 26.550000, 8.050000, 8....
## $ Title        <chr> "Mr", "Mrs", "Miss", "Mrs", "Mr", "Mr", "Mr", "Master"...
## $ Deck         <chr> "U", "C", "U", "C", "U", "U", "E", "U", "U", "U", "G",...
## $ Ticket_class <fct> A, P, S, 1, 3, 3, 1, 3, 3, 2, P, 1, A, 3, 3, 2, 3, 2, ...
## $ Family_size  <dbl> 2, 2, 1, 2, 1, 1, 1, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Group_size   <dbl> 2, 2, 1, 2, 1, 1, 2, 5, 3, 2, 3, 1, 1, 7, 1, 1, 6, 1, ...
## $ Age_group    <fct> 20-29, 30-39, 20-29, 30-39, 30-39, 20-29, 50-59, 0-9, ...

In order to preserve our re-engineered dataset, it is a good idea to save it back to hard drive. So it can be used later in the data analysis.

write.csv(RE_data, file = "./data/RE_Data.CSV", row.names = FALSE)