4.4 Actual Attributes Types Examination
Since we have our raw data in RStudio, We can exam attributes’ types. From figure 4.4, we can see that all the attributes have three types,
inttypes are: PassengerId, Survived, SibSp, Parch.
Factortypes are: Name, Sex, Ticket, Cabin and Embarked.
numtypes are: Age and Fare.
We know that, the type
int is for attribute that has an integer value; and
num is for an numeric attribute, which has the values of real numbers.
Factor is R language’s way to say category type. It is a attribute that can take on one of a limited, and usually fixed, number of possible values, such as blood type.
Attributes types affect the operations we can apply on that attributes. In other words inappropriate types can prevent us to do proper analysis on that attribute. For example, it does not make sense to calculate average on sex, so it is better to be with a type of Category, in R is a
Factor. Similarly, Survived will have only two values 0 or 1, to represent death or live. It makes sense to be an
Factor too. Being a
int type, it will prevent us to apply many methods that only works for a
Factor type attribute.
Another example is Name, its original type is Factor to reflect on its uniqueness. However, Type “Factor” is not good for string processing. It has been prevented that to apply regular expression4 on it. So, it is appropriate to change it into
chr as a character.
There are other inappropriate or wrong attribute types too such as SibSp and Parch are currently typed
int. May be they should be considered as
Factor. It is a common practice that data scientists apply different analyses on a attribute and change the attribute type to apply other different algorithms again5. The goal is to dig the insight out of data.
So, looking into data attributes types, compare with the original meaning of each attributes can help us to spot any inappropriate types or wrong types.
Is Servived typed