Summary

All the analyses in this chapter provide demonstrations on how to access the raw data and understand their quantity and data quality. Notice that the understanding data is never a single one-off action. You never fully understood the given data. once the analytical process moving on, you may need to come back to apply some new decomposition on some attributes to explore more.

Since our raw data is not too big in terms of both the number of records and the number of attributes. So it is relatively easy to assess their quality. In a real-world project, the raw data can be huge or can be too little. To perform an effective analysis you may need to reduce the data size or in other cases to increase the size. It means you need to do sampling on the given datasets and probably attribute selection too. In other cases, you may need to create new attributes or combine a few attributes together. These are called attributes re-engineering. They are the part of important tasks in the data preprocessing, which is covered in the next chapter on data preprocess.

The entire R code in this chapter is avalable in the file “TitanicDataAnalysis_UnderstandData.R” and it can be find in the appendix.