4.6 Data Recods Level Assessment

Although we have examined the data samples in the attributes level. We have a good knowledge about the record numbers in the given datasets. It is still necessary to check samples at the record level.

On one hand, if there are some records that have too many missing values for some attributes⁹, then they are not useful or invalid samples. They should be removed from the training dataset or the missing values should be filled in order to be useful.

On other hand, some records have identical values for most attributes. These could be considered duplicates. Depends on the problem to be solved, they could be problematic and need also to be dealt with.

In our Titanic problem, record level assessment is not an issue since most of the records are different. This does not mean we should completely ignore this step and not doing any record-level checking.

The attribute level missing value check cannot guarantee the record level does not have missing values. It is possible that attribute missing values can be gathered on the same record.↩︎