3.2 Data description
Visual inspection of the data, helps us to see patterns and other details. Deciding which type of plot/graph to use depends on the type of data and it’s purpose
The most useful plots are:
- Histogram – commonly used to visualize the distribution of a variable within consecutive intervals using vertical bars
- Line plot – commonly used to visualize how variable changes over time (time-series data)
- Scatter plot – commonly used to visualize the relationship between two numerical variables using points
Along with plots, descriptive statistics should be provided for each variable (mean, standard deviation, skewness, kurtosis, \(\dots\)) and normality of dependent variable \(y\) should be checked
If the dependent variable \(y\) is not normally distributed it’s unlikely that random variable \(u\) will follow a standard normal (Gaussian) distribution
Distribution of a variable often deviates from a normal curve due to the presence of extreme values, and therefore usually exhibits high kurtosis and skewness different from zero
Histogram, in particular, may indicate the presence of extreme values above the mean (distribution has a long right-tail) which means that variable is positively skewed
Histogram may also indicate the presence of extreme values bellow the mean (distribution has a long left-tail) which means that variable is negatively skewed
A normal distribution has a skewness of \(0\) and a kurtosis of \(3\)
If a distribution is positevely skewed, it is recommended to transform the variable into logs or inverse values, and if it is negatively skewed it is recommended to square the variable or apply a higher-degree power