5 Describing Data in R
Describing data is the process of summarizing and interpreting information that has been collected through research or observation.
To describe data, we typically use statistical measures such as measures of central tendency (e.g., mean, median, mode) and measures of variability (e.g., range, variance, standard deviation). These measures help us to understand the distribution of the data, including how the data is spread out and whether it is skewed or symmetrical.
When describing data, it’s important to consider the research question or hypothesis that is being investigated. By focusing on the relevant aspects of the data and using appropriate statistical and visual tools, we can better understand the underlying trends and relationships in the data and draw meaningful conclusions that can inform future research or educational practice.
5.1 Central Tendency:
Central tendency measures provide a single value that represents the center or “typical” value of a dataset. The primary measures of central tendency are the mean, median, and mode.
5.1.1 Sample Mean
The sample mean is the average value of a variable in a sample. It is denoted by the symbol “x̄”.
The sample mean, often referred to as the average, is the sum of all data points divided by the total number of data points. The mean is sensitive to extreme values (outliers) and may not always represent the true center of the data.
\(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\)
where \(\bar{x}\) is the mean, \(n\) is the total number of data points, and \(x_i\) are the individual data points.
In R, we can calculate the mean using the mean()
function. For example:
5.1.2 Median
The median is the middle value of a dataset when it is sorted in ascending or descending order. If the dataset has an odd number of data points, the median is the middle value; if it has an even number of data points, the median is the average of the two middle values. The median is less sensitive to extreme values compared to the mean.
t is represented as:
\(median = \begin{cases} x_{(n+1)/2} &\text{if }n\text{ is odd}\ \frac{x_{n/2} + x_{(n/2)+1}}{2} &\text{if }n\text{ is even} \end{cases}\)
where \(n\) is the total number of data points, and \(x_{(n+1)/2}\) and \(x_{n/2}\) are the middle values for odd and even \(n\) respectively.
5.2 Measures of Dispersion
Measures of dispersion are statistical values that describe the degree of spread or variability of a dataset. There are several measures of dispersion, including the range, interquartile range, variance, and standard deviation. In this section, we will discuss each measure of dispersion and its formula in detail.
5.2.1 Range:
The range is the simplest measure of dispersion that gives the difference between the maximum and minimum values in a dataset. It provides an idea of how spread out the data is. However, it is highly sensitive to outliers and does not provide information about the distribution’s shape. The formula for the range is as follows:
\[\begin{equation} Range = Max(X) - Min(X) \end{equation}\]The range() function returns the difference between the maximum and minimum values in a dataset, providing a measure of the spread of data around them.
5.2.2 Interquartile Range (IQR):
The interquartile range (IQR) is the measure of dispersion that indicates the spread of the middle 50% of the data. It is less sensitive to outliers than the range. The IQR is defined as the difference between the first quartile (Q1, the 25th percentile) and the third quartile (Q3, the 75th percentile) of a dataset. The formula for IQR is as follows:
\[\begin{equation} IQR = Q3 - Q1 \end{equation}\]where Q1 is the first quartile and Q3 is the third quartile of the dataset.
The IQR() function calculates the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of a dataset, providing a measure of the spread of the middle 50% of the data.
IQR(income)
#> [1] 12500
5.2.3 Variance:
The variance is a measure of dispersion that quantifies the average deviation of data points from the mean. It measures how far a set of numbers is spread out from their average value. The variance is expressed in squared units and is influenced by outliers. There are two types of variance: population variance and sample variance.
Population variance is used when the entire population is available, while sample variance is used when only a sample of the population is available. The formula for population variance is as follows:
\[\begin{equation} \sigma^2 = \frac{\sum(X - \mu)^2}{N} \end{equation}\]where X is the data point, μ is the mean of the dataset, and N is the total number of data points in the population.
The formula for sample variance is slightly different and is as follows:
\[\begin{equation} s^2 = \frac{\sum(X - \bar{x})^2}{n - 1} \end{equation}\]where X is the data point, \(\bar{x}\) is the mean of the sample, and n is the sample size.
var()
function is used to calcultate the variance.
# Calculate variance
var(income)
#> [1] 87500000
5.2.4 Standard Deviation:
The standard deviation is the square root of the variance. It is a widely used measure of dispersion that measures the average deviation of data points from the mean. It is expressed in the same units as the data and is sensitive to outliers, just like variance. There are two types of standard deviation: population standard deviation and sample standard deviation.
The formula for population standard deviation is as follows:
\[\begin{equation} \sigma = \sqrt{\sigma^2} \end{equation}\]where \(\sigma^2\) is the population variance.
The formula for sample standard deviation is slightly different and is as follows:
\[\begin{equation} s = \sqrt{s^2} \end{equation}\]where s^2 is the sample variance.
sd()
function is used to calculate the standard deviation.
# Calculate standard deviation
sd(income)
#> [1] 9354.143