# Chapter 2 Descriptive Statistics

Descriptive statistics is used to present the characteristics of a data set or a sample in a clear and structured way → *It describes the data.*

For that purpose, we use:

frequency tables

graphs (diagrams)

measures of central tendency

measures of dispersion

## 2.1 Levels of measurement (scales)

The scale level determines which operations are possible and which statistical test should be selected.

### 2.1.1 Nominal scale

Nominal scale means that the characteristics (=values) of a variable can be categorized but there is **no clear order between the categories**. We can do frequency comparison but there’s no hierarchy.

Examples:

gender

occupation

study field

marital status

### 2.1.2 Ordinal scale

Ordinal scale is similar to nominal, but the measured values can be brought into a logical order. While the **ranking is possible**, the **distances between values are not equal**.

Examples:

newspaper consumption

frequency of social media use

→ e.g., often, sometimes, rarely, never

### 2.1.3 Quasi-metric ordinal scale

Quasi-metric ordinal scale is a special type of a case, where we know (assume) equal distances between values and that helps us to calculate specific statistics. Typical example is a *Likert scale* (“On a scale from 1 to 7 indicate how much you agree with the following statements”).

### 2.1.4 Interval

With intervals, the **differences between two values are equal and meaningful**. We can not only make statements about the ranking of the objects, but also about the size of their distances, i.e., the difference between the two values (the interval).

Examples:

standardized IQ points

birth year

### 2.1.5 Ratio

In contrast to the interval scale, a ratio scale has an absolute zero point. Absolute zero point means total lack of quantity. For example, length is a ratio because 0 cm means absence of length. There can also be no negative measures (you cannot have length of -X cm).

Examples:

length

weight

age

## 2.2 Frequency tables

Table used to describe the dataset. Example of frequency table for a variable *education*.

*Absolute frequency*= how many times a value is observed*Relative frequency*= how many times a value is observed relative to the total number of observed values (i.e., percentages of total)*Cumulative frequency*= sum of the class and all the classes below it

There are many ways to create frequency tables in R. The following examples use the dataset `Mobile_phone_use.sav`

, which you can find on Moodle. To be able to work with this data set, we first need to load it to R. It is a .sav file, which can be loaded using the `haven`

package:

### 2.2.1 Option 1: using `tab1()`

function from the `epiDisplay`

package

```
install.packages("epiDisplay")
library(epiDisplay)
tab1(Mobile_phone_use$gender, graph = FALSE)
#If you remove the graph = FALSE argument, the function will also generate a bar chart
```

This function will give us the absolute frequency as well as the relative frequency (%) for the variable including (NA+) as well as excluding (NA-) the N/A values (not available), i.e., missing values:

### 2.2.2 Option 2: using `group by()`

and `summarise()`

functions from the `dplyr`

package

```
install.packages("dplyr")
library (dplyr)
gender_freq_perc <- Mobile_phone_use %>%
group_by(gender) %>%
summarise(Frequency = n(), .groups = "drop") %>%
mutate(Percent = Frequency / sum(Frequency) * 100)
gender_freq_perc
```

```
# A tibble: 3 × 3
gender Frequency Percent
<dbl+lbl> <int> <dbl>
1 1 [female] 24 52.2
2 2 [male] 20 43.5
3 NA 2 4.35
```

There are therefore slightly more women in our sample (note the coding in the questionnaire!!!) than men. Two people did not specify their gender.

## 2.3 Graphs and charts

### 2.3.1 Pie Chart

Pie charts can be used for **nominal** and **ordinal** data. However, generally, it is **not recommended** to use the pie chart for displaying data.

Again, there are many ways to generate a pie chart in R.

#### 2.3.1.1 Option 1: Using the `pie()`

function

We will use the `table()`

function within the `pie()`

function. These functions are part of base R and thus no packages are needed. By specifying `useNA = "ifany"`

, we tell R to also diplay NAs, if there are any:

#### 2.3.1.2 Option 2: Using the `ggplot()`

function from the `ggplot2`

package

```
install.packages("ggplot2")
library(ggplot2)
ggplot(gender_freq_perc, aes(x="", y=Frequency, fill=as.factor(gender))) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0)
```

We could also include percentages instead of absolute frequencies:

```
ggplot(gender_freq_perc, aes(x="", y=Percent, fill=as.factor(gender))) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start=0)
```

There are many ways to customize your charts and graphs in R.

`gplot`

is the most common solution, with which you can create countless different types of diagrams and customize them in every conceivable way. There is a lot of helpful information on the Internet, e.g., https://r-graph-gallery.com/index.html and https://r-charts.com/ggplot2/.

### 2.3.2 Bar Chart

Bar charts can be used for **nominal** and **ordinal** data. The height of the columns illustrates frequencies and the width has no meaning.

#### 2.3.2.1 Option 1: Using the `barplot()`

function

We will use the `table()`

function within the `barplot()`

function. These functions are part of base R and thus no packages are needed. By specifying `useNA = "ifany"`

, we tell R to also display NAs, if there are any:

#### Option 2: Using the `ggplot()`

```
ggplot(Mobile_phone_use, aes(x = as.factor(gender))) +
geom_bar() +
labs(title = "Frequency of Gender", x = "Gender", y = "Frequency") + #we can customize the axis labels and the chart title
scale_x_discrete(labels = c("female", "male", "no answer")) + #we can also change the naming of the categories - ATTENTION note the coding in the questionnaire!
theme_minimal()
```

### 2.3.3 Histogram

Histograms can be used for **metric** and **quasi-metric** variables. The area of the columns (height x width) illustrates the frequencies, meaning the width has meaning.

#### 2.3.3.2 Option 2: many possibilities with `ggplot`

```
ggplot(Mobile_phone_use, aes(x = age_years)) +
geom_histogram(binwidth = 8, fill="#69b3a2", color="#e9ecef") + #With binwidth we can set the width of the "groups", i.e. the accuracy of the diagram
labs(title = "Distribution of age", x = "Age in years", y = "Frequency") +
scale_x_continuous(breaks = seq(10,70,by=2)) #with breaks we can set how many points we want to have on the axis
```

### 2.3.4 Boxplot

Boxplots can be used for **metric** and **quasi-metric** variables. They also show the median, quartiles and outliers.

**Outliers** are single unusually large/small measurements. They can be caused by measurement or survey input errors. Outliers should always be closely examined and possibly removed from the dataset for further analysis.

#### 2.3.4.2 Option 2: many possibilities with `ggplot`

It becomes more interesting when we look not just at one metric variable, but how it relates to different groups, e.g. how age is distributed according to gender.

```
ggplot(Mobile_phone_use, aes(x = as.factor(gender), y = age_years)) +
geom_boxplot() +
ggtitle("Age in years by gender") +
labs(x = "Gender", y = "Age") +
scale_x_discrete(labels = c("female", "male", "no answer"))
```

The men in our sample are on average slightly older than the women, but the distribution is different.

## 2.4 Measures of central tendency

We use three measures of central tendency: **mode**, **median** and **mean**.

### 2.4.1 Mode

Mode is the **most frequent score**, i.e., the value that appears the most often within a variable. Usually, it does not have to be calculated because it can be easily read in a frequency table or graphs.

### 2.4.2 Median

Median is the **middle score**, meaning that it divides the data right in the middle, so that 50% of the values are above the median and 50% below it. To determine the median, all values are sorted by size (variable needs to be at least ordinal). In the first example, the middle score is 170. To determine the median for the second example, we need to take the two middle values and divide them by 2, the resulting value is then the median.

We can use the function `median()`

to calculate median in R.

### 2.4.3 Mean

Mean is the **average value**. We calculate it by summing up all the values and then dividing this sum by the number of values. Therefore, mean is quite vulnerable to outliers.

We can use the function `mean()`

to calculate mean in R.

We can also use the function `summary()`

to see the minimum, maximum, quartiles, median and the mean value of a variable:

## 2.5 Measures of dispersion

We use four measures of dispersion: **range**, **interquartile distance**, **variance**, and **standard deviation**.

### 2.5.1 Range

Range is the **difference between largest and smallest value**. Therefore, it provides information about the two values of a distribution between which all values lie. This measure is thus vulnerable to outliers.

The formula for calculating the range of a set of values is given by: \[ R = x_{\text{max}} - x_{\text{min}} \] where \(x_{\text{max}}\) is the maximum value in the data set, and \(x_{\text{min}}\) is the minimum value. The range (\(R\)) is the difference between the maximum and minimum values.

### 2.5.2 Interquartile distance

Interquartile range (IQR) is the range of the **middle 50% of the values**. It is relatively resistant to outliers. To calculate it, we use **quartiles**, which divide the data into four parts:

We calculate the interquartile range by: The formula for calculating the interquartile range of a set of values is given by: \[ IQR = Q._{\text{75}} - Q._{\text{25}} \] where \(Q._{\text{75}}\) is the value of the 3. quartile (75th percentile), and \(Q._{\text{25}}\) is the value of the 1. quartile (25th percentile).

### 2.5.3 Variance and standard deviation

Variance and standard deviation are both measures used to determine the **spread of data around the mean**.

**Variance** is a statistical measurement that gauges the dispersion between values in a variable, essentially measuring how far each value is from the mean (that is, an average error between the mean and the observations made). Variance uses *squared values* and is thus vulnerable to outliers.

We can use the `var()`

function to calculate variance of a variable.

**Standard deviation (SD)** describes the average dispersion of the measured values of a variable around the mean value of a distribution (=tells us how much the data varies from the average).

If the SD is *small*, it means most of the numbers are *close to the average*. If SD is large, the numbers are more spread out, far from the average. The closer the numbers are to the average, the more accurately the average represents the the objects (data) as a whole, or the better the value can be generalized.

We can use the `sd()`

function to calculate variance of a variable.

We can also use the `describe()`

from the `psych`

package to see the standard deviation (SD), range, standard error (SE) and other statistics of a variable. If we include the `IQR=TRUE`

argument, we can also see the IQR.

## 2.6 Participation Exercise 2

In this exercise, you need to:

create a frequency table

calculate measures of central tendency and dispersion

create a chart to visualize the descriptives

### 2.6.1 Step-by-step instructions

Create a new RScript in your STADA project

*File -> New File -> R Script*First, you need to download all the packages that you will need in this exercise. The following code will install the packages you do not have yet.

```
if (!require("haven")) install.packages("haven")
if (!require("psych")) install.packages("psych")
if (!require("epiDisplay")) install.packages("epiDisplay")
if (!require("summarytools")) install.packages("summarytools")
if (!require("dplyr")) install.packages("dplyr")
if (!require("ggplot2")) install.packages("ggplot2")
```

- Next, you also need to load the packages:

```
library("haven")
library("psych")
library("epiDisplay")
library("summarytools")
library("dplyr")
library("ggplot2")
```

If needed, you can of course add install and load more packages later on.

Now you need to load the the data set to R. We will be working with the Media_use.sav file. Use the function

`read_sav()`

.Inspect the data set using the

`head()`

function.Create a frequency table for the variable

*newspaper_use*.Calculate the mean age of the participants (variable

*age*).For the variable

*newspaper_use*calculate the following measures. Copy paste the following code (comments), add it to your script and fill in the resulting values.

- Lastly, create a meaningful graphic for the variables
*gender*and*newspaper_use*.

Tip: take a closer look at the variables and their level of measurement first.

- Save the RScript as
*YourName_PE2.R*and upload it on Moodle.