# Analysing Data using Linear Models

*With applications in R*

*2024-06-04*

# Preface

## Target audience

This book is for bachelor students in social, behavioural and management sciences that want to learn how to analyse their data, with the specific aim to answer research questions. The book has a practical take on data analysis: how to do it, how to interpret the results, and how to report the results. All techniques are presented within the framework of linear models: this includes simple and multiple regression models, linear mixed models and generalised linear models. This approach is illustrated using R.

## Why linear models?

Starting from linear models gives students a great start into the world of data analytics. Starting from the linear regression model, there is a whole world to explore of versatile models that can handle almost any data problem in the social sciences. It also gives an entry into the world of machine learning, as many linear models are also used in that context. Being familiar with linear regression and logistic regression gives student a head start when moving to that field later on and something to build on.

This book is unique in that it is about linear models but has a non-technical focus. Many texts on linear models start from linear algebra with complicated formulas for vectors and matrices. In contrast, this text sticks to relatively simple regression equations and as far as formulas are concerned, they do not go beyond addition, multiplication, division, taking the root or taking the square of something. Anybody with a diploma from secondary education should be able to follow the equations, maybe with a little assurance from a teacher.

## R

Although this book is about analysing data using R, it is not a book on R itself. Online there are many resources to get a first introduction to R. In this book we provide the student with example R code that can be copied and tweaked to make it usable for their own datasets. In order to understand the code and to be able to use it, the student should get acquainted with the tidyverse way of coding in R, and particularly the pipe operator `%>%`

. The most basic functions that we use in this book are `mutate()`

, `filter()`

, `select()`

, `pivot_longer()`

, `pivot_wider()`

, `group_by()`

, `summarise()`

and `ggplot()`

. The student should also be familiar with the difference between numeric variables and factors, know how to read in data files, and how to install and work with R packages. The rest is explained in this book.

## How to read the book

Finally, some important tips on how to read this book. Although the focus is non-technical, some things need to be explained. But it is often hard for a student to figure out to what extent something should be understood in order to put the theory into daily practice. To guide the student, at the end of every chapter there is a list of take-away points. They form the essence of the learning goals. There is also a list of key concepts. If the student reads these take-away points and key concepts and feels they understand what is meant by them, they can stop reading and try to put the theory into practice by analysing some data on their own. If they get stuck, they can go back to the text again to see where they missed some key insights. Remember: analysing data is a skill, you learn by doing.

In order to further help the student to distinguish between what is essential and what is more detailed background information, we put the more detailed background information into grey boxes. For the day-to-day application of linear models, it is sufficient to understand what is in the main text. Whenever a student wants more explanation of how things actually work and why things are as they are, this can be found in the grey boxes. There are also a couple of sections that can be skipped entirely without losing track of the narrative of the book. These sections are indicated by having “(advanced)” in their title.

## Note on statistical reporting and scientific notation

When analysing data with R, the output shows a lot of numbers, with varying numbers of decimals. When reporting the output, we adhere to the style proposed by the American Psychological Association, which dictates that statistics should be reported to at most 2 decimals, and \(p\)-values to 2 or 3. Here we use 3 decimals for \(p\)-values.

Further, when numbers become either very large or very small, R shows output in scientific notation . If R shows a number like `8.77e- 2`

, this should be read as \(8.77 \times 10^{-2}\), which is equivalent to 0.0877. The easiest way to think about it is that you take the number before the \(e\) and then if you see a \(-2\) after the \(e\), you move the decimal dot in the number two places to the left. If you see a \(+2\) after the \(e\), you move the decimal dot two places to the right. In this way, `8.77e+ 1`

should be read as 87.7, and `8.77e+ 2`

should be read as 877. `8.77e+ 0`

should be read as 8.77.

## Disclaimer

Some of the data sets used in this book have been generated for the sole purpose of demonstrating statistical principles. These are not real data and no conclusions should be based on them.

## Acknowledgements

Earlier editions of this work, based on Sweave files, were supported by the Faculty of Behavioural, Management and Social Sciences (BMS) at the University of Twente, the Netherlands. The current bookdown edition was generously supported by a BMS WSV grant awarded to the author. Jolien van Straalen-Pas and Marian van Dijk offered many suggestions that improved the text substantially. Others, including students, spotted a lot of errors, big and small, and made helpful suggestions. For the errors remaining, the author takes all responsibility. Part of the work was done while hosted at the Biostatistics department at the University of Southern Denmark.

This work is licensed under a Creative Commons Attribution-NonCommmercial-ShareAlike 4.0 International License.