1 Introduction

When there is a relationship between two or more variables, it means that one variable has an effect on the other variable(s), or that one variable has an effect on the other variable(s). Two techniques can be used to examine the relationship between variables: regression and correlation. Regression analysis is the process of constructing a mathematical equation (also known as a statistical model or regression model) to express the relationship between variables. This equation or model can also be used to predict one variable from another. Correlation analysis is used to measure the strength of the relationship between variables (i.e. how much the variables influence each other).

1.1 General Examples of Relationships

There may be a relationship between the speed a car travels and the amount of fuel it uses.
There may be a relationship between how much studying you do and how well you do in your exams.
There may be a relationship between the inflation rate and the interest rate the banks charge.
There may be a relationship between the number of years of education a worker has and the worker’s income.

To study these kinds of relationships in statistics, we have to consider two variables at once, e.g., for the first example the two variables are ‘speed’ and ‘amount of fuel’. For the second example, the two variables are ‘time studied’ and ‘marks achieved’ and for the last example the two variables are ‘years of education’ and ‘income’. So the variables are two things that are measured at one time.

1.2 More About the Variables

OK, now we know that when a relationship exists between two variables, it means that one variable has an influence (or effect) on the other variable, but the question is: which variable affects which one? Looking at the first example in the previous section, we may ask: does the ‘speed of the car’ affect the ‘amount of fuel used’ or does the ‘amount of fuel used’ affect the ‘speed of the car’? Clearly, in this case, the ‘speed of the car’ affects the ‘amount of fuel used’.

The variable that has an influence on the other variable, or the variable that causes the effect (speed of the car), is called the independent variable and is denoted by x. The other variable (amount of fuel used) is called the dependent variable and is denoted by y.

Here are some important characteristics about these two variables that you have to know:

The independent variable (x) always influence the dependent variable (y) and not the other way round. In other words, changes in the values of the x variable cause changes in the values of the y variable.
We want to predict the dependent variable y by making use of known values of the independent variable x.
Characteristics of the y variable (dependent variable) are that it can change often, its value is unknown and it cannot be controlled by the experimenter.
Characteristics of the x variable (independent variable) are that it is usually fixed at a certain time, its value is known and it can be controlled by the experimenter .
In each sample of data there is always only one dependent variable y.
In each sample of data there may be one or more than one independent variables. One independent variable in a data set is denoted by x whereas more than one independent variable (say there are four) is denoted by $x_{1}, x_{2}, x_{3}$ and $x_{4}$ .

Example 1.1 Suppose we want to study the relationship between the sales revenue of a product and the amount of money spent on advertising the product. In this case y = sales revenue and x = amount spent on advertising. The aim is to predict the sales revenue (y) of the product by making use of a known amount to be spend on advertising (x). For example, say a company spend R20 000 on advertising per month, what will their sales revenue for that month be? But is the amount spend on advertising the only factor that influence the sales revenue of the product? Of course not! You will agree that sales depend on many factors other than advertising expenditure. For example the time of year, the state of the general economy and the price may also influence the sales. Therefore we have to take all of these factors into account when we want to predict the sales revenue. To make the most accurate prediction, we have to study the relationship between the dependent variable and all the independent variables mentioned above.

Through this study we want to find a statistical model or mathematical equation that best fits the data and this process is known as regression analysis. We then use this equation to predict the dependent variable Y.

Example 1.2 Determine the dependent (y) and independent variables (x) for a possible relationship between 1. consumption (spending of money) and salary 2. a company’s television advertising expenditure, media advertising expenditure and sales 3. age, blood pressure, height of persons 4. monthly income for a household, number of persons in the household, electric bill at end of month, living area (in square meters) of the house

Solution It is usually easier to identify the y variable first.

y = consumption: spending money is not controllable (you do not spend the same amount say every month) ; it is usually not known (you don’t know what you are going to spend next month) ; It varies from one month to another.

x = salary: salary can be controlled ; it is known ; it is fixed (your salary do not vary from one month to another.
y = sales: a company has little control over sales ; it varies from month to month ; it is not known (the company do not know what it’s sales would be next month).

x1 = television: the amount is controllable (the company decides how much to spend on television advertising ; it is known ; it is fixed (they usually spend the same amount every month).

X2 = media: the same as the previous variable
y = blood pressure: uncontrollable (you cannot tell your blood pressure what it must be) ; it is not fixed (it changes from one minute to another) ; it is not known (unless you take the trouble to measure it).

x1 = age: a person’s age is known and fixed (at least for a year)

x2 = height: a person’s height is known and fixed
y = electric bill: you have little control over the amount of the bill ; it changes from month to month (not fixed) ; usually not known (you do not know what the bill for next month would be)

x1 = income: monthly income is usually fixed and known (controllable)

x2 = number of persons: it is known and fixed (controllable)

x3 = living area: known and fixed (controllable)

1.3 Collecting the Data for Regression

Once you have identified the variables that you are going to use in your study, the next step is to collect the sample data that will be used for analysis to determine the relationship between the variables. This entails collecting observations on the y variable as well as the x variables (if there is more than one x variable that influence the dependent variable). The data for regression can be of two types: observational and experimental.

$\color{red}{NB!!!}$ It is very important to be able to distinguish between the two types of data. The reason for this is because some regression techniques can only be applied to observational data and some techniques can only be applied to experimental data. So to choose the right regression techniques, one has to know what data type you are working with. Now lets look at the different characteristics of the two types of data.

1.3.1 Observational data

Observational data are obtained if

Data are recorded as it is observed
No attempt is made to control the independent (x’s) and dependent (y) variables
The values of the x and y variables can be observed at the same time

Example 1.3 Suppose you want to predict an executive’s monthly compensation (salary), y. One way to obtain the data for regression is to select a random sample of executives and record the value of y as well as the values of each of the predictor values (the x -values that influence y). Data for five executives in the sample are displayed in Table 1. There are six x variables that have an influence on the compensation.

Solution The data in Table 1 can be considered as observational data because the data is recorded as it is observed from the executives. We cannot control the values of x and y (if an executive indicates that his/her monthly salary is R55 420, then we have to record this value as it is given (thus we cannot change it and we have no control over the value); if an executive indicates the he/she is 42 years old, we have to record the age as 42 (we cannot change the age and thus have no control over the value). Note that the values of y. and the values of the x’s can be observed at the same time. You can ask the executive to give information about the compensation (y) and all the x variables at the same time.

1.3.2 Experimental data

The second type of data in regression, experimental data, are generated by designed experiments where the values of the independent variables (the x values) are set in advance before the values of y is observed. In other words, the experimenter decide, before performing the experiment, which values of the independent variables are going to be used in the experiment. Only after the experiment is performed, the values of the dependent variable, y, can be observed. Thus, experimental data are obtained if

Data are recorded according to the results of the experiment;
The independent variables (x’s) are controlled (i.e. the experimenter decide beforehand what the values of the independent variables are going to be);
The values of the x and y-values cannot be observed at the same time (the values of x are chosen before the experiment and the values of y are observed after the experiment).

Example 1.4 A production supervisor wants to investigate the effect of two independent variables, temperature $x_{1}$ and pressure $x_{2}$ on the impurity (y) of a certain chemical. The supervisor decided to employ three values of temperature ( $100^{\circ}$ C, $125^{\circ}$ C, and $150^{\circ}$ C) and three values of pressure (50, 60, and 70 kg per square cm) to produce and measure the impurity y of the chemical for each of the 3 = 9 temperature-pressure combinations.

Solution The data in Table 2 can be considered as experimental data because the supervisor decided, before performing the experiment, what the temperatures and pressures (the independent variables) are going to be, (i.e. the independent variables are controlled – you can choose any values for the x’s). The y- values are recorded according to the results of the experiment, and the y-values are only recorded after the experiment is performed (i.e. the x and y-values are not observed at the same time).