Chapter 9 Week 9

9.1 Introduction to Probability Distributions and Revision

The essential reading for this week can be found in Navarro’s Chapter 9

This week we will extend the probability content to probability distributions. We have already had a glance at coin flips last week. We will extend on the concept this week by talking more about sequences of the events.

Last week we were working towards building your intuition about random and independent events. The beauty of the independent variable is that we can construct distributions of the values the variable can take (e.g. outcomes of the coin toss) which we can then study to evaluate the likelihood of certain events happening (e.g. Kobe and his ‘hot hand’).

We will continue today by working with discrete distributions and the task is to sample from the outcomes, visualise, and attempt to determine the likelihood of an event from the distributions.

9.2 Discrete example (guessing homework answers)

Imagine that you are working on your weekly homework quiz on Learn which has 10 multiple choice questions with 4 answers and you want to see what is the likelihood of getting the answer right if you randomly pick the answer each time with your eyes closed. Assume that there is only one correct option.

We can study the probability of getting all the questions right or, say, of getting only half of questions right, by generating the distribution of the likely outcomes.

If you pick an answer at random then the probability of getting it right should be \(\frac{1}{4}=0.25\). Such would be true also for question two, assuming that your answer for question one is independent of question two.

We can generate samples for each trial (n = 10 questions) and will check what is the probability of getting one, two, three, and so on questions out of 10 right. We will use the rbinom() function in R to generate the distribution of a number of attempts.

We can work with it directly in R using the extensions:

rbinom() to generate the distribution (‘r’ for random) for discrete outcomes
dbinom() to study the probability of the outcome
pbinom() to study the cumulative probability (also can be described as the area under the curve)

The key arguments we will use are:

x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.

9.3 Generate a homework attempt

# We can generate an attempt for 10 questions
rbinom(n = 10, # 10 attempts
       size = 10, # 10 questions)
       prob = 0.25) # Probability that one is correct (1/4)

##  [1] 2 4 3 0 1 0 1 2 5 2

# We can generate an attempt for 100 questions/trials
rbinom(n = 100, # 100 times of 10
       size = 10, 
       prob = 0.25)

##   [1] 2 3 3 2 1 4 3 4 3 1 4 4 0 3 1 2 3 5 3 3 2 2 2 3 4 2 2 1 1 3 2 2 1 3 1
##  [36] 1 2 1 4 3 4 4 3 3 0 3 1 1 3 1 3 4 2 2 1 1 3 4 2 2 4 2 2 3 1 1 1 1 1 2
##  [71] 3 0 4 1 2 0 0 4 2 2 5 5 1 3 1 3 2 1 5 1 3 3 3 5 2 5 3 1 3 3

Let’s assign it to a tibble so we can calculate the proportions and make a plot.

library(tidyverse)
homework_guess <- tibble(right_guess = rbinom(n = 100, 
       size = 10, 
       prob = 0.25))

homework_guess %>%
  count(right_guess)

## # A tibble: 7 x 2
##   right_guess     n
##         <int> <int>
## 1           0     4
## 2           1    22
## 3           2    24
## 4           3    25
## 5           4    16
## 6           5     8
## 7           6     1

ggplot(data = homework_guess, aes(x = right_guess)) + 
  geom_bar(fill = 'lightblue') + 
  labs(x = 'Right Guess', y = 'Count') + 
  theme_minimal()

We can change the scale on the graph to see all the options:

ggplot(data = homework_guess, aes(x = right_guess)) + 
  geom_bar(fill = 'lightblue') + 
  xlim(0,10) +  #note how we add xlim()
  labs(x = 'Right Guess', y = 'Count') + 
  theme_minimal()

## Warning: Removed 1 rows containing missing values (geom_bar).

What’s more, we can also change y units to probability.

ggplot(data = homework_guess, aes(x = right_guess)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), # we will work with the function of count, hence we use'..'
           fill = 'lightblue') + 
  xlim(0,10) + 
  labs(x = 'Right Guess', y = 'Probability') + 
  theme_minimal()

## Warning: Removed 1 rows containing missing values (geom_bar).

9.4 Changing the probability (TRUE/FALSE)

What if the multiple choice had only two choices for the answer (i.e. TRUE or FALSE questions) - the right answer will now have a probability of 0.5 instead. Let’s reflect on that and create a new tibble():

#Homework guess distribution with TRUE/FALSE
homework_guess_true_false <- tibble(right_guess = rbinom(n = 100, 
       size = 10, 
       prob = 0.5)) # Note that the probability of getting the right answer has gone up

homework_guess_true_false %>%
  count(right_guess)

## # A tibble: 9 x 2
##   right_guess     n
##         <int> <int>
## 1           1     1
## 2           2     3
## 3           3    13
## 4           4    20
## 5           5    28
## 6           6    19
## 7           7    12
## 8           8     3
## 9           9     1

Your chances are higher! Can you see why?

ggplot(data = homework_guess_true_false, aes(x = right_guess)) + 
  geom_bar(fill = 'lightgreen')  +  
  xlim(0,10) + 
  labs(x = 'Right Guess', y = 'Count') + 
  theme_minimal()

We have less options now, hence a higher chance of picking the correct ones.

Let’s also plot with probabilities:

ggplot(data = homework_guess_true_false, aes(x = right_guess)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'lightgreen') + 
  xlim(0,10) + 
  labs(x = 'Right Guess', y = 'Probability') + 
  theme_minimal()

9.5 Studying the distribution

What’s great about learning probability disitributions is that we can use the distribution above to find exact probability of guessing different numbers of questions correctly. We will use dbinom().

Check what the probability is of getting at least one question right when guessing at random:

dbinom(1, # Number of outcomes that we got right
       size=10, # 10 trials (we have 10 questions)
       prob=0.25) # Probability of getting the right answer (1/4)

## [1] 0.1877117

Or for two correct answers:

dbinom(2, # Number of outcomes that we got right
       size = 10, # 10 trials (we have 10 questions)
       prob = 0.25) # Probability of getting the right answer (1/4)

## [1] 0.2815676

Or three:

dbinom(3, # Number of outcomes that we got right
       size = 10, # 10 trials (we have 10 questions)
       prob = 0.25) # Probability of getting the right answer (1/4)

## [1] 0.2502823

And so on….

What if we want to find the probability of guessing all 10 questions correctly? That’s a magical result!

dbinom(10, size = 10,  prob = 0.25)

## [1] 9.536743e-07

#Round to 2 d.p.
round(dbinom(10, size = 10,  prob = 0.25), 2)

## [1] 0

So rare that it is almost zero, so never try to do the quiz just relying on your luck! :)

9.6 Cumulative Probability (Advanced)

By adding the above we can also get a cumulative probability, meaning that we can study what would be the chance to get four or less right, meaning that you want to include the chances of getting zero, one, two, and three right as well.

Graphically, we really want to analyse the probability mass here, given that all should sum up to one. To know the chances of getting one or two questions right we can sum the probabilities:

dbinom(1, size = 10,  prob = 0.25) +
  dbinom(1, size = 10,  prob = 0.25) +
  dbinom(2, size = 10,  prob = 0.25)

## [1] 0.656991

A faster way to see all the probabilities at once would be:

all_probs <- round(dbinom(x = c(0,1,2,3,4,5,6,7,8,9,10), prob = 0.25, size = 10),2) # Note that we use round to see the values to two decimal places 
all_probs

##  [1] 0.06 0.19 0.28 0.25 0.15 0.06 0.02 0.00 0.00 0.00 0.00

You can then find out what the probability is of getting five or less answers right versus five or more answers right:

# Five or less
pbinom(q = 5, prob = 0.25, size = 10)

## [1] 0.9802723

Quite high chances :)

# Five or more
1 - pbinom(q = 5, prob = 0.25, size = 10 )

## [1] 0.01972771

Not that much! Can you see what we did? We found the probability of outcomes that are equal or less than five and then substracted it from total.

Graphically we can show this as following:

knitr::include_graphics('prob.jpg')

There is an exercise for you to try this yourself with the distribution for TRUE/FALSE questions. Make sure that you revise this before coming back to the course in January as it will remind you of where we ended.

9.7 Revision Practice Rmd. Solutions

This week’s practice is built around the key material we covered during the past eight weeks. You will need to load the data from Learn and then work with the key variables to provide descriptive statistics and visualisations. There is an extra practice at the end for you to work on the discrete probability distribution example as well.

The dataset has information on participants that took part in the memory experiment: ’ IDs, Age, Memory score on three different tasks (Task A, Task B, and Placebo), and data on whether the participant received saw information/text twice.

We are trying to explore whether treament (i.e. task) and seeing information twice may affect the memory scores.

ID: 1 to 143
Age: 18-51
Memory Score: 1-100 (100 when remembered everything)
Task: Task A, Task B, Placebo
Saw_twice: Yes/No (if participant saw the text twice)

# Load tidyverse
library(tidyverse)

# Read data in
data <- read.csv('week_9.csv')

# Check what's inside
head(data)

##    ID Age Memory_score    Task Saw_twice
## 1 ID1  21           54  Task A        No
## 2 ID2  27           23  Task A       Yes
## 3 ID3  25           32 Placebo       Yes
## 4 ID4  25           38  Task B        No
## 5 ID5  49           43  Task B       Yes
## 6 ID6  47           32 Placebo        No

9.7.1 Provide descriptive statistics for age and memory score variables

There are different ways to do so using what we have learned so far:

Look at each variable separately:

# First age
data %>%
  summarise(mean = mean(Age),
            median = median(Age),
            sd = sd(Age))

##       mean median       sd
## 1 34.39583   33.5 9.219331

# Then memory
data %>%
  summarise(mean = mean(Memory_score),
            median = median(Memory_score),
            sd = sd(Memory_score))

##       mean median       sd
## 1 44.03472     41 20.54516

Or do it all in one go:

data %>%
  summarise(mean_age = mean(Age),
            mean_memory = mean(Memory_score),
            median_age = median(Age),
            median_memory = median (Memory_score),
            sd_age = sd(Age),
            sd_memory = sd(Memory_score))

##   mean_age mean_memory median_age median_memory   sd_age sd_memory
## 1 34.39583    44.03472       33.5            41 9.219331  20.54516

9.7.2 Descriptives by groups

Provide descriptives of the memory scores by task, and by whether someone saw the information on the task twice:

# By treatment/task
data %>%
  group_by(Task) %>% #note that we are adding group_by() to differentiate by a variable
    summarise(mean = mean(Memory_score),
            median = median(Memory_score),
            sd = sd(Memory_score))

## # A tibble: 3 x 4
##   Task     mean median    sd
##   <fct>   <dbl>  <dbl> <dbl>
## 1 Placebo  43.2   41    21.9
## 2 Task A   45.7   41.5  19.8
## 3 Task B   43.4   42    20.2

# By whether someone saw information on the task twice
data %>%
  group_by(Saw_twice) %>%
    summarise(mean = mean(Memory_score),
            median = median(Memory_score),
            sd = sd(Memory_score))

## # A tibble: 2 x 4
##   Saw_twice  mean median    sd
##   <fct>     <dbl>  <dbl> <dbl>
## 1 No         44.1   42.5  20.6
## 2 Yes        44.0   41    20.6

9.7.3 Visualise

Provide distributions of age and memory scores.

# Age
ggplot(data = data, aes(x = Age)) + 
  geom_histogram(colour = 'grey', fill = 'cornsilk') + 
  labs(x = 'Age (Years)', y = 'Frequency', title = 'Histogram of Age') + 
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Memory Scores
ggplot(data = data, aes(x = Memory_score)) + 
  geom_histogram(colour = 'grey', fill = 'cornsilk') + 
  labs(x = 'Memory Score (1-100)', y = 'Frequency', title = 'Histogram of Memory Scores') + 
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

9.7.4 Visualise a subset

What about memory scores only for Task A?

# Memory Scores for Task A
ggplot(data = subset(data, Task %in% c('Task A')), aes(x = Memory_score)) + 
  geom_histogram(colour = 'grey', fill = 'cornsilk') + 
  labs(x = 'Memory Score (1-100)', y = 'Frequency', title = 'Histogram of Memory Scores (Task A)') + 
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can also use %> to do the same. Check this out:

# Memory Scores for Task A
data %>%
  filter (Task == 'Task A') %>%
ggplot(data = ., #note how we replace the data with `.` which will allow us to use the specification above as our input
       aes(x = Memory_score)) + 
  geom_histogram(colour = 'grey', fill = 'cornsilk') + 
  labs(x = 'Memory Score (1-100)', y = 'Frequency', title = 'Histogram of Memory Scores (Task A)') + 
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

9.7.5 Variable by a group

What about a plot of Memory Score by Task?

ggplot(data = data, aes(x = Task, y = Memory_score, fill = Task)) + 
  geom_boxplot()

Anything more advanced? Maybe you could try ggridges?

library(ggridges)
ggplot(data = data, aes(x = Memory_score, y = Task,  fill=Task)) + geom_density_ridges()

## Picking joint bandwidth of 8.01

9.7.6 Create a new variable using mutate()

Let’s change memory scores into percentages. How would you add an extra variable that converts scores (1-100) into percentages??

# We can use mutate (try first adding the new variable via piping)
data %>%
  mutate(Memory_score=Memory_score/100)

##        ID Age Memory_score    Task Saw_twice
## 1     ID1  21         0.54  Task A        No
## 2     ID2  27         0.23  Task A       Yes
## 3     ID3  25         0.32 Placebo       Yes
## 4     ID4  25         0.38  Task B        No
## 5     ID5  49         0.43  Task B       Yes
## 6     ID6  47         0.32 Placebo        No
## 7     ID7  28         0.51  Task A        No
## 8     ID8  26         0.26 Placebo        No
## 9     ID9  28         0.56 Placebo        No
## 10   ID10  35         0.39  Task A        No
## 11   ID11  48         0.32  Task B       Yes
## 12   ID12  26         0.19 Placebo        No
## 13   ID13  50         0.46  Task B        No
## 14   ID14  22         0.26  Task B       Yes
## 15   ID15  21         0.28  Task A        No
## 16   ID16  44         0.48  Task A       Yes
## 17   ID17  45         0.28  Task B       Yes
## 18   ID18  40         0.76  Task A        No
## 19   ID19  23         0.62 Placebo        No
## 20   ID20  42         0.44 Placebo        No
## 21   ID21  20         0.26  Task B       Yes
## 22   ID22  34         0.30  Task A       Yes
## 23   ID23  48         0.24  Task A        No
## 24   ID24  48         0.82  Task B        No
## 25   ID25  44         0.29  Task A       Yes
## 26   ID26  44         0.47  Task B        No
## 27   ID27  25         0.55  Task A        No
## 28   ID28  48         0.56 Placebo        No
## 29   ID29  31         0.30 Placebo        No
## 30   ID30  22         0.44  Task B       Yes
## 31   ID31  32         0.66  Task A       Yes
## 32   ID32  22         0.12 Placebo        No
## 33   ID33  29         1.24 Placebo       Yes
## 34   ID34  33         0.59  Task A        No
## 35   ID35  38         0.34  Task A        No
## 36   ID36  31         0.81  Task A       Yes
## 37   ID37  49         0.36  Task B       Yes
## 38   ID38  25         0.08  Task B        No
## 39   ID39  28         0.66  Task B        No
## 40   ID40  28         0.62 Placebo        No
## 41   ID41  42         0.13  Task A       Yes
## 42   ID42  43         0.84  Task B        No
## 43   ID43  47         0.60 Placebo       Yes
## 44   ID44  28         0.05  Task B        No
## 45   ID45  34         0.46  Task B       Yes
## 46   ID46  33         0.72 Placebo        No
## 47   ID47  24         0.41  Task A       Yes
## 48   ID48  39         0.87 Placebo       Yes
## 49   ID49  44         0.56  Task A        No
## 50   ID50  49         0.22 Placebo        No
## 51   ID51  40         0.94  Task B        No
## 52   ID52  22         0.42  Task B       Yes
## 53   ID53  32         0.33  Task A        No
## 54   ID54  47         0.78  Task B       Yes
## 55   ID55  37         0.49 Placebo        No
## 56   ID56  51         0.19  Task A       Yes
## 57   ID57  36         0.67  Task B        No
## 58   ID58  22         0.28  Task B        No
## 59   ID59  37         0.31  Task A       Yes
## 60   ID60  42         0.38  Task B       Yes
## 61   ID61  20         0.53  Task B       Yes
## 62   ID62  25         0.29  Task B        No
## 63   ID63  29         0.38  Task B       Yes
## 64   ID64  44         0.53  Task B        No
## 65   ID65  28         0.37  Task B        No
## 66   ID66  19         0.77 Placebo        No
## 67   ID67  42         0.46 Placebo       Yes
## 68   ID68  32         0.14  Task B        No
## 69   ID69  42         0.89 Placebo        No
## 70   ID70  23         0.53 Placebo       Yes
## 71   ID71  46         0.36  Task B       Yes
## 72   ID72  35         0.55 Placebo       Yes
## 73   ID73  23         0.18 Placebo       Yes
## 74   ID74  25         0.31 Placebo        No
## 75   ID75  38         0.41 Placebo       Yes
## 76   ID76  33         0.42 Placebo       Yes
## 77   ID77  24         0.34 Placebo        No
## 78   ID78  47         0.80  Task A        No
## 79   ID79  37         0.20  Task A       Yes
## 80   ID80  40         0.25  Task B       Yes
## 81   ID81  19         0.75  Task B       Yes
## 82   ID82  39         0.42  Task A        No
## 83   ID83  24         0.31  Task B       Yes
## 84   ID84  27         0.53  Task B        No
## 85   ID85  31         0.35  Task B        No
## 86   ID86  41         0.28  Task B        No
## 87   ID87  29         0.75  Task A       Yes
## 88   ID88  43         0.51  Task B        No
## 89   ID89  42         0.24  Task A        No
## 90   ID90  31         0.31  Task A        No
## 91   ID91  22         0.32 Placebo        No
## 92   ID92  47         0.43 Placebo        No
## 93   ID93  24         0.28 Placebo       Yes
## 94   ID94  33         0.39  Task A       Yes
## 95   ID95  47         0.34  Task A        No
## 96   ID96  33         0.81  Task A        No
## 97   ID97  28         0.41  Task B       Yes
## 98   ID98  36         0.15  Task B        No
## 99   ID99  44         0.71  Task A        No
## 100 ID100  40         0.34  Task A       Yes
## 101 ID101  21         0.20  Task A        No
## 102 ID102  21         0.54  Task A        No
## 103 ID103  40         0.67  Task B        No
## 104 ID104  38         0.26 Placebo        No
## 105 ID105  28         0.57 Placebo        No
## 106 ID106  41         0.31 Placebo        No
## 107 ID107  23         0.14  Task B        No
## 108 ID108  50         0.65  Task A       Yes
## 109 ID109  38         0.54  Task A        No
## 110 ID110  40         0.34 Placebo        No
## 111 ID111  42         0.50  Task B       Yes
## 112 ID112  34         0.54  Task B        No
## 113 ID113  32         0.22 Placebo       Yes
## 114 ID114  46         0.82  Task A        No
## 115 ID115  35         0.58  Task B       Yes
## 116 ID116  31         0.30  Task B        No
## 117 ID117  39         0.63  Task A        No
## 118 ID118  29         0.55 Placebo        No
## 119 ID119  20         0.26 Placebo       Yes
## 120 ID120  26         0.31  Task A       Yes
## 121 ID121  22         0.25  Task B       Yes
## 122 ID122  50         0.37 Placebo        No
## 123 ID123  34         0.52 Placebo        No
## 124 ID124  44         0.46  Task B        No
## 125 ID125  45         0.33  Task A        No
## 126 ID126  22         0.33 Placebo       Yes
## 127 ID127  33         0.37 Placebo        No
## 128 ID128  43         0.27 Placebo        No
## 129 ID129  26         0.45 Placebo        No
## 130 ID130  47         0.59  Task B        No
## 131 ID131  32         0.19  Task B        No
## 132 ID132  42         0.53  Task A       Yes
## 133 ID133  44         0.25  Task A        No
## 134 ID134  42         0.03 Placebo        No
## 135 ID135  28         0.70  Task B       Yes
## 136 ID136  33         0.43  Task B        No
## 137 ID137  20         0.19 Placebo        No
## 138 ID138  46         0.29 Placebo        No
## 139 ID139  41         0.45  Task A       Yes
## 140 ID140  23         0.46 Placebo       Yes
## 141 ID141  45         0.61 Placebo        No
## 142 ID142  25         0.62  Task B        No
## 143 ID143  28         0.41 Placebo        No
## 144 ID144  32         0.65  Task A       Yes

We can see a new variable above but if you head your data it won’t appear:

head(data)

##    ID Age Memory_score    Task Saw_twice
## 1 ID1  21           54  Task A        No
## 2 ID2  27           23  Task A       Yes
## 3 ID3  25           32 Placebo       Yes
## 4 ID4  25           38  Task B        No
## 5 ID5  49           43  Task B       Yes
## 6 ID6  47           32 Placebo        No

That’s because we have not assigned it to a dataset. To do so, we will need to use <- in the following way:

new_data <- data %>%
  mutate(Memory_score=Memory_score/100)

Check now:

head(new_data)

##    ID Age Memory_score    Task Saw_twice
## 1 ID1  21         0.54  Task A        No
## 2 ID2  27         0.23  Task A       Yes
## 3 ID3  25         0.32 Placebo       Yes
## 4 ID4  25         0.38  Task B        No
## 5 ID5  49         0.43  Task B       Yes
## 6 ID6  47         0.32 Placebo        No

9.7.7 Subset observations using filter()

Can you subset only Task A and Placebo from the data? We can use filter and then assign the filtered observations to an object too:

data %>%
  filter(Task == 'Task A' | Task== 'Placebo')

##       ID Age Memory_score    Task Saw_twice
## 1    ID1  21           54  Task A        No
## 2    ID2  27           23  Task A       Yes
## 3    ID3  25           32 Placebo       Yes
## 4    ID6  47           32 Placebo        No
## 5    ID7  28           51  Task A        No
## 6    ID8  26           26 Placebo        No
## 7    ID9  28           56 Placebo        No
## 8   ID10  35           39  Task A        No
## 9   ID12  26           19 Placebo        No
## 10  ID15  21           28  Task A        No
## 11  ID16  44           48  Task A       Yes
## 12  ID18  40           76  Task A        No
## 13  ID19  23           62 Placebo        No
## 14  ID20  42           44 Placebo        No
## 15  ID22  34           30  Task A       Yes
## 16  ID23  48           24  Task A        No
## 17  ID25  44           29  Task A       Yes
## 18  ID27  25           55  Task A        No
## 19  ID28  48           56 Placebo        No
## 20  ID29  31           30 Placebo        No
## 21  ID31  32           66  Task A       Yes
## 22  ID32  22           12 Placebo        No
## 23  ID33  29          124 Placebo       Yes
## 24  ID34  33           59  Task A        No
## 25  ID35  38           34  Task A        No
## 26  ID36  31           81  Task A       Yes
## 27  ID40  28           62 Placebo        No
## 28  ID41  42           13  Task A       Yes
## 29  ID43  47           60 Placebo       Yes
## 30  ID46  33           72 Placebo        No
## 31  ID47  24           41  Task A       Yes
## 32  ID48  39           87 Placebo       Yes
## 33  ID49  44           56  Task A        No
## 34  ID50  49           22 Placebo        No
## 35  ID53  32           33  Task A        No
## 36  ID55  37           49 Placebo        No
## 37  ID56  51           19  Task A       Yes
## 38  ID59  37           31  Task A       Yes
## 39  ID66  19           77 Placebo        No
## 40  ID67  42           46 Placebo       Yes
## 41  ID69  42           89 Placebo        No
## 42  ID70  23           53 Placebo       Yes
## 43  ID72  35           55 Placebo       Yes
## 44  ID73  23           18 Placebo       Yes
## 45  ID74  25           31 Placebo        No
## 46  ID75  38           41 Placebo       Yes
## 47  ID76  33           42 Placebo       Yes
## 48  ID77  24           34 Placebo        No
## 49  ID78  47           80  Task A        No
## 50  ID79  37           20  Task A       Yes
## 51  ID82  39           42  Task A        No
## 52  ID87  29           75  Task A       Yes
## 53  ID89  42           24  Task A        No
## 54  ID90  31           31  Task A        No
## 55  ID91  22           32 Placebo        No
## 56  ID92  47           43 Placebo        No
## 57  ID93  24           28 Placebo       Yes
## 58  ID94  33           39  Task A       Yes
## 59  ID95  47           34  Task A        No
## 60  ID96  33           81  Task A        No
## 61  ID99  44           71  Task A        No
## 62 ID100  40           34  Task A       Yes
## 63 ID101  21           20  Task A        No
## 64 ID102  21           54  Task A        No
## 65 ID104  38           26 Placebo        No
## 66 ID105  28           57 Placebo        No
## 67 ID106  41           31 Placebo        No
## 68 ID108  50           65  Task A       Yes
## 69 ID109  38           54  Task A        No
## 70 ID110  40           34 Placebo        No
## 71 ID113  32           22 Placebo       Yes
## 72 ID114  46           82  Task A        No
## 73 ID117  39           63  Task A        No
## 74 ID118  29           55 Placebo        No
## 75 ID119  20           26 Placebo       Yes
## 76 ID120  26           31  Task A       Yes
## 77 ID122  50           37 Placebo        No
## 78 ID123  34           52 Placebo        No
## 79 ID125  45           33  Task A        No
## 80 ID126  22           33 Placebo       Yes
## 81 ID127  33           37 Placebo        No
## 82 ID128  43           27 Placebo        No
## 83 ID129  26           45 Placebo        No
## 84 ID132  42           53  Task A       Yes
## 85 ID133  44           25  Task A        No
## 86 ID134  42            3 Placebo        No
## 87 ID137  20           19 Placebo        No
## 88 ID138  46           29 Placebo        No
## 89 ID139  41           45  Task A       Yes
## 90 ID140  23           46 Placebo       Yes
## 91 ID141  45           61 Placebo        No
## 92 ID143  28           41 Placebo        No
## 93 ID144  32           65  Task A       Yes

Now, put it inside a new dataset, called reduced, you can also specify which variables you may want to keep.

reduced_data <- data %>%
   filter(Task == 'Task A' | Task== 'Placebo')  %>%
  select(ID, Age, Memory_score, Task, Saw_twice)

9.7.8 Sort via arrange()

We can check the lowest and highest memory scores via sorting in each group:

# Task A (lowest)
data %>% 
  filter(Task == "Task A") %>%
  arrange(Memory_score)

##       ID Age Memory_score   Task Saw_twice
## 1   ID41  42           13 Task A       Yes
## 2   ID56  51           19 Task A       Yes
## 3   ID79  37           20 Task A       Yes
## 4  ID101  21           20 Task A        No
## 5    ID2  27           23 Task A       Yes
## 6   ID23  48           24 Task A        No
## 7   ID89  42           24 Task A        No
## 8  ID133  44           25 Task A        No
## 9   ID15  21           28 Task A        No
## 10  ID25  44           29 Task A       Yes
## 11  ID22  34           30 Task A       Yes
## 12  ID59  37           31 Task A       Yes
## 13  ID90  31           31 Task A        No
## 14 ID120  26           31 Task A       Yes
## 15  ID53  32           33 Task A        No
## 16 ID125  45           33 Task A        No
## 17  ID35  38           34 Task A        No
## 18  ID95  47           34 Task A        No
## 19 ID100  40           34 Task A       Yes
## 20  ID10  35           39 Task A        No
## 21  ID94  33           39 Task A       Yes
## 22  ID47  24           41 Task A       Yes
## 23  ID82  39           42 Task A        No
## 24 ID139  41           45 Task A       Yes
## 25  ID16  44           48 Task A       Yes
## 26   ID7  28           51 Task A        No
## 27 ID132  42           53 Task A       Yes
## 28   ID1  21           54 Task A        No
## 29 ID102  21           54 Task A        No
## 30 ID109  38           54 Task A        No
## 31  ID27  25           55 Task A        No
## 32  ID49  44           56 Task A        No
## 33  ID34  33           59 Task A        No
## 34 ID117  39           63 Task A        No
## 35 ID108  50           65 Task A       Yes
## 36 ID144  32           65 Task A       Yes
## 37  ID31  32           66 Task A       Yes
## 38  ID99  44           71 Task A        No
## 39  ID87  29           75 Task A       Yes
## 40  ID18  40           76 Task A        No
## 41  ID78  47           80 Task A        No
## 42  ID36  31           81 Task A       Yes
## 43  ID96  33           81 Task A        No
## 44 ID114  46           82 Task A        No

What about the highest in Task B?

# Task B (the highest)
data %>%
    filter(Task == "Task B") %>%
  arrange(desc(Memory_score))

##       ID Age Memory_score   Task Saw_twice
## 1   ID51  40           94 Task B        No
## 2   ID42  43           84 Task B        No
## 3   ID24  48           82 Task B        No
## 4   ID54  47           78 Task B       Yes
## 5   ID81  19           75 Task B       Yes
## 6  ID135  28           70 Task B       Yes
## 7   ID57  36           67 Task B        No
## 8  ID103  40           67 Task B        No
## 9   ID39  28           66 Task B        No
## 10 ID142  25           62 Task B        No
## 11 ID130  47           59 Task B        No
## 12 ID115  35           58 Task B       Yes
## 13 ID112  34           54 Task B        No
## 14  ID61  20           53 Task B       Yes
## 15  ID64  44           53 Task B        No
## 16  ID84  27           53 Task B        No
## 17  ID88  43           51 Task B        No
## 18 ID111  42           50 Task B       Yes
## 19  ID26  44           47 Task B        No
## 20  ID13  50           46 Task B        No
## 21  ID45  34           46 Task B       Yes
## 22 ID124  44           46 Task B        No
## 23  ID30  22           44 Task B       Yes
## 24   ID5  49           43 Task B       Yes
## 25 ID136  33           43 Task B        No
## 26  ID52  22           42 Task B       Yes
## 27  ID97  28           41 Task B       Yes
## 28   ID4  25           38 Task B        No
## 29  ID60  42           38 Task B       Yes
## 30  ID63  29           38 Task B       Yes
## 31  ID65  28           37 Task B        No
## 32  ID37  49           36 Task B       Yes
## 33  ID71  46           36 Task B       Yes
## 34  ID85  31           35 Task B        No
## 35  ID11  48           32 Task B       Yes
## 36  ID83  24           31 Task B       Yes
## 37 ID116  31           30 Task B        No
## 38  ID62  25           29 Task B        No
## 39  ID17  45           28 Task B       Yes
## 40  ID58  22           28 Task B        No
## 41  ID86  41           28 Task B        No
## 42  ID14  22           26 Task B       Yes
## 43  ID21  20           26 Task B       Yes
## 44  ID80  40           25 Task B       Yes
## 45 ID121  22           25 Task B       Yes
## 46 ID131  32           19 Task B        No
## 47  ID98  36           15 Task B        No
## 48  ID68  32           14 Task B        No
## 49 ID107  23           14 Task B        No
## 50  ID38  25            8 Task B        No
## 51  ID44  28            5 Task B        No

9.7.9 Let’s do some specific count() using filter()

First check how many people we have in each Task group:

data %>%
  count(Task)

## # A tibble: 3 x 2
##   Task        n
##   <fct>   <int>
## 1 Placebo    49
## 2 Task A     44
## 3 Task B     51

Can you show how many people above 40 years of age and saw the information on the task twice in each Task group?

data %>%
  filter(Age >40) %>%
  filter(Saw_twice == 'Yes') %>%
  count(Task)

## # A tibble: 3 x 2
##   Task        n
##   <fct>   <int>
## 1 Placebo     2
## 2 Task A      7
## 3 Task B      8

For the last one, show how people with the highest memory scores are split by task. Use a memory score threshold of 50 out of 100:

data %>%
  filter(Memory_score > 50) %>%
  count(Task)

## # A tibble: 3 x 2
##   Task        n
##   <fct>   <int>
## 1 Placebo    16
## 2 Task A     19
## 3 Task B     17

Task A has greater queanity of high memory scores.

Nicely done! If you got to the end, you have now succefully practiced all the key code and functions we have learnt in previous weeks. Play more if you like for the practice.

9.8 Extra Probability Practice

Work with the distribution we created in the tutorial for guessing on homework quizzes. We want to analyse how likely you are to get specific numbers of question right.

9.8.1 TRUE/FALSE questions

Work with the TRUE/FALSE example we have seen in the tutorial. What if the multiple choice had only two choices for the answer (i.e. TRUE or FALSE questions)? The right answer will now have a probability of 0.5 if you were to guess at random. Create a tibble() to show this:

homework_guess_true_false <- tibble(right_guess = rbinom(n = 100, 
       size = 10, 
       prob = 0.5))

9.8.2 Count the occurencies

homework_guess_true_false %>%
  count(right_guess)

## # A tibble: 8 x 2
##   right_guess     n
##         <int> <int>
## 1           2     3
## 2           3    10
## 3           4    21
## 4           5    29
## 5           6    21
## 6           7    12
## 7           8     3
## 8           9     1

9.8.3 Plot

ggplot(data = homework_guess_true_false, aes(x = right_guess)) + 
  geom_bar(fill = 'lightgreen') + 
  xlim(0,10) + 
  labs(x = 'Right Guess', y = 'count') + 
  theme_minimal()

Plot with y being a probability:

ggplot(data = homework_guess_true_false, aes(x = right_guess)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = 'lightgreen') + 
  xlim(0,10) + 
  labs(x = 'Right Guess', y = 'Probability') + 
  theme_minimal()

9.8.4 Use dbinom() to study the probability

Check what the probability is of getting only one question right when guessing at random:

dbinom(1, size = 10, prob = 0.5)

## [1] 0.009765625

Or for five correct answers:

dbinom(5, size = 10, prob = 0.5)

## [1] 0.2460938

Or eight:

dbinom(8, size = 10, prob = 0.5)

## [1] 0.04394531

Or put it all together at once (make sure that the probability is 0.5):

all_probs <- round(dbinom(x = c(0,1,2,3,4,5,6,7,8,9,10), prob = 0.5, size = 10),2) # Note that we use round to see the values to two decimal places 
all_probs

##  [1] 0.00 0.01 0.04 0.12 0.21 0.25 0.21 0.12 0.04 0.01 0.00

9.8.5 Less than five or more than five?

We can also study what the chances are of getting less than five questions right versus more than five questions right in a TRUE/FALSE setting (check your notes online).

# Five or less
pbinom(q = 5, prob = 0.5, size = 10)

## [1] 0.6230469

# Five or more
1 - pbinom(q = 5, prob = 0.5, size = 10)

## [1] 0.3769531

Better chances compared to when you are doing a quiz with four options! :)