4.1 Data
4.1.1 Import
Download the data here. As always, save the data to a directory (preferably one that is automatically backed up by file-sharing software) and begin your script by loading the tidyverse
and by setting the working directory to the directory where you have just saved your data:
library(tidyverse)
library(readxl) # we need this package because our data is stored in an Excel file
setwd("c:/dropbox/work/teaching/R/") # change to your own working directory
powercc <- read_excel("power_conspicuous_consumption.xlsx","data") # Import the Excel file. Note that the name of the Excel sheet is data
Don’t forget to save your script in the working directory.
4.1.2 Manipulate
## # A tibble: 147 x 39
## subject start_date end_date duration finished power
## <dbl> <dttm> <dttm> <dbl> <dbl> <chr>
## 1 1 2012-04-19 09:32:56 2012-04-19 09:49:42 1006. 1 high
## 2 2 2012-04-19 09:31:26 2012-04-19 09:51:13 1187. 1 low
## 3 3 2012-04-19 09:29:50 2012-04-19 09:53:10 1400. 1 low
## 4 4 2012-04-19 09:26:25 2012-04-19 09:53:21 1616. 1 low
## 5 5 2012-04-19 09:20:55 2012-04-19 09:54:21 2006. 1 high
## 6 6 2012-04-19 09:28:02 2012-04-19 09:55:50 1668. 1 high
## 7 7 2012-04-19 09:17:54 2012-04-19 09:58:49 2455. 1 low
## 8 8 2012-04-19 09:22:26 2012-04-19 10:01:40 2354. 1 high
## 9 9 2012-04-19 10:13:12 2012-04-19 10:31:03 1071. 1 low
## 10 10 2012-04-19 10:12:55 2012-04-19 10:31:29 1114. 1 high
## # ... with 137 more rows, and 33 more variables: audience <chr>,
## # group_size <dbl>, gender <chr>, age <dbl>, dominance1 <dbl>,
## # dominance2 <dbl>, dominance3 <dbl>, dominance4 <dbl>, dominance5 <dbl>,
## # dominance6 <dbl>, dominance7 <dbl>, sa1 <dbl>, sa2 <dbl>, sa3 <dbl>,
## # sa4 <dbl>, sa5 <dbl>, sa6 <dbl>, sa7 <dbl>, sa8 <dbl>, sa9 <dbl>,
## # sa10 <dbl>, sa11 <dbl>, inconspicuous1 <dbl>, inconspicuous2 <dbl>,
## # inconspicuous3 <dbl>, inconspicuous4 <dbl>, inconspicuous5 <dbl>,
## # conspicuous1 <dbl>, conspicuous2 <dbl>, conspicuous3 <dbl>,
## # conspicuous4 <dbl>, conspicuous5 <dbl>, agree <dbl>
We have 39 columns or variables in our data:
subject
identifies the participantsstart_date
andend_date
indicate the beginning and end of the experimental session.duration
indicates the duration of the experimental sessionfinished
: did participants complete the whole experiment?power
(high vs. low) andaudience
(private vs. public) are the experimental conditionsgroup_size
: in groups of how many did participants come to the lab?gender
andage
of the participantdominance1
,dominance2
, etc. are the questions that measured dominance. An example is “I think I would enjoy having authority over other people”. Participants responded with yes (1) or no (0).sa1
,sa2
, etc. are the questions that measure status aspirations. An example is “I would like an important job where people looked up to me”. Participants responded with yes (1) or no (0).inconspicuous1
,inconspicuous2
, etc. contain the WTP for the inconspicuous products. Scale from 1: I would buy very cheap items to 9: I would buy very expensive items.conspicuous1
,conspicuous2
, etc. contain the WTP for the conspicuous products. Scale from 1: I would buy very cheap items to 9: I would buy very expensive items.agree
: an exploratory question measuring whether people agreed they were better suited for the role of worker or manager. Participants responded on a scale from 1: much better suited for worker (manager) role to 7: much better suited for manager (worker) role. Higher numbers indicate agreement with the role assignment in the experiment.
4.1.2.1 Factorize some variables
Upon inspection of the data, we see that the type of subject
is double
, which means that subject
should be factorized so that its values do not get treated as numbers. We’ll also factorize both our experimental conditions:
powercc <- powercc %>% # we've already created the powercc object before
mutate(subject = factor(subject),
power = factor(power, levels = c("low","high")), # note the levels argument
audience = factor(audience, levels = c("private","public"))) # note the levels argument
Note that we’ve provided a new argument levels
when factorizing power
and audience
. This argument specifies the order of the levels of a factor. In the context of this experiment, it is more natural to talk about the effect of high vs. low power on consumption than to talk about the effect of low vs. high power on consumption. Therefore, we tell R that the low power level should be considered as the first level. Later on we’ll see that the output of analyses can then be interpreted as effects of high power (second level) vs. low power (first level). The same reasoning applies to the audience factor, although providing the levels for this factor was unnecessary because private comes before public alphabetically. Your choice of level for the first or reference level only influences interpretation, not the actual outcome of the analysis.
In one experimental session, the fire alarm went off and we had to leave the lab. Let’s remove participants who did not complete the experiment:
powercc <- powercc %>% # we've already created the powercc object before
filter(finished == 1) # only retain observations where finished is equal to 1
Notice the double ==
when testing for equality. Check out the R 4 Data Science book for other logical operators (scroll down to get to Section 5.2.2).
4.1.2.2 Calculate the internal consistency & the average of questions that measure the same concept
We would like to average the questions that measure dominance to get a single number indicating whether the participant has a dominant or a non-dominant personality. Before we do this, we should get an idea of the internal consistency of the questions that measure dominance. This will tell us whether all these questions measure the same concept. One measure of internal consistency is Cronbach’s alpha. To calculate it, we need a package called psych
:
Once this package is loaded, we can use the alpha
function to calculate Cronbach’s alpha for a set of questions:
dominance.questions <- powercc %>%
select(starts_with("dominance")) # take the powercc data frame and select all the variables with a name that starts with dominance
alpha(dominance.questions) # calculate the cronbach alpha of these variables
##
## Reliability analysis
## Call: alpha(x = dominance.questions)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.69 0.67 0.67 0.23 2.1 0.038 0.63 0.27 0.21
##
## lower alpha upper 95% confidence boundaries
## 0.61 0.69 0.76
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## dominance1 0.68 0.67 0.66 0.25 2.0 0.039 0.016 0.22
## dominance2 0.65 0.63 0.62 0.22 1.7 0.043 0.018 0.21
## dominance3 0.59 0.58 0.56 0.19 1.4 0.050 0.014 0.20
## dominance4 0.62 0.60 0.59 0.20 1.5 0.047 0.018 0.21
## dominance5 0.65 0.64 0.63 0.23 1.7 0.043 0.021 0.21
## dominance6 0.70 0.70 0.68 0.28 2.4 0.038 0.009 0.25
## dominance7 0.65 0.64 0.62 0.23 1.8 0.043 0.017 0.20
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## dominance1 143 0.52 0.49 0.33 0.29 0.53 0.50
## dominance2 143 0.59 0.61 0.52 0.42 0.80 0.40
## dominance3 143 0.75 0.74 0.71 0.59 0.52 0.50
## dominance4 143 0.68 0.68 0.61 0.50 0.58 0.50
## dominance5 143 0.61 0.59 0.48 0.40 0.55 0.50
## dominance6 143 0.29 0.38 0.19 0.14 0.91 0.29
## dominance7 143 0.62 0.59 0.48 0.41 0.53 0.50
##
## Non missing response frequency for each item
## 0 1 miss
## dominance1 0.47 0.53 0
## dominance2 0.20 0.80 0
## dominance3 0.48 0.52 0
## dominance4 0.42 0.58 0
## dominance5 0.45 0.55 0
## dominance6 0.09 0.91 0
## dominance7 0.47 0.53 0
# Note that we could have also written this as follows:
# powercc %>% select(starts_with("dominance")) %>% cronbach()
This produces a lot of output. Under raw_alpha
, we see that the alpha is 0.69, which is on the lower side (.70 is often considered the minimum required), but still ok. The table below tells us what the alpha would be if we dropped one question from our measure. Dropping dominance6
would increase the alpha to 0.7. Compared to the original alpha of 0.69, this increase is tiny and hence we don’t drop dominance6
. If there were a question with a high ‘alpha if dropped’, then this would indicate that this question is measuring something different than the other questions. In that case, you could consider removing this question from your measure.
We can proceed by averaging the responses on the dominance question:
powercc <- powercc %>%
mutate(dominance = (dominance1+dominance2+dominance3+dominance4+dominance5+dominance6+dominance7)/7,
cc = (conspicuous1+conspicuous2+conspicuous3+conspicuous4+conspicuous5)/5,
icc = (inconspicuous1+inconspicuous2+inconspicuous3+inconspicuous4+inconspicuous5)/5) %>%
select(-starts_with("sa"))
I’ve also averaged the questions about conspicuous consumption and inconspicuous consumption, but not the questions about status aspirations because their Cronbach alpha was too low. I’ve deleted the questions about status aspirations from the dataset. I leave it as an exercise to check the Cronbach alpha’s of each of these concepts (do this before deleting the questions about status aspirations, of course).
4.1.3 Recap: importing & manipulating
Here’s what we’ve done so far, in one orderly sequence of piped operations (download the data here):
library(tidyverse)
library(readxl)
setwd("c:/dropbox/work/teaching/R/") # change to your own working directory
powercc <- read_excel("power_conspicuous_consumption.xlsx","data") %>%
filter(finished == 1) %>%
mutate(subject = factor(subject),
power = factor(power, levels = c("low","high")),
audience = factor(audience, levels = c("private","public")),
dominance = (dominance1+dominance2+dominance3+dominance4+dominance5+dominance6+dominance7)/7,
cc = (conspicuous1+conspicuous2+conspicuous3+conspicuous4+conspicuous5)/5,
icc = (inconspicuous1+inconspicuous2+inconspicuous3+inconspicuous4+inconspicuous5)/5) %>%
select(-starts_with("sa"))