6 Tutorial 6: Preparing survey data

In Tutorial 6, you will learn….

  • How to transform variables
  • How to create indices


This time, we simply load a data frame data_combined which we created in Tutorial 5: Matching survey data & data donations.

You will find this R environment via Moodle under the folder “Data for R” (“tutorial6.RData”).

Use the load() command to load it to your working environment.


Let’s remember what this data contains:

  • Our automated content analysis of YouTube search queries:
    • Anonymous ID for each participant: external_submission_ID
    • Share of news-related searches: share
  • Our survey data:
    • Anonymous ID for each participant: ID
    • Sociodemographic characteristics: Age, Gender, Education
    • Political Interest: PI1, PI2, PI3, PI4, PI5
    • Social Media Use: Use_FB, Use_TWI, Use_INST, Use_YOU, Use_TELE, Use_WHATS
    • Trust in News Media: Trust

6.1 Transforming variables

We may want to transform some values of specific variables before using them for further analysis.

For instance, we can see that the variable Education takes on three values:

  • A-levels (here equivalent to the “Abitur” in Germany)
  • Secondary Degree (here equivalent to the “Realschulabschluss” in Germany)
  • University Degree
##          A-levels  Secondary Degree University Degree 
##                27                11                45

Let’s assume that we want to use these variables to categorize participants’ highest degree of education in less detail. The goal is to transform the variable Education to a new variable University describing whether participants have a university degree (1) or not (0).

We do this by…

  • First, creating a new variable called University which simply contains all values from Education.
  • Second, replacing values of University with No University Degree if participants stated that they do not have a university degree, i.e., gave A-levels or Secondary Degree as their highest degree of education.
data_combined <- data_combined %>%
  #copy variable Education to new variable University
  mutate(University = Education,
         #replace all variables other than "University Degree" with "No University Degree"
         University = replace(University,
                              Education != "University Degree",
                              "No University Degree"))

Let’s check results:

data_combined %>%
  #select relevant variables
  select(University, Education) %>%
  #show results
## # A tibble: 2 × 2
##   University           Education        
##   <chr>                <chr>            
## 1 No University Degree A-levels         
## 2 University Degree    University Degree

Looks good!

6.2 Creating indices

Most variables in our data only consist of a single item, e.g., Age, Gender, or Education.

However, especially in surveys, you will often encounter variables that are measured via single items.

Here, we for instance measured Policial Interest with five items:

  • PI1: “If I notice that I lack knowledge about a political topic, I get informed about it.”
  • PI2: “For me, politics is an exciting topic.”
  • PI3: “I often think deeply about a political controversy.”
  • PI4: “I follow political events with great curiosity.”
  • PI5: “In general, I am very interested in politics.”

These items are drawn from a scale developed by Otto & Bachl, 20116.

For multivariate analysis, we may want to create a mean index as a single measure of political interest.

To do so, we install the package sjPlot, a package for facilitating data visualizations, to use the function tab_itemscale(). The function enables us to see summary statistics but also Item Difficulty, Item Discrimination, and Cronbach’s alpha or \(\alpha_{Cronbach}\) as metrics to decide whether or not to combine items to a mean index:

data_combined %>%
  select(starts_with("PI")) %>% 
Component 1
Row Missings Mean SD Skew Item Difficulty Item Discrimination α if deleted
PI1 0.00 % 3.87 0.84 -0.25 0.77 0.62 0.93
PI2 0.00 % 3.54 1.23 -0.58 0.71 0.83 0.89
PI3 0.00 % 3.39 1.18 -0.52 0.68 0.77 0.90
PI4 0.00 % 3.59 1.12 -0.53 0.72 0.88 0.88
PI5 0.00 % 3.54 1.2 -0.6 0.71 0.87 0.88
Mean inter-item-correlation=0.691 · Cronbach’s α=0.919

What does this table show you?

  • You can see which items you included via Row
  • You can see how often an item is missing via Missings
  • You can see descriptive statistics across items via Mean, SD, and Skew
  • You can see how useful items are for measuring the more latent concept political interest via Item Difficulty, Item Discrimination, and \(\alpha_{Cronbach}\) if deleted

Here, we will focus on Item Discrimination and \(\alpha_{Cronbach}\) values:

  1. Check if Item Discrimination values are above .2. This indicates the degree to which an item helps us discriminate between participants who scored very low or high on political interest.
  2. Check if \(\alpha_{Cronbach}\) is higher than .7. The higher, the better. If \(\alpha_{Cronbach}\) values sharply increase as soon as one of the items is not considered for analysis (i.e., “deleted” according to the table), this indicated that we may consider not using it for an index.

Here, it seems that all items discriminate well - all values of Item Discrimination are sufficiently high - and that, together, they form an index with satisfactory validity, seeing that \(\alpha_{Cronbach}\) for all five items constitutes .919.


We can use all five items for political interest to create a mean index called Political_Interest.

We create this index via the tidycomm package, an R package created specifically for communication scientists and their needs in R, by using the add_index() function from the tidycomm package.

Here, we create a mean index Political_Interest consisting of the five items PI1, PI2, PI3, PI4, and PI5:

data_combined <- data_combined %>%
            type = "mean",
            cast.numeric = T)

Let’s check our result:

data_combined %>%
  #choose relevant variables
  select(starts_with("P")) %>%
  #show first two rows
## # A tibble: 2 × 6
##     PI1   PI2   PI3   PI4   PI5 PoliticalInterest
##   <int> <int> <int> <int> <int>             <dbl>
## 1     3     1     2     2     2               2  
## 2     5     4     4     4     4               4.2

Looks good! Now, we are finally ready to run our first regression models.

6.3 Take Aways


  • Mean Index is a composite measure consisting of different items. Here, we aggregate different items - e,g, PI1, PI2, PI3, PI4, and PI5 -, which we assume represent the latent concept political interest, to the single measure Political Interest. We do so by taking the mean value a participant scores across items PI1 to PI5 and saving it via the variable PoliticalInterest.


  • Transforming variables: mutate(), replace()
  • Creating mean indices: tab_itemscale(), add_index()

6.4 Test your knowledge

You’ve worked through all the material of Tutorial 6? Let’s see it - the following tasks will test your knowledge.

6.4.1 Task 6.1

Writing the corresponding R code, check whether you can/should combine all items describing social media use - all variables starting with Use, i.e., USE_FB, USE_TWI, etc., - to a single mean index or not. Explain your solution!

Let’s keep going: with Tutorial 7: Data analysis

  1. Otto, L., & Bacherle, P. (2011). Politisches Interesse Kurzskala (PIKS)–Entwicklung und Validierung. Politische Psychologie, 1(1), 19–35↩︎