5 Tutorial 5: Matching survey data & data donations

In Tutorial 5, you will learn….

  • how to aggregate data donations to per-person observations
  • how to match the survey data with grouped data donations

5.1 Per-person aggregation of data donations

First of, let us load the data donations from YouTube again:

data <- read.csv2("sample_youtube.csv")

We again use the preprocessing and analysis pipeline we built over the last tutorials to classify all individual searches as news-related (1) or not (0).

Reminder: This is a very simplistic, absolutely non-perfect pipeline, both in terms of preprocessing and analysis. Your job for the seminar paper will be to check…

  • which preprocessing steps to use (or not to use)
  • how to build and apply a better dictionary
  • how to validate the dictionary

Let’s stick with our absolutely non-perfect pipeline:

library("dplyr")
library("stringr")
library("quanteda")

## Preprocessing pipeline ##
data <- data %>% 
  
  #removing URL-related terms
  mutate(across("search_query", 
                gsub, 
                pattern = "https://www.youtube.com/results?search_query=", 
                replacement = "",
                fixed = T)) %>%
  mutate(across("search_query", 
                gsub, 
                pattern = "+", 
                replacement = " ",
                fixed = T)) %>%
  
  #removing encoding issues
  mutate(
         #Correct encoding for German "Umlaute"
         search_query = gsub("%C3%B6", "ö", search_query),
         search_query = gsub("%C3%A4", "ä", search_query),
         search_query = gsub("%C3%BC", "ü", search_query),
         search_query = gsub("%C3%9", "Ü", search_query),
         
         #Correct encoding for special signs
         search_query = gsub("%C3%9F", "ß", search_query),
         
         #Correct encoding for punctuation
         search_query = gsub("%0A", " ", search_query),
         search_query = gsub("%22", '"', search_query),
         search_query = gsub("%23", "#", search_query),
         search_query = gsub("%26", "&", search_query),
         search_query = gsub("%27|%E2%80%98|%E2%80%99|%E2%80%93|%C2%B4", "'", search_query),
         search_query = gsub("%2B", "+", search_query),
         search_query = gsub("%3D", "=", search_query),
         search_query = gsub("%3F", "?", search_query),
         search_query = gsub("%40", "@", search_query),

         #Correct encoding for letters from other languages
         search_query = gsub("%C3%A7", "ç", search_query),
         search_query = gsub("%C3%A9", "é", search_query),
         search_query = gsub("%C3%B1", "ñ", search_query),
         search_query = gsub("%C3%A5", "å", search_query),
         search_query = gsub("%C3%B8", "ø", search_query),
         search_query = gsub("%C3%BA", "ú", search_query),
         search_query = gsub("%C3%AE", "î", search_query)) %>%
  
  mutate(
         #transform queries to lower case
         search_query = char_tolower(search_query),
         
         #create unique ID per search query
         query_id = paste0("ID", 1:nrow(data)))

## Analysis pipeline for dictionary analysis ##

#Create a document-feature-matrix
dfm <- data$search_query %>%
  tokens() %>%
  dfm()

#Do dictionary analysis
classification_news <- dfm %>% 
  
  #look up slightly adjusted dictionary with some more terms
  dfm_lookup(dictionary = dictionary(list(news = c("news",
                                      "nachrichten",
                                      "doku",
                                      "interview",
                                      "information",
                                      "tagesschau",
                                      "swr",
                                      "bild tv",
                                      "heute show",
                                      "magazin royal")))) %>%
  
  #convert to data frame
  convert(., to = "data.frame") %>%
  
  #add to data dataframe
  select(news) %>%
  cbind(., data) %>%
  
  mutate(#transform dictionary count to binary classification of news-related (1) or not (0)
    news = replace(news,
                   news > 0,
                   1)) %>%
 
   #reduce to necessary variables& change order of variables
  select(external_submission_id, search_query, news)

#check the first row of the resulting data frame
head(classification_news, 1)
##   external_submission_id               search_query news
## 1                   3861 barbara becker let's dance    0

Puh, what a pipeline! Where to go from here?

Next, we want to create an average score of news-related searches per person.

This means we do not want to have the classification of each search query by each person (current format) but the average percentage of search queries by each person that are news-related (new format, via aggregation).

For instance, how many of the search queries by the person above (ID 3861) are news-related?

We can group results by external_submission_id to create the average share of news-related searches per person as a form of aggregation:

classification_news <- classification_news %>%
  
  #group by ID, here a single person
  group_by(external_submission_id) %>%
  
  #calculate share of news-related searches per person
  summarize(share = (mean(news, na.rm = TRUE))*100) %>%

  #ungroup
  ungroup()

#check first five rows of aggregated data
head(classification_news, 5)
## # A tibble: 5 × 2
##   external_submission_id share
##                    <int> <dbl>
## 1                   3861  0   
## 2                   4146  1.02
## 3                   4172  3.06
## 4                   4268  0   
## 5                   4411  1.12

Fantastic! For every person, we now know the share of her/his queries that were news-related (assuming our dictionary worked well).

For instance, we can see that 0% of all searches by the person with the ID 3861 were news-related but 1.02% of those by the person with the ID 4146.

5.2 Understanding survey data

Next, we read in the survey data:

survey <- read.csv2("survey.csv")

The survey data contains data from two surveys4:

  • A survey of a convenience sample of students at German universities (November 2022 – January 2023), contacted via face-to-face mode
  • A survey of panel participants of an online panel (November 2022 – January 2023), contacted via email

Let us look at the variables included in our data. Please note that operationalizations, e.g. of Political Interest, were partly derived from prior research5.

Variable Description Question Items
ID Anonymous ID Numeric string
Age Age How old are you? Numeric string
Gender Gender What is your gender? Male, Female, Other
Education Education What is your highest level of education? Secondary Degree, A-Levels, University Degree
PI1 Pol. Interest (1) If I notice that I lack knowledge about a political topic, I get informed about it. 1 = Strongly disagree, …, 5 = Strongly agree
PI2 Pol. Interest (2) For me, politics is an exciting topic. 1 = Strongly disagree, …, 5 = Strongly agree
PI3 Pol. Interest (3) I often think deeply about a political controversy. 1 = Strongly disagree, …, 5 = Strongly agree
PI4 Pol. Interest (4) I follow political events with great curiosity. 1 = Strongly disagree, …, 5 = Strongly agree
PI5 Pol. Interest (5) In general, I am very interested in politics. 1 = Strongly disagree, …, 5 = Strongly agree
Use_FB Facebook Use How often do you use [platform]? 1 = Never, …, 5 = daily
Use_TWI Twitter Use How often do you use [platform]? 1 = Never, …, 5 = daily
Use_INST Instagram Use How often do you use [platform]? 1 = Never, …, 5 = daily
Use_YOU YouTube Use How often do you use [platform]? 1 = Never, …, 5 = daily
Use_TELE Telegram Use How often do you use [platform]? 1 = Never, …, 5 = daily
Use_WHATS WhatsApp Use How often do you use [platform]? 1 = Never, …, 5 = daily
Trust Trust in News Media Generally speaking, how much do you trust information by the news media? 1 = I do not trust them at all, …, 5 = I trust them fully

We can see that the survey data contains a lot of variables related to…

  • Identifying participants: ID
  • Sociodemographic characteristics: Age, Gender, Education
  • Political Interest: PI1, PI2, PI3, PI4, PI5
  • Social Media Use: Use_FB, Use_TWI, Use_INST, Use_YOU, Use_TELE, Use_WHATS
  • Trust in News Media: Trust

5.3 Matching survey data to data donations

To automatically match survey responses in survey (data from our online survey) to donated data in classification_news (the aggregated data frame with % of news-related searches per person), we can use the anonymized ID of each participant:

  • variable external_submission_id in classification_news
  • variable ID in survey
#Recheck dataset on news classifications
head(classification_news, 2)
## # A tibble: 2 × 2
##   external_submission_id share
##                    <int> <dbl>
## 1                   3861  0   
## 2                   4146  1.02
#Recheck dataset on survey data
head(survey, 2)
##      ID Age Gender         Education PI1 PI2 PI3 PI4 PI5 Use_FB Use_TWI Use_INST Use_YOU Use_TELE Use_WHATS Trust
## 1 10494  66 Female University Degree   5   4   5   5   5      4       2        3       4        1         5     3
## 2  4411  18   Male          A-levels   5   5   5   5   5      1       1        5       5        3         5     4

To merge both data sets based on the anonymous IDs, we use the left_join() command, which is part of the dplyr package. Here, we have to tell R to match corresponding rows by the variable external_submission_id in classification_news and ID in survey:

#Combine both
data_combined <- classification_news %>%
  left_join(survey, by = c("external_submission_id" = "ID"))

#check result
head(data_combined, 2)
## # A tibble: 2 × 17
##   external_submission_id share   Age Gender Education        PI1   PI2   PI3   PI4   PI5 Use_FB Use_TWI Use_INST Use_YOU
##                    <int> <dbl> <int> <chr>  <chr>          <int> <int> <int> <int> <int>  <int>   <int>    <int>   <int>
## 1                   3861  0       18 Male   A-levels           3     1     2     2     2      1       2        4       3
## 2                   4146  1.02    24 Male   University De…     5     4     4     4     4      1       1        4       5
## # ℹ 3 more variables: Use_TELE <int>, Use_WHATS <int>, Trust <int>

That’s it! Our data donations have now been merged with the classified search queries and we can start preparing our survey data for multivariate analysis.

5.4 Take Aways

Vocabulary:

  • Agregating: Summarizing information in data sets to a broader level, for instance from several observations per individual to a single observation per individual containing all information
  • Merging: Combining two different data sets, here data from our data donations and the survey

Commands:

  • Merging: left_join()

Let’s keep going: with Tutorial 6: Preparing survey data


  1. For the sake of simplicity, we will not distinguish between both data sources. But be aware that both samples were contacted differently and different patterns of self-selection may be at play↩︎

  2. Scale based on: Otto, L., & Bacherle, P. (2011). Politisches Interesse Kurzskala (PIKS)–Entwicklung und Validierung. Politische Psychologie, 1(1), 19–35↩︎