5 Tutorial 5: Matching survey data & data donations
In Tutorial 5, you will learn….
- how to aggregate data donations to per-person observations
- how to match the survey data with grouped data donations
5.1 Per-person aggregation of data donations
First of, let us load the data donations from YouTube again:
We again use the preprocessing and analysis pipeline we built over the last tutorials to classify all individual searches as news-related (1) or not (0).
Reminder: This is a very simplistic, absolutely non-perfect pipeline, both in terms of preprocessing and analysis. Your job for the seminar paper will be to check…
- which preprocessing steps to use (or not to use)
- how to build and apply a better dictionary
- how to validate the dictionary
Let’s stick with our absolutely non-perfect pipeline:
library("dplyr")
library("stringr")
library("quanteda")
## Preprocessing pipeline ##
data <- data %>%
#removing URL-related terms
mutate(across("search_query",
gsub,
pattern = "https://www.youtube.com/results?search_query=",
replacement = "",
fixed = T)) %>%
mutate(across("search_query",
gsub,
pattern = "+",
replacement = " ",
fixed = T)) %>%
#removing encoding issues
mutate(
#Correct encoding for German "Umlaute"
search_query = gsub("%C3%B6", "ö", search_query),
search_query = gsub("%C3%A4", "ä", search_query),
search_query = gsub("%C3%BC", "ü", search_query),
search_query = gsub("%C3%9", "Ü", search_query),
#Correct encoding for special signs
search_query = gsub("%C3%9F", "ß", search_query),
#Correct encoding for punctuation
search_query = gsub("%0A", " ", search_query),
search_query = gsub("%22", '"', search_query),
search_query = gsub("%23", "#", search_query),
search_query = gsub("%26", "&", search_query),
search_query = gsub("%27|%E2%80%98|%E2%80%99|%E2%80%93|%C2%B4", "'", search_query),
search_query = gsub("%2B", "+", search_query),
search_query = gsub("%3D", "=", search_query),
search_query = gsub("%3F", "?", search_query),
search_query = gsub("%40", "@", search_query),
#Correct encoding for letters from other languages
search_query = gsub("%C3%A7", "ç", search_query),
search_query = gsub("%C3%A9", "é", search_query),
search_query = gsub("%C3%B1", "ñ", search_query),
search_query = gsub("%C3%A5", "å", search_query),
search_query = gsub("%C3%B8", "ø", search_query),
search_query = gsub("%C3%BA", "ú", search_query),
search_query = gsub("%C3%AE", "î", search_query)) %>%
mutate(
#transform queries to lower case
search_query = char_tolower(search_query),
#create unique ID per search query
query_id = paste0("ID", 1:nrow(data)))
## Analysis pipeline for dictionary analysis ##
#Create a document-feature-matrix
dfm <- data$search_query %>%
tokens() %>%
dfm()
#Do dictionary analysis
classification_news <- dfm %>%
#look up slightly adjusted dictionary with some more terms
dfm_lookup(dictionary = dictionary(list(news = c("news",
"nachrichten",
"doku",
"interview",
"information",
"tagesschau",
"swr",
"bild tv",
"heute show",
"magazin royal")))) %>%
#convert to data frame
convert(., to = "data.frame") %>%
#add to data dataframe
select(news) %>%
cbind(., data) %>%
mutate(#transform dictionary count to binary classification of news-related (1) or not (0)
news = replace(news,
news > 0,
1)) %>%
#reduce to necessary variables& change order of variables
select(external_submission_id, search_query, news)
#check the first row of the resulting data frame
head(classification_news, 1)
## external_submission_id search_query news
## 1 3861 barbara becker let's dance 0
Puh, what a pipeline! Where to go from here?
Next, we want to create an average score of news-related searches per person.
This means we do not want to have the classification of each search query by each person (current format) but the average percentage of search queries by each person that are news-related (new format, via aggregation).
For instance, how many of the search queries by the person above (ID 3861) are news-related?
We can group results by external_submission_id to create the average share of news-related searches per person as a form of aggregation:
classification_news <- classification_news %>%
#group by ID, here a single person
group_by(external_submission_id) %>%
#calculate share of news-related searches per person
summarize(share = (mean(news, na.rm = TRUE))*100) %>%
#ungroup
ungroup()
#check first five rows of aggregated data
head(classification_news, 5)
## # A tibble: 5 × 2
## external_submission_id share
## <int> <dbl>
## 1 3861 0
## 2 4146 1.02
## 3 4172 3.06
## 4 4268 0
## 5 4411 1.12
Fantastic! For every person, we now know the share of her/his queries that were news-related (assuming our dictionary worked well).
For instance, we can see that 0% of all searches by the person with the ID 3861 were news-related but 1.02% of those by the person with the ID 4146.
5.2 Understanding survey data
Next, we read in the survey data:
The survey data contains data from two surveys4:
- A survey of a convenience sample of students at German universities (November 2022 – January 2023), contacted via face-to-face mode
- A survey of panel participants of an online panel (November 2022 – January 2023), contacted via email
Let us look at the variables included in our data. Please note that operationalizations, e.g. of Political Interest, were partly derived from prior research5.
Variable | Description | Question | Items |
---|---|---|---|
ID | Anonymous ID | Numeric string | |
Age | Age | How old are you? | Numeric string |
Gender | Gender | What is your gender? | Male, Female, Other |
Education | Education | What is your highest level of education? | Secondary Degree, A-Levels, University Degree |
PI1 | Pol. Interest (1) | If I notice that I lack knowledge about a political topic, I get informed about it. | 1 = Strongly disagree, …, 5 = Strongly agree |
PI2 | Pol. Interest (2) | For me, politics is an exciting topic. | 1 = Strongly disagree, …, 5 = Strongly agree |
PI3 | Pol. Interest (3) | I often think deeply about a political controversy. | 1 = Strongly disagree, …, 5 = Strongly agree |
PI4 | Pol. Interest (4) | I follow political events with great curiosity. | 1 = Strongly disagree, …, 5 = Strongly agree |
PI5 | Pol. Interest (5) | In general, I am very interested in politics. | 1 = Strongly disagree, …, 5 = Strongly agree |
Use_FB | Facebook Use | How often do you use [platform]? | 1 = Never, …, 5 = daily |
Use_TWI | Twitter Use | How often do you use [platform]? | 1 = Never, …, 5 = daily |
Use_INST | Instagram Use | How often do you use [platform]? | 1 = Never, …, 5 = daily |
Use_YOU | YouTube Use | How often do you use [platform]? | 1 = Never, …, 5 = daily |
Use_TELE | Telegram Use | How often do you use [platform]? | 1 = Never, …, 5 = daily |
Use_WHATS | WhatsApp Use | How often do you use [platform]? | 1 = Never, …, 5 = daily |
Trust | Trust in News Media | Generally speaking, how much do you trust information by the news media? | 1 = I do not trust them at all, …, 5 = I trust them fully |
We can see that the survey data contains a lot of variables related to…
- Identifying participants: ID
- Sociodemographic characteristics: Age, Gender, Education
- Political Interest: PI1, PI2, PI3, PI4, PI5
- Social Media Use: Use_FB, Use_TWI, Use_INST, Use_YOU, Use_TELE, Use_WHATS
- Trust in News Media: Trust
5.3 Matching survey data to data donations
To automatically match survey responses in survey (data from our online survey) to donated data in classification_news (the aggregated data frame with % of news-related searches per person), we can use the anonymized ID of each participant:
- variable external_submission_id in classification_news
- variable ID in survey
## # A tibble: 2 × 2
## external_submission_id share
## <int> <dbl>
## 1 3861 0
## 2 4146 1.02
## ID Age Gender Education PI1 PI2 PI3 PI4 PI5 Use_FB Use_TWI Use_INST Use_YOU Use_TELE Use_WHATS Trust
## 1 10494 66 Female University Degree 5 4 5 5 5 4 2 3 4 1 5 3
## 2 4411 18 Male A-levels 5 5 5 5 5 1 1 5 5 3 5 4
To merge both data sets based on the anonymous IDs, we use the left_join() command, which is part of the dplyr package. Here, we have to tell R to match corresponding rows by the variable external_submission_id in classification_news and ID in survey:
#Combine both
data_combined <- classification_news %>%
left_join(survey, by = c("external_submission_id" = "ID"))
#check result
head(data_combined, 2)
## # A tibble: 2 × 17
## external_submission_id share Age Gender Education PI1 PI2 PI3 PI4 PI5 Use_FB Use_TWI Use_INST Use_YOU
## <int> <dbl> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 3861 0 18 Male A-levels 3 1 2 2 2 1 2 4 3
## 2 4146 1.02 24 Male University De… 5 4 4 4 4 1 1 4 5
## # ℹ 3 more variables: Use_TELE <int>, Use_WHATS <int>, Trust <int>
That’s it! Our data donations have now been merged with the classified search queries and we can start preparing our survey data for multivariate analysis.
5.4 Take Aways
Vocabulary:
- Agregating: Summarizing information in data sets to a broader level, for instance from several observations per individual to a single observation per individual containing all information
- Merging: Combining two different data sets, here data from our data donations and the survey
Commands:
- Merging: left_join()
Let’s keep going: with Tutorial 6: Preparing survey data
For the sake of simplicity, we will not distinguish between both data sources. But be aware that both samples were contacted differently and different patterns of self-selection may be at play↩︎
Scale based on: Otto, L., & Bacherle, P. (2011). Politisches Interesse Kurzskala (PIKS)–Entwicklung und Validierung. Politische Psychologie, 1(1), 19–35↩︎