Sponsors: Elias Dinas and Juho härkönen, SPS Department


Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.



Content and Organization of the workshop


  1. Introduction to R computing

    • taught by Sphend Kursani: Shpend.Kursani@eui.eu
    • April 9th, 9:00 - 13:00, Seminar Room 2 at Badia Fiesolana

  1. Data management I: Shaping and creating different datasets and variables

  1. Data management II: Relational Data and regular expressions



Content of Data management I and II


  1. Debate in data managment:

  1. Procedure in data managment:

    • The script: ethics and practice
    • The data: ethics and practice

  1. A grammar for data managment:

    • Programming using DPLYR and pipes
    • Separate and Unite
    • Spread and Gather

  1. Tips for a tidy dataset:

    • The time dimension of datasets and the date
    • Missing values: identification and possible correction
    • Unique ids and its use Anonymization
    • Duplicated observations

  1. A toolkit for working with strings:

    • Regular expressions and functions

  1. Relational data:

    • The logic data joins in R
    • Inner and outer joins

  1. Anything else?:


Goals in Data managment 1



Recap from yesterday’s session


  • Quiz: the basic grammar of data managment

    • filter(): ?
    • arrange(): ?
    • select(): ?
    • distinct(): ?
    • mutate(): Add new columns to existing columns
    • summarise(): Collapses a data frame to a single row (e.g. aggregation)
    • group_by(): ?
    • na.omit(): ?


Workflow example 1: getting tidy dataset


  1. Import the dataset ‘gdp.csv’ in R.
  2. Change the layout of the data in a way that one column refers to country, year, gdp
  3. Create a new column named ‘label’. This column should contain the country abbreviation for each observation.
  4. Move forward the first value of each country-group and delate the first row of each country-group
  5. save the dataset
library(tidyverse)
library(stringr)

#0
gdp<- read.csv("./data/gdp/gdp.csv", stringsAsFactors =  FALSE)[,-1]


#1
gdp <- gdp %>% gather(2:170, key = country, value = GDP, na.rm= TRUE, convert =  TRUE)

#2 
gdp$label<- gdp$GDP
gdp$label <- str_replace_all(gdp$label, "[0-9]{1,10}", "") # creates implict missing values! 
gdp$label[gdp$label == ""]<- NA

# 3

gdp <-fill(gdp, label) # to move forward values, and fill NA in this way 

gdp <- gdp %>% 
    group_by(country) %>% 
    slice(-1) # filter by row position OR

gdp <- gdp %>% 
    group_by(country) %>% 
    filter(cgdppc != "year") # filter by row position OR


# extras: nice presentation
names(gdp)[1:3] <- c("year", "country", "gdp")

# 4 
write.csv(gdp, "gdpnew.csv")


Goals in Data managment 2


install.packages(c("tidyverse", "foreign", "stringr", "haven", "openxlsx", "readxl", "eeptools", "tm", "quanteda", "pdftools", "rvest", "xml2"))


1. Debate about data managment



2. Procedure in data managment



2.1. The script: ethics and practice


  • Our motto: Will you understand how the data was organized, elaborated and used for your research in one year?

  • Our practice:

    1. Structure your mind properly

      • …Figure out the logical structure of what to do (write it in a piece of paper)
      • …Describe those tasks in the form of a computer package (dplyr in our case)
      • …Execute codes (first in a sample, if not sure about the outcome)

    1. Structure your code properly

      • Headlines
      • Indentation
      • Write robust syntax (e.g. relative paths)

    1. Comment your code

      • Data sources, data manipulations, steps of analysis
        • …commenting increases readability and reproducability
      • but, you can’t comment too much
      • RMarkdown really helpful


  • What is Markdown? a notebook interface to weave together codes and text

    • Useful? reproduction, automated reports in any format, control (analysis<-> report)
    • Free? please…click here
    • My experience? collection of data on elite defection, change layout of a panel data to a survival analysis, and visualization
    • … and share online my scripts and results: R studio connect

  • Today

    • …We learn basic use.
    • …And how to uploaded in RStudio Connect, others use Github.

  • Another day

    • Encoding, github, Yarl, css…potential topics
# Get Rmarkdown installed
install.packages("rmarkdown")

# Generate a new file, check names and format of the file
    # White part = the space of text --> Information of the steps of your analysis
    # Grey part = R code as a separate paragraph--> analysis and specific comments about a command

#How does it work?
    # treat it as a normal script, at the end of your work, you can click the knit button and get the file you want. 

# get Rstudio Connect 
 # 1) go to the eye looking symbol right to the "run" button 
 # 2)click: "manage account" and add this to the server: https://bookdown.org
 # 3) follow the instructions to have access to Rs connect 
 # 4) log in your account


  • How can I learn more? here some links for Latex-Rmarkdown, 1,2 and html-Rmarkdown3


2.2. Data: ethics and practice


  • Our motto:

    • Data should be consistent, complete and informative
    • There is beauty in simplicity

  • Our practice:

    • Simple codes, always better
    • Give to the vector its suitable class and labels
    • Separate data entry from coding as much as possible
    • Divide complex tasks into several small steps; and if possible, perfom tasks automatically
    • Create vectors and routines that enables the detection of errors
    • Perform double-entry of the data to avoid unsolvable mistakes
    • all in all, today’s tricks to save time when preparing your data for analysis


Install packages for today’s session (if you don’t have them!)

install.packages(c("tidyverse", "foreign", "stringr", "haven", "openxlsx", "readxl", "eeptools", "tm", "quanteda", "pdftools", "naniar", "rvest", "xml2"))


Import the data of today’s session

  • The breakdown of authoritarian regimes

    • The number of countries named as authoritarian and their duration
    • Types of autocracies according to the organization from which the ruler belong
    • Types of regime failure
    • see page 15 of the codebook for further information


# packages
library(tidyverse)
library(readxl)
library(openxlsx)
library(haven)
library(stringr)
 
# import daset 

gwf <- read.csv("./data/regime type/autocracydata.csv" , stringsAsFactors = FALSE) # change working directory
View(gwf)

  • Question: how do we know that this dataset is informative and complete? any thoughts?


3. A grammar for data managment



Select and filter rows by position


  • select(): select columns while filter() select rows (e.g. countries, individuals etc.)

  • The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame


# Task: select all the columns, excepting the first one that refers to the number of rows

shortgwf <- select(gwf, -c(X))  # conversely select(gwf, cowcode:gwf_monarch)

shortgwf <- gwf %>% select(-X) %>% filter(year == 1995)


# select(dataset, columns) in columns you can add the name of the vector or its position. Positive values means the column will be considered, while negative means the column would not be considered

# sometimes, writting the column names do not work. To solve the problem write the name of the columns = `name`.

# Task: filter the dataset by those regimes that are monarchies

shortgwf <- filter(shortgwf, gwf_monarch != 1) # != not equal to

# Question: can I combine both functions? Yes, pipe them

shortgwf <- gwf %>% filter(gwf_monarch != 1) %>%  select(-X)

# how is the old-school way to do this? 


# shortgwf<- gwf[, -1 ]
# shortgwf<- gwf[gwf$gwf_monarch != 1, ]


Arrange


  • arrange(): reorder rows


# task: common practice...datasets in an alphabetical order 

colnames(shortgwf)
shortgwf <- gwf %>% arrange(gwf_country)# write the name order you want wihtin arrange() or their position 

# what if I want to organize the dataset by the regime type first?
shortgwf <- gwf %>% arrange(desc(gwf_country))

# Q: is equivalent the following function to the sum of previous command?

shortgwf <- gwf %>% arrange(gwf_country, desc(year))

# how is the old-school way to do this? 
#shortgwf <- shortgwf[order(shortgwf$gwf_country, desc(shortgwf$gwf_regimetype)), ]


Add new (transformed) columns


  • mutate(): transformation of columns executed line-by-line

  • mutate_all(): transformation of all the columns of a dataset. Any idea about its application?

  • mutate_at(): transformation of multiple columns selected with a character vector or vars()

  • mutate_if(): transformation of multiple columns if condition satisfied


# how can we enumerate rows?
shortgwf <- gwf %>% mutate( rows = row_number()) # kind useful to write the number of rows, which should be updated after deleating a row


# which was the classic approach? 
#shortgwf$rows <- 1:nrow(shortgwf)

# you can add them each column 

shortgwf <- shortgwf %>% mutate_at(c("rows", "year"), .funs = lag) # problem?

shortgwf <- shortgwf %>% mutate_at(c("rows", "year"), .funs = list(new = lag)) 

shortgwf <- shortgwf %>% mutate(new = lag(year, new2 = lag(rows)))
# which is the the difference with respect to the previous code?


# let's say that you want to change all the integer vectors to numeric vectors to numeric

shortgwf <- shortgwf %>% mutate_if(is.integer, .funs = as.numeric)

# only those columns with the name rows in it?

shortgwf <- shortgwf %>% mutate_at(c("rows", "rows_new"), .funs = as.integer) # OR

shortgwf <- shortgwf %>% mutate_at(vars(contains('new')), .funs = as.integer) 


Group comparison and aggregation


  • We can apply above functions to subgroups within the dataset

  • group_by() function to describe how to break a dataset down into groups of rows


# total duration of every regime in a given country

shortgwf <- gwf %>% group_by(gwf_casename) %>% mutate( total = n()) 

# Question: what if we change ```n()``` for nrow()? what can these function be useful for?

shortgwf <- shortgwf %>% group_by(gwf_casename) %>% mutate( total2 = row_number())

# OR shortgwf <- shortgwf %>% group_by(gwf_casename) %>% mutate( total = n(), total2 = row_number()) 


# also more than one variable can be generated 

shortgwf <- shortgwf %>% group_by(gwf_casename, year) %>% mutate( total_months = total*12)


Create summaries of the data frame


  • summarise() function will create summary statistics for a given column in the data frame

  • think about the mean, sd, max and min values, lag values…

  • summarise_all() apply the same summary statistic to all the columns

  • summarise_at() apply the same summary statistic to more than one column


# create a list of regime-cases


list_gwf <- shortgwf %>% select(gwf_casename, total,total_months)  %>%  summarise(sum = sum(total))  

list_gwf <- select (list_gwf, -year) %>% distinct() # what's going with Argentina 43-46?

# we want to calculate the avarage lenght of every regime-case in years and month

regime_mean <- shortgwf %>% select(gwf_regimetype, total, total_months) %>% group_by(gwf_regimetype) %>% summarise_all(funs(mean = mean, sd = sd))

regime_mean2 <- shortgwf %>% select(gwf_regimetype, total, total_months) %>% group_by(gwf_regimetype) %>% summarise_at(vars(total:total_months), funs(mean = mean, sd = sd))

# Q: compare the outcome ob both functions, which are the advantages/disavantages of using one function or the other?


# old_style <- shortgwf %>% select(gwf_casename, total, total_months)  %>% group_by(gwf_casename) %>% summarise(mean1 = mean(total), mean2 = mean(total_months) and the sd...)


  • How can I learn more? check this blog to find some ideas about how to use summrise() and group_by. Here another blog in case you need to remmember the theory again


Exercise: basic grammar

  1. using the dataset ‘gwf’…

  2. In one line of code, we need a dataset with the columns: gwf_country till gwf_enddate and gwf_regimetype and… remove observations that fall into the regime type ‘oligarchy’. Check you did it right

  3. Create a new vector that contains information about the duration of every authoritarian regime type in a given year (tip: you might need row_number) in one line of code.

  4. distinct() is a function that returns unique values. a<- shortgwf %>% group_by(gwf_casename) %>% distinct() why is this function not working? what shall we do to obtain a list of regime-case with their starting and ending date?

Solution

# Exercise 1
exercise1 <- gwf %>% select(gwf_country:gwf_enddate, gwf_regimetype ) %>% filter(gwf_regimetype != "oligarchy")

# Exercise 2
exercise2 <- exercise1 %>% group_by(gwf_casename) %>% mutate(duration = row_number()) # duration 

# Exercise 3 
exercise3 <- exercise2 %>% select(-duration) %>% distinct()


Tidyr to reshape your data


  • Tidyr package is the most intutive package to reshape your data

  • Basic logic

    • identify clearly which are: your observations (rows) and variables (columns)

  • Functions

    • separate(): separate one column into several
    • unite(): unite in one column multiple columns
    • spread(): spread rows into columns
    • gather(): gather columns into rows


Separate and Unite


  • Sometimes… a column contain several values –> separate()

  • Sometimes… a single variable is spread into multiple columns –> unite()


# THE DATASET TO WORK NOW
tidy_data <- exercise3

# create three columns from the column 'gwf_startdate'

tidy_data1 <- tidy_data %>% separate(gwf_startdate , into = c("year", "month", "day"), sep = "-", remove = TRUE, convert = TRUE) 

# It works as follow separete(the column to separate, the name of the new columns, the symbol to separate the values)


# additional useful argument --> 
#'remove = FALSE' do not delete the original variable 
# convert = TRUE to give it a proper data structure

# list <- list %>% separate(gwf_startdate , into = c("year", "month", "day"),  sep = "-", remove = FALSE, convert = TRUE ) 


# also, you do not need 'symbols to separate '

tidy_data1 <- tidy_data1 %>% separate(year , into = c("century", "years"), sep = 2) 
# in sep, + values far-left of the strings, - values far-right of the strings


# Oh, god! that was not the right variable. How can I get my column back!!??

tidy_data1 <- tidy_data1 %>% unite(year, century, years) 

# how can I remove the '_'? adding "sep = "" "

tidy_data1 <- tidy_data %>% separate(gwf_startdate , into = c("year", "month", "day"), sep = "-", remove = TRUE, convert = TRUE) 

tidy_data1 <- tidy_data1 %>% separate(year , into = c("century", "years"), sep = 2) 

tidy_data1 <- tidy_data1 %>% unite(year, century, years, sep= "") 

# _ value is a default value when uniting between the values from different columns. we can specify different "sep" condition


Spread and Gather


  • Sometimes we encounter the problem that observations are scattered across multiple rows –> spread()

    • time serie example in the file ‘economic data.xsls’

  • Sometimes…some of the column names are not names of variables, but values of a variable –> gather()


# create a dataset in which one column correspond to years and the value = regime type

spread_data <- tidy_data1 %>% select(gwf_casename, gwf_regimetype, year) %>%  spread(key = gwf_regimetype, value = year) # horizontally

# this can be useful to estimate some descriptive statistics for each regime type

spread_data <- spread_data %>% summarise_all(min , na.rm = TRUE ) # no possible two functions :'(


############# spread_data is a very ugly dataset, let's put it in its original form. in gather we trust #############

# We need the set of columns that represent values, not variables. In this example, those are the columns after total.

# We need the key: the name of the variable whose values form the column names. Here it is year.

# We need the value: the name of the variable whose values are spread over the cells. Here it's regime type.

united_data <- spread_data %>% gather(2:10, key = gwf_regimetype, value = year) %>% drop_na() # why so ugly?


  • spread() makes long tables shorter and wider

  • gather() makes wide tables narrower and longer


4. Tips for a tidy dataset


4.1. Time is the essence: Introduction to time data-types

Time data vectors

  • Sometimes datasets contain information representing calendar dates and times

  • The problem: time-data vector is stored in different formats

    • as.characther()
    • as.POSIXct()
      • Internally, R save the number of seconds since the start of January 1, 1970. Negative values, seconds before this date.
      • Outcome: print a value. useful format to save in data.frames
    • as.POSIXlt()
      • save a list with different type of entries (eg. sec., min., hours, years from 1900)


# data for the section
tidy_data <- exercise3


# save time data with POSIXct
value <- as.POSIXct(unlist(tidy_data[2,3]))
print(value)
typeof(value) # double? integer value with double precisioin floating points useful for mathematical purposes
cat(value,"\n") # how it is stored in R internally

# save time data with POSIXlt
list_date <- as.POSIXlt(Sys.time(), format = "%y-%m-%d")
print(list_date)
typeof(list_date)
cat(list_date,"\n") 
names(list_date)
list_date[[1]] # seconds
list_date[[2]] # minutes
list_date[[3]] # hours
list_date[[4]] # day of month (1-31)
list_date[[5]] # month of the year (0-11)


  • Tip: save time-data type with the function as.Date():

    • easy manipulation, less risk it gets corrupted, keeps track of numbers of days
    • Q: check help(as.Date) adn observe the commands: what else you can specify with this function?


# save time data with as.Date
value2 <- as.Date(unlist(tidy_data[1,3]))
print(value2)
typeof(value2)
cat(value2,"\n")

value2 <- as.Date.character(tidy_data[1,3], format= "%Y/%m/%d")
typeof(value2)


Exercise: obtaining time vectors


  1. Using the tidy_data… change the character vectors “gwf_startdate” and “gwf_startdate” into as.Date format. Tips: you need to write the code in one line and use mutate_at()

Solution


Time data operators

  • Date is a suitable R’s internal format to run a numer of basic operations

    • Create a new vector that contains time differences
    • Expand dataset conditioned on time/date


library(eeptools)

# create time differences 
tidy_data$duration_month <- age_calc(tidy_data$gwf_startdate, enddate = tidy_data$gwf_enddate, units = "months") # default is in months

# sensible to omited values

tidy_data <- na.omit(tidy_data)

tidy_data$duration_year <- age_calc(tidy_data$gwf_startdate, tidy_data$gwf_enddate, units = "months") #

 # old school
# Lenght of by months
elapsed_months <- function(end_date, start_date) {
    ed <- as.POSIXlt(end_date)
    sd <- as.POSIXlt(start_date)
    12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}

#######################################################################################################

######### ########## Expand dataset condioned on time differences ######### ######### 

# daydly basis.
lst <- Map(`:`, tidy_data$gwf_startdate, tidy_data$gwf_enddate) # Map is a function from purr to extract a lsit values

spread_vertical_data <- cbind(tidy_data[rep(seq_len(nrow(tidy_data)), lengths(lst)), ], date = unlist(lst)) # why negative values?

spread_vertical_data$first <- as.Date(spread_vertical_data$date, origin = "1970-01-01") # date format!

library(lubridate)

# monthly basis

spread_vertical_data$second <- format(as.Date( spread_vertical_data$first), "%Y-%m") 
# it shows only month, still the data is stored on dayly basis!

spread_vertical_data <- spread_vertical_data %>% select(-c(first,date)) %>% distinct()



4.2. Missing values

Theory

  • two types of :

    • Explicit , i.e. flagged with NA, 99, -99 etc
    • Implicit , i.e. simply not present in the data.

  • We have always to make the NA explicit! and be consistent with its coding

Practice

  • Identification either numerical or visualization

    • install.packages("naniar") (today)

  • Procedure to deal with missingness

    • Omit the missing values (today)
    • Imputation (another day)
      • useful reading: King’s paper and Graham’s paper
      • useful packages: Amelia and bucky install.packages("bucky", repos=c("http://r.tahk.us/",getOption("repos")))


1. Identification

  • We have always to make the NA explicit! and be consistent with its coding

    • complete()turns implicit missing values into explicit NA
    • apply functions to omit data na_if()


library(naniar)
# generating a dataset
df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep= T)))

df <- mutate_all(df, funs(na_if(.,-99)))
 
# sometimes white spaces need to be converted in NA

df <- data.frame(replicate(6, sample(c(1:10, ""), 6, rep= T)))

mutate_all(df, funs(na_if(., "")))
 

####### Old school way a function to solve missings ####### 
fix_missing <- function(x){x[x == -99]<- NA  
x} 
# change the -99 in case the value is different.

df<-lapply(df, fix_missing) # to applied it for the whole dataset, of course, also you can use the functio to fix one particular column

df$X1 <- fix_missing(df$X1)


  • Numerical summaries of missing data in variables and cases

    • miss_var_summary(): summary for each variable
    • miss_case_summary() :summary for each case
    • …or old school functions!

  • Visual summary of missing data in variables and cases, check this


# dataset of the section
gwf_2 <- gwf

library(naniar)
report1 <- miss_var_summary(gwf_2, order = TRUE, add_cumsum = TRUE)
report2 <- miss_case_summary(gwf_2, order = TRUE, add_cumsum = TRUE)

# old school, I prefer it

colSums(is.na(gwf_2))
apply(is.na(gwf_2), 2, which) # shows the rows with missing values and columns 

# subset 
no_missing <- gwf_2 %>% filter(is.na(gwf_regimetype)) # or

no_missing <- gwf_2 %>% complete.obs() # or

  • Q: Why did the data get missing? Implications? Solution?


2. Solution 1: Omit the missing values

  • dplyr solves it by using na.rm = TRUE


df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep= T)))

df <- mutate_all(df, funs(na_if(.,-99)))

# way to correct the previous mess
df2 <- df %>% gather(5:6, key = beer, value = cost, na.rm= TRUE) # exclude missing values, UPDATE positions based on the lcoation of the mssing values!

# several ways in the old.fashion... one of them list <- na.omit(list) (https://stackoverflow.com/questions/4862178/remove-rows-with-nas-missing-values-in-data-frame)

df %>% summarise(mean_cost = mean(X4, na.rm= TRUE), sd_cost = sd(X4, na.rm= TRUE))

# also na.rm when running regressions!


2. Solution 2: fill missing values

  • fill() function from tidy package

    • It takes a set of columns where you want missing values to be replaced by the most recent non-missing value
    • for example…
no_missing <- gwf_2 %>% filter(is.na(gwf_regimetype))

gwf_2 <- gwf_2 %>% group_by(gwf_casename) %>% fill(gwf_regimetype)  

# check missings 

no_missing2 <- gwf_2 %>% filter(is.na(gwf_regimetype)) 

gwf_2 <- gwf_2 %>% group_by(gwf_casename) %>% fill(gwf_regimetype, .direction = c("up"))  

# add manual entry for Iran and later apply fill with direction = down


2. Solution 3: Imputation

  • See the info at the beggining of this section


4.3. Duplications


  • Two identical observations or very similar observations are in the dataset

    • Cause by: insert data manually (e.g.misspells and additional white space) or functions(e.g. expanding datasets…). Any thoughts?
    • Consequence ?

  • distinct() is a function integrated in tidy package to deal with duplicates

gwf_2 <- gwf

data <- gwf_2 %>% select(cowcode, gwf_casename, gwf_regimetype) # all repeated

data <- gwf_2 %>% select(cowcode, gwf_casename, gwf_regimetype) %>% distinct() # tadaaaa


  • sometimes we need more in deep-work…

  • Process:

    1. Search or create unique obs. : unit of analysis + a col with a unique value per obs. (e.g. time = year)
    2. Count how many times these unique observation appears
    3. Evaluate the source of the error (e.g. misspell = idiosyncratic?)
    4. Deal with the error: deleate or recode


# If the problems are misspells...

# the process of creating ids

  gwf_2$id_duplica <- paste0(gwf_2$cowcode, sep = "." , gwf_2$year)
  gwf_2$id_duplica <- tolower(gwf_2$id_duplica) #everything in lower case
  gwf_2$id_duplica <- gsub('\\s{1,4}','', gwf_2$id_duplica )# remove empty spaces in the cell
  gwf_2 <- arrange(gwf_2, id_duplica) # to put some alphabetic order here...
  gwf_2 <- transform(gwf_2, id_duplica = as.numeric(factor(id_duplica)))
  
 
  sum(table(gwf_2$id_duplica)-1) # 42 observations are duplicates, but where?

  gwf_2 <- gwf_2 %>% group_by(id_duplica) %>% mutate(n=n())
  
  a <- gwf_2 %>% filter(n != 2)

  • Q :In the case of ‘data’ (see chunck above), what shall we do to remove duplicates? is it possible?


Workflow example 1: getting tidy dataset


  1. Import the dataset ‘gdp.csv’ in R.
  2. Change the layout of the data in a way that one column refers to country, year, gdp
  3. Create a new column named ‘label’. This column should contain the country abbreviation for each observation.
  4. Move forward the first value of each country-group and delate the first row of each country-group
  5. save the dataset


5. A toolkit for working with strings



library(stringr)

# change working directory
gwf <- read.csv("./data/regime type/autocracydata.csv", stringsAsFactors = FALSE) # change working directory


names(gwf) <- names(gwf) %>% str_replace_all("gwf_", "")

# old school:  names(gwf)[1:15] <- c("write here 15 names")




6.1. Regular expressions


  • Regexp generalizable text patterns for searching and manipulating data

  • We focus on exact matching and generalizable expressions

Exact matching

  • Identify a characther vector with the exact same values


example.string <- "</P> Dip. Rodirgo Gromenawer, Perez  Elected: PRIS -Parlamentary group:PIRS 2010</P>"

# A string matching itself
str_extract(example.string, "PRI")

# multiple matches
str_extract_all(example.string, "PRI")

# case sensitivity
str_extract(example.string, "pri")
str_extract_all(example.string, regex("pri", ignore_case =TRUE))


Generalizable regexps

  • The power of these regexps relies on formulas to write more flexible and generalized queries

  • Some regular expressions

    • [:digit:]: Digits: 0 1 2 3 4 5 6 7 8 9
    • [:lower:]: Lower-case characters: a-z
    • [:upper:]: Upper-case characters: A-Z
    • [:alpha:]: Alphabetic characters: a-z and A-Z
    • [:alnum:]: Digits and alphabetic characters
    • [:punct:]: Punctuation characters: ‘.’, ‘,’, ‘;’, etc.
    • [:graph:]: Graphical characters [:alnum:] and [:punct:]
    • [:blank:]: Blank characters: Space and tab
    • [:space:]: Space characters: Space, tab, newline, and other space characters

  • Additional shortcuts

    • \d: matches digits = [0-9]
    • \D: matches non-digits = [^0-9]
    • \w: matches any characther = [a-z] + [A-Z]
    • \s: matches whitespace characther
    • \S: matches non-whitespace characther

  • but also a set of operators

    • ^: match the beginning of a string
    • $: match the end of a string +|: match more than one element +.+: match everything +[ ]: match a range of words or numbers included within the bracket
example.string <- "</P> Dip. Rodirgo Gromenawer, Perez  Elected: PRIS -Parlamentary group:PIRS</P>"

# match the beginning of a string
str_extract_all(example.string, "^</P>")

# match the end of a string
str_extract_all(example.string, "PRI$")
str_extract_all(example.string, "</P>$")

# pipe operator
unlist(str_extract_all(example.string, "Dip|PRI"))

# wildcard
str_extract(example.string, "R.+go") 

# character class (range)
str_extract(example.string, "P[RSTUV][IJKLM]S")
str_extract(example.string, "Ro[:alpha:][:alpha:][:alpha:]go")

# quite annoying repeating several times the same  right?


  • Quantifiers: to avoid repeating same expression:

    • ?: The preceding item is optional and will be matched at most once
    • +: The preceding item will be matched one or more times
    • {n}: The preceding item is matched exactly n times
    • {n,}: The preceding item is matched n or more times
    • {n,m}: The preceding item is matched between n and m


# greedy quantification
str_extract(example.string, "R[:alpha:]{1,6}o")

# the preceding item is the wildcard
str_extract(example.string, "Dip.+Elected") 

# the preceding item is the wildcard
str_extract_all(example.string, ".+?PRI") 

# meta characters
unlist(str_extract_all(example.string, "\\.")) # only want the dot. The dot has a meaning in regular expression that you want to isolate

# assertions
unlist(str_extract_all(example.string, "(?<=</P> ).+")) # positive lookbehind: (?<=...)
unlist(str_extract_all(example.string, ".+(?=</P>)")) # positive lookahead (?=...)

# to create strings, both single and double quotes work
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

# to include a literal single or double quote in a string you can use \ to escape it:

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

# to extract info between brackets or quotations: two strategies

str_extract(string2, '(\").*?(\")')
str_extract(string2, "(?<=\")(.*?)(?=\")") # 

#  Also: negative lookbehind: (?<!...) and  negative lookahead (?!...), never used


  • Once again, it is matter of try and error. here a useful link to proof whether or not your regexps work before running the code


Exercise: first contact with regexps

  1. describe the types of strings that conform to the following regular expressions and construct an example that is matched by the regular expression.
    • "[[:digit:]]{4,}"
    • ".*?\\.txt$"
    • "<(.+?)>.+?</\\1>" # example str_extract_all("Phone 150$, PC 690$", "[0-9]+\\$") # example to extract every number that precedes $

Solution


6.2. Strings manipulation


  • Some functions to use regexps

    • str_extract(): Extracts first string that matches pattern Character vector
    • str_extract_all() : Extracts all strings that match pattern List of character vectors
    • str_locate() : Return position first pattern that match Matrix of start/end positions
    • str_locate_all() : Return positions of all pattern matches List of matrices
    • str_replace() : Replaces first pattern match Character vector
    • str_split(): Split string at pattern List of character vectors
    • str_split_fixed(): Split string at pattern into fixed number of pieces Matrix of character vector


library(stringr)


# joining
str_c("text", "manipulation", sep = " ")
str_c("text", c("manipulation", "basics"), sep = " ")


# working example 
example.string <- "</P> Dip. Rodirgo Gromenawer, Perez  Elected: PRI -Parlamentary group:PRI</P>"

# locate
str_locate(example.string, "Rodirgo Gromenawer, Perez")

# substring extraction
str_sub(example.string, start = 11, end = 35)

# replacement
str_replace(example.string, pattern = "</P>", replacement = "")
str_replace_all(example.string, "</P>",  "")

# splitting
str_split(example.string, ":") %>% unlist


Exercise: string manipulation


  1. The following code hides a secret message. Crack it with R and regular expressions.
secret <- "clI.0pow1zLstc0d87wnkig7Ovd000dgdfgdfsdfeeeggvhryn92V55juwczi8hqrfpxs5j5dwpn0Enwo. Cwisdij76j8kpf03AT5dr3coc0bt7yczjataootj55t3j3ne6c4Sfek.W1w1IwwojigT d6vrf5656Hrbz2.2bkTnbhzgv4I9i05zEcropwSgnb.YqoO65fUa1otfb7wm24k6t3s3339zqe5 fy89n6?d5t9kc4fe905gmr"


Solution


Workflow example 2: working with PDFs


  • Goal with want to construct a dataset out of the PDF data (see legislatura 27.pdf)

    • cols: name of mps, party affiliation, Estate-district where the mp was elected
    • we need to install tm package by Ken Benoit.

  • Process

    1. Is there any pattern in the text that we can use to our favor? Planning!
    2. set working directory
    3. set the corpus: an object that represents a collection of documents
    4. Perform the strategy to extract information in a characther vector
    5. Check if problems in the data extraction
    6. Clean the vectors
    7. Combine the vectors into a data.frame. Further noise that needs to be cleaned?


# import the pdf in R == setting a corpus. we need tm package

library(tm)
library(pdftools)
library(stringr)

# 1. my workign directory:  D:\Biblioteca D\courses material\data managment in R\data

cname <- file.path("D:", "Biblioteca D", "courses material", "data managment in R", "data") # I am telling to R where the pdf files are. Cchange \ for ,

# 2. Corpus

Rpdf <- readPDF(control = list(text = "-layout")) # layout control in order to keep the original format as much as possible 

# Rpdf <- readPDF(engine = "xpdf", control = list(text = "-layout")) # xpdf engine, similar as the default. but needs to installed: http://www.xpdfreader.com/

docs <- Corpus(DirSource(cname), readerControl=list(reader=Rpdf)) # upload documents

writeLines(as.character(docs[1]))

## problems with encoding? Functions below change the encoding of R 

# Spanish\LATIN: Sys.setlocale("LC_CTYPE", "") 
# Cyrillic: Sys.setlocale(category = "LC_ALL", locale = "Russian")

# Specify the doc we want
content_doc <- docs[[1]]$content 

################################################################################## In case this code does not work for you: use


doc <- pdf_text("./data/Legislatura_36.pdf") # can upload only one doc!

#################################################################################################


# 3. extract the info that we need
# name
mps.regex <- "Dip.+[[:alpha:].,]{1,5}"
name_dip<- unlist(str_extract_all(content_doc, mps.regex)) #172?

# party
party <- unlist(str_extract_all(name_dip, ":(.+)"))
party <- unlist(str_replace_all(party, ":", "")) # 171?

# state
state.regex <- "Es[tad][tad].+[[:alpha:]:.+[0-9]]{1,8}"
state_dip<- unlist(str_extract_all(content_doc, state.regex)) # 171?

# 4 and 5. correct errors: error in the string 8, which one shall we used

#name_dip <- unlist(str_remove(name_dip, "Diputado Favio Altamirano Manlio fallecio."))
#name_dip <- name_dip[-8]

# 6. put together in data.frame when we feel comfortable with the lenght of vectors
data<- data.frame(names = name_dip, party.affiliation = party, state = state_dip, stringsAsFactors =  FALSE)


# 5. clean dataset!
data$names <- unlist(str_replace_all(data$names, "Partido:.+" , "")) # remove party in names col
data$names <- unlist(str_replace_all(data$names, "Dip." , "")) # remove dip in names col

data$names_ordered <- str_replace(data$names , "(.+), (.+)", "\\2 \\1") # change first name, second surname. Does it work?

data$names_ordered <- tolower(data$names_ordered) # lower case 

## other tasks: create a id per 


  • How can I learn more? Text as data is a growing topic.

    • Click here, here, or even here for a short introduction about all the techniques that can be used with the tm package or here for the quanteda package


Workflow example 3: working with websites


  • Goal with want to construct a dataset out of the wikipedia tables.

    • we need columns with info about candidates name, outcome of election (“winner”, “loser”), party affiliation and district

  • Process: the same as in the previous example!


# packages

library(xml2)
library(rvest)
library(tidyverse)
library(stringr)

#######  data collection stage of the research  #######  

# 1. specify URL
url <- "https://en.wikipedia.org/wiki/Results_of_the_Malaysian_general_election,_2018_by_parliamentary_constituency"
browseURL(url)

# 2. download static HTML behind the URL and parse it into an XML file

web<- read_html(url) %>% html_table( fill = TRUE)

candidates <- rbind(web[[3]], web[[4]], web[[5]], web[[6]], web[[7]], web[[8]], 
                    web[[9]], web[[10]], web[[11]], web[[12]], web[[13]], web[[14]], web[[15]], web[[16]], web[[17]], web[[18]])

#######  data managment stage of the research  #######  


# 3. setting the dataset
# columns to deleate (10:13) 
candidates <- candidates[,1:9]


# changign labes to facilitate identification
names(candidates)[1]<- "Constituency.code"
names(candidates)[2]<- "Constituency"
names(candidates)[3]<- "Winner"
names(candidates)[6]<- "Loser"
names(candidates)[4]<- "Votes.total.wn"
names(candidates)[5]<- "Majority"
names(candidates)[7]<- "Votes.total.ls"
names(candidates)[8]<- "Incumbent"
names(candidates)[9]<- "Incumbent.Majority"


# create outcome of election column and candiate.names

# gathering columns 
candidates <- candidates %>% gather("Winner", "Loser", key = Outcome, value = Candidate.name, na.rm= TRUE, convert =  TRUE) # candidates and outcome 
candidates <- candidates %>% gather("Votes.total.wn", "Votes.total.ls", key = b, value = Votes.number, na.rm= TRUE, convert =  TRUE) # fix N of votes

# remove columns and empty rows, and repeated observations. 
candidates <- candidates %>% select(-b) %>% distinct()

# remove the first obs. 
candidates$flag <- candidates$Constituency.code # create a flag

candidates$flag <- str_replace_all(candidates$flag, "P[0-9]{1,3}", "1") 
candidates <- candidates %>% filter(flag == 1) %>% select(-flag)

candidates<- arrange(candidates, Constituency.code)


# correct problem: winner has most of the votes, loser has what is left 

#convert chrt vectors to numeric 
candidates$Votes.number <- str_replace_all(candidates$Votes.number, ",", "")
candidates$Votes.number <- as.numeric(as.character(candidates$Votes.number))

# listing winners, keep first obs. which the one with largest percentage
list.winners <- candidates %>% group_by(Candidate.name) %>% filter(Votes.number == max(Votes.number) & Outcome == "Winner")


# listing losers, deleate the first obs. which the one with largest percentage

list.losers <- candidates %>% group_by(Candidate.name) %>% filter(Votes.number != max(Votes.number) & Outcome == "Loser")

# Logic of the collection: the first row correspond to the winners value, the second row correspond to members' value in the table, the rest correspond the values of other candidates

# acombine both list
candidates2 <- rbind(list.winners, list.losers)
candidates2<- arrange(candidates2, desc(Constituency.code))


# adding column party affiliaition 
candidates2$Party.affiliation <- str_extract_all(candidates2$Candidate.name, "\\(.*\\)" )
candidates2$Party.affiliation <- str_replace_all(candidates2$Party.affiliation, "\\(", "")
candidates2$Party.affiliation <- str_replace_all(candidates2$Party.affiliation, "\\)", "")

candidates2$Candidate.name <- str_replace_all(candidates2$Candidate.name, "\\(.*\\)", "" )

# id = name without empty spaces
candidates2$id <- tolower(candidates2$Candidate.name)
candidates2$id <- gsub('\\s{1,4}','', candidates2$id)# remove empty spaces in the cell

# what shall I do with those districts that have changed the district name? 


### ### ### ### ### other advanced and interesting features of webscrapping and regular expressions ### ### ### ### ### 

# adding a column for wikipedia links

# extract url
a <- read_html(url) %>% html_nodes("a") %>% html_attr("href")
a<- unlist(str_extract_all(a, ".wiki.[[:alpha:]]{1,10}\\_+[[:alpha:]]{1,10}\\_.+")) # inspect and select only those links with candidate information 
a<- a[40:304] # simplicity is beauty, remmber 

# create full real links 
url.candidate2 <- paste0("https://en.wikipedia.org", a)

wiki.links2 <- data.frame(url.candidate = url.candidate2)
wiki.links2$Candidate.name <- str_replace_all(wiki.links2$url.candidate, ".+.wiki.", "")
wiki.links2$Candidate.name <- str_replace_all(wiki.links2$Candidate.name, "_", " ")
wiki.links2$Candidate.name <- str_replace_all(wiki.links2$Candidate.name, "\\(.*\\)", "" )


# id = name without empty spaces
wiki.links2$id <- tolower(wiki.links2$Candidate.name)
wiki.links2$id <- gsub('\\s{1,4}','', wiki.links2$id)# remove empty spaces in the cell

# remove repeated and non informative rows 
wiki.links2 <- distinct(wiki.links2)

# 6. merge both info
candidates2018 <- left_join(candidates2,wiki.links2, by = "id")
candidates2018<- candidates2018 %>% select(-14) %>% distinct()

### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###### ### ### ### ###


# WRITE AND SAVE 
write.csv(candidates2018, "./candidateMSY2018.csv")
candidates2 <- read.csv("./candidateMSY2018.csv", na.strings="", stringsAsFactors = FALSE) # sexy stuff

# ;)


  • How can I learn more? best book ever, although a bit outdate. There are tutorials, but either you have to pay for them or they are very partial!


6. Relational data



library(foreign)
library(tidyverse)
library(readxl)


# data sets

# Authoritarian regimes dataset 

gwf <- read.csv("./data/regime type/autocracydata.csv", stringsAsFactors = TRUE)[,-1] # change working directory

# dataset about the failure of authoritarian regimes
failure_regime <- read_excel("data/data/failure_regime.xlsx")
View(failure_regime)



Left versus Right joins

left_data <- left_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country"))


right_data <- right_join(gwf, failure_regime, by= c("cowcode", "year" , "gwf_country", "gwf_casename")) # why one obs. more than failure?




# what I use:  an updated versiobn of a function posted in stack overflow https://stackoverflow.com/questions/28992362/dplyr-join-define-na-values

left_join_NA <- function(x, y, ...) {
  left_join(x = x, y = y, by = ...) %>% 
    mutate_all(funs(replace(., which(is.na(.)), 0))) ## replace the implicit NA into a pre-defined valu, 0 in this case
} 

# you can change the 'type of join' with other type of joins and the default value (99 instead of 0)

complete_data <- left_join_NA(gwf, failure_regime, by= c("cowcode", "year", "gwf_country"))

## classical approaches 
# complete_data[is.na(complete_data)] <- 0 # to change all NA in 0 in a classic manner, better specify each column 

# complete_data$gwf_fail[is.na(complete_data$gwf_fail)] <- 0
# complete_data$gwf_fail_type[is.na(complete_data$gwf_fail_type)] <- 0


Full versus Inner joins

full_data<- full_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country")) # keep all observations.


inner_data <- inner_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country")) # keep only matched observations

# why inner seems to be better than the rest of datasets? 


# why?

  • Q Why do we observe one that inner_data and right_data has one more observation than the failure_data?


Anti versus Semi joins

anti_data <- anti_join(gwf, failure_regime, by= c("cowcode", "year", "gwf_country", "gwf_casename"))

# Anti-joins are useful for diagnosing join mismatches (e.g. countries have slightly different name!), but also as an opposite function of semi_join 

the_data <- anti_join(right_data, anti_data, by= c("cowcode", "year", "gwf_country"))

write.csv(the_data, "the_data.csv")

semi_data <- semi_join(failure_regime, inner_data, by= c("cowcode", "year", "gwf_country")) # Why, one obs. less?


# Semi-joins and anti joins are useful for matching filtered summary tables back to the original rows.

top_autocra <-  the_data %>%  group_by(gwf_regimetype) %>% filter(gwf_spell == max(gwf_spell)) # what is this?

semi_data <- the_data %>% semi_join(top_autocra) 

anti_data <- the_data %>% anti_join(top_autocra) # dataset that does not include the most long lasting authoritarian regimes in the world


Exercise: merging datasets!

  1. Use the gdp data in the previous exercise to build a more complete ‘the_data’ dataset. Identify why merging cannot be possible and which changes in the datasets need to be done in order to merge them. (tip: same colum name?)


Solution