1 Prerequisite

Welcome to the Artificial Intelligence for Public Health (AI4PH) 2022.

This online tutorial will accompany two sessions:

Tutorial on text analytics with R
Data Challenge using the N2C2 NLP Research Datasets

1.1 Before the sessions

Install R and RStudio
Create a new R Project in RStudio(via “File -> New Projects…”)
Install required R packages
Access the Twitter Data for the tutorial session Github Link

1.1.1 Install Required R Packages

Run the code below to install and load the required packages.

# specify the packages to install or load in a vector
packages <- c("tidytext","tidyverse","tidymodels","scales",
              "quanteda","SnowballC","topicmodels","textrecipes",
              "vip","stopwords","themis","discrim","naivebayes",
              "LiblineaR","tidyr","XML","xml2","readr")

# Loop through each package
for (package in packages) {
    # if not installed, then install. 
    if (!require(package, character.only = TRUE)) {
      install.packages(package, 
                       dependencies = TRUE, 
                       repos='http://cran.us.r-project.org')
     }
}
for (package in packages){
    # load the required packages
    library(package, character.only = TRUE)
}

# print the list of libraries that are loaded
(.packages())

1.1.2 Twitter Dataset for the Tutorial

The dataset can be found in the Github Repo

Click the URLs to obtain the twitter data sets in csv files from Github.
- Dataset for Classification: sampleTwitterDataForClassification.csv
- Dataset for Topic Modelling: sampleTwitterDataForTopicModelling.csv
If you want to save the data on your local machine (optional), right-click then Save as

We will use the URLs to access the Twitter data directly.

1.1.2.1 Twitter Dataset for Classification

The dataset is a sample of annotated twitter data with the goal to infer recent plausible COVID-19 cases.

The data structure:

Each column is a variable
Each row is an observation (e.g. tweet, annotation)

Variables included:

X: tweet_id
tweet: the tweet contents
annotation:
- 1: Yes (plausible COVID-19 cases)
- 0: No or Unsure

1.1.2.2 Twitter Dataset for Topic Modelling

The dataset is a collection of tweets related to COVID-19, it has only one column tweet.