1 Prerequisite
Welcome to the Artificial Intelligence for Public Health (AI4PH) 2022.
This online tutorial will accompany two sessions:
- Tutorial on text analytics with R
- Data Challenge using the N2C2 NLP Research Datasets
1.1 Before the sessions
- Install R and RStudio
- Create a new R Project in RStudio(via “File -> New Projects…”)
- Install required R packages
- Access the Twitter Data for the tutorial session Github Link
1.1.1 Install Required R Packages
Run the code below to install and load the required packages.
# specify the packages to install or load in a vector
packages <- c("tidytext","tidyverse","tidymodels","scales",
"quanteda","SnowballC","topicmodels","textrecipes",
"vip","stopwords","themis","discrim","naivebayes",
"LiblineaR","tidyr","XML","xml2","readr")
# Loop through each package
for (package in packages) {
# if not installed, then install.
if (!require(package, character.only = TRUE)) {
install.packages(package,
dependencies = TRUE,
repos='http://cran.us.r-project.org')
}
}
for (package in packages){
# load the required packages
library(package, character.only = TRUE)
}
# print the list of libraries that are loaded
(.packages())1.1.2 Twitter Dataset for the Tutorial
The dataset can be found in the Github Repo
Click the URLs to obtain the twitter data sets in csv files from Github.
- Dataset for Classification:
sampleTwitterDataForClassification.csv - Dataset for Topic Modelling:
sampleTwitterDataForTopicModelling.csv
- Dataset for Classification:
If you want to save the data on your local machine (optional), right-click then
Save as
We will use the URLs to access the Twitter data directly.
1.1.2.1 Twitter Dataset for Classification
The dataset is a sample of annotated twitter data with the goal to infer recent plausible COVID-19 cases.
The data structure:
- Each column is a variable
- Each row is an observation (e.g. tweet, annotation)
Variables included:
X: tweet_idtweet: the tweet contentsannotation:1: Yes (plausible COVID-19 cases)0: No or Unsure