1 Prerequisite
Welcome to the Artificial Intelligence for Public Health (AI4PH) 2022.
This online tutorial will accompany two sessions:
- Tutorial on text analytics with R
- Data Challenge using the N2C2 NLP Research Datasets
1.1 Before the sessions
- Install R and RStudio
- Create a new R Project in RStudio(via “File -> New Projects…”)
- Install required R packages
- Access the Twitter Data for the tutorial session Github Link
1.1.1 Install Required R Packages
Run the code below to install and load the required packages.
# specify the packages to install or load in a vector
<- c("tidytext","tidyverse","tidymodels","scales",
packages "quanteda","SnowballC","topicmodels","textrecipes",
"vip","stopwords","themis","discrim","naivebayes",
"LiblineaR","tidyr","XML","xml2","readr")
# Loop through each package
for (package in packages) {
# if not installed, then install.
if (!require(package, character.only = TRUE)) {
install.packages(package,
dependencies = TRUE,
repos='http://cran.us.r-project.org')
}
}for (package in packages){
# load the required packages
library(package, character.only = TRUE)
}
# print the list of libraries that are loaded
.packages()) (
1.1.2 Twitter Dataset for the Tutorial
The dataset can be found in the Github Repo
Click the URLs to obtain the twitter data sets in csv files from Github.
- Dataset for Classification:
sampleTwitterDataForClassification.csv
- Dataset for Topic Modelling:
sampleTwitterDataForTopicModelling.csv
- Dataset for Classification:
If you want to save the data on your local machine (optional), right-click then
Save as
We will use the URLs to access the Twitter data directly.
1.1.2.1 Twitter Dataset for Classification
The dataset is a sample of annotated twitter data with the goal to infer recent plausible COVID-19 cases.
The data structure:
- Each column is a variable
- Each row is an observation (e.g. tweet, annotation)
Variables included:
X
: tweet_idtweet
: the tweet contentsannotation
:1
: Yes (plausible COVID-19 cases)0
: No or Unsure