1 Prerequisite

Welcome to the Artificial Intelligence for Public Health (AI4PH) 2022.

This online tutorial will accompany two sessions:

  1. Tutorial on text analytics with R
  2. Data Challenge using the N2C2 NLP Research Datasets

1.1 Before the sessions

  1. Install R and RStudio
  2. Create a new R Project in RStudio(via “File -> New Projects…”)
  3. Install required R packages
  4. Access the Twitter Data for the tutorial session Github Link

1.1.1 Install Required R Packages

Run the code below to install and load the required packages.

# specify the packages to install or load in a vector
packages <- c("tidytext","tidyverse","tidymodels","scales",
              "quanteda","SnowballC","topicmodels","textrecipes",
              "vip","stopwords","themis","discrim","naivebayes",
              "LiblineaR","tidyr","XML","xml2","readr")

# Loop through each package
for (package in packages) {
    # if not installed, then install. 
    if (!require(package, character.only = TRUE)) {
      install.packages(package, 
                       dependencies = TRUE, 
                       repos='http://cran.us.r-project.org')
     }
}
for (package in packages){
    # load the required packages
    library(package, character.only = TRUE)
}

# print the list of libraries that are loaded
(.packages())

1.1.2 Twitter Dataset for the Tutorial

The dataset can be found in the Github Repo

We will use the URLs to access the Twitter data directly.

1.1.2.1 Twitter Dataset for Classification

The dataset is a sample of annotated twitter data with the goal to infer recent plausible COVID-19 cases.

The data structure:

  • Each column is a variable
  • Each row is an observation (e.g. tweet, annotation)

Variables included:

  • X: tweet_id

  • tweet: the tweet contents

  • annotation:

    • 1: Yes (plausible COVID-19 cases)
    • 0: No or Unsure

1.1.2.2 Twitter Dataset for Topic Modelling

The dataset is a collection of tweets related to COVID-19, it has only one column tweet.