A Midterm practice exam

library(rtweet)
library(tidyverse)
library(ggthemes)
library(rvest)
library(lubridate)
library(DBI)
library(odbc)
library(RODBC)

A.1 Essay question

|Q1: If you were given a large (billions of records) dataset, what would you do to understand the quality and limitations of the data?

|A: This is a question that basically is covered in the slides of week 2. Some of the things we can do to profile (big) datasets:

Understand what each column stores.
Figure out the correct data types.
Identify if there are missing observations by running aggregate functions (such as summary).
Use head() and tail() to view the first and last 6 rows of the dataset.
Get descriptive statistics of each column (min, max, mean, std, median, frequencies of categorical variables, etc.)
Group by different groups of interest and estimate descriptive statistics.
Select the variables that we care about, and the observations we care about (filter, subsample).
Identify what questions the data can potentially answer.

In the exam, any combination that would include 4 or more of these points would have gotten full points.

A.2 Twitter API: rtweet

Q2: Use the rtweet function get_timeline to get the most recent 3000 tweets of the main ESPN twitter account (username: espn). Then, use the function as_date from the package lubridate to customize column created_at to date. Finally, store into a new tibble named e1 the ESPN tweets posted in September. How many tweets did ESPN post in September?

key = "key"
secret = "secret"
access = "access"
access_secret = "access_secret"
app_name = "appname"

A: First create our toklen (keys excluded from knitted file with the option {r, include=FALSE}):

myToken = create_token(app = app_name,
             consumer_key = key, 
             consumer_secret = secret,
             access_token = access,
             access_secret = access_secret)

One way to find the number of tweets of e1 is to call the function nrow():

e = get_timeline("espn", n = 3000, token =myToken)
e$approxDate = as_date(e$created_at)
e1 = e %>% filter(approxDate >= "2021-09-01" & approxDate <= "2021-09-30")
nrow(e1)

## [1] 516

Q3: Now load a subset of the ESPN tweets stored in th tweetsPracticeExam.csv. Name this tibble p. Which one is the most frequently used source that ESPN uses to tweet from?

p = read_csv("../data/tweetsPracticeExam.csv")
p %>%  group_by(source) %>% count() %>% arrange(desc(n))

## # A tibble: 8 × 2
## # Groups:   source [8]
##   source                               n
##   <chr>                            <int>
## 1 Sprinklr                           492
## 2 Twitter for iPhone                 146
## 3 Twitter Web App                     92
## 4 TweetDeck                           88
## 5 Twitter Media Studio                60
## 6 WSC Sports                          41
## 7 Twitter for Advertisers (legacy)    12
## 8 Twitter Media Studio - LiveCut      12

Q4: How many tweets from the p tibble are retweets (i.e., they have a name in column retweet_name?)

p %>% filter(!is.na(retweet_name)) %>%  nrow

## [1] 297

Q5: In the p tibble, match the times that each of the following account is being retweeted by ESPN:

A: Below I am using the operator %in% to keep only rows where the retweet_name is one of the focal ones that the question asks. You did not have to do this to answer this question. You can answer the question without using the last filter.

p %>% filter(!is.na(retweet_name)) %>% group_by(retweet_name) %>% count() %>% 
  arrange(n) %>% 
  filter(retweet_name %in% c('DJ KHALED','Malika Andrews',
                             'ESPN College Football',
                             'NFL on ESPN',
                             'Adam Schefter',
                             'ESPN FC',
                             'ESPN Fantasy Sports'))

## # A tibble: 7 × 2
## # Groups:   retweet_name [7]
##   retweet_name              n
##   <chr>                 <int>
## 1 DJ KHALED                 1
## 2 ESPN Fantasy Sports       2
## 3 Malika Andrews            2
## 4 ESPN College Football     5
## 5 Adam Schefter            15
## 6 ESPN FC                  19
## 7 NFL on ESPN              24

A.3 Scraping

Q6: Scrape the URL: https://en.wikipedia.org/wiki/Data_analysisLinks to an external site. Use the html tag “p a” to extract all the links in this page. Then use the regular expression “[a-zA-Z ]+” to keep only links that are not references. Save the result to a single-column tibble. Name the column “link”. Remove any missing values. How many links did you end up with in the end?

r = read_html("https://en.wikipedia.org/wiki/Data_analysis")
r1 = r %>% html_nodes("p a") %>% 
  html_text()
r2 = r1 %>% str_extract("[a-zA-Z ]+") %>% as_tibble_col("link") %>% 
  filter(!is.na(link))
nrow(r2 )

## [1] 84

Q7: Create a new custom function that:

Takes as input a wiki concept.
Visits its wikipedia page.
Extracts all the links.
Removes any missing values.
Returns a tibble with two columns: the first column stores the link, and the second column stores the scraped wiki concept.

By using a function from the map() family, call your new custom function with the following wiki concepts as input:** c('Data_mining','Data_analysis','Data_science') Store the result into a single tibble.How many rows does this final tibble have?

getWikis = function(wikiConcept){
  r = read_html(paste("https://en.wikipedia.org/wiki/",wikiConcept,sep=""))
  r1 = r %>% html_nodes("p a") %>% 
  html_text()
  r2 = r1 %>% str_extract("[a-zA-Z ]+") %>% as_tibble_col("link") %>% 
    filter(!is.na(link)) %>% mutate(concept = wikiConcept)
  return(r2)
}

d = c('Data_mining','Data_analysis','Data_science') %>% map_dfr(getWikis)
nrow(d)

## [1] 248

A.4 Databases

Q8: Load the following dataset into R: wiki.csv
Based on this dataset, create an ERD that identifies the following two entities and their relationship:

concept
link

The entity concept can have additional fields such as:

description
wikipedia_link

The entity link can have additional fields such as:

description
date_scraped

Take a screenshot of your ERD and upload it here.

Below is an ERD that works for this example:

How could you have come up with the relationship cardinalities? Run the following which shows that links might appear in multiple concepts:

d %>% count(concept,link) %>% filter(n > 1)

## # A tibble: 20 × 3
##    concept       link                                   n
##    <chr>         <chr>                              <int>
##  1 Data_analysis "business intelligence"                2
##  2 Data_analysis "descriptive statistics"               2
##  3 Data_analysis "exploratory data analysis"            2
##  4 Data_analysis "hypotheses"                           2
##  5 Data_analysis "type "                                2
##  6 Data_mining   "cluster analysis"                     2
##  7 Data_mining   "data analysis"                        2
##  8 Data_mining   "data management"                      2
##  9 Data_mining   "data sets"                            2
## 10 Data_mining   "KDD"                                  2
## 11 Data_mining   "machine learning"                     3
## 12 Data_mining   "overfitting"                          2
## 13 Data_mining   "predictive analytics"                 2
## 14 Data_mining   "SIGKDD"                               2
## 15 Data_mining   "statistics"                           2
## 16 Data_science  "American Statistical Association"     2
## 17 Data_science  "big data"                             2
## 18 Data_science  "C"                                    2
## 19 Data_science  "machine learning"                     2
## 20 Data_science  "statistics"                           2

Q9: For each one of the tables in your ERD, explain whether they are in 1NF, 2NF, or 3NF.

All tables are in 3NF. There are no derived columns (i.e., columns that we can generate by combining other columns in the table), and there are no multiple entities within each table.

Q10: Create all tables in your ERD by connecting to your your_username_db schema. Then load all tables with the appropriate content. Assume missing values when necessary.

How many rows did your larger table end up with?

(I will grade your code)

con = DBI::dbConnect(odbc::odbc(), 
                    Driver= "MySQL", 
                    Server='mysql-isys3350.bc.edu', 
                    UID='username', 
                    PWD= 'password',
                    Port= 3306, 
                    Database = "username_db")

CREATE TABLE kokkodis_student_db.link (
    link_name VARCHAR(100),
    description VARCHAR(150),
    date_scraped DATE,
    PRIMARY KEY (link_name)
);

CREATE TABLE kokkodis_student_db.concept (
    concept_name VARCHAR(100),
    description VARCHAR(150),
    wikipedia_link VARCHAR(150),
    PRIMARY KEY (concept_name)
);

CREATE TABLE kokkodis_student_db.concept_has_link (
    link_name VARCHAR(100),
    concept_name VARCHAR(100),
    PRIMARY KEY (concept_name,link_name),
    CONSTRAINT fk_concept
    FOREIGN KEY (concept_name)
    REFERENCES kokkodis_student_db.concept (concept_name),
    CONSTRAINT fk_link
    FOREIGN KEY (link_name)
    REFERENCES kokkodis_student_db.link (link_name)
);

Next, we load the data:

d %>% select(concept) %>% distinct %>% 
  mutate(wikipedia_link = paste("https://en.wikipedia.org/wiki/",concept,sep=""), description = NA) %>%  
  rename(concept_name = concept)-> concept

dbWriteTable(con,"concept", concept, append = T)

For verification it is always a good idea to run a select query:

select * from concept;

Table A.1: 3 records
concept_name	description	wikipedia_link
Data_analysis	NA	https://en.wikipedia.org/wiki/Data_analysis
Data_mining	NA	https://en.wikipedia.org/wiki/Data_mining
Data_science	NA	https://en.wikipedia.org/wiki/Data_science

d %>% select(link) %>% distinct %>% mutate(date_scraped = today(), description = NA) %>% 
  rename(link_name = link) -> link

dbWriteTable(con,"link", link, append = T)

d %>% select(link,concept) %>% distinct %>% 
  rename(link_name = link, concept_name = concept) -> concept_has_link

select * from link;

Table A.2: Displaying records 1 - 10
link_name	description	date_scraped
AAAI	NA	2021-11-06
ADVISE	NA	2021-11-06
AI	NA	2021-11-06
American Statistical Association	NA	2021-11-06
Andrew Gelman	NA	2021-11-06
Association for Computing Machinery	NA	2021-11-06
Bayes	NA	2021-11-06
Ben Fry	NA	2021-11-06
Bonferroni correction	NA	2021-11-06
Boundary element	NA	2021-11-06

dbWriteTable(con,"concept_has_link", concept_has_link, append = T)

select * from concept_has_link;

Table A.3: Displaying records 1 - 10
link_name	concept_name
AAAI	Data_mining
ADVISE	Data_mining
AI	Data_mining
American Statistical Association	Data_science
Andrew Gelman	Data_science
Association for Computing Machinery	Data_mining
Bayes	Data_mining
Ben Fry	Data_science
Bonferroni correction	Data_analysis
Boundary element	Data_analysis

Q11:

Using the table team_in_game, retrieve all rows that do not store information about “Miami Heat” and “Boston Celtics” and store them in a dataframe (tibble). How many rows does the tibble have?

query = 'SELECT points_scored, team_name FROM kokkodis_student_db.team_in_game 
  WHERE team_name not in ("Miami Heat", "Boston Celtics");'
t = dbGetQuery(con,query) %>% as_tibble
t

## # A tibble: 448 × 2
##    points_scored team_name           
##            <int> <chr>               
##  1           116 Los Angeles Clippers
##  2           109 Los Angeles Lakers  
##  3           121 Milwaukee Bucks     
##  4           122 Denver Nuggets      
##  5           124 Sacramento Kings    
##  6           113 Orlando Magic       
##  7           123 Brooklyn Nets       
##  8           108 Denver Nuggets      
##  9           121 Los Angeles Clippers
## 10           115 Dallas Mavericks    
## # … with 438 more rows

Q12:

Using the tibble from the previous question, what is the average points_scored of “Los Angeles Lakers”?

t %>% filter(team_name=='Los Angeles Lakers') %>% summarize(mean(points_scored))

## # A tibble: 1 × 1
##   `mean(points_scored)`
##                   <dbl>
## 1                  109.