CS5702: Modern Data Book
Preface
Why read this book?
Book structure
Book conventions
About the Author
Licensing
Acknowledgements
Getting Ready
0.1
Being organised
0.2
Course Preparation
i) Introducing the CS5702 Modern Data module
ii) Install and test R and RStudio
iii) Install Zotero
iv) Some introductory data and data science reading
v) Welcome to the module questionnaire
vi) Learn how to get help
0.3
Recap
1
Introduction
1.1
Week 1 goals
1.2
What is R and why do data scientists use it?
1.2.1
R and the R Ecosystem
1.2.2
Demand for R programmers
1.2.3
History of R
1.3
R basics
1.3.1
Variables, assignment and type
1.3.2
Functions, packages and libraries
1.4
Using RStudio effectively
1.4.1
Files and working directories
1.4.2
Using projects in RStudio
1.4.3
What if you can’t run RStudio?
1.5
Summary
2
The Richness of Data
2.1
Goals
2.2
The idea of rich data
2.3
Sources of data
2.4
More complex data structures in R
2.4.1
Creating and using vectors
2.4.2
Data Frames
2.5
Using files in R
2.5.1
Reading files
2.5.2
Writing files
2.6
A quick peak at databases
2.7
Very large data sets (a.k.a. “Big Data”)
2.7.1
So how suited is R for big data analysis?
2.8
Summary
3
Engineering or Hacking?
3.1
Why does it [engineering] matter?
3.1.1
The NHS covid-19 ‘loss’ of 16000 cases
3.1.2
Audit of machine learning experiments
3.1.3
Reproducible code
3.1.4
Tips for more reproducible data analysis
3.2
FAIR data principles
3.2.1
A bad example of data sharing
3.3
Literate programming and RMarkdown
3.3.1
Structuring your code
3.3.2
User-defined functions
3.4
Engineering tools and tips
3.4.1
Using R note books and rmarkdown
3.4.2
Version control and backup
3.5
Extension Exercise
3.6
Summary
4
Exploratory Data Analysis and Visualisation
4.1
Introducing exploratory data analysis
4.1.1
Data questions
4.1.2
Spotting data quality problems
4.2
Patterns and variation
4.2.1
How do the data vary?
4.2.2
Outliers
4.2.3
Co-variation
4.3
Introducing data visualisation in R
4.3.1
Visualisation using ggplot2
4.3.2
Visualising model fit
4.4
Summary
4.4.1
Further Reading
5
Data Quality, Cleaning and Imputation
5.1
Week 5 goals
5.2
The central role of data quality
5.2.1
A business perspective on data quality
5.3
Checking for quality
5.3.1
Types of data quality issue
5.3.2
Variance, a misunderstood and overlooked characteristic
5.3.3
Integrity checking
5.4
Data cleaning and imputation
5.4.1
Data Imputation
5.5
Summary
6
If data should answer questions then who is asking?
6.1
Week 6 goals
6.2
Introducing data, questions and users
6.3
If data should answer questions then who is asking?
6.4
Building a simple shiny app
6.5
Further Reading
7
Text Analysis
7.1
An introduction to text processing
7.2
Summary
7.3
Further Reading
8
NLP and sentiment analysis with R
9
Data Science for Good
References
Published with bookdown
CS5702 Modern Data Book
Chapter 5
Data Quality, Cleaning and Imputation