CS5702: Modern Data Book
Preface
Why read this book?
Book structure
Book conventions
About the Author
Licensing
Acknowledgements
Getting Ready
0.1
Being organised
0.2
Course Preparation
i) Introducing the CS5702 Modern Data module
ii) Install and test R and RStudio
iii) Install Zotero
iv) Some introductory data and data science reading
v) Welcome to the module questionnaire
vi) Learn how to get help
0.3
Recap
1
Introduction
1.1
Week 1 goals
1.2
What is R and why do data scientists use it?
1.2.1
R and the R Ecosystem
1.2.2
Demand for R programmers
1.2.3
History of R
1.3
R basics
1.3.1
Variables, assignment and type
1.3.2
Functions, packages and libraries
1.4
Using RStudio effectively
1.4.1
Files and working directories
1.4.2
Using projects in RStudio
1.4.3
What if you can’t run RStudio?
1.5
Week 1 Seminar
1.6
Summary
2
The Richness of Data
2.1
Goals
2.2
The idea of rich data
2.3
Sources of data
2.4
More complex data structures in R
2.4.1
Creating and using vectors
2.4.2
Data Frames
2.5
Using files in R
2.5.1
Reading files
2.5.2
Writing files
2.6
A quick peak at databases
2.7
Very large data sets (a.k.a. “Big Data”)
2.7.1
So how suited is R for big data analysis?
2.8
Week 2 Seminar
2.9
Summary
3
Engineering or Hacking?
3.1
Why does it [engineering] matter?
3.1.1
Reproducible code {C3_Reproducible}
3.1.2
Tips for more reproducible data analysis
3.2
Literate programming and RMarkdown
3.3
Structuring your code
3.3.1
User-defined functions
3.3.2
Projects, files and working directories
3.3.3
Using R note books and rmarkdown
3.4
FAIR data principles
3.4.1
A bad example of data sharing
3.5
Extension Exercise
3.6
Week 3 Seminar
3.7
Summary
4
Exploratory Data Analysis and Visualisation
4.0.1
To help spot problems.
4.0.2
Data questions
5
Data Cleaning
5.1
Week 5 goals
5.2
Data quality
5.2.1
Missingness
5.2.2
Integrity checking
5.2.3
Outlier analysis
5.2.4
Data cleaning and imputation.
5.3
Week 5 Seminar
5.4
Summary
6
Interactive Data
7
Text Analysis
8
Machine Learning
9
Data Science for Good
References
Published with bookdown
CS5702 Modern Data Book
3.7
Summary