The following sections provide some ideas how to structure a data science project.40
Any project should begin by answering the question: What motivates this project — and why should we care? A motivated data science project should link questions with data. Thus, there are two principle ways to motivate a data science project:
Starting from questions: Ask some research question(s) that you hope to answer with data. Which data could address or answer the question?
Starting from data: Why use this data? Which interesting questions can it inform or answer?
Note that the questions should be expressed without referring to the data (i.e., rather than “What is the correlation between variable
x and variable
y?” try a more general statement like
“What is the relation between someone’s activity level and mood?”).
C.1.2 The data
The data used in a project should meet the following criteria:
Availability: The data should be publicly accessible and included in the submitted project (in
Amount: There are no specific limits or restrictions regarding data size, but the dataset should not be trivial (i.e., larger datasets are better than smaller ones, data in several files better than 1, etc.).
Variety: The data or the analysis should typically include multiple types of variables (e.g., text, numeric, categorical, logical).
Variables: Include a codebook (overview over all variables, their types, and ranges).
References: Include citations to and references of all sources (e.g., of data, corresponding articles, and online sources).
C.1.3 Data exploration (EDA)
Clean up data, etc.
Check data for missing values, outliers, etc.
Provide an overview of distributions of key variables, relationships between variables, etc.
Join and/or re-format tables, select cases or variables, etc.
C.1.4 Data transformation and visualization
Select, add, or change variables
Compute descriptive summary tables and corresponding visualizations of various types (that make sense of the data).
Can you create a tidy version of the data? How or why not?
Try to descriptively answer your initial research questions (e.g., by creating corresponding summary tables or visualizations).
Results: Which answer(s) does your analysis suggest?
Limitations: What limits your possible conclusions? What could overcome these limitations?
Implications: What follows from your analysis? What should be done next?