A Data science projects
This appendix sketches the types, criteria, and some ideas and resources for interesting and engaging data science (DS) projects.
A.1 Two types of projects
We distinguish between two types of data science (DS) projects: Whereas basic DS projects are suitable for an introductory course (e.g., i2ds 1), advanced DS projects are suitable for a follow-up course (e.g., i2ds 2).
A.1.1 Basic DS projects
Basic DS projects should find and explore an interesting real-world dataset (as in Chapter 15: Exploring data). Suitable datasets are non-trivial (i.e., contain at least a few dozen or hundreds of observations and a mix of character/factor and numeric variables) and should raise interesting research questions. Ideally, the project should explain and document all steps and answer the question(s) that motivated it.
Key steps
More specifically, proceed as follows to conduct a basic DS project:
Find some non-trivial data that may be shared and raises interesting questions (see below for links and suggestions).
-
Conduct an exploratory data analysis (EDA, as in Chapter 15: Exploring data) in a reproducible Rmd-file:
- Formulate some questions to be answered by the data
- Read in the data (e.g., from a CSV-file)
- Describe the data (its dimensions, observations, variable types, etc.)
- Tidy and transform the data
- Visualize key aspects of the data
- Answer your questions (or explain why they remain unanswered)
Document and explain all steps (in text and commented code) and note which R packages and functions are being used to solve which tasks.
Include links and references to data sources, R packages, and all other sources used.
In short, a basic DS project should locate some real-world data that is non-trivial, may be shared, and raises interesting questions. Then explore the data to answer the questions in a well-documented and reproducible fashion.
See Appendices B3 of the ds4psy textbook (Neth, 2023a) for links to potential data sources.
A.1.2 Advanced DS projects
Advanced DS projects should integrate various chapters and topics and go beyond existing examples. Ideally, an advanced DS project should have or imply real-world applications. Such projects could create new models or simulations, contribute to existing or create new R packages, or use existing R functions in interactive applications (e.g., using Shiny).
Ideas for advanced DS projects
The following ideas for advanced DS projects are based on chapters of Part 6: Applications:
- Comparing strategies in games (e.g., heuristic vs. learning agents)
- Performing a social network analysis
- Creating a mate search simulation
- Creating a foraging model (e.g., comparing heuristic or RL approaches in single vs. multi-agent simulations)
- Predicting the stock market (and evaluating portfolio performance)
- Plotting text (see Section 24.3)
- Conducting a sentiment analysis
- Creating artistic visualizations (see Section 24.4.2)
A.2 Desiderata and requirements
The success of DS projects will be evaluated on their content and formal characteristics:
A.2.1 Content
Key ingredients of a successful DS project include:
- Ask an interesting question that can be answered within a course project
- Sketch the analysis, method or model that is suited to answer the question
- Find or generate suitable data
- Implement the analysis, method, or model
- Consider including summaries and visualizations
- Interpret your results to answer your questions
- Conclude by mentioning limitations and/or next steps
A.2.2 Form
Formal requirements include:
- Use a reproducible Rmd-file that presents everything in a transparent fashion
- Begin by loading all required R packages and data files or packages
- Document your methodology, intermediate steps, and conclusions (in both code and text)
- Note which R functions and packages you have been using for solving which tasks
- Include links and references to data sources, R packages, and all other sources
- Generate a self-sufficient output file (in HTML or PDF format)
A.2.3 Submission
To submit your project, it should be organized as an R project and contain all files in a single ZIP-archive that indicates its contents and creator. This implies the following steps:
-
Include all text and code in an RStudio project and a single RMarkdown file:
- Store your input file, output file, and all other required files (e.g., data or images) in a single project directory.
- Ensure that your Rmd-input file successfully reads in your data and compiles into an output file (using only R packages loaded in it).
- Compile your Rmd-input file into a single and self-contained HTML-output file.
Create a ZIP-archive that contains your Rmd-input file, the HTML-output file, and all other required files (e.g., data or image files).
Name your ZIP-archive so that it indicates your name, course, and the current date (e.g.,
LastName_FirstName_i2ds_1or2_yymmdd.zip
).Email your ZIP-archive to the course instructor (with the subject line indicating your course) prior to the expiration of the submission deadline.
The deadline for submitting your archived project (in the winter semester 2024/2025) is Friday, February 28, 2025 (on 23:59).
A.3 Advice
The best advice for a successful data science project is to find and do something that you are really interested in. Beyond that, start early, explain and document what you are doing, and — most importantly — have fun!
A.4 Resources
Appendices of the Data Science for Psychologists (ds4psy) textbook (Neth, 2023a):
Appendix C: Data science project provides general advice for successful data science projects
Appendix B.3.3 links to potential data sources. Please make sure that your data may be shared and include appropriate references to credit its creators or providers.
Inspirations for models and simulations:
Page (2018) contains dozens of models that could be implemented in simulations
The Learning Machines blog provides many inspirations that can be developed into projects
Tools for creating R packages and interactive applications:
Using devtools for creating R packages (Wickham, 2015; Wickham et al., 2022; Wickham & Bryan, in progress); see Appendix B
Using R Shiny for creating interactive dashboards (Chang et al., 2024; Wickham, 2021)