A Data science projects

This appendix sketches the types, criteria, and some ideas and resources for interesting and engaging data science (DS) projects.

A.1 Two types of projects

We distinguish between two types of data science (DS) projects: Whereas basic DS projects are suitable for an introductory course (e.g., i2ds 1), advanced DS projects are suitable for a follow-up course (e.g., i2ds 2).

A.1.1 Basic DS projects

Basic DS projects should find and explore an interesting real-world dataset (as in Chapter 15: Exploring data). Suitable datasets are non-trivial (i.e., contain at least a few dozen or hundreds of observations and a mix of character/factor and numeric variables) and should raise interesting research questions. Ideally, the project should explain and document all steps and answer the question(s) that motivated it.

Key steps

More specifically, proceed as follows to conduct a basic DS project:

  1. Find some non-trivial data that may be shared and raises interesting questions (see below for links and suggestions).

  2. Conduct an exploratory data analysis (EDA, as in Chapter 15: Exploring data) in a reproducible Rmd-file:

    • Formulate some questions to be answered by the data
    • Read in the data (e.g., from a CSV-file)
    • Describe the data (its dimensions, observations, variable types, etc.)
    • Tidy and transform the data
    • Visualize key aspects of the data
    • Answer your questions (or explain why they remain unanswered)
  1. Document and explain all steps (in text and commented code) and note which R packages and functions are being used to solve which tasks.

  2. Include links and references to data sources, R packages, and all other sources used.

In short, a basic DS project should locate some real-world data that is non-trivial, may be shared, and raises interesting questions. Then explore the data to answer the questions in a well-documented and reproducible fashion.

See Appendices B3 of the ds4psy textbook (Neth, 2023a) for links to potential data sources.

A.1.2 Advanced DS projects

Advanced DS projects should integrate various chapters and topics and go beyond existing examples. Ideally, an advanced DS project should have or imply real-world applications. Such projects could create new models or simulations, contribute to existing or create new R packages, or use existing R functions in interactive applications (e.g., using Shiny).

Ideas for advanced DS projects

The following ideas for advanced DS projects are based on chapters of Part 6: Applications:

  • Comparing strategies in games (e.g., heuristic vs. learning agents)
  • Performing a social network analysis
  • Creating a mate search simulation
  • Creating a foraging model (e.g., comparing heuristic or RL approaches in single vs. multi-agent simulations)
  • Predicting the stock market (and evaluating portfolio performance)
  • Plotting text (see Section 24.3)
  • Conducting a sentiment analysis
  • Creating artistic visualizations (see Section 24.4.2)

Projects relating to specific R packages

Projects can also collect or provide new data and revise or extend functions from existing R packages. Related ideas for advanced DS projects include:

  • Collecting new data (and provide them as an R package)

  • Creating new data processing or visualization functions (e.g., see the R packages ds4psy, unikn, unicol, FFTrees, or riskyr)

  • Building an interactive application for an existing function (e.g., using R Shiny)

A.2 Desiderata and requirements

The success of DS projects will be evaluated on their content and formal characteristics:

A.2.1 Content

Key ingredients of a successful DS project include:

  • Ask an interesting question that can be answered within a course project
  • Sketch the analysis, method or model that is suited to answer the question
  • Find or generate suitable data
  • Implement the analysis, method, or model
  • Consider including summaries and visualizations
  • Interpret your results to answer your questions
  • Conclude by mentioning limitations and/or next steps

A.2.2 Form

Formal requirements include:

  • Use a reproducible Rmd-file that presents everything in a transparent fashion
  • Begin by loading all required R packages and data files or packages
  • Document your methodology, intermediate steps, and conclusions (in both code and text)
  • Note which R functions and packages you have been using for solving which tasks
  • Include links and references to data sources, R packages, and all other sources
  • Generate a self-sufficient output file (in HTML or PDF format)

A.2.3 Submission

To submit your project, it should be organized as an R project and contain all files in a single ZIP-archive. More specifically,

  • Include all text and code in a single R Markdown file.
  • Ensure that your Rmd-input file loads or/reads in your data and compiles into an output file (using only R packages loaded in it).
  • Compile your Rmd-input file into a self-contained HTML-output file.
  • Store your input file, output file, and all other required files (e.g., data or images) in a single project directory.
  • Create a ZIP-archive that contains your Rmd-input file, the HTML-output file, and all other required files.
  • Name your ZIP-archive so that it indicates your name and course (e.g., Your_Name_i2ds_1.zip or Last_First_i2ds_2_yymmdd.zip).
  • Email your ZIP-archive to the course instructor (with the subject line indicating your course) before the deadline of your course.

The current deadline for submitting your archived project (in Spring/summer semester 2024) is Wednesday, July 31, 2024 (on 23:59).

A.3 Advice

The best advice for a successful data science project is to find and do something that you are really interested in. Beyond that, start early, explain and document what you are doing, and — most importantly — have fun!

A.4 Resources

Appendices of the Data Science for Psychologists (ds4psy) textbook (Neth, 2023a):

  • Appendix C: Data science project provides general advice for successful data science projects

  • Appendix B.3.3 links to potential data sources. Please make sure that your data may be shared and include appropriate references to credit its creators or providers.

Inspirations for models and simulations:

  • Page (2018) contains dozens of models that could be implemented in simulations

  • The Learning Machines blog provides many inspirations that can be developed into projects

Tools for creating R packages and interactive applications: