C.1 Essential elements of DS-projects

C.1.1 Structure

The following notes provide some ideas how to structure a data science project.

Motivation

Any project should begin by answering the question: What motivates this project — and why should we care? A motivated data science project should link questions with data. Thus, there are two principle ways to motivate a data science project:

  1. Starting from questions: Ask some research question(s) that you hope to answer with data. Which data could address or answer the question?

  2. Starting from data: Why use this data? Which interesting questions can it inform or answer?

Note that the questions should be expressed without referring to the data (i.e., rather than “What is the correlation between variable x and variable y?” try a more general statement, like
“What is the relation between someone’s activity level and mood?”).

Hint: The biggest possible boost for any project is your intrinsic motivation. Choosing a topic that really interests you will not only render your project more successful, but may even make it enjoyable to work on it.

Data

The data used in a project should meet the following criteria:

  • Availability: The data should be publicly accessible and must be included in the submitted project (in csv or some other platform-independent format).76

  • Amount: There are no specific limits or restrictions regarding data size, but the dataset should not be trivial (i.e., larger datasets are better than smaller ones, data from several files better than just one file, etc.).

  • Variety: The data or the analysis should typically include multiple types of variables (e.g., categorical, logical, numeric, text, dates or times).

  • Variables: Include a codebook (i.e., an overview over all variables, their types, and ranges).

  • References: Include citations to and references of all sources (e.g., of data, corresponding articles, and online sources).

Data exploration (EDA)

  • Clean up data, etc.

  • Check data for missing values, outliers, etc.

  • Provide an overview of distributions of key variables, relationships between variables, etc.

  • Join and/or re-format tables, select cases or variables, etc.

Data transformation and visualization

  • Select, add, or change variables.

  • Compute descriptive summary tables and corresponding visualizations of various types (that make sense of the data).

  • Can you create a tidy version of the data? How so or why not?

  • Try to descriptively answer your initial research questions (e.g., by creating corresponding summary tables or visualizations).

Hint: Note that you are not required to provide a statistical analysis. If you want to add a statistical analysis, you are welcome to include it, of course. But ensure that your key variables and results are presented in a transparent fashion without statistical testing.

Conclusion

  • Results: Which answer(s) does your analysis suggest (to the questions raised in the introduction)?

  • Limitations: What limits your possible conclusions? What could overcome these limitations?

  • Implications: What follows from your analysis? What should be done next?

Hint: If your final report contains these headings and says something meaningful about each section, it is almost impossible to assign a bad grade to it.

C.1.2 Requirements and grading

Due to the large flexibility and wide variance of possible DS-projects, it is challenging to describe and evaluate them in a coherent fashion. Students always ask for a list of boxes they can check off, but including a very detailed list here would inevitably limit the range of possible projects. For instance, whereas most projects will probably use some combination of numeric and character data, other projects may focus primarily on text- or time-based data. Although providing descriptive statistics on the latter may make sense, these will play a very different role in the project.

Similarly, it makes absolutely no sense to list a required set of commands or packages of an excellent data science project. Most projects submitted in the context of this course tend to heavily rely on the tidyverse (Wickham et al., 2019) packages dplyr, ggplot, tibble, and tidyr, but it would be wrong to censor anyone for providing a sound analysis in base R or using other packages.

Most projects are based on, inspired by, and dependent upon a particular dataset. Alternatively, many very innovative projects may involve creating simulations or programming functions that facilitate some analysis or visualization. If someone is dissatisfied with existing functions for quantifying text (e.g., the ds4psy functions count_chars() and count_words()) or for computing with dates and times (e.g., the functions diff_dates() and diff_times()), this can be the start of a fantastic project. However, such projects would not begin with a particular dataset, but only involve data at a later stage (for testing and demonstrating the use of new functions).

To satisfy the understandable need for transparency without limiting the scope of possible projects, here is a description of the project’s minimial requirements (for passing the course) and some ways of scoring bonus points (for earning a good or excellent grade).

Minimal requirements

To pass this project requirement, here is what you need to do:

  1. Ask at least one non-trivial question that can be addressed by data.

  2. Use data and R commands to answer your question(s).

    • Include some (descriptive) quantification of the data.

    • Include some visualization of your results.

  1. Document all elements and steps of your project: the data, your answer(s), and the solution process:

    • Explicitly load and cite your data and all R packages used.

    • Ensure that the data can be read and all R code can be executed.

    • Explain the purpose of each step and your interpretation and conclusions.

    • Cite all data sources and the R packages (in APA format).

Scoring bonus points

Here are some suggestions what you can do to distinguish your project from the minimum requirements:

  • Ask multiple questions and questions that allow for nuanced answers.
  • Use multiple data types (e.g., categorical and continuous numbers, logical variables, text, time, …).
  • Use multiple datasets and combine them.
  • Make sure that your data is saved in and loaded from csv format (if possible).
  • Screen all key variables of your data prior to using them.
  • Conduct some analyses in several ways (e.g., to verify your results).
  • Explain the purpose of new variables and data transformations.
  • Use summary tables for showing descriptive statistics.
  • Use a variety of visualizations that support your verbal interpretations.
  • Use informative table and figure captions.
  • Create your own functions for analyzing or visualizing data.
  • Justify your choices, interpretations, and conclusions.
  • Critically reflect on the implications and limitations of your data and/or analysis.

None of these points is strictly necessary, but some of them must be present to earn a “good” or “excellent” grade.

C.1.3 Submitting a project

A completed DS-project contains at least three elements:

  1. One or more data files (in some platform-independent format)

  2. A source file (in R or Rmd format, containing code chunks, comments, and text)

  3. An output file (in html or pdf format, providing a report of your project)

All projects should contain a header that clearly indicates a project title, your name, and the current date.

Hint: Please choose descriptive and sensible names for your project, files, archive, and all communications related to them. When naming things, always aim to adopt the recipient’s or user’s perspective. Given that there are many students in a course, names like “data.csv” or “my DS-project” are not very helpful. By contrast, including your own name and some descriptive project title in the name of your archive is helpful.77

What to include

Please submit a zipped-archive of a file folder or an R-project, including all data and scripts:78

  1. Data: Include your data as a .csv file, or some other platform-independent format (and load this file in your code).

  2. Code: An executable .R or .Rmd script containing all code.

  3. Output: One document suited to understand and evaluate the project (either an html-file that combines all text and code OR a pdf file that contains all text and figures).

If you are hosting your project online (e.g., on GitHub or some other cloud server), you can submit a link to it.

Submission deadline

An email with your project (as an attachment or link) should be sent to your instructor, no later than
Wednesday, August 31, 2022.

Alternatively, submit your project to the corresponding folder on the Ilias web platform.

References

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

  1. Contact the instructor before starting your project in case it involves data that cannot be distributed and shared (e.g., due to legal or other restrictions).↩︎

  2. Pointing out these things feels a bit silly. But if I do not do so, about half of the submitted projects are named by some variation of “my DS-project.”↩︎

  3. Use a cloud-based solution, if the size of your archive exceeds 5MB.↩︎