C.1 Essential elements of DS-projects

C.1.1 Structure

The following notes provide some ideas how to structure a data science project.

Motivation

Any project should begin by answering the question: What motivates this project — and why should we care? A motivated data science project should link questions with data. Thus, there are two principle ways to motivate a data science project:

  1. Starting from questions: Ask some research question(s) that you hope to answer with data. Which data could address or answer the question?

  2. Starting from data: Why use this data? Which interesting questions can it inform or answer?

Note that the questions should be expressed without referring to the data (i.e., rather than “What is the correlation between variable x and variable y?” try a more general statement, like
“What is the relation between someone’s activity level and mood?”).

Data

The data used in a project should meet the following criteria:

  • Availability: The data should be publicly accessible and included in the submitted project (in csv format).56

  • Amount: There are no specific limits or restrictions regarding data size, but the dataset should not be trivial (i.e., larger datasets are better than smaller ones, data in several files better than 1, etc.).

  • Variety: The data or the analysis should typically include multiple types of variables (e.g., categorical, logical, numeric, text, dates or times).

  • Variables: Include a codebook (overview over all variables, their types, and ranges).

  • References: Include citations to and references of all sources (e.g., of data, corresponding articles, and online sources).

Data exploration (EDA)

  • Clean up data, etc.

  • Check data for missing values, outliers, etc.

  • Provide an overview of distributions of key variables, relationships between variables, etc.

  • Join and/or re-format tables, select cases or variables, etc.

Data transformation and visualization

  • Select, add, or change variables

  • Compute descriptive summary tables and corresponding visualizations of various types (that make sense of the data).

  • Can you create a tidy version of the data? How or why not?

  • Try to descriptively answer your initial research questions (e.g., by creating corresponding summary tables or visualizations).

Conclusion

  • Results: Which answer(s) does your analysis suggest (to the questions raised in the introduction)?

  • Limitations: What limits your possible conclusions? What could overcome these limitations?

  • Implications: What follows from your analysis? What should be done next?

Hint: If your final report contains these headings and says something meaningful about each section, it is pretty hard to assign a bad grade to it.

C.1.2 Requirements and grading

Due to the large flexibility and wide variance of possible DS-projects, it is challenging to describe and evaluate them in a coherent fashion. Students always ask for a list of boxes they can check off, but including a very detailed list here would inevitably limit the range of possible projects. For instance, whereas most projects will probably use some combination of numeric and character data, other projects may focus primarily on text- or time-based data. Although providing descriptive statistics on the latter may make sense, these will play a very different role in the project.

Similarly, it makes absolutely no sense to list a required set of commands or packages of an excellent data science project. Most projects submitted in the context of this course tend to heavily rely on the tidyverse (Wickham, Averick, et al., 2019) packages dplyr, ggplot, tibble, and tidyr, but it would be wrong to censor anyone for providing sound analysis in base R or other packages.

Most projects are based on, inspired by, and dependent upon a particular dataset. Alternatively, some of the most innovative projects may involve creating simulations or programming functions that facilitate some analysis or visualization. If someone is dissatisfied with existing functions for quantifying text (e.g., the ds4psy functions count_chars() and count_words()) or for computing with dates and times (e.g., the ds4psy functions diff_dates() and diff_times()), this could be the beginning of a fantastic project. However, such projects would not begin with a particular dataset, yet involve data later (for testing and demonstrating the use of new functions).

To satisfy the understandable need for transparency without limiting the scope of possible projects, here is a description of the project’s minimial requirements (for passing the course) and some ways of scoring bonus points (for earning a good or excellent grade).

Minimal requirements

To pass this project requirement, here is what you need to do:

  1. Ask at least one non-trivial question that can be addressed by data.

  2. Use data and R commands (in a .Rmd document) to answer your question(s).

    • Include some (descriptive) quantification of the data.

    • Include some visualization of your results.

  1. Document the data, your answer(s), and the solution process (in an .html or .pdf document).

Bonus points

Here are some suggestions what you can do to distinguish your project and exceed the minimum requirements:

  • Ask multiple questions that allow for more nuanced answers.
  • Explicitly document all steps, the R packages used, and your code.
  • Use multiple datasets and combine them.
  • Make sure that your data is encoded in csv format (if possible) and can easily be exchanged between platforms.
  • Use multiple data types (e.g., categorical and continuous numbers, logical variables, text, time, …).
  • Screen all key variables of your data prior to using them.
  • Make sure that your code can be executed and explain the purpose of new variables and data transformations.
  • Use informative summary tables for showing descriptive statistics.
  • Use different visualizations that support your verbal interpretations.
  • Use informative table and figure captions and APA format for all references.
  • Create your own functions for analyzing or visualizing data.
  • Justify your interpretations and conclusions.
  • Critically reflect on the implications and limitations of your data and/or analysis.

None of these points is strictly necessary, but some of them must be present to earn a “good” or “excellent” grade.

C.1.3 Submitting a project

All projects should contain a header that clearly indicates a project title, your name, and the current date.

What to include

Please submit a zipped-archive of a file folder or an R-project, including all scripts and data (in csv format):57

  • Code: An executable .R or .Rmd script containing all code.

  • Output: 1 document suited to understand and evaluate the project (either an html-file that combines all text and code OR a pdf file that contains all text and figures).

If you are hosting your project online (e.g., on GitHub or some other site), you can also submit a link to it.

Submission deadline

An email with your attached project should be sent to your instructor, no later than
Tuesday, September 01, 2020.

Alternatively, submit your project to the corresponding folder on the Ilias web platform.

References

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686


  1. Contact the instructor in case you want to use data that cannot be included (e.g., due to legal or other restrictions).

  2. Contact the instructor early enough if the size of your archive exceeds 5MB.