4.7 Data underlying graphs

  • Data could be “raw”26 or processed
    • Processed data could be…
      • …aggregated data
      • …data summarized through a statistical model
  • Distinction is important
    • Software (like ggplot) has processing abilities
    • Sometimes we might want to do the processing ourselves
      • e.g., calculate averages ourselves instead of using software/ggplot2 to do it
    • Sometimes we might want to store all the relevant information in a dataframe
  • Ggplot2 has the argument stat ="identity"27
    • Q: Does anyone know what that does?
  • Rules:
    • Try to automatize everything (reproducability!)
    • Reduce distance between data and graph (e.g., variable names = labels)

  1. Here raw entails that we didn’t summarized the data, e.g., we didn’t aggregate it or summarize it through certain statistical models.

  2. Tells the ggplot function to visualize the data “as is”, e.g., we could use geom_bar() and either feed it data that is then summarized/aggregated by the geom_bar() function or we feed it data that we summarized ourselves beforehand and tell geom_bar() not to summarize it.