2.10 Exercise: Ten common characteristics of big data (Salganik 2017)

  • Q: What are concrete examples for the characteristics described below? Please pick one each (and discuss in groups).
  1. Big: Large datasets are a means to an end; they are not an end in themselves.
  2. Always-on: Always-on big data enables the study of unexpected events and real-time measurement.
  3. Non-reactive: Measurement in big data sources is much less likely to change behavior.
  4. Incomplete: No matter how big your big data, it probably doesn’t have the information you want.
  5. Inaccessible: Data held by companies and governments are difficult for researchers to access.
  6. Nonrepresentative: Nonrepresentative data are bad for out-of-sample generalizations, but can be quite useful for within-sample comparisons.
  7. Drifting: Population drift, usage drift, and system drift make it hard to use big data sources to study long-term trends.
  8. Algorithmically confounded: Behavior in big data systems is not natural; it is driven by the engineering goals of the systems.
  9. Dirty: Big data sources can be loaded with junk and spam.
  10. Sensitive: Some of the information that companies and governments have is sensitive.
  • BUT: As we discussed in Section 1.20 Gary King suggests Big Data is not about the Data!, i.e. the revolution lies in the development of analytical/statistical tools rather than large amounts of data. Below we’ll take a step back and review some arguments suggesting that Big Data really is about the data.