Exercise: Ten common characteristics of big data (Salganik 2017)
- Q: What are concrete examples for the characteristics described below? Please pick one each (and discuss in groups).
- Big: Large datasets are a means to an end; they are not an end in themselves.
- Always-on: Always-on big data enables the study of unexpected events and real-time measurement.
- Non-reactive: Measurement in big data sources is much less likely to change behavior.
- Incomplete: No matter how big your big data, it probably doesn’t have the information you want.
- Inaccessible: Data held by companies and governments are difficult for researchers to access.
- Nonrepresentative: Nonrepresentative data are bad for out-of-sample generalizations, but can be quite useful for within-sample comparisons.
- Drifting: Population drift, usage drift, and system drift make it hard to use big data sources to study long-term trends.
- Algorithmically confounded: Behavior in big data systems is not natural; it is driven by the engineering goals of the systems.
- Dirty: Big data sources can be loaded with junk and spam.
- Sensitive: Some of the information that companies and governments have is sensitive.
- BUT: As we discussed in Section 1.20 Gary King suggests Big Data is not about the Data!, i.e. the revolution lies in the development of analytical/statistical tools rather than large amounts of data. Below we’ll take a step back and review some arguments suggesting that Big Data really is about the data.