2.10 Exercise: Ten common characteristics of big data (Salganik 2017)

Q: What are concrete examples for the characteristics described below? Please pick one each (and discuss in groups).

Big: Large datasets are a means to an end; they are not an end in themselves.
Always-on: Always-on big data enables the study of unexpected events and real-time measurement.
Non-reactive: Measurement in big data sources is much less likely to change behavior.
Incomplete: No matter how big your big data, it probably doesn’t have the information you want.
Inaccessible: Data held by companies and governments are difficult for researchers to access.
Nonrepresentative: Nonrepresentative data are bad for out-of-sample generalizations, but can be quite useful for within-sample comparisons.
Drifting: Population drift, usage drift, and system drift make it hard to use big data sources to study long-term trends.
Algorithmically confounded: Behavior in big data systems is not natural; it is driven by the engineering goals of the systems.
Dirty: Big data sources can be loaded with junk and spam.
Sensitive: Some of the information that companies and governments have is sensitive.

BUT: As we discussed in Section 1.20 Gary King suggests Big Data is not about the Data!, i.e. the revolution lies in the development of analytical/statistical tools rather than large amounts of data. Below we’ll take a step back and review some arguments suggesting that Big Data really is about the data.