2.2 The idea of rich data

What do we mean by data? It’s a term we use regularly but it’s hard to come up with an exact definition. Data are individual items that do not carry any specific meaning. Data come in many forms: numbers, words, symbols, etc. They describe transactions, events, and entities. However, data by themselves are not necessarily very informative. By contrast, information is processed, organised data presented in a given context and in a way (hopefully) that is useful to humans.

Data Science provides us with a toolkit to undertake meaningful data analysis to provide information.. When our analysis is driven by relevant questions it is most likely that the information we generate will be of value to the stakeholders.

The interesting question is why the relatively recent explosion in volume and range of data? The web has certainly opened up both the possibilities and availability of data. Cukier and Mayer-Schoenberger (2013) provide an insightful analysis based on the idea of ‘datafication’.
> Big data is also characterized by the ability to render into data many aspects of the world that have never been quantified before; call it “datafication.” … Google’s augmented-reality glasses datafy the gaze. Twitter datafies stray thoughts. LinkedIn datafies professional networks. – Cukier and Mayer-Schoenberger

This process of datafication is ubiquitous and often without our (full) realisation, despite the recent introduction of General Data Protection Regulation (GDPR)²² requirements of the EU and EEA. This growth in data also has huge business ramifications, e.g. see the 2016 McKinsey Report McKinsey Global Institute (2016) or an interesting article in the Harvard Business Review (McAfee et al. 2012).

Moving on, what do we mean by describing data as rich? Big? Complex? These are possible answers. However, we might also think in terms of the range and diversity of data and data sources. In terms of diversity we have:

different scales e.g., continuous, absolute (e.g., counts), categorical (e.g., gender)
quantitative vs text
tweets vs novels
time series vs static data
images, sound and video

In terms of data complexity there are multiple dimensions:

structure
1. structured data e.g., databases and table-like data such as comma-separated values (CSV) files
2. semi-structured data where the structure is inferred perhaps by metadata or semantic tagging e.g., XML and JSON
3. unstructured data e.g., audio streams or natural language text
single vs multiple sources
data quality e.g., noise, bias or missingness
complex interactions between variables (multi-collinearity) or complex causal paths with unobserved or even unknown variables
volume e.g., is distributed or parallel processing necessary?
static data vs changing in real-time

So we can see that data might be extremely challenging to analyse even if it’s ‘only’ a few megabytes or it could be relatively straightforward to analyse if simply structured even though it comprises many gigabytes or more.

References

Cukier, Kenneth, and Viktor Mayer-Schoenberger. 2013. “The Rise of Big Data: How It’s Changing the Way We Think about the World.” Foreign Affairs 92: 28–40.

McAfee, Andrew, Erik Brynjolfsson, Thomas H Davenport, DJ Patil, and Dominic Barton. 2012. “Big Data: The Management Revolution.” Harvard Business Review 90 (10): 60–68.

McKinsey Global Institute. 2016. “The Age of Analytics: Competing in a Data-Driven World.” San Francisco: McKinsey & Company.

See for instance the Guide of how this applies to the UK from the Information Commissioner’s Office. ↩︎