2.3 Sources of data

The range of data sources is immense. Although much will be proprietory or confidential a surprising amount of data is openly provided. As a non-exhaustive list consider the following:

  1. Government e.g., from data.gov.uk
  2. Health e.g., the UK health data set from Public Health England
  3. Financial and economic data e.g. DBnomics : the world’s economic database
  4. Marketing and social media e.g., Microsoft Research Open Data has a corpus of 38m tweets
  5. Journalism and media, e.g., New York Times searchable corpus and Google books is searchable along with Project Gutenberg, Europeana is a digital library with millions of books, paintings, films and museum objects.
  6. Education and social policy e.g., from the UK Data Service who are funded by the ESRC
  7. Business and corporate data e.g., from Quandl recently rebranded as Nasdaq Data Link
  8. Retail e.g., multiple open data sets from Data World
  9. Manufacture e.g., multiple open data sets from Data World
  10. Location-based services: e.g., GPS-enabled smartphones
  11. Science e.g., all the ’omics disciplines, climate and environmental data (Blei and Smyth 2017)
  12. And slightly more lighthearted are the MakeoverMonday Data sets

In addition, a list of 50+ open data sets including from UK government, UN, etc is maintained by the Journey of Analytics blog. In other words, there is a wealth of freely available data, not to mention other data sets that are behind paywalls, proprietory or even top secret!

References

Blei, David M, and Padhraic Smyth. 2017. “Science and Data Science.” Proceedings of the National Academy of Sciences 114 (33): 8689–92.