11.2 How is data stored?

  • On a hard disk… (see here for a little history on computer storage)
  • But while we analyze data with R it is normally stored in the computer’s RAM
    • Q: How much RAM does your computer have?
  • Simple ‘big data’ definition: Data of size bigger than your RAM
  • But with big data we may reach various limits
    • …we can’t load it all into R because of the RAM limit
    • …our PC (personal computer) may not have enough hard disk storage
    • …we may need our PC to run all the time to collect the data (e.g. Twitter) but it heats up our 12sqm student room…
    • …our PC may be to slow to do tasks on a big data set.
  • R (as software) is designed to analzye data but not to store (big) data → database
  • Strategy (see video): Develop model for data subset and scale it up (or not, e.g. sample)
  • Categorization of big data problems:
    1. Extract data
    2. Compute on the parths
    3. Compute on the whole
  • …we have to resort to some other tools: SQL and Google BigQuery
    • Various tutorial on working with big data/databases, e.g. here.