11.3 Introduction to SQL

11.3.1 Databases

  • Database systems: computerized mechanisms to store and retrieve data.
  • Relational databases: data is represented as tables linked based on common keys (to avoid redundancy)
    • “In the relational model, records are”linked" using virtual keys not stored in the database but defined as needed between the data contained in the records" (Wikipedia)
optional caption text

optional caption text

Source: (Wikipedia)

11.3.2 SQL

  • SQL (pronounced S-Q-L or SEQUEL) is a language designed to query relational databases
  • Used by most financial and commercial companies
  • The result of an SQL query is always a table
  • It’s a nonprocedural language: define inputs and outputs; how the statement is executed is left to the optimizer
  • How long SQL queries depends on optimization that is opaque to user (which is great!)
  • SQL is a language that works with many commercial products:
    • Oracle Database, SQL Server (MS), MySQL, PostgreSQL, SQLite (all three open-source), Google BigQuery, Amazon Redshift…
    • Performance will vary, but generally faster than standard data frame manipulation in R (and much more scalable)

11.3.3 Components of a SQL query

  • SELECT columns
  • FROM a table in a database
  • WHERE rows meet a condition
  • GROUP BY values of a column
  • ORDER BY values of a column when displaying results
  • LIMIT to only X number of rows in resulting table
  • Always required:SELECTandFROM`. Rest are optional.
  • SELECT can be combined with operators such as SUM, COUNT, AVG
  • To merge multiple tables, you can use JOIN

11.3.4 SQL at scale: Google BigQuery

  • One of many commercial SQL databases available (Amazon RedShift, Microsoft Azure, Oracle Live SQL…)
  • Used by many financial and commercial companies
  • Advantages:
    • Integration with other Google data storage solutions (Google Drive, Google Cloud Storage)
    • Scalable: same SQL syntax for datasets of any size
    • Easy to collaborate and export results
    • Affordable pricing and cost control
    • API access allows integration with R or python
    • Excellent documentation