11.3 Introduction to SQL
11.3.1 Databases
- Database systems: computerized mechanisms to store and retrieve data.
- Relational databases: data is represented as tables linked based on common keys (to avoid redundancy)
- “In the relational model, records are”linked" using virtual keys not stored in the database but defined as needed between the data contained in the records" (Wikipedia)
Source: (Wikipedia)
11.3.2 SQL
SQL
(pronounced S-Q-L or SEQUEL) is a language designed to query relational databases- Used by most financial and commercial companies
- The result of an
SQL
query is always a table - It’s a nonprocedural language: define inputs and outputs; how the statement is executed is left to the optimizer
- How long SQL queries depends on optimization that is opaque to user (which is great!)
SQL
is a language that works with many commercial products:- Oracle Database, SQL Server (MS), MySQL, PostgreSQL, SQLite (all three open-source), Google BigQuery, Amazon Redshift…
- Performance will vary, but generally faster than standard data frame manipulation in R (and much more scalable)
11.3.3 Components of a SQL query
SELECT
columnsFROM
a table in a databaseWHERE
rows meet a conditionGROUP BY
values of a columnORDER BY
values of a column when displaying resultsLIMIT
to only X number of rows in resulting tableAlways required:
SELECTand
FROM`. Rest are optional.SELECT
can be combined with operators such asSUM
,COUNT
,AVG
…- To merge multiple tables, you can use
JOIN
11.3.4 SQL at scale: Google BigQuery
- One of many commercial SQL databases available (Amazon RedShift, Microsoft Azure, Oracle Live SQL…)
- Used by many financial and commercial companies
- Advantages:
- Integration with other Google data storage solutions (Google Drive, Google Cloud Storage)
- Scalable: same SQL syntax for datasets of any size
- Easy to collaborate and export results
- Affordable pricing and cost control
- API access allows integration with R or python
- Excellent documentation