Chapter 2 Structured Query Language
Structured Query Language (SQL) is a computer language commonly used to access, manipulate and query large relational databases. It often covers the first step within an extract, transform and load process.
This chapter aims to provide a brief overview on Database Management Systems and using SQL to retrieve and aggregate data. In addition to the material provided in this chapter, numerous online resources provide helpful and interactive material for learning SQL, such as tutorialspoint and w3schools.
The content of this chapter is closely linked to concepts in
SQL for Data Science - Badia (2020)
SQL MySQL Cookbook - DuBois (2014)
If you want to setup a SQL database yourself, have a look at MySQL Workbench. Some resources allow for hands on practiting SQL skills Databasestar using an already available database.
2.1 Relational Databases
A Database Management System refers to the technology for creating, storing and organizing data in a database. The main objective of a Database Management System is to supply a way to store and retrieve database information in a convenient and efficient manner.
For an in-depth review of Database Management consult the w2 schools chaptes concerining database mangement. The contents are outside the scope for this course.
Within the Database Management System different instances comprise single databases. Each database has its own schema. A database schema is a blueprint or plan that outlines the structure and organization of a database. It defines how the data is organized, stored, and accessed within the database. A schema includes tables, columns, relationships, constraints, and other database objects that define the data model.
A database can contain a single, or a multitude of different data tables. A full table consists of at least a single row and column, the single entires per row and column are stored within separate fields of the table. A column can also be identified as attribute.
2.1.1 Data Quality
When dealing with large amounts of data, a certain degree of data quality has to be ensured. In order to do so, conventionally constraints are imposed on either the rows, columns and/or the combination of rows and columns of a table in the database. The concept of data quality is rooted within the principles of
Completeness | Refers to the completeness of the dataset. A dataset might be incomplete if there are values missing, a single attribute is missing, a reference value is missing. |
Uniqueness | Within the datatable a row or an entity should be recorded only once. Duplicates in the data might induce skew and noise to descriptive measures. Uniqueness is the most critical dimension for ensuring no duplication or overlaps. It can either be measured against all records within a single or across multiple data tables. |
Consistency | Consistency relates to the degree to which the data is consistent within a single and amongst a multitude of data tables. |
Validity | This dimension signifies that the value attributes are available for aligning with the specific domain or requirement. It can be checked by imposing constraints on the data accordingly. |
Accuracy | Data accuracy is the level to which data represents the real-world scenario and confirms with a verifiable source. |
Integrity | Data journey and transformation across systems can affect its attribute relationships. Integrity indicates that the attributes are maintained correctly, even as data gets stored and used in diverse systems. Data integrity ensures that all enterprise data can be traced and connected. |
A suitable degree of data quality can be ensured by using normalization procedures on the database and its tables accordingly.
Primary Keys
Amongst all attributes that make up the schema of a table, it is common a subset of attributes are sufficient to identify single rows and ensure their uniqueness.
Such attribute sets are called primary keys.
The primary key can either comprise a single column, for example an ID column, or span multiple columns, making the combination of certain attributes unique.
Primary keys are a useful tool to tackle the issue of redundancy and facilitate consistency within a table as they do not allow for multiple copies of the same data.
When creating a table, we aim at declaring a set of keys which is as small as possible but as large as necessary to distinguish amongst rows and provide a unique row identifier.
Oftentimes, an artificial primary key will be created in order to capture the uniqueness of a row and allow for referencing.
As the complexity of the database increases, keys will allow for the construction of relationships between a set of different tables.
Foreign Keys
Foreign keys are used to capture relationships between a set of tables within the database and increase referential integrity.
The way such relationships are expressed in a database is by using the primary key of an entity as its surrogate: primary keys are copied to another table to represent the entity and express the relationships.
For example, a customer ID serves as primary key within one table and might indicate a unique customer number which can be linked via usage of foreign keys to another table which contains more detailed customer information for each customer ID. Whenever we use a primary key of one table on another table, we establish a foreign key on the new table. A foreign key constraints the entries in the child table and only allows for entires which can be looked up in the parent table.
Other constraints
The set of other constraints comprises non-null constraints for specific fields in the database or relates to valid datatypes which are allowed for certain attributes.
This relates to the concept of data validity.
2.1.2 Data Types
Within the scope of this course, we distinguish between numeric, string and date like data types. The following table provides a brief overview.
Strings | Strings represent sequences of characters. They are used to display discrete values of data on a nominal or categorical scale and might also be used for ordinal types. |
Numbers | Numbers represent numerical values on a discrete or continuous scale. Different types of numerical values allow for increasing levels of precision. Most Database Management Systems support integers for whole numbers and floats or decimals for real numbers. |
Temporal-Information | Temporal Information represents time-related attributes of an entry. By convention, date, time and timestamps are supported. A date is a combination of year, month, and day of the month that denotes a certain calendar day. A time is composed of an hour, minute, and second part as hh:mm:ss[.fraction] with precisions up to microseconds (\(10^{-6}\)). A timestamp is a combination of date and time as YYYY-MM-DD hh:mm:ss[.fraction] extending the time with a component for capturing year, month and date. |
2.2 SQL Syntax
This section elaborates on extracting data from a data base using SQL’s SELECT statement.
SQL is case insensitive when it comes to in-built functions, however table names need to be specified precisely. By convention built-in function names are capitalized and line breaks are used deliberately to ensure readability of code as query statements can be rather extensive.
The basic command to query databases is SELECT to specify a (sub)set of columns which are to be extracted from the database followed by a FROM statement which specified the database and data table. Oftentimes the wildcard * is used in order to select all columns of the datatable.
The most basic query can be constructed as following and simply selects all rows and all columns from the datatable.
SELECT *
FROM {database}.{datatable}
2.3 Horizontal Filtering
Since most databases are inherently long and therefore extraction of data might be time consuming, it we might want apply filter criteria prior to extracting of the data. Naming columns explicitly enables you to select only the subset of columns which are relevant for your analysis.
To indicate which column(s) you want to extract, name them in the output column list separated by commas in an arbitrary chosen order.
SELECT
{column1},
{column2},
{column3}FROM {database}.{datatable}
The column output can also be manipulated within the query itself by applying functions to the different columns within the table.
SELECT
LOG({column1}),
LAG({column2}, l, 0) OVER ({colum1}),
LEFT({column3}, n)
FROM {database}.{datatable}
The function LOG(0b7) takes the natural logarithm of a column, LAG() indicates to shift the column values of column2 by \(l\) table entries, with the default value accounting to \(0\). OVER() indicates the ordering of the data when performing the lagging. The function LEFT() outputs the first \(n\) characters from the left of each entry from column3. The input values and columns are adjustable.
When using functions on table columns subsequent reference to the table columns with their ALIAS is convenient to keep the column names concise. This is especially relevant when it comes to working with multiple datatables. An alias can be introduced by using AS after the specification of a table column. This extends the query above to
SELECT
LOG({column1}) AS log_column1,
LAG({column2}, l, 0) OVER ({colum1}) AS lag_column2,
LEFT({column3}, n) AS left_column3
FROM {database}.{datatable}
Notice, that newly computed columns can only be accessed in a new query, which requires the above query to be executed as a subquery.
2.4 Vertical Filtering
Unless you specify or restrict a query in some way, no vertical filter will be applied to the rows of the table. To be more precise, provide a WHERE clause that specifies one or multiple conditions that rows must satisfy in order to be outputted.
Conditions can test for equality (column1 = X) or inequality (column1 != X). For some types of data, such as strings, you can use pattern matches. Binary operators help us to identify a suitable relation for our filtering goal.
The query can be further refined by adding user-specified conditions which will be consecutively applied as a filter to each row of the table.
SELECT
{column1},
{column2},
{column3}FROM {database}.{datatable}
WHERE {condition1}
AND {condition2}
OR {condition3}
2.4.1 Operators
In order to apply more complex filters and comparisons on the extracted data, we can make use of arithmetic, comparison or logical operators. The operators can be used to increase complexity of the queries and preprocess the data during process of retrieval by taking the conjunction or disjunction of two conditions, or the negation of a condition.
Arithmetic Operators
Arithmetic operators are used for mathematical operations on numerical data, such as adding or subtracting to aggregate values from different columns.
+ | Addition |
- | Substraction |
* | Multiplication |
/ | Division |
% | Floor Division |
Comparison Operators
A comparison operator is used to compare two values and test whether they are the same.
= | Equality |
>,< | greater, smaller |
>=,<= | greater than, smaller than |
<>, != | Inequality |
Logical Operators
Logical operators return boolean values of TRUE or FALSE and can be used to link a set of conditions.
AND | TRUE if both Boolean expressions are TRUE |
OR | TRUE if either Boolean expression is TRUE |
XOR | TRUE if at least one Boolean expression is TRUE |
IN | TRUE if a single Boolean expression is TRUE |
LIKE | TRUE if the operand matches a pattern |
NOT | Reverses the value of any other Boolean operator |
EXISTS | TRUE if a subquery contains any rows |
This can translate to
SELECT
{colum1},
{column2},
{column3}FROM {database}.{datatable}
WHERE {column1} = x
AND {column2} IN (listofvalues)
OR {column3} % y = 0
The condition for filtering is evaluated on each row of the data extension, and discards rows with the condition equalling to FALSE. Notice that this query will not alter the underlying data within the table itself as the filter is only applied to the data queried.
2.4.2 Aggregating Data
We have seen how to calculate aggregates over the whole table or subsets of the whole table. However, it is very common in data analysis to be interested in some summarizing statistics (raw counts, means, sums etc.) for different categories within the same dataset. SQL has a set of tools specifically designed for this purpose. The GROUP BY clause initializes a partition of the table into user-specified groups and subsequently aggregates the rows in the query for each of the groups. When grouping data we additionally need to provide information on the mode of aggregation for each group within the data. Each group will be composed of all the rows that share the same value for the prespecified attribute, while the number of groups, and the number of rows on each group, depend on the particular dataset. Once this partition is done, the table will aggregate the attributes per group. The table returned as final result contains one row per attribute and group. The outputted row contains whatever aggregate has been requested for in the SELECT statement for each column.
If necessary, the GROUP BY clause follows the WHERE operator.
If we assume column2 and column3 to contain numerical content, while column1 indicates a certain categorical variable we can group our datatable with
SELECT
SUM({column2}, {column3}) AS sum_col2_col3
FROM {database}.{datatable}
WHERE {columname} IN (listofvalues)
GROUP BY {column1}
We can combine GROUP BY with the use of CASE in the definition of the groups, to generalize the data available.
SELECT *,
CASE
(WHEN {colum1} > 0 THEN "buy"
WHEN {colum2} < 0 THEN "sell"
ELSE 0
END) AS group_indicator
FROM {database}.{datatable};
Note, that we need to execute the CASE statement before running the query containing the GROUP BY clause as the relevant variable has to generated prior to grouping. Essentially, combining both statements results in the following query where count(*) counts the number of rows within each group.
SELECT
count(column1)
FROM
SELECT
(*,
CASE
(WHEN column2 > 0 THEN "buy"
WHEN column2 < 0 THEN "sell"
ELSE 0
END) AS group_ind
FROM {database}.{datatable}) AS main_table
GROUP BY group_ind
In particular, sometimes we want results that are ordered. For these cases, SQL has an additional clause, ORDER BY, which allows us to impose an order on the result of a query. The ORDER BY, if used, is added after any WHERE or GROUP BY clauses as a final statement.
SELECT
{colum1},
{column2},
{column3}FROM {database}.{datatable}
WHERE {column1} IN (listofvalues)
GROUP BY {column1}
ORDER BY {column2} {ASC|DESC}
The field on which records are sorted is called the sort key, you can sort records using more than one column when necessary. When you are sorting records by more than one field, the first sort field is called the major sort key or primary sort key and the second sort field is called the minor or secondary sort key.
2.4.3 Truncation
When investigating large datasets, we are oftentimes interested in the tail behavior of these datasets with a relatively low probability of observation. We can truncate the query to output the first \(n\) items using LIMIT. Analogously, we can get last \(n\) results by using inverse ordering.
SELECT
{colum1},
{column2},
{column3}FROM {database}.{datatable}
WHERE {column1} IN (listofvalues)
GROUP BY {column1}
ORDER BY {column2} {ASC|DESC}
LIMIT 100
2.4.4 Joining Multiple Tables
So far, the queries used in the examples have displayed records from a single table. In many cases, you will need to create queries to select data from more than one table. To combine two tables we can either use the clause UNION or JOIN. When joining a multitude of tables, the query will contain multiple FROM clauses.
If both tables are listed within the FROM clause, the output will be the cartesian product of the mentioned tables.
SELECT *
FROM {database}.{datatable1}, {database}.{datatable2}
Join
The other (far more common) way of combining tables is to JOIN them on a (set of) column(s).
The join operation is the core operation of relational algebra because it is the command that allows you to extract data from more than one table.
To join two tables, it is necessary to identify at least a single common attribute of both tables.
In the most common form of the join (inner join), two tables are combined based on the values in matching columns, creating a new table containing the columns in both tables.
Rows in this new table are the concatenation (combination) of a row from the first table and a row from the second table that match on the common column.
Since we have dealt with the concept of primary and foreign keys previously, the child and parent table can easily be joined by using their primary key as joining attribute.
In order for the comparison to be possible, both attributes need to have the same data type.
SELECT *
FROM {database}.{datatable1} AS table_1
JOIN
SELECT *
FROM {database}.{datatable2} AS table_2
ON table_1.primary_key = table_2.foreign_key;
Union
There is a restriction on set operations.
Two tables are UNION compatible when they have the same number of columns and when their corresponding columns represent the same type of data.
The union of the two tables is a single table containing all rows that are in either table A or table B or in both tables A and B.
SELECT *
FROM {database}.{datatable1}
UNION
SELECT *
FROM {database}.{datatable2}
2.4.5 Subqueries
A subquery is a more advanced SQL coding concept and essentially pictures a query within another query. Subqueries are enclosed with parenthesis.
SELECT *
FROM {database}.{datatable1}
WHERE {column2} IN (
SELECT {column2}
FROM {database}.{datatable2}
WHERE {column1} > 1000)
We have already encountered the concept of using subqueries within the section of grouping tables. When there is a subquery in the FROM clause of the main query, the system first evaluates the subquery and uses the result of this evaluation as one of the data sources for evaluating the main query. The reason that subqueries use AS table names to give their result a name is that this name is then used in the main query to refer to the result. Since the main query acts upon the results of its subquery, using subqueries is a very common and useful tactic for computing complex results by breaking the process down into logical steps. (1) the data is prepared, processed and filtered with a first query and (2) the final result is obtained with the main query.
2.5 Preprocessing
Preprocessing refers to the steps taken to prepare raw data for analysis or modeling. It involves a series of techniques used to transform, clean, and normalize data so that it can be used effectively in machine learning or data analysis applications. Some preprocessing techniques can be
Sampling | Identifying and chosing a subset of data to work with in order to reduce compulatational runtime |
Discretization | Reducing memory usage by discretizing continuous values into discrete values |
Binarization | this is a variation of discretization, in that it takes a numerical continuous attribute or a categorical one and transforms it into a binary attribute (or several binary attributes) |
Normalization | Removing Outliers and analysis of outliers. |
2.6 Exercise
We will be using MySQL Workbench to connect to a database and query data. To access the database ensure that you are connected to the LMU network via VPN.
Afterwards, in the application, establish a connection to the database by using the following credentials.
Username | default_user |
Password | q7$?h!4Mx |
Host | 10.153.30.58 |
Port | 3306 |
The method establishes a connection with the database server at the given URL as user with the credentials provided to read data.
We can start by looking into our database by listing all tables available throughout within the schema taqmsec. We can investigate the available data tables within our database by executing.
SELECT schema_name
FROM information_schema.schemata
ORDER BY schema_name
Within the database the available tables can be extracted.
SELECT table_schema, table_name, table_rows, avg_row_length, create_time, update_time
FROM information_schema.tables
WHERE table_schema = 'taqmsec'
ORDER BY schema_name
## TABLE_SCHEMA TABLE_NAME TABLE_ROWS AVG_ROW_LENGTH
## 0 taqmsec aapl_wrds_cqm 22168707 101
## 1 taqmsec aapl_wrds_ctm 1352538 104
## 2 taqmsec amzn_wrds_cqm 4870158 110
## 3 taqmsec amzn_wrds_ctm 1293406 96
## 4 taqmsec avgo_wrds_cqm 1454922 107
## 5 taqmsec avgo_wrds_ctm 210803 107
## 6 taqmsec pep_wrds_cqm 2771288 100
## 7 taqmsec pep_wrds_ctm 248946 99
## 8 taqmsec tsla_wrds_cqm 3377296 104
## 9 taqmsec tsla_wrds_ctm 1438737 103
The tables are named by the convention ticker_wrds_cqm or ticker_wrds_ctm. While _cqm contains the updates of the order book, _ctm contains the execution of trades.
Within the scope of this course, we will look into the Trades and Quotes (TAQ) dataset provided by the New York Stock Exchange. The TAQ database contains all trades and all quotes (bid and ask) updated for US-equities on a nanosecond (\(10^{-9}\) second) basis. The database contains the stocks
AAPL | Apple |
AMZN | Amazon |
AVGO | Broadcom |
PEP | Pepsi |
TSLA | Tesla |
All stocks are primarily listed on the new york stock exchange with the exchange code Q.
Task
Write a structured SQL script by incrementally extending your query from the basic SELECT * to allow for the following filters on the trade table.
For each preprocessing step, provide the number of relevant trades (the length of the output table).
Trades
The trades look the following:
SELECT *
FROM taqmsec.aapl_wrds_ctm
LIMIT 100
## date time_m time_m_nano ex sym_root sym_suffix \
## 0 2022-02-01 0 days 04:00:02.950514 474 Q AAPL None
## 1 2022-02-01 0 days 04:00:04.008821 30 Q AAPL None
## 2 2022-02-01 0 days 04:00:04.009932 213 Q AAPL None
## 3 2022-02-01 0 days 04:00:04.015919 664 Q AAPL None
## 4 2022-02-01 0 days 04:00:04.015922 766 Q AAPL None
## .. ... ... ... .. ... ...
## 995 2022-02-01 0 days 04:57:44.358336 376 Q AAPL None
## 996 2022-02-01 0 days 04:57:44.358347 283 Q AAPL None
## 997 2022-02-01 0 days 04:57:44.358378 725 Q AAPL None
## 998 2022-02-01 0 days 04:57:48.165434 891 Q AAPL None
## 999 2022-02-01 0 days 04:57:56.101551 814 Q AAPL None
##
## tr_scond size price tr_stop_ind tr_corr tr_seqnum tr_id tr_source tr_rf \
## 0 @ TI 14 174.78 None 00 2114 1 N None
## 1 @FTI 1 174.96 None 00 2123 2 N None
## 2 @FTI 1 174.96 None 00 2124 3 N None
## 3 @FTI 1 175 None 00 2125 4 N None
## 4 @FTI 4 175 None 00 2126 5 N None
## .. ... ... ... ... ... ... ... ... ...
## 995 @FTI 2 174.34 None 00 11495 996 N None
## 996 @FTI 10 174.38 None 00 11496 997 N None
## 997 @FTI 20 174.4 None 00 11497 998 N None
## 998 @ TI 2 174.35 None 00 11511 999 N None
## 999 @ TI 1 174.35 None 00 11516 1000 N None
##
## part_time part_time_nano trf_time trf_time_nano tte_ind
## 0 0 days 04:00:02.950498 703 None 0 0
## 1 0 days 04:00:04.008804 25 None 0 1
## 2 0 days 04:00:04.009917 665 None 0 1
## 3 0 days 04:00:04.015903 623 None 0 1
## 4 0 days 04:00:04.015903 623 None 0 1
## .. ... ... ... ... ...
## 995 0 days 04:57:44.358315 635 None 0 1
## 996 0 days 04:57:44.358331 85 None 0 1
## 997 0 days 04:57:44.358360 115 None 0 1
## 998 0 days 04:57:48.165418 124 None 0 0
## 999 0 days 04:57:56.101535 795 None 0 0
##
## [1000 rows x 20 columns]
Start with a basic query SELECT * FROM TABLE for the ticker AAPL and successively extend the query to allow for advanced filtering of data.
(1) | Filter for exchange equal to Q and date equal to 2022-02-01 |
(2) | The trade condition should be in (’‘,’@‘, ’E’, ‘(E?)’, ‘F’, ‘FI’, ‘@F’, ‘@FI’, ‘I’, ‘@I’). Investigate the data closely - do you need to perform additional steps on the column to ensure comparability with the list elements? |
(3) | Divide the trade size by 100 |
(4) | Concatenate date and part_time and output the values in a new column |
(5) | Compute the logarithmic price return as \(r_t = log(P_{t} - P_{t-1})\) and daily volatility as \(\sigma_r = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n}(r_i - \bar{r})^2}\) |
(6) | Add a new column with a unique index (primary key) for each row separately |
Additionally, write a query that provides answers to the following questions for stock AAPL
(7) | How big was the largest trade in terms of traded quantity? |
(8) | What is the minimum and maximum date available within the database? |
(9) | How much volume (sum of traded quantities) has been traded on average on each of the single trading days? |
(10) | What is the average trading price on each of the single trading days? |
(11) | Combine the queries (9) and (10) into a single query |
(12) | How often does each of the valid trade conditions occur? |
(13) | Can you aggregate the data on a lower frequency time scale? |
(14) | Aggregate duplicate trades based on their participant timestamp and nanosecond by computations of the average price and cumulated traded quantity. Count the number of all distinct trades within a group |
Quotes
Write a structured SQL query by incrementally extending your query to allow for the following filters on the quote table.
SELECT *
FROM taqmsec.aapl_wrds_cqm
LIMIT 100
The raw quote data looks like:
(15) | Compute the spread as difference between ask and bid \(spread_t = P_{ask,t} - P_{bid,t}\) with \(P\) indicating the price |
(16) | Compute the midpoint as arithmetic average of bid and ask \(P_{mid,t} = (P_{ask,t} + P_{bid,t}) / 2\) to compute the quoted spread as \(quoted\_spread_{t} = (P_{ask,t} - P_{bid,t}) / mid_t\) |
(17) | Filter for exchange equal to Q, opening times between 09:30 and 15:30 plus a positive bid-ask spread |
(18) | Create a \(t-1\) lagged variable of the bid size, the bid price, the ask size and the ask price |
(19) | Compute the change in bid size as \(\Delta bid = bid_t - bid_{t-1}\), bid price, ask size and ask price analogously |
(20) | Concatenate date and part_time column to a single column |
(21) | Add a new column with an incrementally ascending index (unique identifier) to each row separately |
(22) | Aggregate duplicate quotes based on their participant timestamp by computations of the means |
(23) | Left join the quote data onto the trade data based on identical participant timestamps |
## Python 3.11.1 (C:/Program Files/python.exe)
## Reticulate 1.26 REPL -- A Python interpreter in R.
## Enter 'exit' or 'quit' to exit the REPL and return to R.
## exit