Chapter 2 Structured Query Language

Structured Query Language (SQL) is a computer language commonly used to access, manipulate and query large relational databases. It often covers the first step within an extract, transform and load process.

This chapter aims to provide a brief overview on Database Management Systems and using SQL to retrieve and aggregate data. In addition to the material provided in this chapter, numerous online resources provide helpful and interactive material for learning SQL, such as tutorialspoint and w3schools.

The content of this chapter is closely linked to concepts in

SQL for Data Science - Badia (2020)
SQL MySQL Cookbook - DuBois (2014)

If you want to setup a SQL database yourself, have a look at MySQL Workbench. Some resources allow for hands on practiting SQL skills Databasestar using an already available database.

2.1 Relational Databases

A Database Management System refers to the technology for creating, storing and organizing data in a database. The main objective of a Database Management System is to supply a way to store and retrieve database information in a convenient and efficient manner.

For an in-depth review of Database Management consult the w2 schools chaptes concerining database mangement. The contents are outside the scope for this course.

Within the Database Management System different instances comprise single databases. Each database has its own schema. A database schema is a blueprint or plan that outlines the structure and organization of a database. It defines how the data is organized, stored, and accessed within the database. A schema includes tables, columns, relationships, constraints, and other database objects that define the data model.

A database can contain a single, or a multitude of different data tables. A full table consists of at least a single row and column, the single entires per row and column are stored within separate fields of the table. A column can also be identified as attribute.

2.1.1 Data Quality

When dealing with large amounts of data, a certain degree of data quality has to be ensured. In order to do so, conventionally constraints are imposed on either the rows, columns and/or the combination of rows and columns of a table in the database. The concept of data quality is rooted within the principles of

Completeness Refers to the completeness of the dataset. A dataset might be incomplete if there are values missing, a single attribute is missing, a reference value is missing.
Uniqueness Within the datatable a row or an entity should be recorded only once. Duplicates in the data might induce skew and noise to descriptive measures. Uniqueness is the most critical dimension for ensuring no duplication or overlaps. It can either be measured against all records within a single or across multiple data tables.
Consistency Consistency relates to the degree to which the data is consistent within a single and amongst a multitude of data tables.
Validity This dimension signifies that the value attributes are available for aligning with the specific domain or requirement. It can be checked by imposing constraints on the data accordingly.
Accuracy Data accuracy is the level to which data represents the real-world scenario and confirms with a verifiable source.
Integrity Data journey and transformation across systems can affect its attribute relationships. Integrity indicates that the attributes are maintained correctly, even as data gets stored and used in diverse systems. Data integrity ensures that all enterprise data can be traced and connected.

A suitable degree of data quality can be ensured by using normalization procedures on the database and its tables accordingly.

Primary Keys
Amongst all attributes that make up the schema of a table, it is common a subset of attributes are sufficient to identify single rows and ensure their uniqueness. Such attribute sets are called primary keys. The primary key can either comprise a single column, for example an ID column, or span multiple columns, making the combination of certain attributes unique. Primary keys are a useful tool to tackle the issue of redundancy and facilitate consistency within a table as they do not allow for multiple copies of the same data. When creating a table, we aim at declaring a set of keys which is as small as possible but as large as necessary to distinguish amongst rows and provide a unique row identifier. Oftentimes, an artificial primary key will be created in order to capture the uniqueness of a row and allow for referencing. As the complexity of the database increases, keys will allow for the construction of relationships between a set of different tables.

Foreign Keys
Foreign keys are used to capture relationships between a set of tables within the database and increase referential integrity. The way such relationships are expressed in a database is by using the primary key of an entity as its surrogate: primary keys are copied to another table to represent the entity and express the relationships.

For example, a customer ID serves as primary key within one table and might indicate a unique customer number which can be linked via usage of foreign keys to another table which contains more detailed customer information for each customer ID. Whenever we use a primary key of one table on another table, we establish a foreign key on the new table. A foreign key constraints the entries in the child table and only allows for entires which can be looked up in the parent table.

Other constraints
The set of other constraints comprises non-null constraints for specific fields in the database or relates to valid datatypes which are allowed for certain attributes. This relates to the concept of data validity.

2.1.2 Data Types

Within the scope of this course, we distinguish between numeric, string and date like data types. The following table provides a brief overview.

Strings Strings represent sequences of characters. They are used to display discrete values of data on a nominal or categorical scale and might also be used for ordinal types.
Numbers Numbers represent numerical values on a discrete or continuous scale. Different types of numerical values allow for increasing levels of precision. Most Database Management Systems support integers for whole numbers and floats or decimals for real numbers.
Temporal-Information Temporal Information represents time-related attributes of an entry. By convention, date, time and timestamps are supported. A date is a combination of year, month, and day of the month that denotes a certain calendar day. A time is composed of an hour, minute, and second part as hh:mm:ss[.fraction] with precisions up to microseconds (\(10^{-6}\)). A timestamp is a combination of date and time as YYYY-MM-DD hh:mm:ss[.fraction] extending the time with a component for capturing year, month and date.

2.2 SQL Syntax

This section elaborates on extracting data from a data base using SQL’s SELECT statement.

SQL is case insensitive when it comes to in-built functions, however table names need to be specified precisely. By convention built-in function names are capitalized and line breaks are used deliberately to ensure readability of code as query statements can be rather extensive.

The basic command to query databases is SELECT to specify a (sub)set of columns which are to be extracted from the database followed by a FROM statement which specified the database and data table. Oftentimes the wildcard * is used in order to select all columns of the datatable.

The most basic query can be constructed as following and simply selects all rows and all columns from the datatable.

SELECT *
FROM {database}.{datatable}

2.3 Horizontal Filtering

Since most databases are inherently long and therefore extraction of data might be time consuming, it we might want apply filter criteria prior to extracting of the data. Naming columns explicitly enables you to select only the subset of columns which are relevant for your analysis.

To indicate which column(s) you want to extract, name them in the output column list separated by commas in an arbitrary chosen order.

SELECT 
{column1}, 
{column2},
{column3}
FROM {database}.{datatable}

The column output can also be manipulated within the query itself by applying functions to the different columns within the table.

SELECT 
LOG({column1}), 
LAG({column2}, l, 0) OVER ({colum1}),
LEFT({column3}, n)
FROM {database}.{datatable}

The function LOG(0b7) takes the natural logarithm of a column, LAG() indicates to shift the column values of column2 by \(l\) table entries, with the default value accounting to \(0\). OVER() indicates the ordering of the data when performing the lagging. The function LEFT() outputs the first \(n\) characters from the left of each entry from column3. The input values and columns are adjustable.

When using functions on table columns subsequent reference to the table columns with their ALIAS is convenient to keep the column names concise. This is especially relevant when it comes to working with multiple datatables. An alias can be introduced by using AS after the specification of a table column. This extends the query above to

SELECT 
LOG({column1}) AS log_column1, 
LAG({column2}, l, 0) OVER ({colum1}) AS lag_column2,
LEFT({column3}, n) AS left_column3
FROM {database}.{datatable}

Notice, that newly computed columns can only be accessed in a new query, which requires the above query to be executed as a subquery.

2.4 Vertical Filtering

Unless you specify or restrict a query in some way, no vertical filter will be applied to the rows of the table. To be more precise, provide a WHERE clause that specifies one or multiple conditions that rows must satisfy in order to be outputted.

Conditions can test for equality (column1 = X) or inequality (column1 != X). For some types of data, such as strings, you can use pattern matches. Binary operators help us to identify a suitable relation for our filtering goal.

The query can be further refined by adding user-specified conditions which will be consecutively applied as a filter to each row of the table.

SELECT 
{column1}, 
{column2}, 
{column3}
FROM {database}.{datatable}
WHERE {condition1} 
AND {condition2}
OR {condition3}

2.4.1 Operators

In order to apply more complex filters and comparisons on the extracted data, we can make use of arithmetic, comparison or logical operators. The operators can be used to increase complexity of the queries and preprocess the data during process of retrieval by taking the conjunction or disjunction of two conditions, or the negation of a condition.

Arithmetic Operators

Arithmetic operators are used for mathematical operations on numerical data, such as adding or subtracting to aggregate values from different columns.

+ Addition
- Substraction
* Multiplication
/ Division
% Floor Division

Comparison Operators

A comparison operator is used to compare two values and test whether they are the same.

= Equality
>,< greater, smaller
>=,<= greater than, smaller than
<>, != Inequality

Logical Operators

Logical operators return boolean values of TRUE or FALSE and can be used to link a set of conditions.

AND TRUE if both Boolean expressions are TRUE
OR TRUE if either Boolean expression is TRUE
XOR TRUE if at least one Boolean expression is TRUE
IN TRUE if a single Boolean expression is TRUE
LIKE TRUE if the operand matches a pattern
NOT Reverses the value of any other Boolean operator
EXISTS TRUE if a subquery contains any rows

This can translate to

SELECT 
{colum1}, 
{column2}, 
{column3}
FROM {database}.{datatable}
WHERE {column1} = x
AND {column2} IN (listofvalues) 
OR {column3} % y = 0

The condition for filtering is evaluated on each row of the data extension, and discards rows with the condition equalling to FALSE. Notice that this query will not alter the underlying data within the table itself as the filter is only applied to the data queried.

2.4.2 Aggregating Data

We have seen how to calculate aggregates over the whole table or subsets of the whole table. However, it is very common in data analysis to be interested in some summarizing statistics (raw counts, means, sums etc.) for different categories within the same dataset. SQL has a set of tools specifically designed for this purpose. The GROUP BY clause initializes a partition of the table into user-specified groups and subsequently aggregates the rows in the query for each of the groups. When grouping data we additionally need to provide information on the mode of aggregation for each group within the data. Each group will be composed of all the rows that share the same value for the prespecified attribute, while the number of groups, and the number of rows on each group, depend on the particular dataset. Once this partition is done, the table will aggregate the attributes per group. The table returned as final result contains one row per attribute and group. The outputted row contains whatever aggregate has been requested for in the SELECT statement for each column.

If necessary, the GROUP BY clause follows the WHERE operator.

If we assume column2 and column3 to contain numerical content, while column1 indicates a certain categorical variable we can group our datatable with

SELECT 
SUM({column2}, {column3}) AS sum_col2_col3
FROM {database}.{datatable}
WHERE {columname} IN (listofvalues)
GROUP BY {column1}

We can combine GROUP BY with the use of CASE in the definition of the groups, to generalize the data available.

SELECT *,
(CASE 
WHEN {colum1} > 0 THEN "buy"
WHEN {colum2} < 0 THEN "sell"
ELSE 0 
END) AS group_indicator
FROM {database}.{datatable};

Note, that we need to execute the CASE statement before running the query containing the GROUP BY clause as the relevant variable has to generated prior to grouping. Essentially, combining both statements results in the following query where count(*) counts the number of rows within each group.

SELECT 
count(column1)
FROM
  (SELECT 
  *,
  (CASE 
  WHEN column2 > 0 THEN "buy"
  WHEN column2 < 0 THEN "sell"
  ELSE 0 
  END) AS group_ind
  FROM {database}.{datatable}) AS main_table
GROUP BY group_ind

In particular, sometimes we want results that are ordered. For these cases, SQL has an additional clause, ORDER BY, which allows us to impose an order on the result of a query. The ORDER BY, if used, is added after any WHERE or GROUP BY clauses as a final statement.

SELECT 
{colum1}, 
{column2}, 
{column3}
FROM {database}.{datatable}
WHERE {column1} IN (listofvalues)
GROUP BY {column1}
ORDER BY {column2} {ASC|DESC}

The field on which records are sorted is called the sort key, you can sort records using more than one column when necessary. When you are sorting records by more than one field, the first sort field is called the major sort key or primary sort key and the second sort field is called the minor or secondary sort key.

2.4.3 Truncation

When investigating large datasets, we are oftentimes interested in the tail behavior of these datasets with a relatively low probability of observation. We can truncate the query to output the first \(n\) items using LIMIT. Analogously, we can get last \(n\) results by using inverse ordering.

SELECT 
{colum1}, 
{column2}, 
{column3}
FROM {database}.{datatable}
WHERE {column1} IN (listofvalues)
GROUP BY {column1}
ORDER BY {column2} {ASC|DESC}
LIMIT 100

2.4.4 Joining Multiple Tables

So far, the queries used in the examples have displayed records from a single table. In many cases, you will need to create queries to select data from more than one table. To combine two tables we can either use the clause UNION or JOIN. When joining a multitude of tables, the query will contain multiple FROM clauses.

If both tables are listed within the FROM clause, the output will be the cartesian product of the mentioned tables.

SELECT *
FROM {database}.{datatable1}, {database}.{datatable2}

Join
The other (far more common) way of combining tables is to JOIN them on a (set of) column(s). The join operation is the core operation of relational algebra because it is the command that allows you to extract data from more than one table. To join two tables, it is necessary to identify at least a single common attribute of both tables. In the most common form of the join (inner join), two tables are combined based on the values in matching columns, creating a new table containing the columns in both tables. Rows in this new table are the concatenation (combination) of a row from the first table and a row from the second table that match on the common column. Since we have dealt with the concept of primary and foreign keys previously, the child and parent table can easily be joined by using their primary key as joining attribute. In order for the comparison to be possible, both attributes need to have the same data type.

SELECT *
FROM {database}.{datatable1} AS table_1
JOIN 
SELECT *
FROM {database}.{datatable2} AS table_2
ON table_1.primary_key = table_2.foreign_key;

Union
There is a restriction on set operations. Two tables are UNION compatible when they have the same number of columns and when their corresponding columns represent the same type of data.

The union of the two tables is a single table containing all rows that are in either table A or table B or in both tables A and B.

SELECT *
FROM {database}.{datatable1}
UNION
SELECT *
FROM {database}.{datatable2}

2.4.5 Subqueries

A subquery is a more advanced SQL coding concept and essentially pictures a query within another query. Subqueries are enclosed with parenthesis.

SELECT * 
FROM {database}.{datatable1}
WHERE {column2} IN (
  SELECT {column2}
  FROM {database}.{datatable2}
  WHERE {column1} > 1000)

We have already encountered the concept of using subqueries within the section of grouping tables. When there is a subquery in the FROM clause of the main query, the system first evaluates the subquery and uses the result of this evaluation as one of the data sources for evaluating the main query. The reason that subqueries use AS table names to give their result a name is that this name is then used in the main query to refer to the result. Since the main query acts upon the results of its subquery, using subqueries is a very common and useful tactic for computing complex results by breaking the process down into logical steps. (1) the data is prepared, processed and filtered with a first query and (2) the final result is obtained with the main query.

2.5 Preprocessing

Preprocessing refers to the steps taken to prepare raw data for analysis or modeling. It involves a series of techniques used to transform, clean, and normalize data so that it can be used effectively in machine learning or data analysis applications. Some preprocessing techniques can be

Sampling Identifying and chosing a subset of data to work with in order to reduce compulatational runtime
Discretization Reducing memory usage by discretizing continuous values into discrete values
Binarization this is a variation of discretization, in that it takes a numerical continuous attribute or a categorical one and transforms it into a binary attribute (or several binary attributes)
Normalization Removing Outliers and analysis of outliers.

2.6 Exercise

We will be using MySQL Workbench to connect to a database and query data. To access the database ensure that you are connected to the LMU network via VPN.

Afterwards, in the application, establish a connection to the database by using the following credentials.

Username default_user
Password q7$?h!4Mx
Host 10.153.30.58
Port 3306

The method establishes a connection with the database server at the given URL as user with the credentials provided to read data.

We can start by looking into our database by listing all tables available throughout within the schema taqmsec. We can investigate the available data tables within our database by executing.

SELECT schema_name 
FROM information_schema.schemata 
ORDER BY schema_name

Within the database the available tables can be extracted.

SELECT table_schema, table_name, table_rows, avg_row_length, create_time, update_time
FROM information_schema.tables 
WHERE table_schema = 'taqmsec'
ORDER BY schema_name
##   TABLE_SCHEMA     TABLE_NAME  TABLE_ROWS  AVG_ROW_LENGTH
## 0      taqmsec  aapl_wrds_cqm    22168707             101
## 1      taqmsec  aapl_wrds_ctm     1352538             104
## 2      taqmsec  amzn_wrds_cqm     4870158             110
## 3      taqmsec  amzn_wrds_ctm     1293406              96
## 4      taqmsec  avgo_wrds_cqm     1454922             107
## 5      taqmsec  avgo_wrds_ctm      210803             107
## 6      taqmsec   pep_wrds_cqm     2771288             100
## 7      taqmsec   pep_wrds_ctm      248946              99
## 8      taqmsec  tsla_wrds_cqm     3377296             104
## 9      taqmsec  tsla_wrds_ctm     1438737             103

The tables are named by the convention ticker_wrds_cqm or ticker_wrds_ctm. While _cqm contains the updates of the order book, _ctm contains the execution of trades.

Within the scope of this course, we will look into the Trades and Quotes (TAQ) dataset provided by the New York Stock Exchange. The TAQ database contains all trades and all quotes (bid and ask) updated for US-equities on a nanosecond (\(10^{-9}\) second) basis. The database contains the stocks

AAPL Apple
AMZN Amazon
AVGO Broadcom
PEP Pepsi
TSLA Tesla

All stocks are primarily listed on the new york stock exchange with the exchange code Q.

Task
Write a structured SQL script by incrementally extending your query from the basic SELECT * to allow for the following filters on the trade table. For each preprocessing step, provide the number of relevant trades (the length of the output table).

Trades

The trades look the following:

SELECT * 
FROM taqmsec.aapl_wrds_ctm
LIMIT 100
##           date                 time_m time_m_nano ex sym_root sym_suffix  \
## 0   2022-02-01 0 days 04:00:02.950514         474  Q     AAPL       None   
## 1   2022-02-01 0 days 04:00:04.008821          30  Q     AAPL       None   
## 2   2022-02-01 0 days 04:00:04.009932         213  Q     AAPL       None   
## 3   2022-02-01 0 days 04:00:04.015919         664  Q     AAPL       None   
## 4   2022-02-01 0 days 04:00:04.015922         766  Q     AAPL       None   
## ..         ...                    ...         ... ..      ...        ...   
## 995 2022-02-01 0 days 04:57:44.358336         376  Q     AAPL       None   
## 996 2022-02-01 0 days 04:57:44.358347         283  Q     AAPL       None   
## 997 2022-02-01 0 days 04:57:44.358378         725  Q     AAPL       None   
## 998 2022-02-01 0 days 04:57:48.165434         891  Q     AAPL       None   
## 999 2022-02-01 0 days 04:57:56.101551         814  Q     AAPL       None   
## 
##     tr_scond size   price tr_stop_ind tr_corr tr_seqnum tr_id tr_source tr_rf  \
## 0       @ TI   14  174.78        None      00      2114     1         N  None   
## 1       @FTI    1  174.96        None      00      2123     2         N  None   
## 2       @FTI    1  174.96        None      00      2124     3         N  None   
## 3       @FTI    1     175        None      00      2125     4         N  None   
## 4       @FTI    4     175        None      00      2126     5         N  None   
## ..       ...  ...     ...         ...     ...       ...   ...       ...   ...   
## 995     @FTI    2  174.34        None      00     11495   996         N  None   
## 996     @FTI   10  174.38        None      00     11496   997         N  None   
## 997     @FTI   20   174.4        None      00     11497   998         N  None   
## 998     @ TI    2  174.35        None      00     11511   999         N  None   
## 999     @ TI    1  174.35        None      00     11516  1000         N  None   
## 
##                  part_time part_time_nano trf_time trf_time_nano tte_ind  
## 0   0 days 04:00:02.950498            703     None             0       0  
## 1   0 days 04:00:04.008804             25     None             0       1  
## 2   0 days 04:00:04.009917            665     None             0       1  
## 3   0 days 04:00:04.015903            623     None             0       1  
## 4   0 days 04:00:04.015903            623     None             0       1  
## ..                     ...            ...      ...           ...     ...  
## 995 0 days 04:57:44.358315            635     None             0       1  
## 996 0 days 04:57:44.358331             85     None             0       1  
## 997 0 days 04:57:44.358360            115     None             0       1  
## 998 0 days 04:57:48.165418            124     None             0       0  
## 999 0 days 04:57:56.101535            795     None             0       0  
## 
## [1000 rows x 20 columns]

Start with a basic query SELECT * FROM TABLE for the ticker AAPL and successively extend the query to allow for advanced filtering of data.

(1) Filter for exchange equal to Q and date equal to 2022-02-01
(2) The trade condition should be in (’‘,’@‘, ’E’, ‘(E?)’, ‘F’, ‘FI’, ‘@F’, ‘@FI’, ‘I’, ‘@I’). Investigate the data closely - do you need to perform additional steps on the column to ensure comparability with the list elements?
(3) Divide the trade size by 100
(4) Concatenate date and part_time and output the values in a new column
(5) Compute the logarithmic price return as \(r_t = log(P_{t} - P_{t-1})\) and daily volatility as \(\sigma_r = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n}(r_i - \bar{r})^2}\)
(6) Add a new column with a unique index (primary key) for each row separately

Additionally, write a query that provides answers to the following questions for stock AAPL

(7) How big was the largest trade in terms of traded quantity?
(8) What is the minimum and maximum date available within the database?
(9) How much volume (sum of traded quantities) has been traded on average on each of the single trading days?
(10) What is the average trading price on each of the single trading days?
(11) Combine the queries (9) and (10) into a single query
(12) How often does each of the valid trade conditions occur?
(13) Can you aggregate the data on a lower frequency time scale?
(14) Aggregate duplicate trades based on their participant timestamp and nanosecond by computations of the average price and cumulated traded quantity. Count the number of all distinct trades within a group

Quotes

Write a structured SQL query by incrementally extending your query to allow for the following filters on the quote table.

SELECT * 
FROM taqmsec.aapl_wrds_cqm
LIMIT 100

The raw quote data looks like:

(15) Compute the spread as difference between ask and bid \(spread_t = P_{ask,t} - P_{bid,t}\) with \(P\) indicating the price
(16) Compute the midpoint as arithmetic average of bid and ask \(P_{mid,t} = (P_{ask,t} + P_{bid,t}) / 2\) to compute the quoted spread as \(quoted\_spread_{t} = (P_{ask,t} - P_{bid,t}) / mid_t\)
(17) Filter for exchange equal to Q, opening times between 09:30 and 15:30 plus a positive bid-ask spread
(18) Create a \(t-1\) lagged variable of the bid size, the bid price, the ask size and the ask price
(19) Compute the change in bid size as \(\Delta bid = bid_t - bid_{t-1}\), bid price, ask size and ask price analogously
(20) Concatenate date and part_time column to a single column
(21) Add a new column with an incrementally ascending index (unique identifier) to each row separately
(22) Aggregate duplicate quotes based on their participant timestamp by computations of the means
Combining trades and quotes
(23) Left join the quote data onto the trade data based on identical participant timestamps
## Python 3.11.1 (C:/Program Files/python.exe)
## Reticulate 1.26 REPL -- A Python interpreter in R.
## Enter 'exit' or 'quit' to exit the REPL and return to R.
## exit