Chapter 3 Python

Python is a general-purpose programming language and finds application in a broad range of domains, including web development, artificial intelligence and data science. Since it is characterized as a high-level programming language, it is considered relatively easy to learn.

The contents within this section can be found within

Python Cookbook - Beazley & Jones (2013)

There is a wide range of online resources available for self-studying purposes, such as W3 Schools.

You can download Python from here Python. To allow easier editing of code, chose an editor of your choice, such as Visual Studio Code with a Tutorial on how to use Python with Visual Studio Code. Please make sure you have a running Python instance as we cannot offer full and individual support with setting up the work environment.

In order to write structured and readable Python code a set of coding standards, PEP 8 have been defined. Throughout the course we require you to ensure compliance with the coding standards. If you are using Visual Studio Code, you can use the extension Flake8 for ensuring compatibility with the coding standards. Compliance with those standards will be part of how the grade is determined.

3.1 Data Structures

When using Python we differentiate between the following data types

Strings str Contains a set of literals surrounded by either quotation marks. Strings are arrays of bytes representing unicode characters
Numeric Types int, float An Integer is a whole number without decimals of unlimited length. A Float is a number containing one or more decimals
Sequence list, tuple, range, set Sequence types are used to store multiple values in a single variable. They might be ordered or unordered, items might be changeable or unchangeable
Mapping dictionary A mapping maps a set of keys to their corresponding values, duplicate values in underlying keys are not allowed
Boolean TRUE, FALSE Can be used to identify logical statements

To identify the data type of a variable use type().

# This is a comment
ex_int = 5
ex_float = 2/3
ex_string = "Hello World!"
ex_dictionary = {"key1": "value1", "key2":"value2"}
ex_boolean = True
ex_list = [1,2,3,4]
ex_tuple = (1,2)
ex_range = range(1,10)
ex_set = set(ex_list)

We can infer about the type of a variable using

type(ex_list)
## <class 'list'>
type(ex_int)
## <class 'int'>

3.2 Python Syntax

You can put a comment into your code using a prefixed # in front of your comment. Commenting your code is especially useful if you want to use it at another point of time and make it understandable for other programmers.

Comments over multiple lines are used when describing functions and called Docstrings. A brief documentation about the usage of docstrings can be found here.

Different from SQL, Python is case sensitive. Where in other programming languages the indentation in code is for readability only, the indentation in Python is important. Python uses indentation to indicate blocks of code, therefore not considering indentation rules will cause the occurrence of errors.

a_list = [1,2,3,4,5] 

for element in a_list:
    print(element)  
## 1
## 2
## 3
## 4
## 5

Using the logical operators similiar to those used within SQL scripts, we can define conditional statements.

a = 5
b = 3

if a > b:
  print("a is greater than b")
elif a < b:
  print("a is smaller than b")
else:
  print("a and b are equal")
## a is greater than b

We can also loop through a list of elements.

a_list = [5, 8, 8, 9, 10]

for element in a_list:
  print(element)
## 5
## 8
## 8
## 9
## 10

We can also loop through a list until a condition is (not) fullfilled (anymore).

iterator = 1

while iterator <= 5:
  print(iterator)
  iterator += 1
## 1
## 2
## 3
## 4
## 5

3.3 Operators

Logical Operators

Logical operators return boolean values of true or false and can be used to link a set of conditions.

&, and TRUE if both Boolean expressions are TRUE
|, or TRUE if either Boolean expression is TRUE
^, xor TRUE if either Boolean expression is TRUE
in TRUE if the operand is equal to one of a list of expressions
~, not Reverses the value of any other Boolean operator

When combining multiple operators, we need to use parentheses to facilitate their correct evaluation. Parentheses have the highest precedence and cause the expressions inside parentheses to be evaluated first. If two operators have the same precedence, the expression is evaluated from left to right.

3.4 Functions

Since we follow the principle of avoiding redundant code, we want to write functions whenever possible. As a rough rule, a function is helpful once we are copy-pasting code 3 times or more.

This is a very basic function, that prints a string.

# A basic function
def hello():
    print(f"Hello World!")

hello()
## Hello World!

Information can be passed into functions as arguments. Arguments are specified after the function name, inside the parentheses.

# A function with a single positional argument
def my_first_function(name):
    print(f"Welcome {name}!")

my_first_function("Lisa")
## Welcome Lisa!

You can add an arbitrary amount of arguments, separated by commas. A function can also take a default argument, which will be processed throughout the function if no argument is provided within the function call.

We differentiate between positional arguments and keyword arguments. A positional argument is passed to the function and evaluated based on its position in the function’s call my_first_function(“Lisa”) while a keyword argument refers to a specific keyword within the function call my_first_function(name=“Lisa”) If no argument is passed to the function during its call, default arguments can be defined within the function itself which will be processed in such a case.

# A function with a single optional argument
def my_first_function(name="Ryan"):
    print(f"Welcome {name}!")

my_first_function()
## Welcome Ryan!
# A function with a positional and optional keyword argument
def my_first_function(name_1, name_2="Ryan"):
    print(f"Welcome {name_1} and {name_2}!")

my_first_function("Lisa")
## Welcome Lisa and Ryan!
my_first_function("Lisa", "Florian")
## Welcome Lisa and Florian!

The number of information that is passed into a function can also be handled in a flexible way. This implies the function takes as many arguments as the user specified and processes them accordingly.

# A function with a variable number of input names
def my_first_function(*names):
    print(f"Welcome {names}!")

my_first_function("Lisa", "Ryan", "Florian")
## Welcome ('Lisa', 'Ryan', 'Florian')!
# A function with a variable number of input names as list
def my_first_function(**names):
    print(f"""
    Welcome {names}!
    The first name is {names["name_1"]}
    The second name is {names["name_2"]}
    """)

my_first_function(name_1="Lisa", name_2="Ryan", name_3="Florian")
## 
##     Welcome {'name_1': 'Lisa', 'name_2': 'Ryan', 'name_3': 'Florian'}!
##     The first name is Lisa
##     The second name is Ryan
## 

You can play around with writing functions in order to understand how they work, what is possible and what is not.

3.5 Dataframes

When dealing with structured data, the data is usually stored in a dataframe. In Python the library pandas is most commonly used to investigate spreadsheet-style data.

# Import modules
import pandas as pd

A Dataframe is an at leas two dimensional table of potentially heterogenous data, containing labelled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

data = {'name': ["Lisa", "Florian", "Moritz"], 'grade': [2.3, 1, 1.7]}
dataframe = pd.DataFrame(data)
##       name  grade
## 0     Lisa    2.3
## 1  Florian    1.0
## 2   Moritz    1.7

We can extract the data types of the frame

dataframe.dtypes
## name      object
## grade    float64
## dtype: object

3.6 Horizontal Filtering

We can apply a filter to the columns of the dataframe.

# Extract all available columns
dataframe.columns
## Index(['name', 'grade'], dtype='object')

Only show the column name

# Extract all available columns
dataframe["name"]
## 0       Lisa
## 1    Florian
## 2     Moritz
## Name: name, dtype: object
dataframe.iloc[:,0]
## 0       Lisa
## 1    Florian
## 2     Moritz
## Name: name, dtype: object

3.7 Vertical Filtering

And we can apply a set of different filtering methods onto the dataframe

dataframe.query("name == 'Lisa'")
##    name  grade
## 0  Lisa    2.3
dataframe[dataframe["name"] == "Lisa"]
##    name  grade
## 0  Lisa    2.3

We can make the filtering procedure more dynamic by employing a variable instead of a static name.

filter_name = "Lisa"

dataframe.query(f"name == '{filter_name}'")
##    name  grade
## 0  Lisa    2.3
dataframe[dataframe["name"] == filter_name]
##    name  grade
## 0  Lisa    2.3

We can extend our filter to now contain a multitude of conditions. Instead of filtering for a single scalar value, we filter for values within an underlying list of values.

filter_name = ["Lisa", "Florian"]

dataframe.query(f"name in {filter_name}")
##       name  grade
## 0     Lisa    2.3
## 1  Florian    1.0
dataframe[dataframe["name"].isin(filter_name)]
##       name  grade
## 0     Lisa    2.3
## 1  Florian    1.0

Using boolean operators, we can also make use of multiple conditions. When using multiple conditions for filtering, it is necessary to cluster the single conditions with parentheses based on their logical structure.

filter_name = ["Lisa", "Florian"]

dataframe.query(f"(name in {filter_name} | name == 'Moritz') & grade < 2")
##       name  grade
## 1  Florian    1.0
## 2   Moritz    1.7
dataframe[((dataframe["name"].isin(filter_name)) | (dataframe["name"] == "Moritz")) & (dataframe["grade"] < 2)]
##       name  grade
## 1  Florian    1.0
## 2   Moritz    1.7

The dataframe also allows for arithmetic operations to be performed on each column of the dataframe.

dataframe.sum()
## name     LisaFlorianMoritz
## grade                  5.0
## dtype: object
dataframe.mean()
## grade    1.666667
## dtype: float64
## 
## <string>:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
dataframe.min()
## name     Florian
## grade        1.0
## dtype: object

3.8 Exercise

Throughout this course we will be using Python to access a SQL database. Using the python package sqlalchemy we are able to establish a first connection to the database. In addition we need to install the package MySQLClient.

We set up the connection to the database with using sqlalchemy connecting to the database using the same credentials as previously mysql://default_user::3306.

# Import modules
import sqlalchemy as sa
import pandas as pd

# Create Connection to Database
engine = sa.create_engine('mysql://default_user:q7$?h!4Mx@10.153.30.58:3306')
connection = engine.connect()

We can create our queries as input strings to be executed at the database.

# Write initial query for trades
trade_query = f"""SELECT *
                  FROM taqmsec.aapl_wrds_ctm
                  """
# Extract trade data
trades = pd.read_sql(sa.text(trade_query), connection)

# Write initial query for quotes
quote_query = f"""SELECT *
                  FROM taqmsec.aapl_wrds_cqm
                  """
# Extract quote data
quotes = pd.read_sql(sa.text(quote_query), connection)

The extracted tables are now stored in pandas dataframes with the names trades and quotes. Can use provide answers to the following questions using Python (identical questions in.

Trades

Start with a basic query SELECT * FROM TABLE and successively extend the query to allow for advanced filtering of data.

(1) Filter for exchange equal to Q and date equal to 2022-02-01
(2) The trade condition should be in (’‘,’@‘, ’E’, ‘(E?)’, ‘F’, ‘FI’, ‘(F?)’, ‘@FI’, ‘I’, ‘@I’). Investigate the data closely - do you need to perform additional steps on the column to ensure comparability with the list elements?
(3) Divide the trade size by 100
(4) Concatenate date and part_time and output the values in a new column
(5) Compute the logarithmic price return as \(r_t = log(P_{t} - P_{t-1})\) and daily volatility \(\sigma_D\)
(6) Add a new column with a unique index (primary key) for each row separately

Additionally, write a query that provides answers to the following questions for stock AAPL

(7) How big was the largest trade in terms of traded quantity?
(8) What is the minimum and maximum date available within the database?
(9) How much volume (sum of traded quantities) has been traded on average on each of the single trading days?
(10) What’s the average trading price on each of the single trading days?
(11) Combine the queries (9) and (10) into a single query
(12) How often does each of the valid trade conditions occur?
(13) Can you aggregate the data on a lower frequency time scale?
(14) Aggregate duplicate trades based on their participant timestamp and nanosecond by computations of the mean price and count all distinct trades within a group

Quotes

Write a structured SQL query by incrementally extending your query to allow for the following filters on the quote table.

(15) Compute the spread as difference between ask and bid \(spread_t = P_{ask,t} - P_{bid,t}\)
(16) Compute the midpoint as arithmetic average of bid and ask \(mid_t = (P_{ask,t} + P_{bid,t}) / 2\) to compute the quoted spread as \(quoted\_spread_{t} = (P_{ask,t} - P_{bid,t}) / mid_t\)
(17) Filter for exchange equal to Q, opening times between 09:30 and 15:30 plus a positive bid-ask spread
(18) Create a \(t-1\) lagged variable of the bid size, the bid price, the ask size and the ask price
(19) Compute the change in bid size as \(\Delta bid = bid_t - bid_{t-1}\), bid price, ask size and ask price analogously
(20) Concatenate date and part_time column to a single column
(21) Add a new column with an incrementally ascending index (unique identifier) to each row separately
(22) Aggregate duplicate quotes based on their participant timestamp by computations of the means
Combining trades and quotes
(23) Left join the quote data onto the trade data based on identical participant timestamps
## Python 3.11.1 (C:/Program Files/python.exe)
## Reticulate 1.26 REPL -- A Python interpreter in R.
## Enter 'exit' or 'quit' to exit the REPL and return to R.
## exit