Chapter 2 Python
Python is a general-purpose programming language and finds application in a broad range of domains, including web development, artificial intelligence and data science. Since it is characterized as a high-level programming language, it is considered relatively easy to learn.
You can download Python from here Python. To allow easier editing of code, we recommend to chose an editor of your choice, such as Visual Studio Code with a Tutorial on how to use Python with Visual Studio Code. Please make sure you have a running Python instance as we cannot offer full and individual support with setting up the work environment.
In order to write structured and readable Python code a set of coding standards, PEP 8 have been defined. Throughout the course we require you to ensure compliance with the coding standards. If you are using Visual Studio Code, you can use the extension Flake8 for ensuring compatibility with the coding standards.
The contents within this section can be found within the Python Cookbook, authored by Beazley & Jones (2013). Additionally, there is a wide range of online resources available for self-studying purposes, such as W3 Schools.
2.1 Data Structures
Before diving in the actual coding, it is vital to understand different data types employed by Python. The following section provides a first overview.
Strings | str | Contains a set of literals surrounded by quotation marks. Strings are arrays of bytes representing unicode characters.In Python, strings are fundamental data types used to represent textual data. They are sequences of characters enclosed within either single quotes, double quotes or triple quotes. Strings are immutable, meaning once defined, their contents cannot be changed. This immutability allows for efficient handling of string objects in Python. |
Numeric Types | int, float | Python supports several numeric data types to represent numerical values. An Integer is a whole number without decimals of unlimited length. They can be positive, negative, or zero. In Python, integers have unlimited precision, meaning they can be of any size as long as the system’s memory allows. A Float is a number containing one or more decimals. Python provides various arithmetic operations and functions for working with numeric data types. These operations include addition, subtraction, multiplication, division, exponentiation, modulus, and floor division. |
Sequences | list, tuple, range, set | A sequence is an ordered collection of elements or items. Sequences allow you to store and manipulate multiple values in a single variable. Python provides several built-in sequence types, each with its own characteristics and use cases. The main sequence types in Python are lists, tuples, and strings. Sequences can might be ordered or unordered, items might be changeable or unchangeable. A list is a built-in data structure used to store a collection of items. Lists are ordered, mutable (modifiable), and can contain elements of different data types, including integers, floats, strings, and even other lists. Lists are defined using square brackets [ ], and elements within the list are separated by commas. |
# Example Sequence
ex_list = [1,2,3,4]
ex_tuple = (1,2)
ex_range = range(1,10)
ex_set = set(ex_list)
Mappings | dictionary | a mapping is a collection of key-value pairs where each key is associated with a value. The concept of mapping is implemented in Python through dictionaries, which are unordered collections of items. Dictionaries are also known as associative arrays, hash tables, or simply maps in other programming languages. |
Booleans | TRUE, FALSE | Booleans are used to evaluate logical expressions and control the flow of program execution based on conditions. They are extensively used in programming for implementing branching logic, loop control, and decision-making constructs. |
To identify the data type of a variable use type().
<class 'list'>
<class 'int'>
2.2 Operators
Operators in Python are symbols that perform operations on variables and values. Python supports various types of operators, including arithmetic operators, comparison operators, logical operators, assignment operators, and more. These operators are used to manipulate data, make decisions, and perform calculations within Python programs.
Comparison Operators
A comparison operator is used to compare two values and test whether they are the same.
== | Equality |
>,< | greater, smaller |
>=,<= | greater than, smaller than |
!= | Inequality |
False
False
Logical Operators
Logical operators are used to combine conditional statements and return a Boolean result based on the logical relationship between them. Logical operators can be used to link a set of conditions.
&, and | TRUE if both Boolean expressions are TRUE |
|, or | TRUE if either Boolean expression is TRUE |
^, xor | TRUE if either Boolean expression is TRUE |
in | TRUE if the operand is equal to one of a list of expressions |
~, not | Reverses the value of any other Boolean operator |
True
True
When combining multiple operators, we need to use parentheses to facilitate their correct evaluation. Parentheses have the highest precedence and cause the expressions inside parentheses to be evaluated first. If two operators have the same precedence, the expression is evaluated from left to right.
2.3 Syntax
Comments
Comments are utilized to clarify the code’s purpose, improve its readability, and assist both other developers and your future self in understanding it better. Since comments are ignored by the Python interpreter when running the program, they do not alter its functionality.
You can put a comment into your code using a prefixed # in front of your comment. Commenting your code is especially useful if you want to use it at another point of time and make it understandable for other programmers. Comments are also useful for temporarily disabling lines of code without deleting them. This can be helpful for debugging or testing different sections of code. Write clear and concise comments that explain the intention of the code. Avoid redundant or unnecessary comments that simply restate what the code is doing.
Comments over multiple lines are used when describing functions and called Docstrings. A brief documentation about the usage of docstrings can be found here.
Case Sensitivity & Indentation
Python is case-sensitive, meaning it distinguishes between uppercase and lowercase letters. This applies to variable names, function names, keywords, and any other identifiers in Python code.
Indentation plays a crucial role in Python’s syntax for defining the beginning and the end of code blocks. Consistent indentation (typically using four single spaces or tabs) is required to maintain the structure of the code and determine which statements belong to which block.
Consequently, ignoring indentation or predetermined capitalization result in errors
2.4 Loops
Loops are used to execute a block of code repeatedly as long as a certain condition is true. Python supports two main types of loops: for loops and while loops. These loops allow you to automate repetitive tasks and iterate over collections or sequences of data.
For loops are used to iterate over a sequence (such as a list, tuple, string, or range) and execute a block of code for each element in the sequence. The loop variable takes on each value in the sequence one by one.
1
2
3
4
5
While loops are used to repeatedly execute a block of code as long as a specified condition is TRUE. The loop continues to execute until the condition becomes false.
1
2
3
4
5
Python provides loop control statements such as break, continue, and pass to modify the behavior of loops. break terminates the loop prematurely, continue skips the current iteration and moves to the next iteration and pass acts as a placeholder and does nothing.
2.5 Conditionals
Python supports the if, elif (short for “else if”), and else statements for implementing conditional logic.
a = 5
b = 3
if a > b:
print("a is greater than b")
elif a < b:
print("a is smaller than b")
else:
print("a and b are equal")
a is greater than b
2.6 Functions
Functions are blocks of reusable code that perform a specific task. Functions allow you to break down your program into smaller, manageable parts, making your code more organized, readable, and modular. You can define your own functions or use built-in functions provided by Python or external libraries.
Since we follow the principle of avoiding redundant code, we want to write functions whenever possible. As a rough rule, a function is helpful once we are copy-pasting code 3 times or more.
You define a function using the def keyword, followed by the function name and parentheses (). Any parameters (inputs) to the function are listed within the parentheses. The function body, containing the code to be executed when the function is called, is indented. To execute a function, you “call” it by using its name followed by parentheses ().
Welcome!
Information can be passed into functions as arguments. If the function requires any arguments, you pass them within the parentheses. Arguments are specified after the function name, inside the parentheses.
# A function with a single positional argument
def hello(name):
print(f"Welcome {name}!")
hello("Lisa")
Welcome Lisa!
You can add an arbitrary amount of arguments, separated by commas. A function can also take a default argument, which will be processed throughout the function if no argument is provided within the function call.
If no argument is passed to the function during its call, default arguments can be defined within the function itself which will be processed in such a case.
# A function with a single optional argument
def hello(name="Somebody"):
print(f"Welcome {name}!")
hello()
Welcome Somebody!
We differentiate between positional arguments and keyword arguments. A positional argument is passed to the function and evaluated based on its position in the function’s call hello(“Agustina”) while a keyword argument refers to a specific keyword within the function call hello(name=“Somebody”). The keyword argument is given for a specified variable.
# A function with a positional and optional keyword argument
def hello(name_1, name_2="Somebody"):
print(f"Welcome {name_1} and {name_2}!")
hello("Lisa")
Welcome Lisa and Somebody!
Welcome Lisa and Florian!
The number of information that is passed into a function can also be handled in a flexible way. This implies the function takes as many arguments as the user specified and processes them accordingly. We therefore specify function class=“highlight-syntax”>*args is useful when you want to create flexible functions that can accept a varying number of positional arguments. It’s commonly used when working with functions that delegate to other functions or when building APIs that need to handle arbitrary inputs.
# A function with a variable number of input names
def hello(*names):
print(f"Welcome {names}!")
hello("Lisa", "Ryan", "Florian")
Welcome ('Lisa', 'Ryan', 'Florian')!
class=“highlight-syntax”>*kwargs is useful when you want to create flexible functions that can accept a varying number of keyword arguments.
# A function with a variable number of input names as list
def hello(**names):
for key, value in names.items():
print(key, ":", value)
print(f"Welcome {value}!")
hello(name_1="Lisa", name_2="Ryan", name_3="Florian")
name_1 : Lisa
Welcome Lisa!
name_2 : Ryan
Welcome Ryan!
name_3 : Florian
Welcome Florian!
You can play around with writing functions in order to understand how they work, what is possible and what is not.
2.7 Dataframes
When working with large structured sequences of data, data is commonly stored in a pandas Dataframe. Pandas is an open-source Python library. It offers powerful and flexible data structures, particularly Series (1-dimensional) and DataFrame (2-dimensional), that allow you to work with structured data easily and efficiently.
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet, where each column can represent a different feature, and each row represents an individual record or observation.
A Dataframe is at least a two dimensional table of potentially heterogenous data, containing labelled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.
Since we are utilizing pandas Dataframes, we import the library first.
A Dataframe can easily be created from a dictionary, with keys representing columns and each key’s value is a data entry.
data = {"name": ["Lisa", "Florian", "Moritz"],
"grade": [2.3, 1, 1.7],
"profession": ["PhD", "PhD", "Student"]}
dataframe = pd.DataFrame(data)
name grade profession
0 Lisa 2.3 PhD
1 Florian 1.0 PhD
2 Moritz 1.7 Student
Alternatively, we can construct a Dataframe from external files, such as .csv.
We can extract the data types of the frame
name object
grade float64
profession object
dtype: object
2.7.1 Horizontal Filtering
Horizontal filtering of a DataFrame typically involves selecting specific columns of the Dataframe. When working with large amounts of data, you may also have a large number of features (columns) in your dataset, while not all of them are relevant for your ongoing analysis. Horizontal filtering of the Dataframe allows you to select only the columns that are necessary for your analysis, making your dataset more manageable and improving computational efficiency.
We first extract all columns of the Dataframe at hand.
Index(['name', 'grade', 'profession'], dtype='object')
We then apply a filter to the columns of the Dataframe to only show the columns name and grade.
name grade
0 Lisa 2.3
1 Florian 1.0
2 Moritz 1.7
trades.loc[:, mask]
2.7.2 Vertical Filtering
Vertical filtering in a DataFrame refers to selecting specific rows based on a defined set of conditions. It can be especially relevant when working with only a subset of the data which meetds specific conditions. Vertical filtering allows you to extract rows that satisfy these conditions, enabling focused analysis on relevant portions of your dataset. Correspondingly, vertical filtering is also used for cleaning of the data. Single rows can be removed if they contain missing values, outliers or errors to ensure quality and integrity of your dataset.
we do so by creating a “mask” of the original Dataframe that indicates whether each row meets our defined condition. The outcome variable is Boolean, it is either True or False for each row. This mask is now used to filter the complete dataframe. All rows that received the Boolean value TRUE (that is, all rows that fulfill the condition) will remain within the filtered dataframe. All rows that received the Boolean value FALSE (that is, all rows that do not fulfill the condition) will be removed from the dataframe.
name grade profession
0 Lisa 2.3 PhD
We can make the filtering procedure more dynamic by employing a variable instead of a static name.
name grade profession
0 Lisa 2.3 PhD
We can extend our filter to now contain a multitude of conditions. Instead of filtering for a single scalar value, we filter for values within a list of values.
name grade profession
0 Lisa 2.3 PhD
Using Boolean operators, we can also make use of multiple conditions. When using multiple conditions for filtering, it is necessary to cluster the single conditions with parentheses based on their logical structure.
filter_name = ["Lisa", "Florian"]
dataframe[((dataframe["name"].isin(filter_name)) | (dataframe["name"] == "Moritz")) & (dataframe["grade"] < 2)]
name grade profession
1 Florian 1.0 PhD
2 Moritz 1.7 Student
Recall that you need to store the filtered dataframe is a new variable if you want to proceed working with the filtered data.
2.8 Aggregating
Aggregating information from a DataFrame in Pandas involves summarizing or calculating statistics across rows and / or columns. You can apply built-in aggregation functions, such as sum, mean, median directly to the columns of a DataFrame.
grade
count 3.000000
mean 1.666667
std 0.650641
min 1.000000
25% 1.350000
50% 1.700000
75% 2.000000
max 2.300000
name grade
profession
PhD 2 2
Student 1 1
More complex ways of aggregation are possible as well, however they require you to define explicit ways of how the data is supposed to be aggregated.
Be careful that the aggregation you have chosen might only work for numeric values, therefore you can either explicitly define the (numeric) subset of columns you want to aggregate or specify within the function call, that you only want to aggregate a certain type of column.