3 Relational Database Schemas: Tables, Rows, Columns, and Keys
Using the example query from Figure 2.1 in Basic Syntax (Single-table Query): Clauses and Language Elements we established a basic understanding of SQL syntax (the first “s”) using a single-table query to illustrate the SELECT statement, the binding order of its clauses, and common SQL language elements. The natural next step for understanding SQL syntax is to introduce the ON/JOIN subclauses for performing multi-table SELECT statements — one of, if not the most critical DBI concepts. Before delving into ON/JOIN syntax, though, we’ll cover some foundational knowledge about the basic structure (or schema) of relational database management systems (RDBMS) — the second “s.” This chapter will focus on this concept of schema (and how to navigate it) before syntactically evaluating a multi-table example query in the next chapter, Basic Syntax (Multi-table Query): ON/JOIN Subclauses.
Relational Database Schemas: The Second “s”
Relational database management systems (RDBMS’s) range considerably in their complexity and size, but consist of many themed tables (formally called relations) with schematic uniqueness and relatedness through the assignment of primary keys and foreign keys, respectively. Within tables there are rows (formally called tuples) and columns (formally called attributes). The importance of these concepts lies in understanding table relationships, which includes the mapping between tables, the participation (required or optional), and the multiplicity (e.g., one-to-one, one-to-many, many-to-many). Internalizing those concepts allows the DBI to proceed with more certainty about their expected query results, the structure of their output, and whether it matches the desired level of analysis (more about this later).
3.1 Tables (relations)
Tables are the functional units within an RDBMS schema and the primary objects of interest when performing JOIN operations. Understanding how an RDBMS system organizes its data into entities, and potentially further into types or hierarchies, is the first step in choosing tables for an analysis. As discussed in the Introduction the Epic Clarity Data Dictionary can be searched in various ways to help identify potentially relevant tables and their contents and relationships.
3.2 Table Rows (tuples)
Each row from an RDBMS table is unique and represents an atomic unit of relevance. For example, in HFHS Epic that could be patient encounters, medication administration record actions, laboratory component results, flowsheet row documentation, and billing claims. Put more simply, if you perform a single table query that results in 1000 rows, the relevant question should be, A thousand what? Furthermore, incorporating table joins into queries can directly affect the atomic unit (or granularity) of each row by changing the cardinality which in turn can drastically affect interpretation of results. As we’ll see in the next section, the uniqueness of a table’s rows is schematically determined by its columns.
3.3 Table Columns (attributes)
Columns serve multiple functions within a table, the most important of which for DBIs are the enforcement of each row’s uniqueness, linkage to other tables, indexing, and — the most practical aspect — information about each row or its raw values/measurements.
3.3.1 Practical Information
Ensuring the uniqueness of table rows is part of columns’ schematic function. But at the end of the day users are more interested in the practical information that comes from table columns, regardless of their schematic role — the columns help tell the story. This chapter is dedicated more towards syntax and schema, but later chapters will apply these principles towards real use cases, which will draw upon more practical knowledge about the Clarity schema.
3.3.2 Uniqueness: Candidate Key or Minimal Superkey
Uniqueness is a fundamental constraint for rows in a relational database table. The reason for this (besides sound database design and function) is practical and goes back to the key question users should ask themselves whenever they envision their ultimate query results — what is the meaning of each row and how do we work with that analytically. Below are some important operational definitions for ascertaining a table’s schematic uniqueness and thus understanding the ultimate meaning of rows from source tables and query results:
Candidate Key or Minimal Superkey: The minimal set of columns (≥ 1) that is necessary to uniquely identify the rows within a table. The value or combination of values within this minimal set of columns appear only once in the table.
- Single-column candidate keys are sometimes referred to as simple keys or generically as an id (e.g.,
order_med.order_med_id
,pat_enc_hsp.pat_enc_csn_id
). Multi-column candidate keys are sometimes referred to as composite (or compound) keys (e.g.,order_med_id
andline
are the composite key formar_admin_info
). - Columns belonging to this minimal set are sometimes referred to as prime attributes. Otherwise, the columns are considered non-prime attributes.
- Technically, a table may contain more than one set of candidate keys; hence, the term candidate. However, practically speaking the user shouldn’t rely on there being another candidate key (relational databases are designed to be normalized and thus minimize this sort of redundancy) — the exception being unless the user is selecting all table columns (which is sometimes called a trivial superkey).
Primary Key: The candidate key selected as the official unique identifier for rows in a given table. Candidate keys not selected as the primary key are sometimes referred to as alternate keys.
- The choice of a primary key is driven by many factors outside the scope of this manual. As mentioned under the definition of candidate key, the user is unlikely to encounter what are considered to be true alternate keys (i.e., candidate keys not selected to be a primary key). Decisions about primary keys are made before any data is ever populated in the database. There may be practical superkeys that the user identifies for their queries that happens to confer uniqueness to each row, but the key isn’t necessarily robust to diverse query conditions and its uniqueness isn’t a guaranteed feature by the relational database constraints. When speaking about RDBMS tables, unless the DBI is dealing with a limited scope data set that won’t change (i.e., they know its expected features) it is best not to determine the query structure based on user-defined keys.
- As discussed in Epic Clarity Data Dictionary, users may identify the primary key for a Clarity table by consulting with the “Primary Key” section of the Epic Clarity Data Dictionary.
- Data warehouses are dimensional databases structured differently than relational databases and are outside the scope of this manual, although some DBIs may already use such a database (e.g, Warehouse). As a quick pointer, data warehouses idiomatically use primary keys that are meaningless integers, which often have a
_key
suffix appended to their column name. These are examples of surrogate keys or synthetic keys, as opposed to natural keys or operational keys — e.g.,order_med.order_med_id
— which are directly observable within Epic.
3.3.3 Linking to Destination Tables
So far the only keys we have discussed are candidate keys and their associated entities, which conferred table rows their schematic uniqueness. However, candidates keys represent a fraction of the table columns that could be used for performing join operations. Table joins are simply the process of combining one (i.e., self-joining) or more tables based on reference columns from each table and a set of conditions — any column can participate in a join operation (including columns that should never be joined together — e.g., two tables with “comments” columns whose contents will seldom match up). The syntax will be covered in the next chapter, Basic Syntax (Multi-table Query): ON/JOIN Subclauses. Additionally, table joins can be reductive, neutral, or additive and thus have the potential to change the granularity, or atomic unit of analysis, for a result set — this is a critical schematic concept called join cardinality and will be discussed in greater detail in Schema and Multi-table Queries.
The concept of join columns is distinct from foreign keys; which determines a table’s schematic relatedness, or its referential integrity within the schema. Foreign keys are simply source table columns that refer to the primary key of another table. As mentioned earlier, any column can participate in a table join, regardless if the source column is referencing a primary key or non-primary key in the destination table. Nonetheless, DBIs can identify potentially useful join prospects by consulting with the “Foreign Key Information” section of the Clarity Data Dictionary — using column names (regardless if they’re foreign keys or not) as search terms in the data dictionary can also help accomplish the same thing. Cross-referencing column names to their join prospects is another essential workflow for learning a database schema and deriving practical knowledge for query design.
3.3.4 Table Indexing
Indexing refers to the physical storage of table data; which implicates the structure and speed with which queries access that data. Database engineering is outside the scope of this manual (and its author’s knowledge) but table indexes are useful to DBIs for authoring more performant SQL code. In the Clarity Data Dictionary near the top after the “Primary Key” section is the “Index Information” section. There users can determine if a table column influences the storage structure of that table’s data.
For the sake of simple discussion, the most common query performance benefit of indexed columns is when they are part of predicates in the ON/JOIN subclauses (as we’ll discuss in the next chapter, Basic Syntax (Multi-table Query): ON/JOIN Subclauses or within the join or search condition of an ON subclause (which is a predicate) or WHERE clause. Indexed columns as part of expressions in the SELECT list may also help optimize query performance. Ultimately, of all the Table Columns (attributes) subsections this one is by far the least important, but could be of use for those DBIs who have progressed to the point where query execution time becomes important. Obviously, intended results that take longer are more desirable than unintended results that execute quickly.