2.6 A quick peak at databases

Sometimes data sets are stored as databases, e.g., relational database such as SQLite, MySQL or Oracle. These are collections of tables that need to be organised, usually into a single rectangular data frame. Although this brings additional complexity (and this section will only scratch the surface) it offers advantages over above necessity. First, it can help manage very large volumes of data and it also offers the possibility of only accessing data as and when it is needed. By contrast, with the text files that were discussed in the previous section we need to read the entire data set into memory even if only small portions of the data set are actually processed. This can impose data size limitations.

Example Employee Relational Data Base The diagram above shows the data model and highlights the tables (referred to as relations) and the relationships between them. The notation allows us to express different types of relationship e.g., one-to-one and one-to-many. In the above example a Dept Manager may have many Employees. Strictly speaking the formal notation states that a Dept Manager must have one or more Employees and an Employee must have exactly one Dept Manager. The notation details are given below.

Data Base Model Notation Definitions
Data Base Model Notation Definitions

In order to get the data from the Employees database into R and into a convenient rectangular format such as a data frame then we will need to manipulate the database and to do so we need to use a special purpose database language, in this case Structured Query Language better known as SQL. It is a small but powerful language since it has strong mathematical foundations being defined in terms of set theory6. Using SQL we can join tables by matching on foreign keys. So for the Employees example we might wish to match Employees to Dept Manager using the manager’s employee number emp_no to determine her or his last_name or hire_date.

To manipulate databases, it’s therefore necessary to embed SQL within R and naturally there are packages to do this. There are many choices and to some extent it depends on which specific database you wish to access and second, on whether you plan to update the database or merely fetch data from it. See the RStudio tutorial for specific information.

A more specific example, based on the {RSQLite} package embeds SQLite commands to let you query and manipulate SQLite databases from R. See https://rsqlite.r-dbi.org/ for more details.

If you are interested in database design, a good (and lengthy) text book is (Connolly and Begg 2015). The W3Schools Tutorial on SQL is beginner friendly! Also see Chapter 13 of (Grolemund and Wickham 2018) for more details of working with relational databases such as MySQL and sqlite in R.

References

Connolly, Thomas M, and Carolyn E Begg. 2015. Database Systems: A Practical Approach to Design, Implementation, and Management. 6th ed. Pearson Education.
Grolemund, Garrett, and Hadley Wickham. 2018. “R for Data Science.”

  1. Relations are a special kind of set that relates a set of inputs to outputs. Where the mapping is unique i.e., there is exactly one output for an input these are called functions.↩︎