3.1 Pitfalls of Abstraction

Virtue is the golden mean between two vices, the one of excess and the other of deficiency.

-Aristotle

What Aristotle says about virtue also applies to abstraction. If we had no abstraction life would be more challenging. I don’t need to know the internals of power steering to use a car. The wheel becomes an interface to the complexities. At the other extreme, a lack of fundamental knowledge causes problems. The fact that combustion engines require gasoline is an implementation detail. All drivers must understand this level of detail or they will quickly become hitchhikers.

The DBI package provides a layer of abstraction between the R user and the underlying database system. Many R users have no need or desire to understand the internals of an RDBMS. However, if they know the basics of databases, then they can deliver a more robust solution.

The R function dbCreateTable() from DBI allows the user to create a SQL table without defining column data types. It’s tedious to look at each column and figure out the most appropriate data type, but failure to do so can lead to future performance problems. For example, suppose we use dbCreateTable() with the Yellow Cab data file.

url <- "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2018-01.csv"
download.file(url,"D:/Cab_Data/yellow_tripdata_2018-01.csv")
trips_fs<- read_csv('D:/Cab_Data/yellow_tripdata_2018-01.csv')

dbCreateTable(con,'yellow_trip_summary_model',trips_fs)

First, the dbCreateTable function generates a CREATE TABLE SQL statement. Then the statement is executed against the database specified in connection argument. A single line of R code replaces several lines of SQL code.

CREATE TABLE yellow_trip_summary(
    [VendorID] [float] NULL,
    [tpep_pickup_datetime] [datetime] NULL,
    [tpep_dropoff_datetime] [datetime] NULL,
    [passenger_count] [float] NULL,
    [trip_distance] [float] NULL,
    [RatecodeID] [float] NULL,
    [store_and_fwd_flag] [varchar](255) NULL,
    [PULocationID] [float] NULL,
    [DOLocationID] [float] NULL,
    [payment_type] [float] NULL,
    [fare_amount] [float] NULL,
    [extra] [float] NULL,
    [mta_tax] [float] NULL,
    [tip_amount] [float] NULL,
    [tolls_amount] [float] NULL,
    [improvement_surcharge] [float] NULL,
    [total_amount] [float] NULL,
    [Hour_Range] [varchar](10) NULL,
    [Day] [char](3) NULL
)

The generated code might meet the requirements, but it has some flaws. First, is the FLOAT data type appropriate for every numeric data type? Float expresses approximate numeric values with a default storage size of eight bytes. Eight bytes are wasteful for several of the columns. For example, the payment_type column could be defined as TINYINT because only five classes [1,2,3,4,5] of payment types exist.

SELECT payment_type,count(1)
  FROM [Cab_Demo].[dbo].[yellow_trip_summary_partitioned]
  group by payment_type

Table 1.1: 5 records
payment_type
4	153273
3	531621
5	3
1	71367409
2	30751943

As long as the number of payment type classes stays below 256, TINYINT will work. Using TINYINT, with a storage size of one byte would save space and could help prevent decimal codes from entering the database. Should users be able to enter a payment code of 4.2? They technical could with a float data type.

When performance or accuracy matters then developers should specify the data types. I created another table with the smallest possible data type for each column. At 100 million rows, this table design requires 44% less disk space compared to the first table.

CREATE TABLE [dbo].[yellow_trip_summary](
    [VendorID] [tinyint] NULL,
    [tpep_pickup_datetime] [datetime] NULL,
    [tpep_dropoff_datetime] [datetime] NULL,
    [passenger_count] [tinyint] NULL,
    [trip_distance] [smallint] NULL,
    [RatecodeID] [tinyint] NULL,
    [store_and_fwd_flag] [char](1) NULL,
    [PULocationID] [smallint] NULL,
    [DOLocationID] [smallint] NULL,
    [payment_type] [tinyint] NULL,
    [fare_amount] [decimal](8, 2) NULL,
    [extra] [decimal](8, 2) NULL,
    [mta_tax] [decimal](8, 2) NULL,
    [tip_amount] [decimal](8, 2) NULL,
    [tolls_amount] [decimal](8, 2) NULL,
    [improvement_surcharge] [decimal](8, 2) NULL,
    [total_amount] [decimal](8, 2) NULL,
    [Hour_Range] [varchar](10) NULL,
    [Day] [char](3) NULL
)

The size on disk for the first table is 16.1 GB and the second table is 8.9 GB. A difference of 7.2 GB might not seem like a big deal. After all, disk space is relatively cheap. But disk shouldn’t be the only consideration for design. The 7.2 GB now takes up room in memory.

Memory
Network
Backups
Indexes
Replication