Chapter 11 Seeds

Seeds are Comma Separated Values (csv) files stored inside your seeds directory which can be loaded into your data warehouse using the dbt seed command.

Seeds can also be referenced by your SQL models using the ref() function. Seeds in dbt are version controlled. That is, you can revert them to a previous state.

Seeds are best used for data that changes infrequently. A good example is country codes, email accounts and station names. However, seeds should not be used to store sensitive information such as passwords.

To demonstrate about seeds in dbt, we will try to upload a New York City (NYC) bike history data for 2014.

Extract the zip folder and copy the 3e7acf34-19ba-4bf4-8dd2-cf349623dc6b.csv inside the dbt_book/seeds directory. To reduce the verbosity of its name, rename it to 2014-tripdata.csv.

11.1 Uploading a seed into your data warehouse

Believe you me we had a better csv table to upload, one much more related to the NYC bikes dataset. However, because it was ~320MB and we want to be economical in the upload time and bandwith, we settled for this historical data. Nevertheless, keeping on with this chapter, to upload a seed into a data warehouse, we use this command:

dbt seed

After that, it’s a test of patience. Depending on your upload speeds, it shouldn’t take long to upload a 36MB file to BigQuery.

-- snip --
16:31:51  Concurrency: 1 threads (target='dev')
16:31:51  
16:31:51  1 of 1 START seed file nyc_bikes.2014-tripdata ................................. [RUN]
16:33:17  1 of 1 OK loaded seed file nyc_bikes.2014-tripdata ............................. [INSERT 224736 in 85.39s]
16:33:17  
16:33:17  Finished running 1 seed in 0 hours 1 minutes and 28.54 seconds (88.54s).
16:33:17  
16:33:17  Completed successfully
16:33:17  
16:33:17  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

If you go to BigQuery and refresh the contents of your nyc_bikes dataset, you should see the 2014-tripdata table present.

Seeds
Seeds

11.2 Referencing seeds in models

Just like you would reference a model in another model, we can also reference seeds in another model. All you need to reference a seed is to place the name of the csv file, excluding the .csv extension inside the ref() function.

For example, we want to create a view that contains those start station names from our 2014 table that are existent in the citi_trips_long model.

Within the models/my_models directory, create the citi_stations_2014.sql model with the following query:

{{ config(materialized='view') }}

WITH citi_stations_2014 AS (
    SELECT * FROM {{ref ('2014-tripdata') }}
    WHERE `start station name` IN (
        SELECT start_station_name FROM {{ ref('citi_trips_long') }}
    )
)

SELECT * FROM citi_stations_2014

Thereafter, run the model using dbt run --select citi_stations_2014

That will create a view that contains only those stations within the 2014-tripdata.csv also within the citi_trips_long model. Much to our surprise, all the stations within our 2014 table are also found in the citi_trips_long model!

Here are the SQL queries we used to perform a count of each of the two tables in BigQuery.

SELECT COUNT(*) FROM `dbt-project-437116`.`nyc_bikes`.`2014-tripdata`;

SELECT COUNT(*) FROM dbt-project-437116.nyc_bikes.citi_stations_2014;

Both returned a value of 224736.

Here is the view of the citi_stations_2014 model in BigQuery.

View from seed
View from seed

11.3 Seed Configurations at project level

Though it may sound like there is a lot to do here, there actually isn’t. Suffice to only say that seeds are configurable as much as our normal models are. There are two ways to configure seeds in dbt: either in the dbt_project YAML file or at the individual seed’s YAML properties.

For the purposes of this exercise, at the project level we will set a dictionary that looks as follows:

seeds:
  dbt_book:
    2014-tripdata:
      schema: nyc2014_data

For any custom schema that we set, the result will be in the following format: {{ target.schema }}_{{ schema }}. That means the expected schema for our seed will be nyc_bikes_nyc2014_data –quite a mouthful of a name.

Thereafter run dbt seed.

-- snip --
19:24:43  Concurrency: 1 threads (target='dev')
19:24:43  
19:24:43  1 of 1 START seed file nyc_bikes_nyc2014_data.2014-tripdata .................... [RUN]
19:26:18  1 of 1 OK loaded seed file nyc_bikes_nyc2014_data.2014-tripdata ................ [INSERT 224736 in 95.10s]
19:26:18  
19:26:18  Finished running 1 seed in 0 hours 1 minutes and 46.50 seconds (106.50s).
19:26:18  
19:26:18  Completed successfully
19:26:18  
19:26:18  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Think of a schema as a folder or container for storing your data (read tables). Therefore, if you were in a company, there would be a schema (read it as folder or container) for sales, customers, products and clients. Inside the schema, (read folder or container) for sales, there would be tables for january_sales, february_sales and so on.

Just a nota bene, don’t use hyphens (-) for your schema names, otherwise it will result in an error.

Here is our seed data appearing under the nyc_bikes_nyc2014_data dataset.

Schema for seeds
Schema for seeds

11.4 Seed properties and configurations at properties level

Seeds can also be configured at the properties level. In fact, the configurations at the properties level will override those set at the project level, that is at the dbt_project file.

To demonstrate setting seed configurations at the properties level, create nyc_bikes2014 YAML file. Copy paste the following contents into the file.

version: 2

seeds:
  - name: 2014-tripdata
    description: "Seed for NYC 2014 bike data"
    docs:
      show: true 
      node_color: purple # Use name (such as node_color: purple) or hex code with quotes (such as node_color: "#cd7f32")
    config:
      schema: nyc_bikes2014

Not much different from the model properties’ files we create, isn’t it? In this case, the name of the model is not a SQL file but the csv we pushed to the data warehouse. The docs key is not so much important as the config key which we use to set the schema of our seed in the data warehouse.

What’s good for the goose is good for the gander. Much akin to the model properties files where we can insert tests and documentation, the same goes for seed properties’ files. In the contents of the below properties file of nyc_bikes2014.yml we have inserted documentation at both the table and column levels. We have also inserted tests at both levels as well.

version: 2

seeds:
  - name: 2014-tripdata
    description: '{{ doc("seed_2014_tripdata") }}'
    docs:
      show: true 
      node_color: purple # Use name (such as node_color: purple) or hex code with quotes (such as node_color: "#cd7f32")
    config:
      schema: nyc_bikes2014
    tests:
      - dbt_expectations.expect_table_column_count_to_be_between:
          min_value: 1 # (Optional)
    columns:
      - name: _id
        description: 'Unique identifier'
        tests:
          - dbt_expectations.expect_column_values_to_be_unique

      - name: tripduration
        description: '{{ doc("tripduration") }}'

If you have additional seeds, simply add them to the properties files much like what we have in the my_models.yml which consists of the three models citi_trips_minutes, citi_trips_round and citi_trips_long.

11.5 Performing tests on seeds

Tests on seeds are performed in much the same way as other models. The only trick is to insert the name of the csv file. For example, to run a test of our 2014-tripdata.csv which is our seed model, we execute dbt test --select 2014-tripdata.

-- snip --
18:19:53  Concurrency: 1 threads (target='dev')
18:19:53  
18:19:53  1 of 2 START test dbt_expectations_expect_column_values_to_be_unique_2014-tripdata__id  [RUN]
18:19:57  1 of 2 PASS dbt_expectations_expect_column_values_to_be_unique_2014-tripdata__id  [PASS in 4.67s]
18:19:57  2 of 2 START test dbt_expectations_expect_table_column_count_to_be_between_2014-tripdata_1  [RUN]
18:20:02  2 of 2 PASS dbt_expectations_expect_table_column_count_to_be_between_2014-tripdata_1  [PASS in 4.58s]
18:20:02  
18:20:02  Finished running 2 data tests in 0 hours 0 minutes and 12.62 seconds (12.62s).
18:20:02  
18:20:02  Completed successfully
-- snip --

As you can see, the two tests of dbt_expectations.expect_table_column_count_to_be_between and dbt_expectations.expect_column_values_to_be_unique passed.

11.6 Viewing documentation for dbt seeds

Even much less different than running tests is the generation of documentation regarding your dbt seeds. The process is exactly the same. First, run dbt docs generate. Assuming that the manifest files have successfully been created in the catalog, run dbt docs serve. If no errors appear at this final stage, open the port link that appears, such as localhost:8080/, on the terminal in your preferred browser.

You should see the documentation you created for your seed. The lineage graph should also work well for the seeds too.

Documentation on seeds
Documentation on seeds

Every seed at some point is left to grow on its own. We suppose this chapter has provided the necessary nutrition to see you bud to life working with dbt seeds!