Chapter 10 dbt Expectations package

What is dbt-expectations? dbt-expectations is an extension package for dbt which works much akin to the Great Expectations package for Python. It was intentionally designed to provide Great Expectations like features in dbt, but now from dbt itself rather than integrating Great Expectations (GE).

Unless you’ve used GE, you may be wondering what this is in the first place, and its okay to feel lost. GE is much like tests in the previous chapter, it conducts quality tests on your data, thus flagging those that deviate from the set assertions.

I would put dbt-expectations and GE on the same plane and use an allegory to drive the point home: that of a car. When buying a car, there are some common checklist items, and others bespoke depending on your car model. For example, an ordinary car must have the following features:

have four wheels
have a driver’s seat
have a gear (whether manual or automatic)
have headlights
have a windshield

The above list can go on and on depending on your knowledge of cars. But your checklist can also contain some unique items, but which are a must-have depending on your car make. For example, here is a checklist of the Volvo XC60 T6:

0.9l/100km fuel consumption
Allowed emissions 22g/km (the less the better)
Hybrid fuel type

So if you go to a showrooms and the beautiful or handsome sales agent takes you to the Volvo XC60, you will be perusing it as you cross your checklist. dbt-expectations and GE work in the same way.

10.1 `dbt-expectations` installation

According to the documentation dbt-expectations will work for dbt versions 1.7x and higher. Let’s first pass this little test.

dbt --version

If you get your dbt-core version is above 1.7x, then you can proceed. If not, you need to update your dbt. You can do so using python -m pip install --upgrade dbt-core or if you want to be more specific, this will do: python -m pip install --upgrade dbt-core==0.19.0.

Ours, at the moment of writing this book, was version 1.8.7. Therefore we have a clean bill of health to proceed.

dbt-expectations isn’t installed in the same type and enter kind of means like we did for dbt-core and dbt big-query. Nevertheless, some code is written in some YAML files and from henceforth dbt recognises it.

First create a packages.yml file in the same level as your dbt_project.yml file. You can do so by running this command:

touch packages.yml

On the packages.yml file, insert the following:

packages:
  - package: calogica/dbt_expectations
    version: [">=0.10.0", "<0.11.0"]

Apart from that, the dbt-date dependency must also be installed. This is because dbt-expectations references it. However, this will be installed in the dbt_project.yml file rather than the packages.yml file. So inside the dbt_project.yml paste the following just before the materializations dictionary.

vars:
  'dbt_date:time_zone': 'Africa/Nairobi'

You may insert any valid timezone apart from the one specified above, but we highly suggest that you use your timezone.

Now run dbt deps to seal the deal by installing the dbt-expectations package.

dbt deps

Here is the output showing the successful installation of the package in our environment.

19:31:10  Running with dbt=1.8.7
19:31:12  Updating lock file in file path: /home/sammigachuhi/dbt_book2/dbt_book/package-lock.yml
19:31:13  Installing calogica/dbt_expectations
19:31:44  Installed from version 0.10.4
19:31:44  Up to date!
19:31:44  Installing calogica/dbt_date
19:31:45  Installed from version 0.10.1
19:31:45  Up to date!

10.2 Types of `dbt-expectations` tests

dbt-expectations comes with a plethora of tests’ functions which can be classified into the following categories.

Table shape
Missing values, unique values, and types
Sets and ranges
String matching
Aggregate functions
Multi-column
Distributional functions

We will perform one test in each category just to exemplify the potential of dbt-expectations.

10.2.1 Table shape

10.2.1.1 `expect_table_row_count_to_equal_other_table`

Description: Expect the number of rows in a model match another model.

We will expect the citi_trips_round and the citi_trips_minutes tables to have the same number of rows since their respective models used the same citi_bike_trips table. Therefore, the two tables should pass this test, or will they?

Since we only want to concentrate on the models within the my_models directory, just run: dbt test --select models/my_models

Here is the output.

-- snip --
07:57:38  2 of 8 FAIL 1 dbt_expectations_expect_table_row_count_to_equal_other_table_citi_trips_minutes_ref_citi_trips_round_  [FAIL 1 in 4.83s]
07:57:38  3 of 8 START test dbt_expectations_expect_table_row_count_to_equal_other_table_citi_trips_round_ref_citi_trips_minutes_  [RUN]
07:57:42  3 of 8 FAIL 1 dbt_expectations_expect_table_row_count_to_equal_other_table_citi_trips_round_ref_citi_trips_minutes_  [FAIL 1 in 4.53s]
-- snip --

This leaves one puzzled, why?

A close look at the model for citi_trips_round reveals the answer. This model was designed to only work on non-null rows, unlike the citi_trips_minutes which worked on all rows, null or not. Therefore the citi_trips_round had less rows and thus the generated the error. In case your tests results seem incongruent, it is always good to recheck the models to refresh your memory, as we did here.

In fact, dbt did a good job of generating an SQL to show us the error:

07:58:03  Failure in test dbt_expectations_expect_table_row_count_to_equal_other_table_citi_trips_minutes_ref_citi_trips_round_ (models/my_models/my_models.yml)
07:58:03    Got 1 result, configured to fail if != 0
07:58:03  
07:58:03    compiled code at target/compiled/dbt_book/models/my_models/my_models.yml/dbt_expectations_expect_table__c00100dada30a31f15f90b9c1ba0b295.sql

If you click on the destination of the SQL statement and copy the contents to the SQL query tab of BigQuery, you will see the difference in row count for the two tables.

Row count difference

10.2.2 Missing values, unique values, and types

10.2.2.1 `expect_column_values_to_not_be_null`

Description: Expect column values to not be null.

This is a no-brainer kind of test. Can you guess which columns in any of our tables in the nyc_bikes dataset should never be null? Here is a clue: station_id and station names, unless the biker teleports to or from somewhere!

So on the my_models YAML file, insert the below test on the start_station_id, start_station_name, end_station_id and end_station_name fields.

tests:
  - dbt_expectations.expect_column_values_to_not_be_null

You will get some interesting results. some of the tests fail for the citi_trips_minutes model because of the many null rows in the table. However, none of this particular test fail for the citi_trips_round table; it has zero null rows.

A caveat when using the dbt_expectations.expect_column_values_to_not_be_null, only add a colon : when specifying more optional parameters such as row_condition: "id is not null" # (Optional). Otherwise, leave it out.

10.2.3 Sets and Ranges

10.2.3.1 `expect_column_values_to_be_in_set`

Description: Expect each column value to be in a given set.

This test works best for where you are sure that a certain column will only accept certain values. A good example is the gender column. There can only be three results: male, female and other. Here we insert the test in our citi_trips_minutes and citi_trips_round models.

- name: gender
  description: "Gender (unknown, male, female)"
  tests:
    - dbt_expectations.expect_column_values_to_be_in_set:
        value_set: ['unknown','male','female']

If you run the above test for both the citi_trips_minutes and citi_trips_round models, the test will fail for the former. Why, because of the pesky null rows. However, to take the null rows into consideration and take them as accepted values in the citi_trips_minutes model only, we simply add an empty quotation marks, like so (''). Here is our modified test: value_set: ['', 'unknown','male','female']. The test will then pass for our citi_trips_minutes table.

10.2.4 String matching

10.2.4.1 `expect_column_value_lengths_to_be_between`

Description: Expect column entries to be strings with length between a min_value value and a max_value value (inclusive).

Because our citi_trips_minutes table has several null rows, we will put the minimum expected value to be 0. Since we also want to catch those station names with overly long names, we will put the max value as 70. So for both the start_station_name and the end_station_name, we inserted the following test:

tests:
  - dbt_expectations.expect_column_values_to_not_be_null
  - dbt_expectations.expect_column_value_lengths_to_be_between:
      min_value: 1 # (Optional)
      max_value: 70 # (Optional)

We are glad to know both models passed this simple test.

10.2.5 Aggregate

10.2.5.1 `expect_column_max_to_be_between`

Description: Expect the column max to be between a min and max value

You may wonder what the purpose of this test is. But don’t dismiss it yet, it can come quite in handy when searching for outlier values. We will demonstrate it in catching overly long bike trips. However, this test needs some background knowledge of your data.

Applying the below queries on BigQuery will help.

SELECT AVG(trip_duration_min) FROM dbt-project-437116.nyc_bikes.citi_trips_minutes; -- 16

SELECT MAX(trip_duration_min) FROM dbt-project-437116.nyc_bikes.citi_trips_minutes; -- 325167.48

SELECT * FROM dbt-project-437116.nyc_bikes.citi_trips_minutes
WHERE trip_duration_min > 200000;

Now lets place the limits of our maximum trip duration values to be between the average of 16 and some intermediate value such as 100,000 minutes (1,666 hours)!

tests:
  - dbt_expectations.expect_column_max_to_be_between:
      min_value: 16 # (Optional)
      max_value: 100000 # (Optional)

That will flag off some errors, but if you change the max_value parameter to 360000, the tests will pass. However, in the trip_min_round field of the citi_trips_round model, we set the max_value as 100000 to demonstrate an error of this test.

10.2.6 Multi-column

10.2.6.1 `expect_column_pair_values_A_to_be_greater_than_B`

Description: Expect values in column A to be greater than column B.

This kind of test comes in handy when you want to ensure that one of your columnar values is greater than, or less than that of a different column. A good example is a comparison of trip duration in seconds in the tripduration column versus the trip duration in minutes from trip_min_round column in our citi_trips_round table. Definitely time in seconds will always have a greater value in terms of length than the more concise minutes values!

In our citi_trips_round model, insert the test as follows:

- name: citi_trips_round
    description: '{{ doc("citi_trips_round") }}'
    tests:
      - dbt_expectations.expect_table_row_count_to_equal_other_table:
          compare_model: ref("citi_trips_minutes")
      - dbt_expectations.expect_column_pair_values_A_to_be_greater_than_B:
          column_A: tripduration
          column_B: trip_min_round

It surely does pass the test.

-- snip --
09:40:46  5 of 26 PASS dbt_expectations_expect_column_pair_values_A_to_be_greater_than_B_citi_trips_round_tripduration__trip_min_round  [PASS in 2.20s]

10.2.7 Distributional functions

This is another category of shipped-in tests of dbt-expectations. However, they require some statistical homework to be conducted on your data prior to applying the tests. The tests under this category include: expect_column_values_to_be_within_n_moving_stdevs, expect_column_values_to_be_within_n_stdevs and expect_row_values_to_have_data_for_every_n_datepart.

Chapter 10 dbt Expectations package

10.1 dbt-expectations installation

10.2 Types of dbt-expectations tests