Tech and Ethos

Building a Custom Diff Test for DBT in CI: Finding the Subtle Errors

Oct 15, 2024

Introduction

Our data warehouse powers a variety of critical workloads, supporting several automated decision-making processes. Given this background, data correctness was a very important requirement for our data marts. At the same time, we follow a continuous deployment approach, pushing changes to production several times each day. This makes it crucial to have a robust testing process in place before every deployment. While the built-in CI tooling available in DBT helped us catch many issues, like duplicate rows or unexpected values, we noticed subtle errors still slipping through. Specifically, we wanted to make sure that changes in a PR wouldn't lead to unintended changes that are hard to catch with just DBT tests. This called for a way to compare the tables generated by the code change in the PR against their production counterparts at a row level. We looked around for a solution that would fit our use case, exploring multiple tools and services. However, what we found was either too complex and bundled with a lot more features (and hence costly) that we did not need, or requiring cumbersome infrastructure setup. None of these options fit our needs exactly, so we decided to build our own lightweight custom solution (a CLI tool).

Leveraging DBT's Extensibility

One of the things that makes DBT especially powerful is its extensibility, which means that even if something isn't natively supported, you can often find a mechanism to customize behavior. In our case, the solution lay in the DBT meta field, which allows users to assign additional metadata to any resource (Learn more in the DBT documentation). In our implementation, we needed to perform these additional diff tests on only certain presentation tables that were public. For tables where we wanted to apply the diff checks, we configured primary keys in the DBT meta field like this:

meta:
  dbt_diff:
    primary_keys:
      - PAYMENT_ID
    enabled: true

This configuration allows us to mark which tables we want to be included in the diff testing while specifying the primary keys that uniquely identify rows.

Building the CI Diff Test Workflow

Once a PR is raised, we start our CI process by using DBT's defer feature, which identifies the changed models and runs them in the context of CI. Learn more about defer in the DBT documentation. After this step, the CLI tool takes over for the diff test. We leverage the generated run_results.json file to determine which models were successfully built during CI. We compare these against the manifest.json to filter out those models that have the diff configuration set in their metadata. For each table that qualifies, we call a Snowflake stored procedure to conduct a row-level comparison. The procedure takes the fully qualified names (FQNs) of the production and CI tables, along with the specified primary keys, to compare their rows.

Snowflake Stored Procedure for Table Comparison

Here's a quick overview of the stored procedure, table_diff_analysis, that we use to compare the tables:

CALL table_diff_analysis(
    reference_database STRING,
    reference_schema STRING,
    reference_table STRING,
    target_database STRING,
    target_schema STRING,
    target_table STRING,
    primary_key_columns_csv STRING
);

The result of the stored procedure includes several key insights, such as:

Column Additions/Removals: Differences in schema.
Column Data Type Changes: Changes in column types.
Row Count Differences: Unique row counts between the production and CI versions.
Mismatch Summary: Count of rows in each column which are different for the same row (by primary key). The result for each table is added to the pull request as a comment, along with a summary of changes with significant differences. Below is an example of a comment added to the PR:

Conclusion

DBT's extensibility allowed us to implement a lightweight yet powerful solution for an important data quality check. This additional check has helped increase confidence in our changes, provide an overview for reviewers and authors of the changes that are part of the PR, and reduced the possibility of unintended changes landing in production. If you're considering implementing a similar solution or have run into issues with ensuring data consistency, we'd love to hear about your experience!

About the Author Ram joined Ethos in 2022 as an Analytics Engineer. Ram worked with Data Platforms and contributed to bullet-proofing the Ethos data warehouse. When Ram isn't coding, he enjoys working out, traveling, reading books, and exploring new technologies. Interested in joining our team? Learn more about our career opportunities here.