Advanced DBT Selector Methods

This post is intended for DBT Core. I'm unsure of DBT Clouds release process.

Dec 02, 2023

In the dynamic environment of DBT (Data Build Tool) projects, it's crucial for development teams to implement an automated, efficient testing strategy. This approach ensures that changes are thoroughly vetted before deployment, minimizing production failure risk and running the least code possible while maintaining full coverage. In this post, I'll guide you through an innovative method I've devised, explaining its importance and mechanics in detail.

Note: This workflow is executed via a Makefile command in Docker.

The --select argument is a key feature of the DBT run and build methods. This powerful argument enables you to filter models based on various criteria, such as their state (e.g., new or modified) or configuration settings (incremental tables). Though there are numerous arguments (here), I'll focus on leveraging the state and configuration selectors.

By the end of this discussion, you'll have a deeper understanding of how to effectively use these selectors to enhance the reliability and efficiency of your DBT project workflows. Let’s get into it…

As mentioned above, I’m diving into the state and configuration methods within the —select argument. This is the meat and potatoes of this script, which allows you to filter out unneeded models and only run the things you care about…so we might as well start there. The first command we care about is:

dbt build --select state:new+ state:modified+ --defer --state $(<some_file_path>) --fail-fast

If you are unfamiliar with the DBT build command, check out the docs here.

So what’s happening here? To start, I’ve passed in the —select method, looking at models with a new OR modified state and running all models after the new/modified model, as seen in the DAG. Secondly, I pass in the —defer argument. Using defer allows DBT to skip the compilation of upstream models based on two criteria, as mentioned in the docs:

Is the referenced node included in the model selection criteria of the current run?
Does the reference node exist as a database object in the current environment?

Without getting too deep in the weeds, I’m reducing runtime by redacting work that has already been completed. For more information on how the defer argument relies on the manifest.yml file and how these files can be adjusted for a higher level of control, dive into the documentation!

The final argument is —fail-fast, which will skip all other runs in the current runtime, reducing executing time.

One thing to pay attention to in the DBT build command above is why DBT considers the two arguments within the select command to be OR and not AND. Can you see why?

Wait a minute, you missed the --state $(<some_file_path>) argument. Great catch reader sitting somewhere in this world. This is another CYA moment… when executing DBT build, DBT runs, amongst other things, DBT compile. DBT compile overwrites files in the target directory, which is how DBT identifies new or modified models (essentially a diff checker). The multiple DBT commands in the Makefile (not introduced yet — see below) allow DBT to always have a clean source of truth for performing these diff checks.

The second and final command I consider the meat and potatoes is…drum roll…

dbt run --select config.materialized:incremental,state:new config.materialized:incremental,state:modified --defer --state $(<some_file_path>) --fail-fast

This code snippet is identical to the first DBT build, excluding the models I’m selecting, and I’m using the run vs build argument. Why do we need this argument? DBT does not run the is_incremental logic on a model’s first run, meaning you could have bad code nested in Jinja logic that may not be executed until it hits production. Executing this run command ensures that the is_incremental is executed before that point.

Test time: Above, I mentioned to pay attention to the OR. Can you see how this logic includes AND?

If you see that a comma equals AND and a space equals OR then you have a good eye. One thing to note is that if the AND command has spaces on either side of the comma, DBT will fail.

To summarize the —select arguments, I’m saying, ‘select all models configured as incremental AND have a state of new OR all models configured as incremental AND have a state of modified.’

Ok, so far, we have these two commands for testing our code before merging into main or production:

dbt build --select +state:new+ +state:modified+ --defer --state $(<some_file_path>) --fail-fast; \
dbt run --select config.materialized:incremental,state:new config.materialized:incremental,state:modified --defer --state $(<some_file_path>) --fail-fast;

The next step is to populate <some_file_path> with the compiled objects from the branch we’d like to compare. This looks something like:

git checkout $(SOME_BRANCH); \
git pull; \
docker exec -w "<file_path_to_dbt_project>" docker_container_1 dbt compile --target-path $(<some_file_path>)

The important thing here is that you run dbt compile in the same container that the above two dbt commands will be run, the —target-path (where dbt compiles the target folder) is used in all arguments shown so far.

That’s pretty much it; throwing this together, the Makefile argument will look something like this:

# Make sure all changes on the current branch have been committed 
@if [ -n "$$(git status --porcelain)" ]; then \
    echo "Uncommitted changes detected. Please commit them before running this script."; \
    exit 1; \
fi

# Set the current branch name to env_var for future use
@CURRENT_BRANCH=$$(git rev-parse --abbrev-ref HEAD); 

# Checkout branch for comparison (usually main/prod)
git checkout $(SOME_BRANCH); \

# Update comparison branch to match what's in the repo
git pull; \

# Compile DBT objects in <some_file_path> for comparison
docker exec -w "<file_path_to_dbt_project>" docker_container_1 dbt compile --target-path $(<some_file_path>);

# Checkout branch with changes
git checkout $$CURRENT_BRANCH;

# Execute commands explained above.
docker exec -w "<file_path_to_dbt_project>" docker_container_1 sh -c ' \
       dbt build --select +state:new+ +state:modified+ --defer --state $(<some_file_path>) --fail-fast; \
dbt run --select config.materialized:incremental,state:new --defer --state $(<some_file_path>) --fail-fast;

Minus a few echo commands and an env_var to allow for full refreshes on incremental tables, that script will have you covered in deployments.

I hope this Saturday morning blog post was worth your time reading! As always, feel free to reach out and connect on LinkedIn.

Arctic Insights, LLC

Discussion about this post

Ready for more?