From b9049a8f128cdab4ee08fd85f6d0dc6c84071ab4 Mon Sep 17 00:00:00 2001 From: Jeremy Cohen Date: Thu, 20 Apr 2023 17:08:00 +0200 Subject: [PATCH 1/6] Draft some revisions to model versions --- .../docs/collaborate/govern/model-versions.md | 144 ++++++++++++++++-- 1 file changed, 130 insertions(+), 14 deletions(-) diff --git a/website/docs/docs/collaborate/govern/model-versions.md b/website/docs/docs/collaborate/govern/model-versions.md index 8f61f5fb432..40f10c21a73 100644 --- a/website/docs/docs/collaborate/govern/model-versions.md +++ b/website/docs/docs/collaborate/govern/model-versions.md @@ -9,22 +9,35 @@ description: "Version models to help with lifecycle management" This functionality is new in v1.5. ::: -API versioning is a _complex_ problem in software engineering. It's also essential. Our goal is to _overcome obstacles to transform a complex problem into a reality_. +Versioning APIs is a challenging problem in software engineering. The goal of model versions is not to make the problem go away, or pretend it's easier than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it. ## Related documentation - [`versions`](resource-properties/versions) - [`latest_version`](resource-properties/latest-version) - [`include` & `exclude`](resource-properties/include-exclude) +- [`ref` with `version` argument](ref#versioned-ref) ## Why version a model? -If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's contract in a way that "breaks" the previous set of parameters. +If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's structure in a way that "breaks" the previous set of guarantees. -One approach is to force every model consumer to immediately handle the breaking change when it's deployed to production. While this may work at smaller organizations or while iterating on an immature set of data models, it doesn’t scale well beyond that. +One approach is to force every model consumer to immediately handle the breaking change as soon as it's deployed to production. This is actually the appropriate answer at many smaller organizations, or while rapidly iterating on a not-yet-mature set of data models. But it doesn’t scale well beyond that. -Instead, the model owner can create a **new version**, during which consumers can migrate from the old version to the new. +Instead, for mature models at larger organizations, the model owner can create a **new version**, during which consumers can migrate from the old version to the new. -In the meantime, anywhere that model is used downstream, it can be referenced at a specific version. +In the meantime, anywhere that model is used downstream, it can continue to be referenced at a specific version. + +In the future, we intend to also add support for **deprecating models**. Taken together, model versions and deprecation offer a pathway for _sunsetting_ and _migrating_. In the short term, avoid breaking everyone's queries. Over the longer term, older & unmaintained versions go away—they do **not** stick around forever. + +## When should you version a model? + +Many changes to a model are not breaking, and do not require a new version! Examples include adding a new column, or fixing a bug in modeling logic. + +By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. + +It's also possible to change the model in more subtle ways — by recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers. + +The process of sunsetting and migrating model versions requires real work, and may require significant coordination across teams. If, instead of using model versions, you opt for non-breaking changes wherever possible—that's a completely legitimate approach. Even so, after a while, you'll find yourself with lots of unused or deprecated columns. Many teams will want to consider a predictable cadence (once or twice a year) for bumping the version of their mature models, and taking the opportunity to remove no-longer-used columns. ## How is this different from "version control"? @@ -32,9 +45,40 @@ In the meantime, anywhere that model is used downstream, it can be referenced at Model versions are different. Multiple versions of a model will live in the same code repository at the same time and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted. -dbt's model `versions` makes it possible to define multiple versions: -- That share the same "reference" name -- While reusing the same top-level properties, highlighting just their differences +## How is this different from just creating a new model? + +Honestly, it's only a little bit different! There isn't much magic here, and that's by design. + +You've always been able to create a new model, and name it `dim_customers_v2`. Why should you opt for a "real" versioned model instead? + +First, the versioned model preserves its _reference name_. Versioned models are `ref`'d by their _model name_, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword. + + + +```sql +{{ ref('dim_customers') }} -- resolves to latest +{{ ref('dim_customers', version=2) }} -- resolves to v2 +``` + + + +Second, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you an opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. + +Third, dbt supports `version`-based selection. For example, you could define a [default yaml selector](node-selection/yaml-selectors#default), to avoid running any old model versions in development—even as you continue to run them in production through a sunset and migration period: + +```yml +selectors: + - name: exclude_old_versions + default: "{{ target.name == 'dev' }}" + definition: + method: fqn + value: "*" + exclude: + - method: version + value: old +``` + +Finally, we intend to add support for **deprecating models** in dbt Core v1.6. When you slate a versioned model for deprecation, dbt will be able to provide more helpful warnings to downstream consumers of that model. Rather than just, "This model is going away," it's - "This older version of the model is going away, and there's a new version coming soon." ## How to create a new version of a model @@ -67,7 +111,7 @@ If you wanted to make a breaking change to the model - for example, removing a c ```yaml models: - name: dim_customers - latest_version: 2 + latest_version: 1 config: materialized: table contract: @@ -92,19 +136,62 @@ models: The above configuration will create two models (one for each version), and produce database relations with aliases `dim_customers_v1` and `dim_customers_v2`. -By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It is possible to override this by setting a `defined_in` property. +By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!) + +The `latest_version` would be `2` (numerically greatest) if not specified explicitly. In this case, `v1` is specified to still be the latest; `v2` is just a prerelease in early development. When ready to roll out `v2` to everyone by default, bump the `latest_version` to `2` (or remove it from the specification). -You can reconfigure each version independently. For example, if you wanted `dim_customers.v1` to continue populating the database table named `dim_customers` (its original name), you could use the `defined_in` configuration: +### Configuring versioned models + +You can reconfigure each version independently. For example, you could materialize `v2` as a table and `v1` as a view: + + + +```yml +versions: + - v: 2 + config: + materialized: table + - v: 1 + config: + materialized: view +``` + + + +Like with all config inheritance, any configs set _within_ the versioned model's definition (`.sql` or `.py` file) will take precedence over the configs set in yaml. + +### Configuring database location with `alias` + +Following the example, let's say you wanted `dim_customers.v1` to continue populating the database table named `dim_customers`. That's what the table was named previously, and you may have several other dashboards or tools expecting to read its data from `..dim_customers`. + +You could use the `alias` configuration: + + ```yml - v: 1 - defined_in: dim_customers # keep original relation name + config: + alias: dim_customers # keep v1 in its original database location +``` + + + +Or, you could define a separate view that always points to the latest version of the model. We recommend this pattern because it's the most transparent and easiest to follow. + + + +```sql +{{ config(alias = 'dim_customers') }} + +select * from {{ ref('dim_customers') }} ``` + + :::info -Projects which have historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro will need to update their custom implementations to account for model versions. +If your project has historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro, and you'd like to start using model versions, you should update your custom implementation to account for model versions. Specifically, we'd encourage you to add [a condition like this one](https://github.com/dbt-labs/dbt-core/blob/ada8860e48b32ac712d92e8b0977b2c3c9749981/core/dbt/include/global_project/macros/get_custom_name/get_custom_alias.sql#L26-L30). -Otherwise, they'll see something like this as soon as they start using versions: +Your existing implementation of `generate_alias_name` should not encounter any errors upon first upgrading to v1.5. It's only when you create your first versioned model, that you may see an error like: ```sh dbt.exceptions.AmbiguousAliasError: Compilation Error @@ -114,3 +201,32 @@ dbt.exceptions.AmbiguousAliasError: Compilation Error - model.project_name.model_name.v1 (models/.../model_name.sql) - model.project_name.model_name.v2 (models/.../model_name_v2.sql) ``` + +We opted to use `generate_alias_name` for this functionality so that the logic remains accessible to end users, and could be reimplemented with custom logic. + +### Optimizing model versions + +How you define each model version is completely up to you. While it's easy to start by copy-pasting from one model's SQL definition into another, you should think about _what actually is changing_ from one version to another. + +For example, if your new model version is only renaming or removing certain columns, you could define one version as a view on top of the other one: + + + +```sql +{{ config(materialized = 'view') }} + +{% set dim_customers_v1 = ref('dim_customers', v=1)} + +select +{{ dbt_utils.star(from=dim_customers_v1, except=["country_name"]) }} +from {{ dim_customers_v1 }} +``` + + + +Of course, if one model version makes meaningful and substantive changes to logic in another, it may not be possible to optimize it in this way. At that point, the cost of human intuition and legibility is more important than the cost of recomputing similar transformations. + +We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How? +- Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`) +- Where possible, define other versions as `select` transformations, which take the latest version as their starting point +- When bumping the `latest_version`, migrate the SQL and yaml accordingly. In this case, we would see if it's possible to redefine `v1` with respect to `v2`. From cb26fd1dc6a9b26c62ac044a0df2822e5e40d05a Mon Sep 17 00:00:00 2001 From: Jeremy Cohen Date: Thu, 20 Apr 2023 22:19:29 +0200 Subject: [PATCH 2/6] Side-by-side example --- .../docs/collaborate/govern/model-versions.md | 58 +++++++++++++++++-- 1 file changed, 52 insertions(+), 6 deletions(-) diff --git a/website/docs/docs/collaborate/govern/model-versions.md b/website/docs/docs/collaborate/govern/model-versions.md index 40f10c21a73..8e7a5aab8a5 100644 --- a/website/docs/docs/collaborate/govern/model-versions.md +++ b/website/docs/docs/collaborate/govern/model-versions.md @@ -104,7 +104,10 @@ models: -If you wanted to make a breaking change to the model - for example, removing a column - you'd create a new model file (SQL or Python) encompassing those breaking changes. The default convention is naming the new file with a `_v` suffix. The new version can then be configured in relation to the original model: +If you wanted to make a breaking change to the model - for example, removing a column - you'd create a new model file (SQL or Python) encompassing those breaking changes. The default convention is naming the new file with a `_v` suffix. The new version can then be configured in relation to the original model, in a way that highlights the diffs between them. Or, you can choose to define each model version with full specifications, and repeat the values they have in common. + + + @@ -114,8 +117,7 @@ models: latest_version: 1 config: materialized: table - contract: - enforced: true + contract: {enforced: true} columns: - name: customer_id description: This is the primary key @@ -123,17 +125,61 @@ models: - name: country_name description: Where this customer lives data_type: varchar + + # declare the versions, and just highlight the diffs + versions: + - v: 2 + columns: + - include: all + exclude: [country_name] # this is the breaking change! + - v: 1 + # no need to redefine anything -- matches the properties defined above +``` + + + + + + + + + +```yaml +models: + - name: dim_customers + latest_version: 1 + + # declare the versions, and fully specify them versions: - v: 2 + config: + materialized: table + contract: {enforced: true} columns: - - include: "*" - exclude: - - country_name # this is the breaking change! + - name: customer_id + description: This is the primary key + data_type: int + # no country_name column + - v: 1 + config: + materialized: table + contract: {enforced: true} + columns: + - name: customer_id + description: This is the primary key + data_type: int + - name: country_name + description: Where this customer lives + data_type: varchar ``` + + + + The above configuration will create two models (one for each version), and produce database relations with aliases `dim_customers_v1` and `dim_customers_v2`. By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!) From e263bed8826226fdb4034189d01c8daefcf2f1d9 Mon Sep 17 00:00:00 2001 From: Jeremy Cohen Date: Mon, 24 Apr 2023 04:02:32 +0200 Subject: [PATCH 3/6] PR feedback --- _redirects | 1 + .../docs/collaborate/govern/model-access.md | 2 +- .../collaborate/govern/model-contracts.md | 2 +- .../docs/collaborate/govern/model-versions.md | 124 ++++++++++++------ website/docs/reference/model-properties.md | 2 +- .../{latest-version.md => latest_version.md} | 0 .../reference/resource-properties/versions.md | 23 ++-- website/sidebars.js | 2 +- 8 files changed, 106 insertions(+), 50 deletions(-) rename website/docs/reference/resource-properties/{latest-version.md => latest_version.md} (100%) diff --git a/_redirects b/_redirects index 6506cbeae7a..1b71bc255cd 100644 --- a/_redirects +++ b/_redirects @@ -278,6 +278,7 @@ docs/dbt-cloud/using-dbt-cloud/cloud-model-timing-tab /docs/deploy/dbt-cloud-job /docs/artifacts /docs/dbt-cloud/using-dbt-cloud/artifacts 301 /docs/bigquery-configs /reference/resource-configs/bigquery-configs 301 /reference/resource-properties/docs /reference/resource-configs/docs 301 +/reference/resource-properties/latest-version /reference/resource-configs/latest_version 301 /docs/building-a-dbt-project/building-models/bigquery-configs /reference/resource-configs/bigquery-configs 301 /docs/building-a-dbt-project/building-models/configuring-models /reference/model-configs /docs/building-a-dbt-project/building-models/enable-and-disable-models /reference/resource-configs/enabled 301 diff --git a/website/docs/docs/collaborate/govern/model-access.md b/website/docs/docs/collaborate/govern/model-access.md index 6dd5471c2a7..4099fde774d 100644 --- a/website/docs/docs/collaborate/govern/model-access.md +++ b/website/docs/docs/collaborate/govern/model-access.md @@ -6,7 +6,7 @@ description: "Define model access with group capabilities" --- :::info New functionality -This functionality is new in v1.5. +This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6730)! ::: ## Related documentation diff --git a/website/docs/docs/collaborate/govern/model-contracts.md b/website/docs/docs/collaborate/govern/model-contracts.md index 21bfb980f95..3e65c4a593f 100644 --- a/website/docs/docs/collaborate/govern/model-contracts.md +++ b/website/docs/docs/collaborate/govern/model-contracts.md @@ -6,7 +6,7 @@ description: "Model contracts define a set of parameters validated during transf --- :::info New functionality -This functionality is new in v1.5. +This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6726)! ::: ## Related documentation diff --git a/website/docs/docs/collaborate/govern/model-versions.md b/website/docs/docs/collaborate/govern/model-versions.md index 8e7a5aab8a5..5003d4c0f43 100644 --- a/website/docs/docs/collaborate/govern/model-versions.md +++ b/website/docs/docs/collaborate/govern/model-versions.md @@ -6,44 +6,59 @@ description: "Version models to help with lifecycle management" --- :::info New functionality -This functionality is new in v1.5. +This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6736)! ::: -Versioning APIs is a challenging problem in software engineering. The goal of model versions is not to make the problem go away, or pretend it's easier than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it. +Versioning APIs is a hard problem in software engineering. At the root of the challenge is the fact that the producers and consumers of an API have competing incentives: +- Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier. +- Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. There is a real cost associated with migrating to a newer API version, but unplanned migration is far costlier. + +The goal of model versions is not to make the problem go away, nor to pretend it's somehow easier or simpler than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it. ## Related documentation - [`versions`](resource-properties/versions) -- [`latest_version`](resource-properties/latest-version) +- [`latest_version`](resource-properties/latest_version) - [`include` & `exclude`](resource-properties/include-exclude) - [`ref` with `version` argument](ref#versioned-ref) ## Why version a model? -If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's structure in a way that "breaks" the previous set of guarantees. +If a model defines a ["contract"](model-contracts) (a set of guarantees for its structure), it's also possible to change that model's structure in a way that breaks the previous set of guarantees. This could be as obvious as removing or renaming a column, or more subtle, like changing its data type or nullability. One approach is to force every model consumer to immediately handle the breaking change as soon as it's deployed to production. This is actually the appropriate answer at many smaller organizations, or while rapidly iterating on a not-yet-mature set of data models. But it doesn’t scale well beyond that. -Instead, for mature models at larger organizations, the model owner can create a **new version**, during which consumers can migrate from the old version to the new. +Instead, for mature models at larger organizations, powering queries inside & outside dbt, the model owner can use **model versions** to: +- Test "prerelease" changes (in production, in downstream systems) +- Bump the latest version, to be used as the canonical source of truth +- Offer a migration window off the "old" version + +During that migration window, anywhere that model is being used downstream, it can continue to be referenced at a specific version. -In the meantime, anywhere that model is used downstream, it can continue to be referenced at a specific version. +In the future, dbt will also offer first-class support for **deprecating models** ([dbt-core#7433](https://github.com/dbt-labs/dbt-core/issues/7433)). Taken together, model versions and deprecation offer a pathway for model producers to _sunset_ old models, and consumers the time to _migrate_ across breaking changes. It's a way of managing change across an organization: develop a new version, bump the latest, slate the old version for deprecation, update downstream references, and then remove the old version. -In the future, we intend to also add support for **deprecating models**. Taken together, model versions and deprecation offer a pathway for _sunsetting_ and _migrating_. In the short term, avoid breaking everyone's queries. Over the longer term, older & unmaintained versions go away—they do **not** stick around forever. +There is a real trade-off that exists here—the cost to frequently migrate downstream code, and the cost (and clutter) of materializing multiple versions of a model in the data warehouse. Model versions do not make that problem go away, but by setting a deprecation date, and communicating a clear window for consumers to gracefully migrate off old versions, they put a known boundary on the cost of that migration. ## When should you version a model? -Many changes to a model are not breaking, and do not require a new version! Examples include adding a new column, or fixing a bug in modeling logic. +By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. These changes, when made intentionally, would require a new model version. But many changes are not breaking, and don't require a new version—such as adding a new column, or fixing a bug in an existing column's calculation. -By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. +Of course, it's possible to change a model's definition in other ways—recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers. -It's also possible to change the model in more subtle ways — by recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers. +This is always a judgment call. As the maintainer of a widely-used model, you know best what's a bug fix and what's an unexpected behavior change. -The process of sunsetting and migrating model versions requires real work, and may require significant coordination across teams. If, instead of using model versions, you opt for non-breaking changes wherever possible—that's a completely legitimate approach. Even so, after a while, you'll find yourself with lots of unused or deprecated columns. Many teams will want to consider a predictable cadence (once or twice a year) for bumping the version of their mature models, and taking the opportunity to remove no-longer-used columns. +The process of sunsetting and migrating model versions requires real work, and likely significant coordination across teams. You should opt for non-breaking changes whenever possible. Inevitably, however, these non-breaking additions will leave your most important models with lots of unused or deprecated columns. + +Rather than constantly adding a new version for each small change, you should opt for a predictable cadence (once or twice a year, communicated well in advance) where you bump the "latest" version of your model, removing columns that are no longer being used. ## How is this different from "version control"? -[Version control](git-version-control) allows your team to collaborate simultaneously on a single code repository, manage conflicts between changes, and review changes before deploying into production. In that sense, version control is an essential tool for versioning the deployment of an entire dbt project—always the latest state of the `main` branch, with the ability to "rollback" changes by reverting a commit or pull request. In general, only one version of your project code is deployed into an environment at a time. +[Version control](git-version-control) allows your team to collaborate simultaneously on a single code repository, manage conflicts between changes, and review changes before deploying into production. In that sense, version control is an essential tool for versioning the deployment of an entire dbt project—always the latest state of the `main` branch, with the ability to roll back changes by reverting a commit or pull request. In general, only one version of your project code is deployed into an environment at a time. + +When you make updates to a model's source code—its logical definition, in SQL or Python, or related configuration—dbt can [compare your project to previous state](project-state), enabling you to rebuild only models that have changed, and models downstream of a change. In this way, it's possible to develop changes to a model, quickly test in CI, and efficiently deploy into production—all coordinated via your version control system. -Model versions are different. Multiple versions of a model will live in the same code repository at the same time and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted. +**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. I need to do my part by offering a migration path, with clear diffs and deprecation dates. + +Multiple versions of a model will live in the same code repository at the same time, and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted. ## How is this different from just creating a new model? @@ -51,20 +66,30 @@ Honestly, it's only a little bit different! There isn't much magic here, and tha You've always been able to create a new model, and name it `dim_customers_v2`. Why should you opt for a "real" versioned model instead? -First, the versioned model preserves its _reference name_. Versioned models are `ref`'d by their _model name_, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword. +As the **producer** of a versioned model: +1. You keep track of all live versions in one place +2. You can reuse most configuration, and highlight just the diffs +3. You can select models to build (or not) based on their version - +As the **consumer** of a versioned model: +1. You use a consistent `ref`, with the option of pinning to a specific live version +2. You will be notified throughout the life cycle of a versioned model -```sql -{{ ref('dim_customers') }} -- resolves to latest -{{ ref('dim_customers', version=2) }} -- resolves to v2 -``` +All versions of a model preserve the model's original name. They are `ref`'d by that name, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword. - +Let's say that `dim_customers` has three versions defined: `v2` is the "latest", `v3` is "prerelease," and `v1` is an old version that's still within its deprecation window. Because `v2` is the latest version, it gets some special treatment: it can be defined in a file without a suffix, and `ref('dim_customers')` will resolve to `v2` if a version pin is not specified. The table below breaks down the standard conventions: + +| v | `ref` syntax | File name | Database relation | +|---|-------------------------------------------------------|-------------------------------------------------|--------------------------------------------------------------------------| +| 3 | `ref('dim_customers', v=3)` | `dim_customers_v3.sql` | `analytics.dim_customers_v3` | +| 2 | `ref('dim_customers')` or `ref('dim_customers', v=2)` | `dim_customers_v2.sql` (or `dim_customers.sql`) | `analytics.dim_customers_v2` and `analytics.dim_customers` (recommended) | +| 1 | `ref('dim_customers', v=1)` | `dim_customers_v1.sql` | `analytics.dim_customers_v1` | -Second, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you an opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. +As you'll see in the implemenatation section below, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. -Third, dbt supports `version`-based selection. For example, you could define a [default yaml selector](node-selection/yaml-selectors#default), to avoid running any old model versions in development—even as you continue to run them in production through a sunset and migration period: +dbt also supports [`version`-based selection](node-selection/methods#the-version-method). For example, you could define a [default yaml selector](node-selection/yaml-selectors#default), to avoid running any old model versions in development—even as you continue to run them in production through a sunset and migration period. (Of course, you could accomplish something similar by applying `tags` to these models, and cycling through them over time.) + + ```yml selectors: @@ -78,11 +103,23 @@ selectors: value: old ``` -Finally, we intend to add support for **deprecating models** in dbt Core v1.6. When you slate a versioned model for deprecation, dbt will be able to provide more helpful warnings to downstream consumers of that model. Rather than just, "This model is going away," it's - "This older version of the model is going away, and there's a new version coming soon." + + +Because dbt knows that these models are _actually the same model_, it can notify downstream consumers as new versions become available, and (in the future) as older versions are slated for deprecation. + +```bash +Found an unpinned reference to versioned model 'dim_customers'. +Resolving to latest version: my_model.v2 +A prerelease version 3 is available. It has not yet been marked 'latest' by its maintainer. +When that happens, this reference will resolve to my_model.v3 instead. + + Try out v3: {{ ref(my_dbt_project, my_model, v=3) }} + Pin to v2: {{ ref(my_dbt_project, my_model, v=2) }} +``` ## How to create a new version of a model -Let's say you have a model (not yet versioned) with the following contract: +Most often, you'll start with a model that is not yet versioned. Let's go back in time to when `dim_customers` was a simple standalone model, with an enforced contract. For simplicity, we'll pretend it had only two columns—`customer_id` and `country_name`—though most mature models will obviously have many more. @@ -104,10 +141,10 @@ models: -If you wanted to make a breaking change to the model - for example, removing a column - you'd create a new model file (SQL or Python) encompassing those breaking changes. The default convention is naming the new file with a `_v` suffix. The new version can then be configured in relation to the original model, in a way that highlights the diffs between them. Or, you can choose to define each model version with full specifications, and repeat the values they have in common. +If you wanted to make a breaking change to the model-for example, removing a column-you'd create a new model file (SQL or Python) encompassing those breaking changes. The default convention is naming the new file with a `_v` suffix. The new version can then be configured in relation to the original model, in a way that highlights the diffs between them. Or, you can choose to define each model version with full specifications, and repeat the values they have in common. - + @@ -126,14 +163,14 @@ models: description: Where this customer lives data_type: varchar - # declare the versions, and just highlight the diffs + # Declare the versions, highlighting just the diffs versions: - v: 2 columns: - include: all exclude: [country_name] # this is the breaking change! - v: 1 - # no need to redefine anything -- matches the properties defined above + # No need to redefine anything -- matches the properties defined above ``` @@ -180,11 +217,17 @@ models: -The above configuration will create two models (one for each version), and produce database relations with aliases `dim_customers_v1` and `dim_customers_v2`. -By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!) +The above configuration will create two models, `dim_customers.v1` and `dim_customers.v2`. + +**Where are they defined?** + + +**Where will they be materialized?** By convention, these will create database relations with aliases `dim_customers_v1` and `dim_customers_v2`. We recommend that you also create a view, named `dim_customers`, pointing to the latest version. Check out guidance on an easy & repeatable way to do that. + +By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It will also accept `dim_customers.sql` (no suffix) as the definition of the latest version. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!) -The `latest_version` would be `2` (numerically greatest) if not specified explicitly. In this case, `v1` is specified to still be the latest; `v2` is just a prerelease in early development. When ready to roll out `v2` to everyone by default, bump the `latest_version` to `2` (or remove it from the specification). +If not specified explicitly, the `latest_version` would be `2` (numerically greatest). In this case, `v1` is specified to still be the latest; `v2` is a prerelease in early development. When we're ready to roll out `v2` to everyone by default, we would bump the `latest_version` to `2`, or remove it from the specification. ### Configuring versioned models @@ -222,18 +265,22 @@ You could use the `alias` configuration: -Or, you could define a separate view that always points to the latest version of the model. We recommend this pattern because it's the most transparent and easiest to follow. +Or, you could do one better: In a model or project hook, create a view named `dim_customers` that always points to the latest version of the `dim_customers` model. You can find logic for just such a hook in [this gist](https://gist.github.com/jtcohen6/68220cd76b0bde088d3439664ccfb013/edit). Then, for all the versioned models in your project, it's as simple as: - + -```sql -{{ config(alias = 'dim_customers') }} + -select * from {{ ref('dim_customers') }} +```yml +# dbt_project.yml +on-run-end: + - "{{ create_latest_version_views() }}" ``` +**This is the pattern we recommend,** and we may build it into `dbt-core` as out-of-the-box functionality. This has the effect of providing the same flexibility that users get from `ref`, even if they're querying outside of dbt. Want a specific version? Pin to version X by adding the `_vX` suffix. Want the latest version? No suffix. + :::info If your project has historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro, and you'd like to start using model versions, you should update your custom implementation to account for model versions. Specifically, we'd encourage you to add [a condition like this one](https://github.com/dbt-labs/dbt-core/blob/ada8860e48b32ac712d92e8b0977b2c3c9749981/core/dbt/include/global_project/macros/get_custom_name/get_custom_alias.sql#L26-L30). @@ -249,6 +296,7 @@ dbt.exceptions.AmbiguousAliasError: Compilation Error ``` We opted to use `generate_alias_name` for this functionality so that the logic remains accessible to end users, and could be reimplemented with custom logic. +::: ### Optimizing model versions @@ -275,4 +323,6 @@ Of course, if one model version makes meaningful and substantive changes to logi We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How? - Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`) - Where possible, define other versions as `select` transformations, which take the latest version as their starting point -- When bumping the `latest_version`, migrate the SQL and yaml accordingly. In this case, we would see if it's possible to redefine `v1` with respect to `v2`. +- When bumping the `latest_version`, migrate the SQL and yaml accordingly. + +In the example above, the third point might be tricky. It's easier to _exclude_ `country_name`, than it is to add it back in. Instead, we might need to keep around the full original logic for `dim_customers.v1`—but materialize it as a `view`, to minimize the data warehouse cost of building it. If downstream queriers see slightly degraded performance, it's still significantly better than broken queries, and all the more reason to migrate to the new "latest" version. diff --git a/website/docs/reference/model-properties.md b/website/docs/reference/model-properties.md index 8550e0358a9..3ea9787207a 100644 --- a/website/docs/reference/model-properties.md +++ b/website/docs/reference/model-properties.md @@ -16,7 +16,7 @@ models: [description](description): [docs](/reference/resource-configs/docs): show: true | false - [latest_version](resource-properties/latest-version): + [latest_version](resource-properties/latest_version): [access](resource-properties/access): private | protected | public [config](resource-properties/config): [](model-configs): diff --git a/website/docs/reference/resource-properties/latest-version.md b/website/docs/reference/resource-properties/latest_version.md similarity index 100% rename from website/docs/reference/resource-properties/latest-version.md rename to website/docs/reference/resource-properties/latest_version.md diff --git a/website/docs/reference/resource-properties/versions.md b/website/docs/reference/resource-properties/versions.md index 7efae5dfb89..a32cbbead73 100644 --- a/website/docs/reference/resource-properties/versions.md +++ b/website/docs/reference/resource-properties/versions.md @@ -16,7 +16,7 @@ models: - v: # required defined_in: # optional -- default is _v columns: - # include/exclude columns from the top-level model properties + # specify all columns, or include/exclude columns from the top-level model yaml definition - [include](resource-properties/include-exclude): [exclude](resource-properties/include-exclude): # specify additional columns @@ -24,7 +24,7 @@ models: - v: ... # optional - [latest_version](resource-properties/latest-version): + [latest_version](resource-properties/latest_version): ``` @@ -35,7 +35,7 @@ The standard convention for naming model versions is `_v`. This h The version identifier for a version of a model. This value can be numeric (integer or float), or any string. -The value of the version identifier is used to order versions of a model relative to one another. If a versioned model does _not_ explicitly configure a [`latest_version`](resource-properties/latest-version), the highest version number is used as the latest version to resolve `ref` calls to the model without a `version` argument. +The value of the version identifier is used to order versions of a model relative to one another. If a versioned model does _not_ explicitly configure a [`latest_version`](resource-properties/latest_version), the highest version number is used as the latest version to resolve `ref` calls to the model without a `version` argument. In general, we recommend that you use a simple "major versioning" scheme for your models: `v1`, `v2`, `v3`, etc, where each version represents a breaking change from previous versions. However, you are welcome to use other versioning schemes. @@ -43,16 +43,21 @@ In general, we recommend that you use a simple "major versioning" scheme for you The name of the model file (excluding the file extension, e.g. `.sql` or `.py`) where the model version is defined. -If `defined_in` is not specified, dbt searches for the definition of a versioned model in a model file named `_v`. Model file names must be globally unique, even when defining versioned implementations of a model with a different name. +If `defined_in` is not specified, dbt searches for the definition of a versioned model in a model file named `_v`. The **latest** version of a model may also be defined in a file named ``, without the version suffix. Model file names must be globally unique, even when defining versioned implementations of a model with a different name. -### Alias +### `alias` -The default `alias` for a versioned model is `_v`. +The default resolved `alias` for a versioned model is `_v`. The logic for this is encoded in the `generate_alias_name` macro. This default can be overwritten in two ways: -- Configuring a custom `alias` within the version yaml -- Overwriting dbt's `generate_alias_name` macro, to use different behavior when `node.version` +- Configuring a custom `alias` within the version yaml, or the versioned model's definition +- Overwriting dbt's `generate_alias_name` macro, to use different behavior based on `node.version` See ["Custom aliases"](https://docs.getdbt.com/docs/build/custom-aliases) for more details. -Setting a different value of `defined_in` does **not** automatically change the `alias` of the model to match. The two are determined independently. +Note that the value of `defined_in` and the `alias` configuration of a model are not coordinated, except by convention. The two are declared and determined independently. + +### Our recommendations +- Follow a consistent naming convention for model versions and aliases. +- Use `defined_in` and `alias` only if you have good reason. +- Create a view that always points to the latest version of your model. You can automate this for all versioned models in your project with an `on-run-end` hook. For more details, read the full docs on ["Model versions"](model-versions#configuring-database-location-with-alias) diff --git a/website/sidebars.js b/website/sidebars.js index e9f4aad921a..165847f9ac9 100644 --- a/website/sidebars.js +++ b/website/sidebars.js @@ -459,7 +459,7 @@ const sidebarSettings = { "reference/resource-properties/config", "reference/resource-properties/constraints", "reference/resource-properties/description", - "reference/resource-properties/latest-version", + "reference/resource-properties/latest_version", "reference/resource-properties/include-exclude", "reference/resource-properties/quote", "reference/resource-properties/tests", From 8d303dfdd58785cfddb23b07fd632bb2d4fd1a62 Mon Sep 17 00:00:00 2001 From: Jeremy Cohen Date: Mon, 24 Apr 2023 12:54:53 +0200 Subject: [PATCH 4/6] More feedback --- .../docs/collaborate/govern/model-versions.md | 58 ++++++++++++++----- .../reference/resource-properties/versions.md | 2 +- 2 files changed, 44 insertions(+), 16 deletions(-) diff --git a/website/docs/docs/collaborate/govern/model-versions.md b/website/docs/docs/collaborate/govern/model-versions.md index 5003d4c0f43..5e1c7573a82 100644 --- a/website/docs/docs/collaborate/govern/model-versions.md +++ b/website/docs/docs/collaborate/govern/model-versions.md @@ -52,13 +52,13 @@ Rather than constantly adding a new version for each small change, you should op ## How is this different from "version control"? -[Version control](git-version-control) allows your team to collaborate simultaneously on a single code repository, manage conflicts between changes, and review changes before deploying into production. In that sense, version control is an essential tool for versioning the deployment of an entire dbt project—always the latest state of the `main` branch, with the ability to roll back changes by reverting a commit or pull request. In general, only one version of your project code is deployed into an environment at a time. +[Version control](git-version-control) allows your team to collaborate simultaneously on a single code repository, manage conflicts between changes, and review changes before deploying into production. In that sense, version control is an essential tool for versioning the deployment of an entire dbt project—always the latest state of the `main` branch. In general, only one version of your project code is deployed into an environment at a time. If something goes wrong, you have the ability to roll back changes by reverting a commit or pull request, or by leveraging data platform capabilities around "time travel." When you make updates to a model's source code—its logical definition, in SQL or Python, or related configuration—dbt can [compare your project to previous state](project-state), enabling you to rebuild only models that have changed, and models downstream of a change. In this way, it's possible to develop changes to a model, quickly test in CI, and efficiently deploy into production—all coordinated via your version control system. -**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. I need to do my part by offering a migration path, with clear diffs and deprecation dates. +**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. You need to do my part by offering a migration path, with clear diffs and deprecation dates. -Multiple versions of a model will live in the same code repository at the same time, and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned—multiple versions are live simultaneously; older versions are often eventually sunsetted. +Multiple versions of a model will live in the same code repository at the same time, and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned: Multiple versions are live simultaneously, two or three, and not more). Over time, newer versions come online, and older versions are sunsetted . ## How is this different from just creating a new model? @@ -220,14 +220,11 @@ models: The above configuration will create two models, `dim_customers.v1` and `dim_customers.v2`. -**Where are they defined?** +**Where are they defined?** By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It will also accept `dim_customers.sql` (no suffix) as the definition of the latest version. (It is possible to override this by setting [`defined_in: any_file_name_you_want`](resource-properties/versions#defined_in), but only if you have a good reason. We strongly encourage you to follow the convention.) +**Where will they be materialized?** By convention, these will create database relations with aliases `dim_customers_v1` and `dim_customers_v2`. In the future, dbt will also create a view or clone, named `dim_customers`, pointing to the latest version. See [the section below](#configuring-database-location-with-alias) for a way to implement this now. -**Where will they be materialized?** By convention, these will create database relations with aliases `dim_customers_v1` and `dim_customers_v2`. We recommend that you also create a view, named `dim_customers`, pointing to the latest version. Check out guidance on an easy & repeatable way to do that. - -By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It will also accept `dim_customers.sql` (no suffix) as the definition of the latest version. (It is possible to override this by setting `defined_in: any_file_name_you_want`, but we strongly encourage you to follow the convention!) - -If not specified explicitly, the `latest_version` would be `2` (numerically greatest). In this case, `v1` is specified to still be the latest; `v2` is a prerelease in early development. When we're ready to roll out `v2` to everyone by default, we would bump the `latest_version` to `2`, or remove it from the specification. +**Which version is "latest"?** If not specified explicitly, the `latest_version` would be `2` (numerically greatest). In this case, `v1` is specified to still be the latest; `v2` is a prerelease in early development. When we're ready to roll out `v2` to everyone by default, we would bump the `latest_version` to `2`, or remove it from the specification. ### Configuring versioned models @@ -265,21 +262,52 @@ You could use the `alias` configuration: -Or, you could do one better: In a model or project hook, create a view named `dim_customers` that always points to the latest version of the `dim_customers` model. You can find logic for just such a hook in [this gist](https://gist.github.com/jtcohen6/68220cd76b0bde088d3439664ccfb013/edit). Then, for all the versioned models in your project, it's as simple as: +Or, you could do one better: Define a post-hook to create a view named `dim_customers`, which always points to the latest version of the `dim_customers` model. You can find logic for just such a hook in [this gist](https://gist.github.com/jtcohen6/68220cd76b0bde088d3439664ccfb013/edit). Then, you can implement this for all versioned models in your project: - + - +```sql +{% macro create_latest_version_view() %} + + {% if model.get('version') and model.get('version') == model.get('latest_version') %} + + {% set new_relation = api.Relation.create( + database = this.database, + schema = this.schema, + identifier = model['name'] + ) %} + + {% set existing_relation = load_relation(new_relation) %} + {{ drop_relation_if_exists(existing_relation) }} + + {% set create_view_sql = create_view_as(new_relation, "select * from " ~ this) -%} + + {% do log("Creating view " ~ new_relation ~ " pointing to " ~ this, info = true) if execute %} + + {{ return(create_view_sql) }} + + {% endif %} + +{% endmacro %} +``` + + + + + ```yml # dbt_project.yml -on-run-end: - - "{{ create_latest_version_views() }}" +models: + post-hook: + - "{{ create_latest_version_view() }}" ``` -**This is the pattern we recommend,** and we may build it into `dbt-core` as out-of-the-box functionality. This has the effect of providing the same flexibility that users get from `ref`, even if they're querying outside of dbt. Want a specific version? Pin to version X by adding the `_vX` suffix. Want the latest version? No suffix. +**This is the pattern we recommend,** and we intend to build it into to `dbt-core` as out-of-the-box functionality: [dbt-core#7442](https://github.com/dbt-labs/dbt-core/issues/7442). + +By following this pattern, you can offer the same flexibility as `ref`, even if someone is querying outside of dbt. Want a specific version? Pin to version X by adding the `_vX` suffix. Want the latest version? No suffix. :::info If your project has historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro, and you'd like to start using model versions, you should update your custom implementation to account for model versions. Specifically, we'd encourage you to add [a condition like this one](https://github.com/dbt-labs/dbt-core/blob/ada8860e48b32ac712d92e8b0977b2c3c9749981/core/dbt/include/global_project/macros/get_custom_name/get_custom_alias.sql#L26-L30). diff --git a/website/docs/reference/resource-properties/versions.md b/website/docs/reference/resource-properties/versions.md index a32cbbead73..031431caec1 100644 --- a/website/docs/reference/resource-properties/versions.md +++ b/website/docs/reference/resource-properties/versions.md @@ -37,7 +37,7 @@ The version identifier for a version of a model. This value can be numeric (inte The value of the version identifier is used to order versions of a model relative to one another. If a versioned model does _not_ explicitly configure a [`latest_version`](resource-properties/latest_version), the highest version number is used as the latest version to resolve `ref` calls to the model without a `version` argument. -In general, we recommend that you use a simple "major versioning" scheme for your models: `v1`, `v2`, `v3`, etc, where each version represents a breaking change from previous versions. However, you are welcome to use other versioning schemes. +In general, we recommend that you use a simple "major versioning" scheme for your models: `1`, `2`, `3`, and so on, where each version reflects a breaking change from previous versions. You are able to use other versioning schemes. dbt will sort your version identifiers alphabetically if the values are not all numeric. You should **not** include the letter `v` in the version identifier, as dbt will do that for you. ### `defined_in` From 1ebed2c1c578f5a90eeb2fd5869b5c0da9742291 Mon Sep 17 00:00:00 2001 From: Jeremy Cohen Date: Tue, 25 Apr 2023 12:00:32 +0200 Subject: [PATCH 5/6] PR feedback, self-review --- _redirects | 2 +- .../docs/collaborate/govern/model-access.md | 2 +- .../collaborate/govern/model-contracts.md | 2 +- .../docs/collaborate/govern/model-versions.md | 132 ++++++++++++------ 4 files changed, 94 insertions(+), 44 deletions(-) diff --git a/_redirects b/_redirects index 1b71bc255cd..0ad04d46989 100644 --- a/_redirects +++ b/_redirects @@ -278,7 +278,7 @@ docs/dbt-cloud/using-dbt-cloud/cloud-model-timing-tab /docs/deploy/dbt-cloud-job /docs/artifacts /docs/dbt-cloud/using-dbt-cloud/artifacts 301 /docs/bigquery-configs /reference/resource-configs/bigquery-configs 301 /reference/resource-properties/docs /reference/resource-configs/docs 301 -/reference/resource-properties/latest-version /reference/resource-configs/latest_version 301 +/reference/resource-properties/latest-version /reference/resource-properties/latest_version 301 /docs/building-a-dbt-project/building-models/bigquery-configs /reference/resource-configs/bigquery-configs 301 /docs/building-a-dbt-project/building-models/configuring-models /reference/model-configs /docs/building-a-dbt-project/building-models/enable-and-disable-models /reference/resource-configs/enabled 301 diff --git a/website/docs/docs/collaborate/govern/model-access.md b/website/docs/docs/collaborate/govern/model-access.md index 4099fde774d..577f2647301 100644 --- a/website/docs/docs/collaborate/govern/model-access.md +++ b/website/docs/docs/collaborate/govern/model-access.md @@ -6,7 +6,7 @@ description: "Define model access with group capabilities" --- :::info New functionality -This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6730)! +This functionality is new in v1.5 — if you have thoughts, participate in [the discussion on GitHub](https://github.com/dbt-labs/dbt-core/discussions/6730)! ::: ## Related documentation diff --git a/website/docs/docs/collaborate/govern/model-contracts.md b/website/docs/docs/collaborate/govern/model-contracts.md index 3e65c4a593f..10d86b3c8aa 100644 --- a/website/docs/docs/collaborate/govern/model-contracts.md +++ b/website/docs/docs/collaborate/govern/model-contracts.md @@ -6,7 +6,7 @@ description: "Model contracts define a set of parameters validated during transf --- :::info New functionality -This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6726)! +This functionality is new in v1.5 — if you have thoughts, participate in [the discussion on GitHub](https://github.com/dbt-labs/dbt-core/discussions/6726)! ::: ## Related documentation diff --git a/website/docs/docs/collaborate/govern/model-versions.md b/website/docs/docs/collaborate/govern/model-versions.md index 5e1c7573a82..6c09eb594b9 100644 --- a/website/docs/docs/collaborate/govern/model-versions.md +++ b/website/docs/docs/collaborate/govern/model-versions.md @@ -6,10 +6,10 @@ description: "Version models to help with lifecycle management" --- :::info New functionality -This functionality is new in v1.5 — if you have thoughts, weigh into the [GitHub discussion](https://github.com/dbt-labs/dbt-core/discussions/6736)! +This functionality is new in v1.5 — if you have thoughts, participate in [the discussion on GitHub](https://github.com/dbt-labs/dbt-core/discussions/6736)! ::: -Versioning APIs is a hard problem in software engineering. At the root of the challenge is the fact that the producers and consumers of an API have competing incentives: +Versioning APIs is a hard problem in software engineering. The root of the challenge is that the producers and consumers of an API have competing incentives: - Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier. - Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. There is a real cost associated with migrating to a newer API version, but unplanned migration is far costlier. @@ -64,12 +64,13 @@ Multiple versions of a model will live in the same code repository at the same t Honestly, it's only a little bit different! There isn't much magic here, and that's by design. -You've always been able to create a new model, and name it `dim_customers_v2`. Why should you opt for a "real" versioned model instead? +You've always been able to copy-paste, create a new model file, and name it `dim_customers_v2.sql`. Why should you opt for a "real" versioned model instead? As the **producer** of a versioned model: -1. You keep track of all live versions in one place -2. You can reuse most configuration, and highlight just the diffs -3. You can select models to build (or not) based on their version +1. You keep track of all live versions in one place, rather than scattering them throughout the codebase +2. You can reuse the model's configuration, and highlight just the diffs between versions +3. You can select models to build (or not) based on whether they're a `latest`, `prerelease`, or `old` version +4. dbt will notify consumers of your versioned model when new versions become available, or (in the future) when they are slated for deprecation As the **consumer** of a versioned model: 1. You use a consistent `ref`, with the option of pinning to a specific live version @@ -79,15 +80,15 @@ All versions of a model preserve the model's original name. They are `ref`'d by Let's say that `dim_customers` has three versions defined: `v2` is the "latest", `v3` is "prerelease," and `v1` is an old version that's still within its deprecation window. Because `v2` is the latest version, it gets some special treatment: it can be defined in a file without a suffix, and `ref('dim_customers')` will resolve to `v2` if a version pin is not specified. The table below breaks down the standard conventions: -| v | `ref` syntax | File name | Database relation | -|---|-------------------------------------------------------|-------------------------------------------------|--------------------------------------------------------------------------| -| 3 | `ref('dim_customers', v=3)` | `dim_customers_v3.sql` | `analytics.dim_customers_v3` | -| 2 | `ref('dim_customers')` or `ref('dim_customers', v=2)` | `dim_customers_v2.sql` (or `dim_customers.sql`) | `analytics.dim_customers_v2` and `analytics.dim_customers` (recommended) | -| 1 | `ref('dim_customers', v=1)` | `dim_customers_v1.sql` | `analytics.dim_customers_v1` | +| v | version | `ref` syntax | File name | Database relation | +|---|------------|-------------------------------------------------------|-------------------------------------------------|--------------------------------------------------------------------------| +| 3 | "prerelease" | `ref('dim_customers', v=3)` | `dim_customers_v3.sql` | `analytics.dim_customers_v3` | +| 2 | "latest" | `ref('dim_customers', v=2)` **and** `ref('dim_customers')` | `dim_customers_v2.sql` **or** `dim_customers.sql` | `analytics.dim_customers_v2` **and** `analytics.dim_customers` (recommended) | +| 1 | "old" | `ref('dim_customers', v=1)` | `dim_customers_v1.sql` | `analytics.dim_customers_v1` | -As you'll see in the implemenatation section below, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. +As you'll see in the implementation section below, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. -dbt also supports [`version`-based selection](node-selection/methods#the-version-method). For example, you could define a [default yaml selector](node-selection/yaml-selectors#default), to avoid running any old model versions in development—even as you continue to run them in production through a sunset and migration period. (Of course, you could accomplish something similar by applying `tags` to these models, and cycling through them over time.) +dbt also supports [`version`-based selection](node-selection/methods#the-version-method). For example, you could define a [default yaml selector](node-selection/yaml-selectors#default) that avoids running any old model versions in development, even while you continue to run them in production through a sunset and migration period. (You could accomplish something similar by applying `tags` to these models, and cycling through those tags over time.) @@ -113,13 +114,32 @@ Resolving to latest version: my_model.v2 A prerelease version 3 is available. It has not yet been marked 'latest' by its maintainer. When that happens, this reference will resolve to my_model.v3 instead. - Try out v3: {{ ref(my_dbt_project, my_model, v=3) }} - Pin to v2: {{ ref(my_dbt_project, my_model, v=2) }} + Try out v3: {{ ref('my_dbt_project', 'my_model', v='3') }} + Pin to v2: {{ ref('my_dbt_project', 'my_model', v='2') }} ``` ## How to create a new version of a model -Most often, you'll start with a model that is not yet versioned. Let's go back in time to when `dim_customers` was a simple standalone model, with an enforced contract. For simplicity, we'll pretend it had only two columns—`customer_id` and `country_name`—though most mature models will obviously have many more. +Most often, you'll start with a model that is not yet versioned. Let's go back in time to when `dim_customers` was a simple standalone model, with an enforced contract. For simplicity, let's pretend it has only two columns, `customer_id` and `country_name`, though most mature models will have many more. + + + +```sql +-- lots of sql + +final as ( + + select + customer_id, + country_name + from ... + +) + +select * from final +``` + + @@ -141,7 +161,31 @@ models: -If you wanted to make a breaking change to the model-for example, removing a column-you'd create a new model file (SQL or Python) encompassing those breaking changes. The default convention is naming the new file with a `_v` suffix. The new version can then be configured in relation to the original model, in a way that highlights the diffs between them. Or, you can choose to define each model version with full specifications, and repeat the values they have in common. +Let's say you need to make a breaking change to the model: Removing the `country_name` column, which is no longer reliable. First, create create a new model file (SQL or Python) encompassing those breaking changes. + + +The default convention is naming the new file with a `_v` suffix. Let's make a new file, named `dim_customers_v2.sql`. (We don't need to rename the existing model file just yet, while it's still the "latest" version.) + + + +```sql +-- lots of sql + +final as ( + + select + customer_id + -- country_name has been removed! + from ... + +) + +select * from final +``` + + + +Now, you could define properties and configuration for `dim_customers_v2` as a new standalone model, with no actual relation to `dim_customers` save a striking resemblance. Instead, we're going to declare that these are versions of the same model, both named `dim_customers`. We can define their properties in common, and then **just** highlight the diffs between them. (Or, you can choose to define each model version with full specifications, and repeat the values they have in common.) @@ -163,14 +207,19 @@ models: description: Where this customer lives data_type: varchar - # Declare the versions, highlighting just the diffs + # Declare the versions, and highlight the diffs versions: + + - v: 1 + # Matches what's above -- nothing more needed + - v: 2 + # Removed a column -- this is the breaking change! columns: + # This means: use the 'columns' list from above, but exclude country_name - include: all - exclude: [country_name] # this is the breaking change! - - v: 1 - # No need to redefine anything -- matches the properties defined above + exclude: [country_name] + ``` @@ -217,14 +266,13 @@ models: +The configuration above says: Instead of two unrelated models, I have two versioned definitions of the same model: `dim_customers.v1` and `dim_customers.v2`. -The above configuration will create two models, `dim_customers.v1` and `dim_customers.v2`. +**Where are they defined?** dbt expects each model version to be defined in a file named `_v`. In this case: `dim_customers_v1.sql` and `dim_customers_v2.sql`. It's also possible to define the "latest" version in `dim_customers.sql` (no suffix), without additional configuration. Finally, you can override this convention by setting [`defined_in: any_file_name_you_want`](resource-properties/versions#defined_in)—but we strongly encourage you to follow the convention, unless you have a very good reason. -**Where are they defined?** By convention, dbt will expect those two models to be defined in files named `dim_customers_v1.sql` and `dim_customers_v2.sql`. It will also accept `dim_customers.sql` (no suffix) as the definition of the latest version. (It is possible to override this by setting [`defined_in: any_file_name_you_want`](resource-properties/versions#defined_in), but only if you have a good reason. We strongly encourage you to follow the convention.) +**Where will they be materialized?** Each model version will create a database relation with alias `_v`. In this case: `dim_customers_v1` and `dim_customers_v2`. See [the section below](#configuring-database-location-with-alias) for more details on configuring aliases. -**Where will they be materialized?** By convention, these will create database relations with aliases `dim_customers_v1` and `dim_customers_v2`. In the future, dbt will also create a view or clone, named `dim_customers`, pointing to the latest version. See [the section below](#configuring-database-location-with-alias) for a way to implement this now. - -**Which version is "latest"?** If not specified explicitly, the `latest_version` would be `2` (numerically greatest). In this case, `v1` is specified to still be the latest; `v2` is a prerelease in early development. When we're ready to roll out `v2` to everyone by default, we would bump the `latest_version` to `2`, or remove it from the specification. +**Which version is "latest"?** If not specified explicitly, the `latest_version` would be `2`, because it's numerically greatest. In this case, we've explicitly specified that `latest_version: 1`. That means `v2` is a "prerelease," in early development and testing. When we're ready to roll out `v2` to everyone by default, we would bump `latest_version: 2`, or remove `latest_version` from the specification. ### Configuring versioned models @@ -262,30 +310,36 @@ You could use the `alias` configuration: -Or, you could do one better: Define a post-hook to create a view named `dim_customers`, which always points to the latest version of the `dim_customers` model. You can find logic for just such a hook in [this gist](https://gist.github.com/jtcohen6/68220cd76b0bde088d3439664ccfb013/edit). Then, you can implement this for all versioned models in your project: +**The pattern we recommend:** Create a view or table clone with the model's canonical name that always points to the latest version. By following this pattern, you can offer the same flexibility as `ref`, even if someone is querying outside of dbt. Want a specific version? Pin to version X by adding the `_vX` suffix. Want the latest version? No suffix, and the view will redirect you. - +We intend to build this into `dbt-core` as out-of-the-box functionality. (Upvote or comment on [dbt-core#7442](https://github.com/dbt-labs/dbt-core/issues/7442).) In the meantime, you can implement this pattern yourself with a custom macro and post-hook: + + ```sql {% macro create_latest_version_view() %} + -- this hook will run only if the model is versioned, and only if it's the latest version + -- otherwise, it's a no-op {% if model.get('version') and model.get('version') == model.get('latest_version') %} - {% set new_relation = api.Relation.create( - database = this.database, - schema = this.schema, - identifier = model['name'] - ) %} - - {% set existing_relation = load_relation(new_relation) %} - {{ drop_relation_if_exists(existing_relation) }} + {% set new_relation = this.incorporate(path={"identifier": model['name']}) %} - {% set create_view_sql = create_view_as(new_relation, "select * from " ~ this) -%} + {% set create_view_sql -%} + -- this syntax may vary by data platform + create or replace view {{ new_relation }} + as select * from {{ this }} + {%- endset %} {% do log("Creating view " ~ new_relation ~ " pointing to " ~ this, info = true) if execute %} {{ return(create_view_sql) }} + {% else %} + + -- no-op + select 1 as id + {% endif %} {% endmacro %} @@ -305,10 +359,6 @@ models: -**This is the pattern we recommend,** and we intend to build it into to `dbt-core` as out-of-the-box functionality: [dbt-core#7442](https://github.com/dbt-labs/dbt-core/issues/7442). - -By following this pattern, you can offer the same flexibility as `ref`, even if someone is querying outside of dbt. Want a specific version? Pin to version X by adding the `_vX` suffix. Want the latest version? No suffix. - :::info If your project has historically implemented [custom aliases](/docs/build/custom-aliases) by reimplementing the `generate_alias_name` macro, and you'd like to start using model versions, you should update your custom implementation to account for model versions. Specifically, we'd encourage you to add [a condition like this one](https://github.com/dbt-labs/dbt-core/blob/ada8860e48b32ac712d92e8b0977b2c3c9749981/core/dbt/include/global_project/macros/get_custom_name/get_custom_alias.sql#L26-L30). From 1a007574559aaad7ff0e7c505125d54f8b693b93 Mon Sep 17 00:00:00 2001 From: Jeremy Cohen Date: Wed, 26 Apr 2023 12:44:59 +0200 Subject: [PATCH 6/6] Final feedbacack --- .../docs/collaborate/govern/model-versions.md | 34 ++++++++++--------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/website/docs/docs/collaborate/govern/model-versions.md b/website/docs/docs/collaborate/govern/model-versions.md index 6c09eb594b9..3e1c79b00dc 100644 --- a/website/docs/docs/collaborate/govern/model-versions.md +++ b/website/docs/docs/collaborate/govern/model-versions.md @@ -10,10 +10,12 @@ This functionality is new in v1.5 — if you have thoughts, participate in [the ::: Versioning APIs is a hard problem in software engineering. The root of the challenge is that the producers and consumers of an API have competing incentives: -- Producers of an API need the ability to make changes to its logic. There is a real cost associated with maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier. -- Consumers of an API need to trust in its stability—their queries will keep working, and won't break without warning. There is a real cost associated with migrating to a newer API version, but unplanned migration is far costlier. +- Producers of an API need the ability to modify its logic and structure. There is a real cost to maintaining legacy endpoints forever, but losing the trust of downstream users is far costlier. +- Consumers of an API need to trust in its stability: their queries will keep working, and won't break without warning. Although migrating to a newer API version incurs an expense, an unplanned migration is far costlier. -The goal of model versions is not to make the problem go away, nor to pretend it's somehow easier or simpler than it is. Rather, we want dbt to provide tools that make it possible to tackle this problem, thoughtfully and head-on, and to develop standard patterns for solving it. +When sharing a final dbt model with other teams or systems, that model is operating like an API. When the producer of that model needs to make significant changes, how can they avoid breaking the queries of its users downstream? + +Model versioning is a tool to tackle this problem, thoughtfully and head-on. The goal of is not to make the problem go away entirely, nor to pretend it's easier or simpler than it is. ## Related documentation - [`versions`](resource-properties/versions) @@ -40,7 +42,7 @@ There is a real trade-off that exists here—the cost to frequently migrate down ## When should you version a model? -By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. These changes, when made intentionally, would require a new model version. But many changes are not breaking, and don't require a new version—such as adding a new column, or fixing a bug in an existing column's calculation. +By enforcing a model's contract, dbt can help you catch unintended changes to column names and data types that could cause a big headache for downstream queriers. If you're making these changes intentionally, you should create a new model version. If you're making a non-breaking change, you don't need a new version—such as adding a new column, or fixing a bug in an existing column's calculation. Of course, it's possible to change a model's definition in other ways—recalculating a column in a way that doesn't change its name, data type, or enforceable characteristics—but would substantially change the results seen by downstream queriers. @@ -56,7 +58,7 @@ Rather than constantly adding a new version for each small change, you should op When you make updates to a model's source code—its logical definition, in SQL or Python, or related configuration—dbt can [compare your project to previous state](project-state), enabling you to rebuild only models that have changed, and models downstream of a change. In this way, it's possible to develop changes to a model, quickly test in CI, and efficiently deploy into production—all coordinated via your version control system. -**Versioned models are different.** Defining model `versions` is appropriate when there are people, systems, and processes beyond your team's control, inside or outside of dbt. You can neither simply go migrate them all, nor break their queries on a whim. You need to do my part by offering a migration path, with clear diffs and deprecation dates. +**Versioned models are different.** Defining model `versions` is appropriate when people, systems, and processes beyond your team's control, inside or outside of dbt, depend on your models. You can neither simply go migrate them all, nor break their queries on a whim. You need to offer a migration path, with clear diffs and deprecation dates. Multiple versions of a model will live in the same code repository at the same time, and be deployed into the same data environment simultaneously. This is similar to how web APIs are versioned: Multiple versions are live simultaneously, two or three, and not more). Over time, newer versions come online, and older versions are sunsetted . @@ -67,14 +69,14 @@ Honestly, it's only a little bit different! There isn't much magic here, and tha You've always been able to copy-paste, create a new model file, and name it `dim_customers_v2.sql`. Why should you opt for a "real" versioned model instead? As the **producer** of a versioned model: -1. You keep track of all live versions in one place, rather than scattering them throughout the codebase -2. You can reuse the model's configuration, and highlight just the diffs between versions -3. You can select models to build (or not) based on whether they're a `latest`, `prerelease`, or `old` version -4. dbt will notify consumers of your versioned model when new versions become available, or (in the future) when they are slated for deprecation +- You keep track of all live versions in one place, rather than scattering them throughout the codebase +- You can reuse the model's configuration, and highlight just the diffs between versions +- You can select models to build (or not) based on whether they're a `latest`, `prerelease`, or `old` version +- dbt will notify consumers of your versioned model when new versions become available, or (in the future) when they are slated for deprecation As the **consumer** of a versioned model: -1. You use a consistent `ref`, with the option of pinning to a specific live version -2. You will be notified throughout the life cycle of a versioned model +- You use a consistent `ref`, with the option of pinning to a specific live version +- You will be notified throughout the life cycle of a versioned model All versions of a model preserve the model's original name. They are `ref`'d by that name, rather than the name of the file that they're defined in. By default, the `ref` resolves to the latest version (as declared by that model's maintainer), but you can also `ref` a specific version of the model, with a `version` keyword. @@ -86,9 +88,9 @@ Let's say that `dim_customers` has three versions defined: `v2` is the "latest", | 2 | "latest" | `ref('dim_customers', v=2)` **and** `ref('dim_customers')` | `dim_customers_v2.sql` **or** `dim_customers.sql` | `analytics.dim_customers_v2` **and** `analytics.dim_customers` (recommended) | | 1 | "old" | `ref('dim_customers', v=1)` | `dim_customers_v1.sql` | `analytics.dim_customers_v1` | -As you'll see in the implementation section below, a versioned model can reuse the majority of its yaml properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. +As you'll see in the implementation section below, a versioned model can reuse the majority of its YAML properties and configuration. Each version needs to only say how it _differs_ from the shared set of attributes. This gives you, as the producer of a versioned model, the opportunity to highlight the differences across versions—which is otherwise difficult to detect in models with dozens or hundreds of columns—and to clearly track, in one place, all versions of the model which are currently live. -dbt also supports [`version`-based selection](node-selection/methods#the-version-method). For example, you could define a [default yaml selector](node-selection/yaml-selectors#default) that avoids running any old model versions in development, even while you continue to run them in production through a sunset and migration period. (You could accomplish something similar by applying `tags` to these models, and cycling through those tags over time.) +dbt also supports [`version`-based selection](node-selection/methods#the-version-method). For example, you could define a [default YAML selector](node-selection/yaml-selectors#default) that avoids running any old model versions in development, even while you continue to run them in production through a sunset and migration period. (You could accomplish something similar by applying `tags` to these models, and cycling through those tags over time.) @@ -292,7 +294,7 @@ versions: -Like with all config inheritance, any configs set _within_ the versioned model's definition (`.sql` or `.py` file) will take precedence over the configs set in yaml. +Like with all config inheritance, any configs set _within_ the versioned model's definition (`.sql` or `.py` file) will take precedence over the configs set in YAML. ### Configuring database location with `alias` @@ -399,8 +401,8 @@ from {{ dim_customers_v1 }} Of course, if one model version makes meaningful and substantive changes to logic in another, it may not be possible to optimize it in this way. At that point, the cost of human intuition and legibility is more important than the cost of recomputing similar transformations. We expect to develop more opinionated recommendations as teams start adopting model versions in practice. One recommended pattern we can envision: Prioritize the definition of the `latest_version`, and define other versions (old and prerelease) based on their diffs from the latest. How? -- Define the properties and configuration for the latest version in the top-level model yaml, and the diffs for other versions below (via `include`/`exclude`) +- Define the properties and configuration for the latest version in the top-level model YAML, and the diffs for other versions below (via `include`/`exclude`) - Where possible, define other versions as `select` transformations, which take the latest version as their starting point -- When bumping the `latest_version`, migrate the SQL and yaml accordingly. +- When bumping the `latest_version`, migrate the SQL and YAML accordingly. In the example above, the third point might be tricky. It's easier to _exclude_ `country_name`, than it is to add it back in. Instead, we might need to keep around the full original logic for `dim_customers.v1`—but materialize it as a `view`, to minimize the data warehouse cost of building it. If downstream queriers see slightly degraded performance, it's still significantly better than broken queries, and all the more reason to migrate to the new "latest" version.