Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift + shift for cross-db macros #5298

Merged
merged 19 commits into from
Jun 16, 2022
Merged

Lift + shift for cross-db macros #5298

merged 19 commits into from
Jun 16, 2022

Conversation

dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 commented May 25, 2022

A step on the way toward resolving #4813. Continues #5265.

Description

  • Move dispatched + default__ macros into global project, macros/utils
  • Move Postgres-specific implementations into dbt-postgres
  • Move test cases into dbt-tests-adapter, with options to override the macro_namespace. This allows us to test for backwards compatibility when people call the dbt_utils-namespaced versions.

Conflict for current_timestamp()

Context

  • Both dbt-utils and dbt-core contained definitions for current_timestamp()
  • The text of the definitions is not identical (see here)
  • However the SQL result of the definitions appears to be equal (see here)

Options

We have some options for how to handle this:

  1. Keep the existing definition within each database adapter and discard the dbt-utils definition
  2. Overwrite the existing definition within each database adapter in favor of the dbt-utils definition
  3. Keep the existing definition within each database adapter and re-institute the dbt-utils definition back into the macros/cross_db_utils folder.

Initial decision

I went with the following decision since dbt-core is more foundational:

  • Option 1: Keep the original definitions from dbt-core and discard the definitions from dbt-core

But all else being equal, I probably would have preferred Option 2. Risk aversion tilted the scales to Option 1.

Further options (per adapter)

If desired, any or all of the following could be done without any known issues:

  • Move definition of current_timestamp() from dbt/include/global_project/macros/adapters/freshness.sql to dbt/include/global_project/macros/utils/current_timestamp.sql (or equivalent location per adapter)
  • Replace the original definition in dbt-core with the original definition from dbt-utils

To do before merge

  • Review and affirm the default definition of current_timestamp() here
  • Review initial decision for handing current_timestamp() and change tack if desired
  • Determine how to handle tests decorated with @pytest.mark.skip within tests/adapter/dbt/tests/adapter/utils/

@cla-bot cla-bot bot added the cla:yes label May 25, 2022
@dbeatty10
Copy link
Contributor Author

@jtcohen6 could you take a look at the "to do before merge" items above?

Everything has been been moved into dbt-core, and all tests are passing. Also affirmed that all the adapters are working when using this branch.

I made an initial decision how to handle the conflict with current_timestamp(), but we still have full optionality to change that decision. The tests in dbt-utils didn't see any functional differences between the competing definitions. My main worry is that that there is some edge case that our tests aren't exposing (like some particular database setting for handling timestamps).

Do you have a preference between options 1, 2, and 3 for handing the current_timestamp() conflict?

@dbeatty10 dbeatty10 marked this pull request as ready for review May 31, 2022 15:33
@dbeatty10 dbeatty10 requested a review from a team May 31, 2022 15:33
@dbeatty10 dbeatty10 requested review from a team as code owners May 31, 2022 15:33
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbeatty10 Great work here!!

As discussed the other day, let's kick out the current_timestamp + type_* macros/tests for now. We have other issues to track that work: https://github.com/dbt-labs/dbt-utils/issues/599, dbt-labs/dbt-utils#598

The rest of this looks good to move forward. This is the first PR we'll need to merge, and then we can move forward with the rest.

.changes/unreleased/Features-20220518-114604.yaml Outdated Show resolved Hide resolved
@jtcohen6 jtcohen6 added the Team:Adapters Issues designated for the adapter area of the code label Jun 2, 2022
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of tests that didn't got implemented, do we have anything tracking those?

Otherwise this is mainly moving things in dbt-utils to core right?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 3, 2022

There seems to be a lot of tests that didn't got implemented, do we have anything tracking those?

Yes! Let's kick those type_* macros + tests out of this PR. They'll be added in dbt-labs/dbt-utils#586 instead.

Same for current_timestamp macros + tests. We're tracking the work there in #5521.

Otherwise this is mainly moving things in dbt-utils to core right?

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ship it! And then, let's ship the rest of it!

Comment on lines +1 to +7
{% macro cast_bool_to_text(field) %}
{{ adapter.dispatch('cast_bool_to_text', 'dbt') (field) }}
{% endmacro %}

{% macro default__cast_bool_to_text(field) %}
cast({{ field }} as {{ api.Column.translate_type('string') }})
{% endmacro %}
Copy link
Contributor

@jtcohen6 jtcohen6 Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever work in 1001e42! That brought this macro to my attention, along with a few of the others.

This one feels pretty close to the type_* macro behavior, and like it really just exists to handle a wonky behavior around cast() on Redshift. But I don't think we need to split hairs, beyond the extent to which I already have :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think I just took the type_* macro behavior based on api.Column.translate_type() from your dbt-utils branch.

If/when we push a version of the the type_* macro into core, we have the option of replacing these references with that macro instead.

@@ -0,0 +1,30 @@
{% macro listagg(measure, delimiter_text="','", order_by_clause=none, limit_num=none) -%}
Copy link
Contributor

@jtcohen6 jtcohen6 Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my work on dbt-labs/dbt-spark#359, I've realized that order_by_clause is really not supported on the SparkSQL analogue for listagg. I don't think that's a reason to exclude it from the standard / default implementation, since so many other SQL dialects support it so handily, but:

  • We might be able to support it, via some array sorting, if it were expressed as order_by=<column_name>, instead of expecting a SQL order by column_name expression
  • I'm curious what you think about the approach I took here — basically, raise a warning if the argument is passed, then ignore it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach you took with ignoring the order_by_clause parameter and raising a warning is completely reasonable.

Definitely worth a look if we can replace a user-supplied value like order by column_name expression with order_by=<column_name> instead. I'll take a peek and report back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think we should defer this work to a separate PR (if we do in fact take it on).

Before taking it on, here are the main things for us to answer:

  1. Do all databases operate correctly with an expression like order_by_clause="order by column_name"?
  2. Would it be a breaking change? If so, is there any path forward that is non-breaking?

Findings for question 1

Summary: Yes, it looks like they all support order_by_clause="order by column_name"

Details:

  • Here are current implementations across databases.
    • Looks like our supported adapters each needs a different implementation, possibly because it was a relative late-comer first appearing in SQL:2016. (Fascinating discussion on listagg here)
  • The pytests here include expressions like "order by order_col" (exactly as you expected)
  • The pytests use positional arguments exclusively and don't currently cover named keyword arguments.
    • 🟢 Covering the positional arguments is most crucial since messing with order of those can affect consumers relying on a particular order.
    • 🟡 We might want to add keyword testing for good measure though.

Findings for question 2

It would be a breaking change if we drop order_by_clause="order by column_name" and replace it with something like order_by="column_name".

Options:

  1. Accept the breaking change
  2. Reject the breaking change
  3. Introduce the new parameter and deprecate the old parameter in a non-breaking fashion.

The last option would look something like this:

  • New signature:{% macro listagg(measure, delimiter_text="','", order_by_clause=none, limit_num=none, order_by=none) -%}
    • Doing it this way prevents unexpected errors for any consumers relying upon positional arguments
  • Raise an error (or warn) if both order_by_clause and order_by are provided
  • When only order_by_clause is provided:
    • This is the "base case", and all the current code is relevant
    • Provide a deprecation warning that this parameter will be removed in a future release.
  • When only order_by_clause is provided:
    • For non-spark adapters, convert order_by into the relevant order_by_clause value. Proceed to the code in the base case above.
    • For spark, use order_by according to your plans 😈

When doing the final deprecation in the future, we'd probably want to leave order_by_clause=none as a non-functional vestigial parameter to preserve the positions of each keyword parameter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbeatty10 Thanks for going deep on this—much deeper than I did!

I did some more poking, and the amount of struct + array_sort nonsense that would be required even just to get order_by=column_name working is nontrivial. More effort than it's worth until we have a proven need. It's an optional argument, and feels like a nice-to-have.

I'm quite happy to leave this in place, and document / warn that order_by_clause is not supported on SparkSQL. It does seem to be a (recent) standard, and most other databases support it in exactly this form (order_by_clause). Perhaps we could convince the fine folks developing Apache Spark to add support for it, someday? :)

@jtcohen6 jtcohen6 merged commit a02db03 into main Jun 16, 2022
@jtcohen6 jtcohen6 deleted the dbeatty/utils-lift-shift branch June 16, 2022 18:07
@jtcohen6
Copy link
Contributor

🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants