Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Updates catalog queries to use strict equality #40

Merged

Conversation

clausherther
Copy link
Contributor

@clausherther clausherther commented May 4, 2021

Fixes #34 (hopefully)

This PR updates the following macros in adapter.sql to use strict equality instead of ilike and make filter parameters lower case on right-hand side of filter:

  • get_columns_in_relation
  • list_relations_without_caching
  • check_schema_exists

Also, updates get_catalog in catalog.sql to:

  • push the table_schema predicate up for performance reasons
  • make filter parameters lower case on right-hand side of filter

Presto catalogs seem to store values lower case (at least on hive). Adding a function like upper to the left side of a catalog filter causes some catalogs to not be able to push filter predicates down to the catalog provider, causing resource problems.
To avoid that, we assume values are stored in lower case and force incoming schema parameters to lower case as well.

@cla-bot cla-bot bot added the cla:yes label May 4, 2021
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @clausherther!!

I'd request two changes:

  • make the comparisons case-insensitive
  • remove the presto_ilike macro, to cut down surface area we're no longer using

{% if relation.schema %}
and {{ presto_ilike('table_schema', relation.schema) }}
and table_schema = '{{ relation.schema }}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll still want these comparisons to be case-insensitive:

Suggested change
and table_schema = '{{ relation.schema }}'
and lower(table_schema) = lower('{{ relation.schema }}')

Lower or upper, whichever you prefer, and the same for the ones below. I did find an interesting note in the docs:

The lower() and upper() functions do not perform locale-sensitive, context-sensitive, or one-to-many mappings required for some languages. Specifically, this will return incorrect results for Lithuanian, Turkish and Azeri.

I don't think that's something we should have to worry about right here & right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, funny about that case insensitivity: on our (rather large) cluster, wrapping table_schema in upper also fails. I tried submitting a fix for catalog.sql as well to make dbt docs generate work and ran into that. I have an internal support thread going but don't think there's a good fix there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, doing something like

where upper(table_schema) = '{{ relation.schema | upper }}'

which compiles to

where upper(table_schema) = 'CLAUSH_DEV'

fails with

PrestoExternalError(type=EXTERNAL, name=HIVE_METASTORE_ERROR, message="Internal error processing get_all_tables", query_id=20210504_212530_12029_bz8fk)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What! Oy... I'm guessing that's why we initially went the regex route. This still feels like a net improvement, we just may want to document a recommendation against mixed-case catalog/schema names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, at least on our cluster it seems that table_schema is stored in lower-case letters even if you created the schema upper case.
For example:

create schema "CLAUSH_TEST";
create table "CLAUSH_TEST".test as (select 1 as col1)


select *
from "iceberg".INFORMATION_SCHEMA.tables
where
    table_schema != 'information_schema'
    and
    table_schema = 'claush_test'

returns 1 row:

table_catalog	table_schema	table_name	table_type
iceberg	claush_test	test	BASE TABLE

So, how do you feel something like:

where table_schema = '{{ relation.schema | lower }}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use lower case on right side of comparison and updated README accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if, in fact, Presto identifiers are never case-sensitive... (prestodb/presto#2863, trinodb/trino#17)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Although the comparison operator certainly is case-sensitive. So,

select * from iceberg.INFORMATION_SCHEMA.tables where table_schema = 'claush'

works, while

select * from iceberg.INFORMATION_SCHEMA.tables where table_schema = 'CLAUSH'

does not.

@clausherther clausherther requested a review from jtcohen6 May 4, 2021 22:29
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clausherther Sorry I lost track of this one! Thanks for the contribution, this LGTM.

FYI @iknox-fa I'm gong to merge into master now, so that we can include in 0.20.0rc1

@jtcohen6 jtcohen6 merged commit 5fb69a0 into dbt-labs:master Jun 7, 2021
@clausherther
Copy link
Contributor Author

👏 yay!

@clausherther clausherther deleted the issue-24/update-catalog-queries branch June 7, 2021 21:58
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove regex from metadata queries
2 participants