Skip to content
This repository has been archived by the owner on Dec 18, 2023. It is now read-only.

Remove regex from metadata queries #34

Closed
jtcohen6 opened this issue Nov 24, 2020 · 3 comments · Fixed by #40
Closed

Remove regex from metadata queries #34

jtcohen6 opened this issue Nov 24, 2020 · 3 comments · Fixed by #40
Labels
enhancement New feature or request

Comments

@jtcohen6
Copy link
Contributor

@WangSleep commented on Thu Nov 19 2020

When i excute "dbt docs generate" on CLI .it dosen't work .

it's error message :
PrestoQueryError(type=INTERNAL_ERROR, name=OPTIMIZER_TIMEOUT, message="The optimizer exhausted the time limit of 180000 ms", query_id=20201110_082631_11221_7vjyv)

i can excute "dbt deug" , "dbt test" ,"dbt run" on CLI . but "dbt docs generate" alone can't

Can you solve my problem?

thanks a lot ~~~😊


@WangSleep commented on Thu Nov 19 2020

I found out when I performed "dbt docs generate" , it will excute this sql

select table_catalog as database, table_name as name, table_schema as schema, case when table_type = 'BASE TABLE' then 'table' when table_type = 'VIEW' then 'view' else table_type end as table_type from "hive".INFORMATION_SCHEMA.tables where regexp_like(table_schema, '(?i)\ADatebase\Z')

Too many resources are required, and execution time is too long to cause a timeout 。

Could you tell me what can be solved? thanks a lot


@jtcohen6 commented on Thu Nov 19 2020

Hey @WangSleep, which database / adapter plugin are you using? It sounds like the catalog query—which dbt uses to grab metadata, populate catalog.json, and thereby feed information into the docs site—may be inefficient as currently implemented on this adapter.


@WangSleep commented on Thu Nov 19 2020

Hey @WangSleep, which database / adapter plugin are you using? It sounds like the catalog query—which dbt uses to grab metadata, populate catalog.json, and thereby feed information into the docs site—may be inefficient as currently implemented on this adapter.

yeh .

We query by configuring presto-python-client, which I know is for docs, but python_presto query "hive". INFORMATION_SCHEMA is slow. This causes a timeout. Later I found out that I used lore-zeppelin, which was based on jdbc, and I could query. It takes about 5 minutes.

thanks a lot ~


@jtcohen6 commented on Fri Nov 20 2020

I'm still a bit confused. Are you using dbt-presto? What's the output when you run dbt --version?


@WangSleep commented on Sun Nov 22 2020

I'm still a bit confused. Are you using dbt-presto? What's the output when you run dbt --version?

hi , bro !
yes .
I'm using it . dbt --version : 0.18.1


@jtcohen6 commented on Mon Nov 23 2020

We haven't released dbt-presto v0.18.1 yet (which is on us), and presto__get_catalog as implemented in that plugin looks different from the query snippet you pasted above. (In fact, your snippet doesn't look like any of the get_catalog queries familiar to me.)

So... I'm still confused as to which dbt plugin you're using :)


@WangSleep commented on Mon Nov 23 2020

We haven't released dbt-presto v0.18.1 yet (which is on us), and presto__get_catalog as implemented in that plugin looks different from the query snippet you pasted above. (In fact, your snippet doesn't look like any of the get_catalog queries familiar to me.)

So... I'm still confused as to which dbt plugin you're using :)

hi . @jtcohen6

FYI

This is the version I installed :)

image


@jtcohen6 commented on Tue Nov 24 2020

Ok, this makes much more sense now. Appreciate the clarification!

The snippet you included comes from presto__list_relations_without_caching. I'm not sure why it only times out on dbt docs generate, since that's a query that should run at the start of all invocations.

I'm going to transfer this issue to the dbt-presto repo, where we can discuss if there is a more performant way to write that metadata introspection. Out of curiosity, how many objects are in your "hive" catalog?

@jtcohen6 jtcohen6 changed the title i have a problem . Hopefully it can be solved 。 list_relations_without_caching resulting in timeout Nov 24, 2020
@WangSleep
Copy link

thanks ~

about 4000 - 5000 metadata table in hive and about 100 database information when i need to query :)

@jtcohen6
Copy link
Contributor Author

I'm pretty sure the culprit here is presto_ilike, which performs regex in the filter and likely requires scanning all tables in the metastore.

Several months ago, we made the decision to switch dbt-snowflake from performing ilike queries of information_schema.tables to instead use show terse objects in schema (dbt-labs/dbt-core#2322).

Presto supports both information_schema queries and show for schemas, tables, and columns. It would be worth doing some performance benchmarking to check if one is faster than the other. The big change, that we should make regardless, is to remove the regex filter and replace it with an equality match, both in list_relations_without_caching and get_columns_in_relation.

As in dbt-snowflake, it will require a bit of extra handling to first figure out the right quoting and casing of available schemas.

@jtcohen6 jtcohen6 changed the title list_relations_without_caching resulting in timeout Remove regex from metadata queries Nov 25, 2020
@jtcohen6 jtcohen6 added the enhancement New feature or request label Nov 25, 2020
@WangSleep
Copy link

WangSleep commented Nov 26, 2020

I'm pretty sure the culprit here is presto_ilike, which performs regex in the filter and likely requires scanning all tables in the metastore.

Several months ago, we made the decision to switch dbt-snowflake from performing ilike queries of information_schema.tables to instead use show terse objects in schema (fishtown-analytics/dbt#2322).

Presto supports both information_schema queries and show for schemas, tables, and columns. It would be worth doing some performance benchmarking to check if one is faster than the other. The big change, that we should make regardless, is to remove the regex filter and replace it with an equality match, both in list_relations_without_caching and get_columns_in_relation.

As in dbt-snowflake, it will require a bit of extra handling to first figure out the right quoting and casing of available schemas.

I modified the adapters.sql and catalog.sql files。

adapters.sql : use '=' replace 'ilike'
catalog.sql : add two 'where' Conditions.

It now takes about 2 minutes to execute。so , thansk a lot .

For now, it's settled 。

thank you vary much .and finally ,snapshot materialization not implemented for presto。It's such a pity @@/(ㄒoㄒ)/~~

@jtcohen6

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants