Remove limitations to partial parsing, turn on by default #3102

jtcohen6 · 2021-02-15T14:25:19Z

Parsing files, especially .yml files, is one of the most time-consuming operations at the start of an invocation. Partial parsing is a powerful feature that enables dbt to avoid re-parsing unchanged files in subsequent runs. While it's no substitute for improving overall parse time, the better it can be, the better developers' quality of life will be, especially when working with very big projects.

Today, there are some significant limitations. From docs:

If environment variables control the parsed representation of your project, then the logic executed by dbt may differ from the logic specified in your project. Partial parsing should only be used when all of the logic in your dbt project is encoded in the files inside of that project.

What would we need in order to return reliable, up-to-date results? I imagine dbt would need to:

capture which files/macros/configs/etc depend on the values of environment variables
record which environment variables matter, and what their values are
compare the env var values in subsequent runs to determine if the files need to be re-parsed

If partial parsing is enabled and --vars change between runs, dbt will always re-parse.

What would we need to avoid re-parsing all files when --vars change? I imagine it's similar to above! In both cases, if dbt could statically analyze which files directly or indirectly depend on the {{ env_var() }} macro and {{ var() }} macros, and compare the values in subsequent runs, it can know whether to re-parse certain files.

By default, partial_parse is set to false

What would need to change to turn this on for all projects by default? I imagine it's having good answers to env vars above. (Currently, dbt will return correct parsed results without speed improvements if --vars change, but there's a chance it returns incorrect results if env vars change.) There may be other things I'm missing as well.

Alternatively, we could distinguish between the two and say that env vars are a non-dbt construct, whereas --vars are a dbt construct for which it must take responsibility. It's up to the developer / deployment orchestrator to ensure that, if partial parse is being used, all env vars are consistent across the pickled and current runtimes.

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-02-24T09:34:32Z

See also: #2265, #2363

jtcohen6 · 2021-03-03T15:43:52Z

Also:

Why don't we pickle the "final" version of the manifest? Today, patch_sources_elapsed and process_manifest_elapsed have to be performed in full, partial parsing or no. (We opened this as a separate issue: Refactor partial parsing to cover more startup costs #3163)
Could we use msgpack (via Mashumaro) or cbor instead of a pickle file?

jtcohen6 · 2021-06-05T21:44:07Z

A number of these improvements are included in 0.20.0rc1. An additional slew of test cases and edge cases are detailed in #3371. Closing in favor of that issue.

jtcohen6 added enhancement New feature or request performance vars labels Feb 15, 2021

jtcohen6 mentioned this issue Mar 23, 2021

[Q2C1] dbt Core must scale #3188

Closed

jtcohen6 added the partial_parsing label Jun 5, 2021

jtcohen6 closed this as completed Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove limitations to partial parsing, turn on by default #3102

Remove limitations to partial parsing, turn on by default #3102

jtcohen6 commented Feb 15, 2021

jtcohen6 commented Feb 24, 2021

jtcohen6 commented Mar 3, 2021 •

edited

Loading

jtcohen6 commented Jun 5, 2021

Remove limitations to partial parsing, turn on by default #3102

Remove limitations to partial parsing, turn on by default #3102

Comments

jtcohen6 commented Feb 15, 2021

jtcohen6 commented Feb 24, 2021

jtcohen6 commented Mar 3, 2021 • edited Loading

jtcohen6 commented Jun 5, 2021

jtcohen6 commented Mar 3, 2021 •

edited

Loading