Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove limitations to partial parsing, turn on by default #3102

Closed
jtcohen6 opened this issue Feb 15, 2021 · 3 comments
Closed

Remove limitations to partial parsing, turn on by default #3102

jtcohen6 opened this issue Feb 15, 2021 · 3 comments

Comments

@jtcohen6
Copy link
Contributor

Parsing files, especially .yml files, is one of the most time-consuming operations at the start of an invocation. Partial parsing is a powerful feature that enables dbt to avoid re-parsing unchanged files in subsequent runs. While it's no substitute for improving overall parse time, the better it can be, the better developers' quality of life will be, especially when working with very big projects.

Today, there are some significant limitations. From docs:

If environment variables control the parsed representation of your project, then the logic executed by dbt may differ from the logic specified in your project. Partial parsing should only be used when all of the logic in your dbt project is encoded in the files inside of that project.

What would we need in order to return reliable, up-to-date results? I imagine dbt would need to:

  • capture which files/macros/configs/etc depend on the values of environment variables
  • record which environment variables matter, and what their values are
  • compare the env var values in subsequent runs to determine if the files need to be re-parsed

If partial parsing is enabled and --vars change between runs, dbt will always re-parse.

What would we need to avoid re-parsing all files when --vars change? I imagine it's similar to above! In both cases, if dbt could statically analyze which files directly or indirectly depend on the {{ env_var() }} macro and {{ var() }} macros, and compare the values in subsequent runs, it can know whether to re-parse certain files.

By default, partial_parse is set to false

What would need to change to turn this on for all projects by default? I imagine it's having good answers to env vars above. (Currently, dbt will return correct parsed results without speed improvements if --vars change, but there's a chance it returns incorrect results if env vars change.) There may be other things I'm missing as well.

Alternatively, we could distinguish between the two and say that env vars are a non-dbt construct, whereas --vars are a dbt construct for which it must take responsibility. It's up to the developer / deployment orchestrator to ensure that, if partial parse is being used, all env vars are consistent across the pickled and current runtimes.

@jtcohen6 jtcohen6 added enhancement New feature or request performance vars labels Feb 15, 2021
@jtcohen6
Copy link
Contributor Author

See also: #2265, #2363

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Mar 3, 2021

Also:

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jun 5, 2021

A number of these improvements are included in 0.20.0rc1. An additional slew of test cases and edge cases are detailed in #3371. Closing in favor of that issue.

@jtcohen6 jtcohen6 closed this as completed Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant