Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Manifest instead of ParseResult [#3163] #3219

Merged
merged 1 commit into from
Apr 6, 2021
Merged

Conversation

gshank
Copy link
Contributor

@gshank gshank commented Apr 1, 2021

resolves #3163

Description

As part of the performance initiative we are working on saving more of the parsing state to be reused when partial parsing is turned on. This ticket switches to use a Manifest object instead of a ParseResult object to store parsing state. The eventual goal is to be able to reload a previously fully parsed Manifest.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@gshank gshank force-pushed the partial_parsing branch from 39024b1 to 39c7dbd Compare April 1, 2021 16:42
@cla-bot cla-bot bot added the cla:yes label Apr 1, 2021
@gshank gshank requested review from jtcohen6 and kwigley April 1, 2021 16:42
@gshank gshank changed the title Use Manifest instead of ParseResults [#3163] Use Manifest instead of ParseResult [#3163] Apr 1, 2021
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool! I left a few initial notes as I was trying to read through + understand the changes. Thank you for all the inline comments along the way :)

Comment on lines +382 to +384
# finally, we should hash the actual profile used, not just root project +
# profiles.yml + relevant args. While sufficient, it is definitely overkill.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's not clear to me why we need the full profile information here. On the one hand, it is common for users to include special Jinja variables like {{ target.name }} in node configs, so we'd need to re-render those if the target changes. Ideally, we'd be able to take an approach like the one you outlined above for vars—so that changing target doesn't invalidate everything, just the nodes that depend on it.

Aside from those Jinja variables, I don't believe the parsed representation / internal manifest should depend on any part of the specific connection, except for the adapter/credentials type being used. Each adapter plugin contains macros, and it can add or override certain node configs, so parsing the same project with dbt-postgres vs. dbt-redshift would produce a genuinely different manifest. But, absent any target-aware logic, I think parsing the same project for one Redshift database vs. a different Redshift database should return the exact same manifest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a case where we're being overzealous, and fully re-parsing the project when we could make partial parsing smarter. Therefore, this is in the "slow + correct" bucket, rather than the "fast + incorrect" bucket. We shouldn't lose track of this, but for now, we should be prioritizing edge cases in the latter category.

Comment on lines +378 to +382
# TODO: this should be calculated per-file based on the vars() calls made in
# parsing, so changing one var doesn't invalidate everything. also there should
# be something like that for env_var - currently changing env_vars in way that
# impact graph selection or configs will result in weird test failures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1000. Thanks for calling it out here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was already, I just moved it :) But it's a good comment.

Comment on lines +339 to +342
process_sources(self.manifest, project_name)
process_refs(self.manifest, project_name)
process_docs(self.manifest, self.root_project)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it right, our next goal (in #3217) is to move these methods to before the partial-parsing save point. Today, dbt fully re-processes every single source, ref, and description, every single time. Ideally, partial-parsing should only be re-processing nodes that have changed—whether the change occurred in a .sql file, .yml file, or .md file associated with that node.

Do I have that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct.

@gshank gshank force-pushed the partial_parsing branch 2 times, most recently from 211f593 to e5edc57 Compare April 2, 2021 14:15
Copy link
Contributor

@kwigley kwigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

def sanitized_update(
self,
source_file: SourceFile,
old_manifest: Any,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note, you can do something like

Suggested change
old_manifest: Any,
old_manifest: Optional["Manifest"],

here to refer to itself as a type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've seen code using the string class name but didn't think of it for here.

@kwigley
Copy link
Contributor

kwigley commented Apr 5, 2021

I'm not 100% on what is causing the Windows tests to fail, can we re-run it to see if it is just a transient error?

can we re-run it

I'm actually not 100% how to do this haha

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 5, 2021

I'm not 100% on what is causing the Windows tests to fail, can we re-run it to see if it is just a transient error?

can we re-run it

I'm actually not 100% how to do this haha

I've re-run it twice, and it's failing reliably (though inscrutably) on this line:

test/unit/test_config.py::TestProject::test_cycle Windows fatal exception: stack overflow

@gshank gshank force-pushed the partial_parsing branch 3 times, most recently from 67c98ed to a132540 Compare April 6, 2021 16:54
@gshank gshank force-pushed the partial_parsing branch from a132540 to 307d47e Compare April 6, 2021 17:51
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good in local testing!

@gshank gshank merged commit 749f873 into develop Apr 6, 2021
@gshank gshank deleted the partial_parsing branch April 6, 2021 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor partial parsing to cover more startup costs
3 participants