Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/schema.yml parsing can be very slow at large scales #2480

Closed
beckjake opened this issue May 21, 2020 · 1 comment
Closed

docs/schema.yml parsing can be very slow at large scales #2480

beckjake opened this issue May 21, 2020 · 1 comment
Labels
bug Something isn't working

Comments

@beckjake
Copy link
Contributor

Describe the bug

This is similar to #2474 (and comes out of the same issue). Docs parsing can be really slow if you have lots of docs and large numbers of schema.yml entries that reference them. There's a few things about it that are "slow":

  • we render a lot of fields, and rendering jinja is slow
    • This is hard, so I won't talk about it
  • we search for docs using doc() a lot, and searching for docs scales like garbage
    • I'll discuss this below
  • loading yaml can be slow
    • This isn't fixable, but we can minimize it for some users (see below)

Steps To Reproduce

  1. Make a project with lots of docs blocks. 50,000 is a nice number
  2. Make a corresponding number of models/sources and call doc pointing at those docs in their description fields
  3. Run dbt compile
  4. Wait a pretty long time

#2474 fixed a significant regression in 0.17.0, but I believe this behavior has existed for a long time (and this behavior still exists with that PR applied).

Expected behavior

I expect ~O(n) runtime scaling! I guess O(n*log(n)) I'd accept if we have to sort things, but I don't think we do.

System information

I believe this problem is cross-platform, cross-db, and cross-python. Newer python versions will probably be faster, but doing fewer things is better everywhere.

The output of dbt --version:

0.17.0rc1

Additional context

Scaling doc()

I haven't made any charts about how it scales, but eyeballing profile results it looks like the search algorithm is making O(n^2) calls (I expected O(n) here 😕 ), so if you have N docs your time spent will go as O(n^3). That's not great!

The search algorithm is bad because it's a very simple representation of the goal:

  • if a package is supplied, search for a package + doc name pair
  • if no package is supplied, search for the caller's package + doc name, then the active project package + doc name, then "anywhere" + doc name

We should make search constant-time if we can, and that will help a lot. I don't know how/if I'd hope to avoid calling doc() n times, so that's probably an O(n) minimum!

The easy thing to do to fix this is to create a Dict[DocName, Dict[ProjectName, ParsedDocumentation]] once, and then lookups can do their package-specific searches using a two-level lookup (name first, then package). I have some code in a branch that does this, but I want to think about it more.

ref and source resolution behave pretty similar to docs and probably have the same problems, but I've spent zero time looking at it.

Faster yaml loading

If you have a lot of sources, dbt is going to run yaml.safe_load a lot, and that can be slow. By default, PyYAML is written in pure python, but there's a built in C module capability. If you have libyaml installed and pypi knows about it (on linux I think you only have to apt-get install libyaml-dev before you pip install), then yaml.CSafeLoader is available, and you can change yaml.safe_load(whatever) to yaml.load(whatever, Loader=yaml.CSafeLoader). The performance gains are substantial at scale.

@beckjake beckjake added the bug Something isn't working label May 21, 2020
@drewbanin drewbanin added this to the Octavius Catto milestone May 21, 2020
beckjake added a commit that referenced this issue May 22, 2020
@drewbanin
Copy link
Contributor

closed by #2481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants