docs/schema.yml parsing can be very slow at large scales #2480

beckjake · 2020-05-21T16:59:49Z

Describe the bug

This is similar to #2474 (and comes out of the same issue). Docs parsing can be really slow if you have lots of docs and large numbers of schema.yml entries that reference them. There's a few things about it that are "slow":

we render a lot of fields, and rendering jinja is slow
- This is hard, so I won't talk about it
we search for docs using doc() a lot, and searching for docs scales like garbage
- I'll discuss this below
loading yaml can be slow
- This isn't fixable, but we can minimize it for some users (see below)

Steps To Reproduce

Make a project with lots of docs blocks. 50,000 is a nice number
Make a corresponding number of models/sources and call doc pointing at those docs in their description fields
Run dbt compile
Wait a pretty long time

#2474 fixed a significant regression in 0.17.0, but I believe this behavior has existed for a long time (and this behavior still exists with that PR applied).

Expected behavior

I expect ~O(n) runtime scaling! I guess O(n*log(n)) I'd accept if we have to sort things, but I don't think we do.

System information

I believe this problem is cross-platform, cross-db, and cross-python. Newer python versions will probably be faster, but doing fewer things is better everywhere.

The output of dbt --version:

0.17.0rc1

Additional context

Scaling `doc()`

I haven't made any charts about how it scales, but eyeballing profile results it looks like the search algorithm is making O(n^2) calls (I expected O(n) here 😕 ), so if you have N docs your time spent will go as O(n^3). That's not great!

The search algorithm is bad because it's a very simple representation of the goal:

if a package is supplied, search for a package + doc name pair
if no package is supplied, search for the caller's package + doc name, then the active project package + doc name, then "anywhere" + doc name

We should make search constant-time if we can, and that will help a lot. I don't know how/if I'd hope to avoid calling doc() n times, so that's probably an O(n) minimum!

The easy thing to do to fix this is to create a Dict[DocName, Dict[ProjectName, ParsedDocumentation]] once, and then lookups can do their package-specific searches using a two-level lookup (name first, then package). I have some code in a branch that does this, but I want to think about it more.

ref and source resolution behave pretty similar to docs and probably have the same problems, but I've spent zero time looking at it.

Faster yaml loading

If you have a lot of sources, dbt is going to run yaml.safe_load a lot, and that can be slow. By default, PyYAML is written in pure python, but there's a built in C module capability. If you have libyaml installed and pypi knows about it (on linux I think you only have to apt-get install libyaml-dev before you pip install), then yaml.CSafeLoader is available, and you can change yaml.safe_load(whatever) to yaml.load(whatever, Loader=yaml.CSafeLoader). The performance gains are substantial at scale.

The text was updated successfully, but these errors were encountered:

Improve docs performance (#2480)

drewbanin · 2020-06-10T13:10:51Z

closed by #2481

beckjake added the bug Something isn't working label May 21, 2020

drewbanin added this to the Octavius Catto milestone May 21, 2020

beckjake mentioned this issue May 21, 2020

Improve docs performance (#2480) #2481

Merged

4 tasks

beckjake added a commit that referenced this issue May 22, 2020

Merge pull request #2481 from fishtown-analytics/fix/docs-perf

6dac4c7

Improve docs performance (#2480)

drewbanin closed this as completed Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs/schema.yml parsing can be very slow at large scales #2480

docs/schema.yml parsing can be very slow at large scales #2480

beckjake commented May 21, 2020

drewbanin commented Jun 10, 2020

docs/schema.yml parsing can be very slow at large scales #2480

docs/schema.yml parsing can be very slow at large scales #2480

Comments

beckjake commented May 21, 2020

Describe the bug

Steps To Reproduce

Expected behavior

System information

Additional context

Scaling doc()

Faster yaml loading

drewbanin commented Jun 10, 2020

Scaling `doc()`