You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is similar to #2474 (and comes out of the same issue). Docs parsing can be really slow if you have lots of docs and large numbers of schema.yml entries that reference them. There's a few things about it that are "slow":
we render a lot of fields, and rendering jinja is slow
This is hard, so I won't talk about it
we search for docs using doc() a lot, and searching for docs scales like garbage
I'll discuss this below
loading yaml can be slow
This isn't fixable, but we can minimize it for some users (see below)
Steps To Reproduce
Make a project with lots of docs blocks. 50,000 is a nice number
Make a corresponding number of models/sources and call doc pointing at those docs in their description fields
Run dbt compile
Wait a pretty long time
#2474 fixed a significant regression in 0.17.0, but I believe this behavior has existed for a long time (and this behavior still exists with that PR applied).
Expected behavior
I expect ~O(n) runtime scaling! I guess O(n*log(n)) I'd accept if we have to sort things, but I don't think we do.
System information
I believe this problem is cross-platform, cross-db, and cross-python. Newer python versions will probably be faster, but doing fewer things is better everywhere.
The output of dbt --version:
0.17.0rc1
Additional context
Scaling doc()
I haven't made any charts about how it scales, but eyeballing profile results it looks like the search algorithm is making O(n^2) calls (I expected O(n) here 😕 ), so if you have N docs your time spent will go as O(n^3). That's not great!
The search algorithm is bad because it's a very simple representation of the goal:
if a package is supplied, search for a package + doc name pair
if no package is supplied, search for the caller's package + doc name, then the active project package + doc name, then "anywhere" + doc name
We should make search constant-time if we can, and that will help a lot. I don't know how/if I'd hope to avoid calling doc() n times, so that's probably an O(n) minimum!
The easy thing to do to fix this is to create a Dict[DocName, Dict[ProjectName, ParsedDocumentation]] once, and then lookups can do their package-specific searches using a two-level lookup (name first, then package). I have some code in a branch that does this, but I want to think about it more.
ref and source resolution behave pretty similar to docs and probably have the same problems, but I've spent zero time looking at it.
Faster yaml loading
If you have a lot of sources, dbt is going to run yaml.safe_load a lot, and that can be slow. By default, PyYAML is written in pure python, but there's a built in C module capability. If you have libyaml installed and pypi knows about it (on linux I think you only have to apt-get install libyaml-dev before you pip install), then yaml.CSafeLoader is available, and you can change yaml.safe_load(whatever) to yaml.load(whatever, Loader=yaml.CSafeLoader). The performance gains are substantial at scale.
The text was updated successfully, but these errors were encountered:
Describe the bug
This is similar to #2474 (and comes out of the same issue). Docs parsing can be really slow if you have lots of docs and large numbers of schema.yml entries that reference them. There's a few things about it that are "slow":
Steps To Reproduce
doc
pointing at those docs in their description fields#2474 fixed a significant regression in 0.17.0, but I believe this behavior has existed for a long time (and this behavior still exists with that PR applied).
Expected behavior
I expect ~O(n) runtime scaling! I guess O(n*log(n)) I'd accept if we have to sort things, but I don't think we do.
System information
I believe this problem is cross-platform, cross-db, and cross-python. Newer python versions will probably be faster, but doing fewer things is better everywhere.
The output of
dbt --version
:Additional context
Scaling
doc()
I haven't made any charts about how it scales, but eyeballing
profile
results it looks like the search algorithm is making O(n^2) calls (I expected O(n) here 😕 ), so if you have N docs your time spent will go as O(n^3). That's not great!The search algorithm is bad because it's a very simple representation of the goal:
We should make search constant-time if we can, and that will help a lot. I don't know how/if I'd hope to avoid calling
doc()
n times, so that's probably an O(n) minimum!The easy thing to do to fix this is to create a
Dict[DocName, Dict[ProjectName, ParsedDocumentation]]
once, and then lookups can do their package-specific searches using a two-level lookup (name first, then package). I have some code in a branch that does this, but I want to think about it more.ref
andsource
resolution behave pretty similar to docs and probably have the same problems, but I've spent zero time looking at it.Faster yaml loading
If you have a lot of sources, dbt is going to run
yaml.safe_load
a lot, and that can be slow. By default, PyYAML is written in pure python, but there's a built in C module capability. If you have libyaml installed and pypi knows about it (on linux I think you only have toapt-get install libyaml-dev
before youpip install
), thenyaml.CSafeLoader
is available, and you can changeyaml.safe_load(whatever)
toyaml.load(whatever, Loader=yaml.CSafeLoader)
. The performance gains are substantial at scale.The text was updated successfully, but these errors were encountered: