Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project #6073

Closed
2 tasks done
china-cse opened this issue Oct 16, 2022 · 25 comments
Closed
2 tasks done
Labels
bug Something isn't working enhancement New feature or request performance stale Issues that have gone stale

Comments

@china-cse
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

In case project has more than 2K-3K nodes(includes models, tests etc), dbt build taking time to start its first model. Just to build a single model itself takes 15 mins. Have disabled analytics tracking event thought that could have caused but still no luck. However, dbt run is faster but I can't go with just run as tests required immediately before proceed to next model in DAG.

Expected Behavior

Models could be more depends project size but building single model not supposed to take 15 mins and it does not look realistic when we have multiple dbt commands to run.

Steps To Reproduce

version: dbt 1.0 till latest release 1.3

total nodes count: 3K+ including models, tests, snapshots, seeds

build single model:
$dbt build -s test_model

Relevant log output

No response

Environment

- OS: Windows/DBT Cloud
- Python: 3.7.4
- dbt: 1.2

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@china-cse china-cse added bug Something isn't working triage labels Oct 16, 2022
@github-actions github-actions bot changed the title [Bug] Performance issues with DBT build when nodes more than 2K in a project [CT-1353] [Bug] Performance issues with DBT build when nodes more than 2K in a project Oct 16, 2022
@jtcohen6
Copy link
Contributor

Thanks for opening @china-cse!

This is an issue we saw after introducing the dbt build command last year. The fact is that build requires constructing a more complex DAG, with tests as additional upstream dependencies, versus the simpler DAG used for other commands. After the v0.21 release, we decided the right move was to create genuinely different DAGs for build and non-build commands.

Some ideas we could explore to improve performance here:

  • A more-performant networkx algorithm (in the vein of Perf improvement to subgraph selection #4155)
  • Caching the networkx DAG between invocations — with cache invalidation for project code / config changes (like partial parsing), or when switching between build and non-build commands
  • Where possible, perform certain types of node selection before adding test edges to the graph. This either wouldn't work with graph operators, or it would represent a functional change, if we resolve graph operators before adding test edges. (Imagine a case with model_a --> some_test and model_a --> model_b. What should dbt build --select some_test+ resolve to? Today, it includes model_b; if we reordered the logic so graph selection happens before the some_test --> model_b edge is added, it no longer would.) But it could make a big difference in the case of dbt build --select single_model.

These startup costs are no fun anywhere, but they're especially painful in development, when you're trying to build a single model. As a larger topic, we are thinking a lot about the experience for large dbt deployments, which might want to look less like 2-3k nodes in a single project, and more like several projects with several hundred each. There's some lively and ongoing discussion over in #5244.

@jtcohen6 jtcohen6 added enhancement New feature or request performance Team:Execution and removed bug Something isn't working triage labels Oct 16, 2022
@china-cse
Copy link
Author

thank you @jtcohen6 for the response and references to ongoing discussions.

Have gone through code snippet where its taking time to build graph.. I'm intrigued to see below block implementation where its removing all nodes/edges instead of just taking subgraph using networkx module. Is there any specific reason its implemented in such way using loops and going through each nodes check and remove.

image

As per my understanding of the same logic, it can be achieved through just using subgraph like below.. It just faster and simpler code. have tested this, and it works faster and achieves same result. Would this be amended with logic or you see any issues?

image

Thank you!!!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 17, 2022

@china-cse This is great! Someone from the Core-Execution team can take a look at the proposed logic, when they have some free bandwidth. If it's strictly better/faster at achieving the same result, I see no reason not to go with it.

@jtcohen6 jtcohen6 added the support_rotation A good task to pick up during support rotation label Oct 17, 2022
@china-cse
Copy link
Author

Great @jtcohen6 , thank you!!

@china-cse
Copy link
Author

@jtcohen6, Just to remind.. I think this fix should be expedited.. in case organizations reaching the point beyond expected avg nodes count ~2K, you receive lot of performance issue requests..

@will-sargent-dbtlabs
Copy link

will-sargent-dbtlabs commented Oct 27, 2022

Agree! This is definitely affecting customers with very large projects.

@leahwicz
Copy link
Contributor

@iknox-fa after the Hackathon concludes and you are on Support Rotation Monday, could you take a look at the above proposed solution and see if that would work for us please?

@iknox-fa
Copy link
Contributor

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn't do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:

>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]

As you can see we've removed nodes 5 and 7 from our graph even though they were selected!

Here's what we were expecting to happen:

>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]

Now interestingly enough that's not what we get today-- instead we get:

[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]

As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.

@iknox-fa
Copy link
Contributor

iknox-fa commented Nov 1, 2022

@jtcohen6 I'm adding the bug label to this one since there is an unexpected result in the graph edge creation step (although at worse it only effects performance, not correctness).

@iknox-fa iknox-fa added the bug Something isn't working label Nov 1, 2022
@jeremyyeo
Copy link
Contributor

To add to Will's point - this one is a blocker for development on "large projects" (we do have one such example that we've brought up internally) - you tweak a model and you do dbt build -s +my_changed_model and it takes 10-15 minutes just to get to "Acquiring connection" - basically in 1 hour development cycle - you can only run/test your dbt project ~4 times (because you have to wait for so long).

@china-cse
Copy link
Author

china-cse commented Nov 13, 2022

Hi @china-cse, thanks for the bugreport and the effort put into researching a solution! Unfortunately, in this case graph.subgraph doesn't do exactly what we need it to. It removes all unconnected nodes after selection has occurred, whereas what we need to do is re-construct the original graph creating the edges between nodes that were already there.

As an example, if we applied the logic proposed here like so:

>>> import networkx as nx
>>> G = nx.path_graph(8)
>>> list(G.edges)
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7)]
>>> H = G.subgraph([1,2,3,5,7])
>>> list(H.edges)
[(1, 2), (2, 3)]

As you can see we've removed nodes 5 and 7 from our graph even though they were selected!

Here's what we were expecting to happen:

>>> I = Graph(nx.DiGraph(G))
<dbt.graph.graph.Graph object at 0x10f3a9940>
>>> J = I.get_subset_graph([1,2,3,5,7])
>>> list(J.graph.edges)
[(1, 2), (2, 3), (3, 5), (5, 7)]

Now interestingly enough that's not what we get today-- instead we get:

[(1, 2), (2, 1), (2, 3), (3, 2), (3, 5), (5, 3), (5, 7), (7, 5)]

As you can see, we have an extra set of edges being generated pointed in the opposite direction. This definitely represents a bug that I can try to take a closer look at tomorrow.

Also, as noted in the last time I worked on this code-- we might get a better result if we leveraged some DAG specific algo work.

Hi @iknox-fa

thanks for pointing the issue, my bad I was checking only single node when tested.

I had looked into further on the approach. Looping through whole graph to find all selected nodes & removing one by one takes cost & time. It would be good if we trim of all unnecessary nodes from Graph before we start building new Graph based on the selection.

I worked on below approach, this takes less time in finding nodes.. and tested with 10K nodes.. seems its giving faster results.

this solution requires additional function (lets say "trim_graph()") for scalability purpose which main purpose is to remove unnecessary nodes/paths except direct paths

approach is as follow..

for example, graph as below

nodes: [0,1,2,3,4,5,6,7,8]

edges: [(0,1),(1,2),(1,3),(2,4),(2,6),(3,4),(4,5),(5,6),(6,7),(5,8)]

image

new trim-graph function::

scenario-1:

lets say, selection is [4].. its pretty straightforward in finding parents & children and remove all nodes except selected..

trim-graph outcome:
nodes: [4]
edges: []

cost: O(f(n)) --> its a best short in finding single nodes as its very cost effective.. cost is O(1)

scenario-2:

lets say, selection is [2,5].. will loop through selected nodes and remove unnecessary parents and children falls outside of direct paths

trim-graph outcome:
nodes: [2,4,5]
edges: [(2,4),(4,5)]

cost: O(3*f(n)) --> its thrice the cost of above scenario, it may increase as selection nodes increases.. so average cost would be O(cf(n)) if c specifies number of nodes.

scenario-3:

worst case scenario, selection is [0,7]... here not much to trim except node 8 as this falls outside direct path from 0 -> 7

trim-graph outcome:
nodes: [1,2,3,4,5,6,7]
edges: [(0,1),(1,2),(1,3),(2,4),(2,6),(3,4),(4,5),(5,6),(6,7)]

cost: O(7*f(n)) --> its almost like selecting whole graph

Once above function completes, will continue with existing logic to remove not required nodes based on selection.

this logic fixes bug mentioned by you as well...

please let me know if I can work on this change..

thank you!!

@ChenyuLInx
Copy link
Contributor

@china-cse trimming the graph sounds like a good idea to me!
I think You should continue with it and open a PR if you are interested!
I am wondering how do you plan to do the trimming part? Doing a round of BFS from nodes on two sides of the DAG, remove node that isn't selected and stop at selected node?

@misteliy
Copy link

Dear all, thanks for looking into this because for us this started to become a bottleneck as well. In our company we currently face 18k nodes and thus this becomes unbearable. @china-cse what is the current speed up with the new solution? @jtcohen6 is this something that will find its way to be released soon? Is there something we can help with or should start to engage? Thank you

@china-cse
Copy link
Author

Hi @misteliy,

Awaiting dbt-core team review on pull request. It’s been a while I don’t see any update. Let me try update and ask for the status.

thanks

@simonloach
Copy link

@china-cse I had created a PR for some changes that I wanted to suggest: china-cse#1. I had verified and the algorithm works as intended in the case of linear DAG. Will look into writing a test suite with more complex DiGraphs and test performance.

@china-cse
Copy link
Author

thanks @simonloach for the update.. let me know if any issues... thank you!!

@jtcohen6 jtcohen6 removed the support_rotation A good task to pick up during support rotation label Jan 15, 2023
@boxysean
Copy link
Contributor

boxysean commented Jan 20, 2023

I have a client who is interested in this PR and are struggling with dbt build run durations. (I'm not yet sure if this is impacting their dev, or prod, and what kind of business impact it has for them.) They reported to me that they have about 10K-15K nodes in their project (models+tests) and reported a startup time of about 15 minutes before the first model is built (similar to @china-cse's report). I also understand that dbt Core v1.4 will be released soon, so I would like to assist in profiling the performance of @china-cse's work.

Summary

I did not observe any speed gain from this PR. I tried dbt-core version 1.3.2 with and without the patch from this PR, and observed the same result on a synthetic DAG I created.

That does not mean that this change doesn't improve performance, it just means that I wasn't able to observe the performance gains claimed. I would like to run this test again on a DAG in the same shape (not necessarily the same code) as the DAGs that claim to have a 15 minute startup time.

Benchmark tests

Methodology

I set up a test environment on my laptop with dbt-core version 1.3.2 (editable repo) and dbt-duckdb version 0.6.1. I used these versions because I could relatively easily benchmark dbt build end-to-end with DuckDB within my own system, and dbt-duckdb currently requires dbt-core version <=1.3.2. Then, on a Test DAG (described below), I ran the command dbt build using my local dbt core version 1.3.2 with and without the patch to core/dbt/graph/graph.py (patch file) in this PR.

Test DAG: I built a synthetic DAG in the shape of a binary tree and 1023 models, each model with 5 "no-op" generic tests. The models had SELECT 1 table materializations with SQL-commented refs to their children models, and "no-op" tests were SELECT 1 WHERE 1=0. See this project generation code.

I used hyperfine for command line benchmarking.

Test 1: Vanilla dbt-core v1.3.2

Steps to recreate:

# within a pyenv
git clone [email protected]:dbt-labs/dbt-core.git
cd dbt-core
git checkout v1.3.2 -b v1.3.2
pip install -e ./core && pip install -e ./plugins/postgres && pip install -e ./tests/adapter
pip install duckdb
cd ../my_project
hyperfine --runs 5 --export-json vanilla_v1.3.2.json 'dbt build'  # this vanilla v1.3.2

Results:

➜  my_project $ hyperfine --runs 5 --export-json vanilla_v1.3.2.json 'dbt build'  # this vanilla v1.3.2
Benchmark 1: dbt build
  Time (mean ± σ):     297.611 s ± 16.962 s    [User: 273.805 s, System: 27.789 s]
  Range (min … max):   281.686 s … 326.369 s    5 runs

Test 2: dbt-core v1.3.2, with core/dbt/graph/graph.py patch applied

Steps make repo (within same):

# within same pyenv
cd dbt-core
git remote add china-cse [email protected]:china-cse/dbt-core-local.git
git fetch -a
git checkout dbt-build-performance-fix
git diff v1.3.2 -- core/dbt/graph/graph.py > ../dbt-build-performance-fix.patch
git checkout v1.3.2 -b v1.3.2-with-dbt-build-performance-fix
git apply ../dbt-build-performance-fix.patch 
cd ../my_project
hyperfine --warmup 1 --runs 5 'dbt build' # this is with performance fix

Results:

➜  my_project $ hyperfine --warmup 1 --runs 5 'dbt build' # this is with performance fix
Benchmark 1: dbt build
  Time (mean ± σ):     321.686 s ± 21.727 s    [User: 293.544 s, System: 29.333 s]
  Range (min … max):   292.003 s … 347.179 s    5 runs

@boxysean
Copy link
Contributor

boxysean commented Jan 22, 2023

Hi all, I propose a small improvement in #6694.

For anyone curious, here are some visualizations of where dbt-core is spending its time during a dbt compile on a very large DAG of 8468 models, 17103 tests. Hopefully this gives everyone on this thread ideas where to spend optimization efforts.

Main thread:

main-thread

Other threads:

other_threads

I'm personally curious why the graph building code is multithreaded, does anyone know if there are performance gains from that? We see the main thread locked and waiting for the subthread due to the GIL; I don't believe there are I/O wait times in the subthread which would benefit runtime.

@ttusing
Copy link
Contributor

ttusing commented Mar 20, 2023

Hello - I have a PR to increase dbt build performance. I added a trimming strategy similar to what was described here by @china-cse

The strategy is a bit different than described, I first remove any nodes with degree 0 or 1 (iteratively), then sort the remaining nodes by degree, starting with the lowest, when determining whether to remove them, such that a minimal number of edges are created when removing nodes.

This decreased my build time when selecting a single model from 2 minutes 18 seconds to 10 seconds.

I think this can be slightly improved further by removing any nodes with 0 incoming or outgoing edges, which is a bit broader a definition than degree 1 and will probably make a big difference on large and interconnected DAGs. I'm not familiar enough with networkx to know how to do that easily/efficiently.

Issue: #7195
PR: #7194

@kostek-pl
Copy link
Contributor

kostek-pl commented Apr 5, 2023

Hi Tobie
I've checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement.
Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

@ttusing
Copy link
Contributor

ttusing commented Apr 5, 2023

Hi Tobie I've checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

awesome!

Could you post your findings in #7195?

@ttusing
Copy link
Contributor

ttusing commented Apr 8, 2023

Hi Tobie I've checked your version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

The PR is merged to main and is in likely for 1.5-rc1 which, peeking around, looks like will be released on Monday :).

I have some ideas on how to optimize get_subset_graph even further (see https://github.com/ttusing/dbt-core/pull/3/files). If everyone can share their runtimes on 1.5 and where their runs are hanging, we can iterate further. I think it would be cool to build a community understanding of how many nodes DBT can handle in <1 minute on common hardware, would help me inform how to design my DBT projectors.

@ttusing
Copy link
Contributor

ttusing commented May 18, 2023

@china-cse @kostek-pl @boxysean do you consider this issue to still be open? Can you share your project build times in dbt 1.5?

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Feb 18, 2024
@peterallenwebb
Copy link
Contributor

I'm closing this for now, since the problematic performance issues identified are in older versions of dbt, and we believe they are probably resolved. We are always ready and willing to look into this further, if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request performance stale Issues that have gone stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.