-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize presto SQL Lab query performance. #5132
Optimize presto SQL Lab query performance. #5132
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5132 +/- ##
==========================================
- Coverage 77.51% 77.48% -0.03%
==========================================
Files 44 44
Lines 8734 8737 +3
==========================================
Hits 6770 6770
- Misses 1964 1967 +3
Continue to review full report at Codecov.
|
Oh wow. Thanks for digging in and improving this. @timifasubaa @john-bodley |
superset/db_engine_specs.py
Outdated
@@ -744,6 +744,12 @@ def handle_cursor(cls, cursor, query, session): | |||
break | |||
|
|||
if stats: | |||
state = stats.get('state', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: None
is implicit in dict.get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, how stupid was I.
Thanks @xiaohanyu for looking into this. For your example above to do know how much time is spent writing/reading the data to/from S3 after the query has completed. |
Thanks for working on this. I'm curious what the baseline comparison for the 2-5x speedup was. How complex was the query that saw a 2-5x speedup. |
By stop polling when presto query already finished. When user make queries to Presto via SQL Lab, presto will run the query and then it can return all data back to superset in one shot. However, the default implementation of superset has enabled a default polling for presto to: - Get the fancy progress bar - Get the data back when the query finished. However, the polling implementation of superset is not right. I've done a profiling with a table of 1 billion rows, here're some data: - Total number of rows: 1.02 Billion - SQL Lab query limit: 1 million - Output Data: 1.5 GB - Superset memory consumed: about 10-20 GB - Time: 7 minutes to finish in Presto, takes additional 15 minutes for superset to get and store data. The problems with default issue is, even if presto has finished the query (7 minutes with above profiling), superset still do lots of wasted polling, in above profiling, superset sent about 540 polling in total, and at half of the polling is not necessary. Part of the simplied polling response: ``` { "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u", "id": "20180525_042715_03742_nza9u", "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11", "stats": { "state": "FINISHED", "queuedSplits": 21701, "progressPercentage": 35.98264191882267, "elapsedTimeMillis": 1029, "nodes": 116, "completedSplits": 15257, "scheduled": true, "wallTimeMillis": 2571904, "peakMemoryBytes": 0, "processedBytes": 40825519532, "processedRows": 47734066, "queuedTimeMillis": 0, "queued": false, "cpuTimeMillis": 849228, "rootStage": { "state": "FINISHED", "queuedSplits": 0, "nodes": 1, "totalSplits": 17, "processedBytes": 16829644, "processedRows": 11495, "completedSplits": 17, "stageId": "0", "done": true, "cpuTimeMillis": 69, "subStages": [ { "state": "CANCELED", "queuedSplits": 21701, "nodes": 116, "totalSplits": 42384, "processedBytes": 40825519532, "processedRows": 47734066, "completedSplits": 15240, "stageId": "1", "done": true, "cpuTimeMillis": 849159, "subStages": [], "wallTimeMillis": 2570374, "userTimeMillis": 730020, "runningSplits": 5443 } ], "wallTimeMillis": 1530, "userTimeMillis": 50, "runningSplits": 0 }, "totalSplits": 42401, "userTimeMillis": 730070, "runningSplits": 5443 } } } ``` Superset will terminate the polling when it finds that `nextUri` becomes none, but actually, when `["stats"]["state"] == "FINISHED"`, it means that presto has already finished the query and superset can stop polling and get the data back. After this simple optimization, we get a 2-5x performance boost for Presto SQL Lab queries.
a0f5802
to
e7040b0
Compare
Hi, @john-bodley , @timifasubaa , In general, the technical process of a SQL Lab presto query is as follows:
Step C, D, E, F occupies 90% time of the whole cycle, and Step C, E, F is not feabible for easy optimizations. My optimization focus on step D, and get rid of 80% of unnecessary pollings, which, as you can see, lead to a 2-5x improvement for end-to-end performance. I've added some debug info to the default presto polling process, and here's a sample debugging output for a query with 1 million output rows: In our environment, this query takes Presto 7 minutes to finish, and takes additional 15 minutes for superset to get the final data. In Step D, it send about 540 queries just to polling the progress, and the polling will end when So my patch just utilize this kind of private internal presto API, and removes the unnecessary polling queries. Here's a benchmark report from our own production server (please ignore that lumos is our internal branch name): I've also test the consistency of data before and after patch and it works to be the same data result. One potential risk is, I'm not an expert of Presto, so the internal json response maybe changed in Presto, and this may be break. |
Thanks @xiaohanyu for the detailed explanation. The reason I asked in @timifasubaa was looking into improvements to phases F and G per #4834. |
By stop polling when presto query already finished. When user make queries to Presto via SQL Lab, presto will run the query and then it can return all data back to superset in one shot. However, the default implementation of superset has enabled a default polling for presto to: - Get the fancy progress bar - Get the data back when the query finished. However, the polling implementation of superset is not right. I've done a profiling with a table of 1 billion rows, here're some data: - Total number of rows: 1.02 Billion - SQL Lab query limit: 1 million - Output Data: 1.5 GB - Superset memory consumed: about 10-20 GB - Time: 7 minutes to finish in Presto, takes additional 15 minutes for superset to get and store data. The problems with default issue is, even if presto has finished the query (7 minutes with above profiling), superset still do lots of wasted polling, in above profiling, superset sent about 540 polling in total, and at half of the polling is not necessary. Part of the simplied polling response: ``` { "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u", "id": "20180525_042715_03742_nza9u", "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11", "stats": { "state": "FINISHED", "queuedSplits": 21701, "progressPercentage": 35.98264191882267, "elapsedTimeMillis": 1029, "nodes": 116, "completedSplits": 15257, "scheduled": true, "wallTimeMillis": 2571904, "peakMemoryBytes": 0, "processedBytes": 40825519532, "processedRows": 47734066, "queuedTimeMillis": 0, "queued": false, "cpuTimeMillis": 849228, "rootStage": { "state": "FINISHED", "queuedSplits": 0, "nodes": 1, "totalSplits": 17, "processedBytes": 16829644, "processedRows": 11495, "completedSplits": 17, "stageId": "0", "done": true, "cpuTimeMillis": 69, "subStages": [ { "state": "CANCELED", "queuedSplits": 21701, "nodes": 116, "totalSplits": 42384, "processedBytes": 40825519532, "processedRows": 47734066, "completedSplits": 15240, "stageId": "1", "done": true, "cpuTimeMillis": 849159, "subStages": [], "wallTimeMillis": 2570374, "userTimeMillis": 730020, "runningSplits": 5443 } ], "wallTimeMillis": 1530, "userTimeMillis": 50, "runningSplits": 0 }, "totalSplits": 42401, "userTimeMillis": 730070, "runningSplits": 5443 } } } ``` Superset will terminate the polling when it finds that `nextUri` becomes none, but actually, when `["stats"]["state"] == "FINISHED"`, it means that presto has already finished the query and superset can stop polling and get the data back. After this simple optimization, we get a 2-5x performance boost for Presto SQL Lab queries. (cherry picked from commit b71f551)
By stop polling when presto query already finished. When user make queries to Presto via SQL Lab, presto will run the query and then it can return all data back to superset in one shot. However, the default implementation of superset has enabled a default polling for presto to: - Get the fancy progress bar - Get the data back when the query finished. However, the polling implementation of superset is not right. I've done a profiling with a table of 1 billion rows, here're some data: - Total number of rows: 1.02 Billion - SQL Lab query limit: 1 million - Output Data: 1.5 GB - Superset memory consumed: about 10-20 GB - Time: 7 minutes to finish in Presto, takes additional 15 minutes for superset to get and store data. The problems with default issue is, even if presto has finished the query (7 minutes with above profiling), superset still do lots of wasted polling, in above profiling, superset sent about 540 polling in total, and at half of the polling is not necessary. Part of the simplied polling response: ``` { "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u", "id": "20180525_042715_03742_nza9u", "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11", "stats": { "state": "FINISHED", "queuedSplits": 21701, "progressPercentage": 35.98264191882267, "elapsedTimeMillis": 1029, "nodes": 116, "completedSplits": 15257, "scheduled": true, "wallTimeMillis": 2571904, "peakMemoryBytes": 0, "processedBytes": 40825519532, "processedRows": 47734066, "queuedTimeMillis": 0, "queued": false, "cpuTimeMillis": 849228, "rootStage": { "state": "FINISHED", "queuedSplits": 0, "nodes": 1, "totalSplits": 17, "processedBytes": 16829644, "processedRows": 11495, "completedSplits": 17, "stageId": "0", "done": true, "cpuTimeMillis": 69, "subStages": [ { "state": "CANCELED", "queuedSplits": 21701, "nodes": 116, "totalSplits": 42384, "processedBytes": 40825519532, "processedRows": 47734066, "completedSplits": 15240, "stageId": "1", "done": true, "cpuTimeMillis": 849159, "subStages": [], "wallTimeMillis": 2570374, "userTimeMillis": 730020, "runningSplits": 5443 } ], "wallTimeMillis": 1530, "userTimeMillis": 50, "runningSplits": 0 }, "totalSplits": 42401, "userTimeMillis": 730070, "runningSplits": 5443 } } } ``` Superset will terminate the polling when it finds that `nextUri` becomes none, but actually, when `["stats"]["state"] == "FINISHED"`, it means that presto has already finished the query and superset can stop polling and get the data back. After this simple optimization, we get a 2-5x performance boost for Presto SQL Lab queries. (cherry picked from commit b71f551)
By stop polling when presto query already finished. When user make queries to Presto via SQL Lab, presto will run the query and then it can return all data back to superset in one shot. However, the default implementation of superset has enabled a default polling for presto to: - Get the fancy progress bar - Get the data back when the query finished. However, the polling implementation of superset is not right. I've done a profiling with a table of 1 billion rows, here're some data: - Total number of rows: 1.02 Billion - SQL Lab query limit: 1 million - Output Data: 1.5 GB - Superset memory consumed: about 10-20 GB - Time: 7 minutes to finish in Presto, takes additional 15 minutes for superset to get and store data. The problems with default issue is, even if presto has finished the query (7 minutes with above profiling), superset still do lots of wasted polling, in above profiling, superset sent about 540 polling in total, and at half of the polling is not necessary. Part of the simplied polling response: ``` { "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u", "id": "20180525_042715_03742_nza9u", "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11", "stats": { "state": "FINISHED", "queuedSplits": 21701, "progressPercentage": 35.98264191882267, "elapsedTimeMillis": 1029, "nodes": 116, "completedSplits": 15257, "scheduled": true, "wallTimeMillis": 2571904, "peakMemoryBytes": 0, "processedBytes": 40825519532, "processedRows": 47734066, "queuedTimeMillis": 0, "queued": false, "cpuTimeMillis": 849228, "rootStage": { "state": "FINISHED", "queuedSplits": 0, "nodes": 1, "totalSplits": 17, "processedBytes": 16829644, "processedRows": 11495, "completedSplits": 17, "stageId": "0", "done": true, "cpuTimeMillis": 69, "subStages": [ { "state": "CANCELED", "queuedSplits": 21701, "nodes": 116, "totalSplits": 42384, "processedBytes": 40825519532, "processedRows": 47734066, "completedSplits": 15240, "stageId": "1", "done": true, "cpuTimeMillis": 849159, "subStages": [], "wallTimeMillis": 2570374, "userTimeMillis": 730020, "runningSplits": 5443 } ], "wallTimeMillis": 1530, "userTimeMillis": 50, "runningSplits": 0 }, "totalSplits": 42401, "userTimeMillis": 730070, "runningSplits": 5443 } } } ``` Superset will terminate the polling when it finds that `nextUri` becomes none, but actually, when `["stats"]["state"] == "FINISHED"`, it means that presto has already finished the query and superset can stop polling and get the data back. After this simple optimization, we get a 2-5x performance boost for Presto SQL Lab queries.
By stop polling when presto query already finished. When user make queries to Presto via SQL Lab, presto will run the query and then it can return all data back to superset in one shot. However, the default implementation of superset has enabled a default polling for presto to: - Get the fancy progress bar - Get the data back when the query finished. However, the polling implementation of superset is not right. I've done a profiling with a table of 1 billion rows, here're some data: - Total number of rows: 1.02 Billion - SQL Lab query limit: 1 million - Output Data: 1.5 GB - Superset memory consumed: about 10-20 GB - Time: 7 minutes to finish in Presto, takes additional 15 minutes for superset to get and store data. The problems with default issue is, even if presto has finished the query (7 minutes with above profiling), superset still do lots of wasted polling, in above profiling, superset sent about 540 polling in total, and at half of the polling is not necessary. Part of the simplied polling response: ``` { "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u", "id": "20180525_042715_03742_nza9u", "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11", "stats": { "state": "FINISHED", "queuedSplits": 21701, "progressPercentage": 35.98264191882267, "elapsedTimeMillis": 1029, "nodes": 116, "completedSplits": 15257, "scheduled": true, "wallTimeMillis": 2571904, "peakMemoryBytes": 0, "processedBytes": 40825519532, "processedRows": 47734066, "queuedTimeMillis": 0, "queued": false, "cpuTimeMillis": 849228, "rootStage": { "state": "FINISHED", "queuedSplits": 0, "nodes": 1, "totalSplits": 17, "processedBytes": 16829644, "processedRows": 11495, "completedSplits": 17, "stageId": "0", "done": true, "cpuTimeMillis": 69, "subStages": [ { "state": "CANCELED", "queuedSplits": 21701, "nodes": 116, "totalSplits": 42384, "processedBytes": 40825519532, "processedRows": 47734066, "completedSplits": 15240, "stageId": "1", "done": true, "cpuTimeMillis": 849159, "subStages": [], "wallTimeMillis": 2570374, "userTimeMillis": 730020, "runningSplits": 5443 } ], "wallTimeMillis": 1530, "userTimeMillis": 50, "runningSplits": 0 }, "totalSplits": 42401, "userTimeMillis": 730070, "runningSplits": 5443 } } } ``` Superset will terminate the polling when it finds that `nextUri` becomes none, but actually, when `["stats"]["state"] == "FINISHED"`, it means that presto has already finished the query and superset can stop polling and get the data back. After this simple optimization, we get a 2-5x performance boost for Presto SQL Lab queries.
By stop polling when presto query already finished.
When user make queries to Presto via SQL Lab, presto will run the query
and then it can return all data back to superset in one shot.
However, the default implementation of superset has enabled a default
polling for presto to:
However, the polling implementation of superset is not right.
I've done a profiling with a table of 1 billion rows, here're some data:
superset to get and store data.
The problems with default issue is, even if presto has finished the
query (7 minutes with above profiling), superset still do lots of wasted
polling, in above profiling, superset sent about 540 polling in total,
and at half of the polling is not necessary.
Part of the simplied polling response:
Superset will terminate the polling when it finds that
nextUri
becomes none, but actually, when
["stats"]["state"] == "FINISHED"
,it means that presto has already finished the query and superset can stop
polling and get the data back.
After this simple optimization, we get a 2-5x performance boost for
Presto SQL Lab queries.