-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query enter Finishing Stage after data returned #16397
Comments
just saw the same behavior where query "completed" all data transactions (1GB level with TB memory cluster) but the query stuck at "Finishing" status until it reached time limit and failed the query. The query can get through in a much larger cluster (e.g. 3x worker nodes) in seconds we recently upgraded from 365 to 405. |
We're seeing this in 395 (upgrading from 370). It's affecting some of our large processes. Anybody know how frequent this is, what triggers it, or how to avoid it? Also, is it fixed in 409, or was that just the latest version at the time of writing? |
That was just the latest version I can test at that time, so I can still meet this issue in 409 |
We traced it down to some specific client code on our end. But the code works Vs 370 and hangs vs 395. Thus far, we’ve only seen it with python clients using fetch_one() function. There may be some less-than-ideal connection / cursor management that triggers it. switching to fetch_all() seems to mitigate. But we have ~1000 users, so still hesitant to release. Would like to trace down origin of behavior change and revert it for our newer versions. |
Here's the relevant commit we believe - b3d38c3. Title is "Finish query only after the results are fully consumed", which is pretty on-point and it comes in on 391. Given there has been 20+ releases since then and most of our huge user base has not complained after months of this being in our non-prod, we think it's probably a rare side-case w/ fetch_one() in python being used awkwardly. If you can mention what code triggered it in your case, would be nice to see if it is the same one or a different one. |
We are able to reproduce the problem consistently by doing the following in the python driver (presto or trino driver):
code is something like: with closing(get_conn()) as conn:
with closing(conn.cursor()) as cur:
cur.execute(sql)
return cur.fetchone() the query will remain in FINISHING state until it becomes an abandoned query. That said, it seems like fetchone was never intended to be used as a way to grab only the first row (and ignore the rest). |
Yeah we found the same. Calling But another person at our work found that MySQL connection code uses this pattern around the net a lot, so I do think it is common. Here's some code I have to replicate/mitigate (very simplistic). # Works w/ any SQL with a single record, e.g. count or limit 1.
def fetch_one(self, sql):
conn = self.__get_connection()
cur = conn.cursor()
cur.execute(sql)
item = cur.fetchone()
cur.fetchone() # take out this and it stays in finishing forever.
cur.close()
conn.close()
return item |
Have a slack thread going on this referencing this issue, will update here if anyone replies. https://trinodb.slack.com/archives/CGB0QHWSW/p1681738401506199?thread_ts=1681490101.786149&cid=CGB0QHWSW |
Issue also happens with |
From the commit I called out earlier, it says the following. So, my bet is that changes made to help with "spooling" for fault tolerance had some possibly unintended changes on normal execution/delivery.
|
@choyrim found this in slack -
So, assuming you're using an older Trino client or presto client (like we were), the TL;DR here is (1) Trino behavior on server side changed, (2) it exposed a likely old trino/presto client bug, (3) so the updated trino client is needed to avoid this, or you have to code around it on the python side with I'd probably recommend closing this issue given this info. |
we can see that a similar fix is not available in presto-python-client on master (as of 2023/04/17) https://github.com/prestodb/presto-python-client/blob/master/prestodb/dbapi.py#L302-L310 |
Yeah, that client is for the Facebook fork (PrestoDB), so they’re *very diverged now. Definitely wouldn’t expect it to implement a feature for a change on the Trino engine. The old trino was “presto-sql” fork, and even that was very diverged. |
We're also facing this issue for large enough responses with the Trino Python client. In our case, when the first part of results are returned, the client stop consuming data from server then the queries are stuck at the "FINISHING" state and the memory is not released. After taking a look, we noticed that this issue might come from this line. It looks like if at least one row is returned, the query execution is considered to be finished. We removed the condition |
If the client code on your side is not consuming results via The client blocks the Closing since this is not a bug.
(emphasis mine) |
Hi, I find that since version 391 till version 409, there are some queries that will enter a finishing stage somehow after the data are returned. Since data is returned, our application already stopped requesting from Trino. Therefore, after few minutes, the query will timeout as no one is visiting for that query anymore. I don't have this issue in version 390 but once I upgrade to 391 or beyond, this issue will occur. May I know if anyone has comment on this situation? Thank you.
The text was updated successfully, but these errors were encountered: