Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query enter Finishing Stage after data returned #16397

Closed
Theo-Lo opened this issue Mar 7, 2023 · 15 comments
Closed

Query enter Finishing Stage after data returned #16397

Theo-Lo opened this issue Mar 7, 2023 · 15 comments

Comments

@Theo-Lo
Copy link

Theo-Lo commented Mar 7, 2023

Hi, I find that since version 391 till version 409, there are some queries that will enter a finishing stage somehow after the data are returned. Since data is returned, our application already stopped requesting from Trino. Therefore, after few minutes, the query will timeout as no one is visiting for that query anymore. I don't have this issue in version 390 but once I upgrade to 391 or beyond, this issue will occur. May I know if anyone has comment on this situation? Thank you.
image

@charlessong-lyft
Copy link

just saw the same behavior where query "completed" all data transactions (1GB level with TB memory cluster) but the query stuck at "Finishing" status until it reached time limit and failed the query. The query can get through in a much larger cluster (e.g. 3x worker nodes) in seconds

we recently upgraded from 365 to 405.

@johnwhumphreys
Copy link
Contributor

We're seeing this in 395 (upgrading from 370). It's affecting some of our large processes.

Anybody know how frequent this is, what triggers it, or how to avoid it? Also, is it fixed in 409, or was that just the latest version at the time of writing?

@Theo-Lo
Copy link
Author

Theo-Lo commented Apr 17, 2023

We're seeing this in 395 (upgrading from 370). It's affecting some of our large processes.

Anybody know how frequent this is, what triggers it, or how to avoid it? Also, is it fixed in 409, or was that just the latest version at the time of writing?

That was just the latest version I can test at that time, so I can still meet this issue in 409

@johnwhumphreys
Copy link
Contributor

We traced it down to some specific client code on our end. But the code works Vs 370 and hangs vs 395.

Thus far, we’ve only seen it with python clients using fetch_one() function. There may be some less-than-ideal connection / cursor management that triggers it.

switching to fetch_all() seems to mitigate. But we have ~1000 users, so still hesitant to release. Would like to trace down origin of behavior change and revert it for our newer versions.

@johnwhumphreys
Copy link
Contributor

johnwhumphreys commented Apr 17, 2023

Here's the relevant commit we believe - b3d38c3. Title is "Finish query only after the results are fully consumed", which is pretty on-point and it comes in on 391.

Given there has been 20+ releases since then and most of our huge user base has not complained after months of this being in our non-prod, we think it's probably a rare side-case w/ fetch_one() in python being used awkwardly. If you can mention what code triggered it in your case, would be nice to see if it is the same one or a different one.

@choyrim
Copy link

choyrim commented Apr 17, 2023

We are able to reproduce the problem consistently by doing the following in the python driver (presto or trino driver):

  • open connection
  • open cursor
  • execute query on a cursor
  • fetchone() on the cursor
  • close cursor
  • close connection

code is something like:

        with closing(get_conn()) as conn:
            with closing(conn.cursor()) as cur:
                cur.execute(sql)
                return cur.fetchone()

the query will remain in FINISHING state until it becomes an abandoned query.

That said, it seems like fetchone was never intended to be used as a way to grab only the first row (and ignore the rest).

@johnwhumphreys
Copy link
Contributor

johnwhumphreys commented Apr 17, 2023

Yeah we found the same. Calling cur.fetchone() a second time makes it finish (assuming the result set is one record). It seems to want to run until the iterator returns None.

But another person at our work found that MySQL connection code uses this pattern around the net a lot, so I do think it is common.

Here's some code I have to replicate/mitigate (very simplistic).

# Works w/ any SQL with a single record, e.g. count or limit 1.
def fetch_one(self, sql):
    conn = self.__get_connection()
    cur = conn.cursor()
    cur.execute(sql)
    item = cur.fetchone()
    cur.fetchone() # take out this and it stays in finishing forever.
    cur.close()
    conn.close()
    return item

@johnwhumphreys
Copy link
Contributor

Have a slack thread going on this referencing this issue, will update here if anyone replies. https://trinodb.slack.com/archives/CGB0QHWSW/p1681738401506199?thread_ts=1681490101.786149&cid=CGB0QHWSW

@johnwhumphreys
Copy link
Contributor

Issue also happens with fetchmany(1) on single record result sets.

@johnwhumphreys
Copy link
Contributor

From the commit I called out earlier, it says the following.

So, my bet is that changes made to help with "spooling" for fault tolerance had some possibly unintended changes on normal execution/delivery.

This commit makes it consistent with QueryState#FINISHED documentation.

This change is needed to allow spooling exchange to be used for query results delivery. Spooling exchanges in fault tolerant execution are closed on query completion. It is important not to close an exchange before the results are fully consumed.

@johnwhumphreys
Copy link
Contributor

johnwhumphreys commented Apr 18, 2023

@choyrim found this in slack -

The trino driver (read: python client) already addressed this after 0.315.0 - trinodb/trino-python-client@ed2203f

So, assuming you're using an older Trino client or presto client (like we were), the TL;DR here is (1) Trino behavior on server side changed, (2) it exposed a likely old trino/presto client bug, (3) so the updated trino client is needed to avoid this, or you have to code around it on the python side with fetchall or whatever.

I'd probably recommend closing this issue given this info.

@choyrim
Copy link

choyrim commented Apr 18, 2023

we can see that a similar fix is not available in presto-python-client on master (as of 2023/04/17)

https://github.com/prestodb/presto-python-client/blob/master/prestodb/dbapi.py#L302-L310

@johnwhumphreys
Copy link
Contributor

Yeah, that client is for the Facebook fork (PrestoDB), so they’re *very diverged now. Definitely wouldn’t expect it to implement a feature for a change on the Trino engine. The old trino was “presto-sql” fork, and even that was very diverged.

@ttq186
Copy link

ttq186 commented Oct 23, 2023

We're also facing this issue for large enough responses with the Trino Python client. In our case, when the first part of results are returned, the client stop consuming data from server then the queries are stuck at the "FINISHING" state and the memory is not released. After taking a look, we noticed that this issue might come from this line. It looks like if at least one row is returned, the query execution is considered to be finished. We removed the condition len(self._result.rows) == 0 and things work normally for us. We're not sure if this might cause any side effect.

@hashhar
Copy link
Member

hashhar commented Oct 23, 2023

In our case, when the first part of results are returned, the client stop consuming data from server then the queries are stuck at the "FINISHING" state and the memory is not released.

If the client code on your side is not consuming results via fetchall/fetchmany/fetchone then that's expected.

The client blocks the execute call until at-least one rows is available or query is finished. After that initial row being fetched it's still user's responsibility to keep fetching the results or cancel the query explicitly.

Closing since this is not a bug.

From https://github.com/trinodb/trino-python-client/blob/e4249f48402d1b9dfc585f19bd854d21ae12e6dd/CHANGES.md#breaking-changes-1

Block the execute method of the cursor until at least one row is received. Users no longer need to call fetchone or fetchall to ensure query starts executing on the Trino server. Note that results still need to be consumed by calling fetchone or fetchall to ensure that a query isn't considered idle and terminated on the server. (trinodb/trino-python-client#232)

(emphasis mine)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants