Query enter Finishing Stage after data returned #16397

Theo-Lo · 2023-03-07T01:58:19Z

Hi, I find that since version 391 till version 409, there are some queries that will enter a finishing stage somehow after the data are returned. Since data is returned, our application already stopped requesting from Trino. Therefore, after few minutes, the query will timeout as no one is visiting for that query anymore. I don't have this issue in version 390 but once I upgrade to 391 or beyond, this issue will occur. May I know if anyone has comment on this situation? Thank you.

charlessong-lyft · 2023-03-27T21:30:09Z

just saw the same behavior where query "completed" all data transactions (1GB level with TB memory cluster) but the query stuck at "Finishing" status until it reached time limit and failed the query. The query can get through in a much larger cluster (e.g. 3x worker nodes) in seconds

we recently upgraded from 365 to 405.

johnwhumphreys · 2023-04-14T16:31:42Z

We're seeing this in 395 (upgrading from 370). It's affecting some of our large processes.

Anybody know how frequent this is, what triggers it, or how to avoid it? Also, is it fixed in 409, or was that just the latest version at the time of writing?

Theo-Lo · 2023-04-17T02:01:28Z

We're seeing this in 395 (upgrading from 370). It's affecting some of our large processes.

Anybody know how frequent this is, what triggers it, or how to avoid it? Also, is it fixed in 409, or was that just the latest version at the time of writing?

That was just the latest version I can test at that time, so I can still meet this issue in 409

johnwhumphreys · 2023-04-17T10:48:40Z

We traced it down to some specific client code on our end. But the code works Vs 370 and hangs vs 395.

Thus far, we’ve only seen it with python clients using fetch_one() function. There may be some less-than-ideal connection / cursor management that triggers it.

switching to fetch_all() seems to mitigate. But we have ~1000 users, so still hesitant to release. Would like to trace down origin of behavior change and revert it for our newer versions.

johnwhumphreys · 2023-04-17T13:08:56Z

Here's the relevant commit we believe - b3d38c3. Title is "Finish query only after the results are fully consumed", which is pretty on-point and it comes in on 391.

Given there has been 20+ releases since then and most of our huge user base has not complained after months of this being in our non-prod, we think it's probably a rare side-case w/ fetch_one() in python being used awkwardly. If you can mention what code triggered it in your case, would be nice to see if it is the same one or a different one.

choyrim · 2023-04-17T13:17:39Z

We are able to reproduce the problem consistently by doing the following in the python driver (presto or trino driver):

open connection
open cursor
execute query on a cursor
fetchone() on the cursor
close cursor
close connection

code is something like:

        with closing(get_conn()) as conn:
            with closing(conn.cursor()) as cur:
                cur.execute(sql)
                return cur.fetchone()

the query will remain in FINISHING state until it becomes an abandoned query.

That said, it seems like fetchone was never intended to be used as a way to grab only the first row (and ignore the rest).

johnwhumphreys · 2023-04-17T13:30:09Z

Yeah we found the same. Calling cur.fetchone() a second time makes it finish (assuming the result set is one record). It seems to want to run until the iterator returns None.

But another person at our work found that MySQL connection code uses this pattern around the net a lot, so I do think it is common.

Here's some code I have to replicate/mitigate (very simplistic).

# Works w/ any SQL with a single record, e.g. count or limit 1.
def fetch_one(self, sql):
    conn = self.__get_connection()
    cur = conn.cursor()
    cur.execute(sql)
    item = cur.fetchone()
    cur.fetchone() # take out this and it stays in finishing forever.
    cur.close()
    conn.close()
    return item

johnwhumphreys · 2023-04-17T13:33:55Z

Have a slack thread going on this referencing this issue, will update here if anyone replies. https://trinodb.slack.com/archives/CGB0QHWSW/p1681738401506199?thread_ts=1681490101.786149&cid=CGB0QHWSW

johnwhumphreys · 2023-04-17T13:51:25Z

Issue also happens with fetchmany(1) on single record result sets.

johnwhumphreys · 2023-04-17T14:02:05Z

From the commit I called out earlier, it says the following.

So, my bet is that changes made to help with "spooling" for fault tolerance had some possibly unintended changes on normal execution/delivery.

This commit makes it consistent with QueryState#FINISHED documentation.

This change is needed to allow spooling exchange to be used for query results delivery. Spooling exchanges in fault tolerant execution are closed on query completion. It is important not to close an exchange before the results are fully consumed.

johnwhumphreys · 2023-04-18T00:47:01Z

@choyrim found this in slack -

The trino driver (read: python client) already addressed this after 0.315.0 - trinodb/trino-python-client@ed2203f

So, assuming you're using an older Trino client or presto client (like we were), the TL;DR here is (1) Trino behavior on server side changed, (2) it exposed a likely old trino/presto client bug, (3) so the updated trino client is needed to avoid this, or you have to code around it on the python side with fetchall or whatever.

I'd probably recommend closing this issue given this info.

choyrim · 2023-04-18T01:23:47Z

we can see that a similar fix is not available in presto-python-client on master (as of 2023/04/17)

https://github.com/prestodb/presto-python-client/blob/master/prestodb/dbapi.py#L302-L310

johnwhumphreys · 2023-04-19T10:53:12Z

Yeah, that client is for the Facebook fork (PrestoDB), so they’re *very diverged now. Definitely wouldn’t expect it to implement a feature for a change on the Trino engine. The old trino was “presto-sql” fork, and even that was very diverged.

ttq186 · 2023-10-23T13:43:29Z

We're also facing this issue for large enough responses with the Trino Python client. In our case, when the first part of results are returned, the client stop consuming data from server then the queries are stuck at the "FINISHING" state and the memory is not released. After taking a look, we noticed that this issue might come from this line. It looks like if at least one row is returned, the query execution is considered to be finished. We removed the condition len(self._result.rows) == 0 and things work normally for us. We're not sure if this might cause any side effect.

hashhar · 2023-10-23T15:47:29Z

In our case, when the first part of results are returned, the client stop consuming data from server then the queries are stuck at the "FINISHING" state and the memory is not released.

If the client code on your side is not consuming results via fetchall/fetchmany/fetchone then that's expected.

The client blocks the execute call until at-least one rows is available or query is finished. After that initial row being fetched it's still user's responsibility to keep fetching the results or cancel the query explicitly.

Closing since this is not a bug.

From https://github.com/trinodb/trino-python-client/blob/e4249f48402d1b9dfc585f19bd854d21ae12e6dd/CHANGES.md#breaking-changes-1

Block the execute method of the cursor until at least one row is received. Users no longer need to call fetchone or fetchall to ensure query starts executing on the Trino server. Note that results still need to be consumed by calling fetchone or fetchall to ensure that a query isn't considered idle and terminated on the server. (trinodb/trino-python-client#232)

(emphasis mine)

hashhar closed this as completed Oct 23, 2023

dwolfeu mentioned this issue Nov 21, 2024

Curious Trino retry behaviour #22989

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query enter Finishing Stage after data returned #16397

Query enter Finishing Stage after data returned #16397

Theo-Lo commented Mar 7, 2023

charlessong-lyft commented Mar 27, 2023

johnwhumphreys commented Apr 14, 2023

Theo-Lo commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023 •

edited

Loading

choyrim commented Apr 17, 2023 •

edited

Loading

johnwhumphreys commented Apr 17, 2023 •

edited

Loading

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 18, 2023 •

edited

Loading

choyrim commented Apr 18, 2023

johnwhumphreys commented Apr 19, 2023

ttq186 commented Oct 23, 2023 •

edited

Loading

hashhar commented Oct 23, 2023

Query enter Finishing Stage after data returned #16397

Query enter Finishing Stage after data returned #16397

Comments

Theo-Lo commented Mar 7, 2023

charlessong-lyft commented Mar 27, 2023

johnwhumphreys commented Apr 14, 2023

Theo-Lo commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023 • edited Loading

choyrim commented Apr 17, 2023 • edited Loading

johnwhumphreys commented Apr 17, 2023 • edited Loading

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 17, 2023

johnwhumphreys commented Apr 18, 2023 • edited Loading

choyrim commented Apr 18, 2023

johnwhumphreys commented Apr 19, 2023

ttq186 commented Oct 23, 2023 • edited Loading

hashhar commented Oct 23, 2023

johnwhumphreys commented Apr 17, 2023 •

edited

Loading

choyrim commented Apr 17, 2023 •

edited

Loading

johnwhumphreys commented Apr 17, 2023 •

edited

Loading

johnwhumphreys commented Apr 18, 2023 •

edited

Loading

ttq186 commented Oct 23, 2023 •

edited

Loading