Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpq: another command is already in progress #177

Open
Chratho opened this issue Feb 18, 2016 · 22 comments
Open

libpq: another command is already in progress #177

Chratho opened this issue Feb 18, 2016 · 22 comments

Comments

@Chratho
Copy link

Chratho commented Feb 18, 2016

Every once in a while I am noting "another command is already in progress" errors from libpq. The respective application manages database connections in a resource pool.

I am now wondering whether the application sometimes puts broken connections (with PGASYNC_BUSY still set) back into the pool, letting a subsequent access to the connection fail.

The implementation of exec contains the following line:

_ <- PQ.consumeInput h -- FIXME?

Is ignoring the result really okay this way? Calling PQgetResult afterwards, you might not be able to decide whether the result indicates that fetching is simply done or an error occurred. If at least something was already received, an application might continue without even noting.

I am not sure whether this really is an issue, but currently its the only reason I could think of causing the above exception.

@Chratho Chratho changed the title Possible issue with execute_ libpq: another command is already in progress Feb 18, 2016
@lpsmith
Copy link
Owner

lpsmith commented Feb 18, 2016

Well, your hypothesis seems quite plausible to me. I don't have any opinion on whether or not this could be caused by ignoring the result of PQconsumeInput. I don't recall what that result is supposed to mean, off the top of my head.

Out of curiosity, does your application use the Database.PostgreSQL.Simple.Internal at all? Also, does it ever use Database.PostgreSQL.Simple.Copy?

@Chratho
Copy link
Author

Chratho commented Feb 19, 2016

The application does not use Copy at all. It also doens't import Database.PostgreSQL.Simple.Internal directly, but of course we use some of the functions reexported from this module: There are some rare uses of execute_, and of course connectPostgreSQL (using execute_ itself) is called by the pool frequently.

The exception occurs in situations of high load, with lots of concurrent connections.

But to be honest, a bug in postgresql-simple is only one possible explanation. I might very well miss code fragments where exceptions are caught before even reaching the pools withResource handlers - so no chance to throw any resources away. Though I haven't found them yet...

@lpsmith
Copy link
Owner

lpsmith commented Feb 19, 2016

Right, I don't care if you are using things from Internal that are exported elsewhere; I just wanted to try to get some feeling for the likelihood that this could be an issue created via use of Internal, especially the withConnection operator and the Connection data constructor. (Though, there are a number of things that are exported there that aren't exported elsewhere that should be perfectly harmless, so importing Internal isn't a necessarily a red flag.)

This might be the error you would get if you started a Copy operation and then attempted to execute something else before the copy operation is complete... but you aren't using copy.

Your second hypothesis also seems quite plausible. And, assuming we are interpreting the error message correctly, I'd be at a loss to come up with a third hypothesis, because possibly other than the copy module, any successfully completed command should leave the connection in the ready state, although as you suggest, some wonky exception handling on your part could result in the attempt to reuse a connection that should have been discarded. (Though, if this is the case, there might be a strong argument that postgresql-simple should be doing something slightly different when it raises the exception.)

libpq is a bit opaque regarding the correct use of it's async interface, so it's quite possibly a bug with postgresql-simple. (Not to mention that several of the postgresql core developers seem to think that async is old technology that's no longer relevant in the face of threads, argh.)

So, I agree with your reasoning so far. Keep me updated if you learn something more.

As for some experiments, you might try changing this line to #if 1 if you aren't overly dependent on the async functionality. (Some of my projects need async for timeouts and whatnot. Though, I am curious about the relative performance of async on highly loaded servers, which I haven't really tested as I don't need that right now.)

Another experiment whose results could be much more interesting, is to modify exec to not ignore the result of PQconsumeInput. The documentation suggests that the most appropriate thing to do if it returns true would probably be to throw an exception with the results of PQerrorMessage. It may also make sense to throw away the connection. You may want to create your own exception type to make the exception easier to find in your logs.

@hesselink
Copy link

hesselink commented Jun 28, 2016

I'm running into this message. It could be that I'm doing something wrong, so let me first describe what I'm doing:

I'm using postgresql-simple (through opaleye) to run a query, essentially coming down to withTransaction conn (queryWith_ rp conn q). The connection comes from a connection pool (using resource-pool). This query is potentially long-running, and it wrapped in timeout. If this timeout fires, I get the error.

I've added some logging to postgresql-simple, and the error occurs when trying to roll back the transaction (in withTransactionMode). I've tried both your experiments above. Switching off the async functionality fixes the issue. The call to consumeInput returns True, so ignoring it there isn't the issue, it seems.

Am I doing something wrong in my implementation? If not, how can I dig deeper to find and fix this issue?

Here's a minimal example reproducing the problem for me with the latest git master:

{-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}
import Database.PostgreSQL.Simple
import System.Timeout

main :: IO ()
main = do
  conn <- connectPostgreSQL "dbname='foo' user='bar' password='xxxx'"
  res <- timeout 1000000 $ withTransaction conn (query_ conn "select pg_sleep(5)")
  case res of
    Just [Only x] -> putStrLn "Done." >> return x
    Just _ -> error "Impossible."
    Nothing -> putStrLn "Timed out."

For me (on OS X 10.10.5, running postgresql 9.3.4.1, GHC 7.10.3 and postgresql-simple commit 74b2f4a) this prints:

*** Exception: libpq: failed (another command is already in progress
)

I'd expect it to print "Timed out." instead.

@lpsmith
Copy link
Owner

lpsmith commented Jun 28, 2016

Yeah, I'd expect it to print "Timed out." as well. It certainly seems like a bug to me.

@hesselink
Copy link

OK, that's good to know. I've worked around it for now by using postgres' statement_timeout instead and catching the resulting exception in Haskell. Let me know if I can help find and/or fix the bug.

@lpsmith
Copy link
Owner

lpsmith commented Jun 28, 2016

Ok, so the issue is kinda obvious once I think about it: the async exception from timeout is causing the execution of query_ to be abandoned, then the onException handler of withTransaction attempts to roll back the query. (essentially by issing another query_ when the connection object has been left in an unusable state.)

Probably the most straightforward "solution" would be to modify withTransaction to ignore exceptions generated from the rollback attempt. Though, that does raise the question if there are additional cases where the onException rollback attempt could lead to deadlock, or lead to unresponsiveness for several seconds or longer.

(So, a variant of withTransaction that closes the connection instead of attempting to rollback might well be useful here. And, even with the current semantics, perhaps we could be more intelligent about not attempting rollback and possibly even closing the connection when that's the sensible thing to do.)

Async exceptions are a massive PITA. On the other hand, they are at times indispensible. Even so, in my own code I'm increasingly trying to get away from using timeout in favor of registerDelay (from STM). postgresql-simple could offer some stm integration as well, at least on ghc-7.8 and later.

Incidentally, is Silk using a modified version of the postgresql-libpq binding? I'm highly surprised that timeout is working when you switch off the async functionality; unless of course it isn't timing out until after the 5 second postgresql sleep statement finishes. (And indeed, not being able to timeout blocking C calls to libpq's exec is precisely the reason I personally implemented the async functionality in the first place... but my particular use cases don't use withTransaction.)

However, there is another potential solution to making timeout work in a timely fashion: to modify the ffi binding for PQexec to be an interruptible foreign call. Of course, this requires GHC 7.8 and Windows Vista or later (or unix). It might be interesting to see what this would do in this particular example.

@hesselink
Copy link

Ah, that sounds very plausible. Ignoring exceptions from the rollback sounds a bit dangerous, like you say. Is there a way to detect if the connection is in an unusable state beforehand and not do the rollback in those cases?

I didn't know about registerDelay, that sounds very useful, although in this case you'd need cooperation from postgresql-simple, as you say.

We're not using a modified version, just the latest master. You're right about the synchronous version, by the way: it didn't throw, and returned "Timed out.", but only after 5 seconds.

As for your "interruptible FFI" solution, doesn't that have the same problem as the current code: query_ would be interrupted, withTransaction would catch the exception and try to rollback, causing a new exception?

@lpsmith
Copy link
Owner

lpsmith commented Jun 28, 2016

Well, I don't think ignoring exceptions from the rollback attempt is that dangerous, after all, the only reason why postgresql-simple is trying to rollback is because there's an exception. So there will be an exception, always, it's just a question of which exception it will be.

My musings about deadlock/ temporary unresponsiveness was more along the lines of, let's say the network starts misbehaving, and that somehow leads to an exception inside a withTransaction block, and then the rollback attempt doesn't complete in a timely fashion because hey, the network isn't cooperating and doesn't even have the courtesy to inform you that it's not working right now.

Admittedly, this scenario isn't fully formed in my mind, so there are many details I can't really plausibly fill in in this sketch of a scenario.

As for how an interruptible exec would behave in this scenario, I'm not sure. You are right in that it should (indeed, must) behave similarly, but there might be some differences. I just don't know how libpq's exec would respond to being interrupted, and what kind of state(s) that could leave the connection object in.

But even in the "best case" scenario (for some (possibly weird) definition of "best"), where an interrupted exec leaves the connection in a usable state, then the backend would still be executing the sleep statement. And with the current state of postgres, a query cannot be interrupted by the client on the same connection: you would have to interrupt that over another connection. (Indeed, PQcancel creates a temporary connection to issue the cancel.)

So the rollback attempt would have to block until after the sleep statement finishes, unless it can error out before that. (In which case, it would be quite difficult to recover the connection, so should we close the connection object in that case, perhaps?)

@lpsmith
Copy link
Owner

lpsmith commented Jun 28, 2016

As an aside, my rule of thumb is that most resources that could have possibly been touched by a block of code that has been System.timeout-ed should be treated as junk. Unfortunately, even things like timeout (readChan chan) can result in lost messages, because the async exception can kill the readChan after it's removed the element from the channel but before it's been returned to the caller.

So yeah, I've long been slightly uncomfortable with the rollback behavior of withTransaction, but at this point a lot of people are depending on it. And there's probably a good debate to be had over the use and misuse of exceptions and what's appropriate (and not) to have in an exception handler. But I digress.

lpsmith added a commit that referenced this issue Jun 28, 2016
This should fix the timeout/withTransaction problem reported in #177
lpsmith added a commit that referenced this issue Jun 29, 2016
Regression test for commit 71b4080

See issue #177

Thanks to Erik Hesselink and Silk
@lpsmith
Copy link
Owner

lpsmith commented Jun 29, 2016

@Chratho You don't happen to be using withTransaction in the presence of async exceptions? I am curious if this fixes your problem.

@reiddraper
Copy link

We are hitting this issue, too. Adding some information about our situation:

  • We always use withTransaction (we use postgresql-transactional)
  • A cursory investigation into our exception-tracking database shows no other exceptions that are occurring within a similar time-frame
  • We do use System.timeout in two transactions. I'll try and narrow the exceptions down and see if they're corresponding to this timeout being hit.

@ninegua
Copy link

ninegua commented Jun 1, 2017

I'm hitting this bug in a multithreaded program without using transaction, and without using async exceptions. Well, I was once using async exceptions but I've removed all use of them in my code (I can't say for sure for all the libraries used). The bug still happens.

While it is still a mystery why this keeps happening, I am hoping there is at least a way to catch it and restart my application so that it doesn't hang dead. Where is this error thrown any way? I guess it is from libpq, but is there a way to catch and handle it from Haskell?

@nh2
Copy link

nh2 commented Jul 5, 2017

Hi, good discussion.

-- If the action throws /any/ kind of exception (not just a
-- PostgreSQL-related exception), the transaction will be rolled back using
-- 'rollback', then the exception will be rethrown.

I don't think this is the case. The code uses mask, which does not prevent IO actions from being interrupted:

withTransactionMode :: TransactionMode -> Connection -> IO a -> IO a
withTransactionMode mode conn act =
  mask $ \restore -> do
    beginMode mode conn
    r <- restore act `E.onException` rollback_ conn
    commit conn
    return r

As detailed in Interruptible operations, beginMode, rollback_ and commit all perform network IO and as such can be interrupted even when inside mask.

As a result, this function does not guarantee that a rollback was issued. So if you continue using the conn somewhere else after an async exception was thrown, you will unknowingly still be in an open transaction. (Though not really fully unknowingly because the exception will have bubbled up to you, since as I mentioned, mask doesn't really prevent it. So you'll have handled it somewhere, e.g. in some catch-all exception handler of a web server, and if you put the connection for which rollback or commit were interrupted back into some pool, then the next user of that exception will be unknowingly be in the middle of a transaction.)

You would have to use uninterruptibleMask to guarantee they can't receive async exceptions. But that would be a bad idea: If you do networked IO while being under uninterruptibleMask, exceptions are really masked and thus code might block indefinitely (for example, if that network hangs / is slow, you can't kill the thread, you can't timeout etc.). And the docs of uninterruptibleMask say explicitly ["... and you can guarantee that the interruptible operation will only block for a short period of time"]((https://hackage.haskell.org/package/base-4.9.1.0/docs/Control-Exception.html#v:uninterruptibleMask), which is not the case for sending commands to postgres. So I recommend not doing that.

Thus if the current code is kept, at least the haddocks of the related functions should indicate that rolling back on exception is only a "best effort" approach, and that if an async exception appears at an unfortunate time, the connection might be in undefined state.

I believe that the only reasonable thing that postresql-simple can do in withTransaction is to not try to rollback on async exceptions. Such rollback must be handled by the caller. (I will argue in the next section why; you can skip it if you're already convinced.)


The reason why only the caller (user of postgresql-simple) can handle async exceptions with rollback is this:

postgresql-simple has no idea what specific async exceptions is. Thus in no case it should catch it without rethrowing. Not rethrowing would be wrong because then the user wouldn't learn of the exception having occurred. Now, if postgresql-simple also wants to do anything itself (like rollback) when an async exception happens, it has 2 choices, both of which are bad:

  • Either it can use onException (inside uninterruptibleMask), which is bad as explained above because it can introduce arbitrary hangs.
  • Or it could use try and run rollback in unmasked state. But after try it would have to rethrow the exception so that the caller learns about it. But because we're running unmasked, rollback itself can receive an async exception. So then we have potentially 2 exceptions to rethrow, which is impossible. So try is bad, too.

As a result, postgresql-simple has no sensible way of handling async exceptions with rollbacks.

The user can do this, because they have different leverage: They might know precisely which async exception can be expected and whether or not it should be rethrown. In such cases, the user is allowed to "swallow" the async exception, when postgresql-simple isn't. For example, as the user I can throw a MyKnownTimeoutExceptionType async exception, catch it like this:

withInteruptibleTransaction :: ... -> Either String a
withInteruptibleTransaction action ... = do
  eitherResult <- try (withTransaction action ...)
  case eitherResult of
    Right result -> return (Right result)
    Left (ex :: MyKnownTimeoutExceptionType) -> rollback >> return (Left $ "got timeout from " ++ show ex)

Here, the user doesn't have to rethrow ex because its "their own" exception, and if rollback throws, it can do so without delay, and withInteruptibleTransaction will throw, which is good because if rollback fails the DB connection is broken and should be reset.

End of argument why the user can handle async exception but postgresql-simple cannot.


Probably the most straightforward "solution" would be to modify withTransaction to ignore exceptions generated from the rollback attempt.

@lpsmith I believe this is not a correct solution. If anything, withTransaction should check immediately if the exception recevied is IOError, and if so, rethrow it immediately without trying to issue a rollback (because after an IOError the connection is broken and rollback can't work).

But this is conditional on postgresql-simple handling async exception at all, and as argued above, my opinion is that it shouldn't do it.

Another topic: You cannot currently rely on timeout working to interrupt socket operations due to GHC bug hWaitForInput cannot be interrupted by async exceptions on unix, which I will hopefully soon get back to fixing.

@lpsmith
Copy link
Owner

lpsmith commented Jul 9, 2017

@nh2, I find your argument convincing. There are a few minor things you aren't appreciating, but I doubt they change the overall conclusion of your argument.

First, postgresql-simple largely bypasses GHC IO subsystem, in favor of FFI calls. On Unix we do use non-blocking calls to libpq and GHC's IO manager, so your analysis is likely correct anyway. (Though bugs in hWaitForInput are likely irrelevant. At best, the root cause there might also be causing problems here.) Windows still uses blocking FFI calls, so they aren't interruptible.

Second, when I wrote the code I had no way of distinguishing asynchronous exceptions, and I didn't fully appreciate many of their subtleties. This is one area where the whole ecosystem has improved, and postgresql-simple needs updating. And I wasn't comfortable with my kludge from last year, but it fixed the immediate issue. I agree that we could probably do a better job.

Lastly, it isn't quite so simple to immediately rethrow IOErrors. I don't doubt that we should be leaving async exceptions alone, but dealing with synchronous exceptions better is more work than that.

Repository owner deleted a comment from bretonbot Jan 23, 2018
@3noch
Copy link

3noch commented Feb 8, 2020

I'm not up-to-date on this discussion but I am fairly certain I'm hitting this issue when using timeout to keep a transaction from taking too long and it's being killed while streaming a large result set from the database. This seems to keep the transaction open until the streaming is completely done, even though the timeout is over.

@codygman
Copy link

codygman commented Mar 2, 2021

Our team is also affected by this issue, it looks like there was never bandwidth to work towards a solution?

@kaldonir
Copy link

kaldonir commented Mar 5, 2021

In our team, we're regularly running into the same issue. Is there a workaround besides avoiding async exceptions on threads which execute transactions?

@codygman
Copy link

codygman commented Mar 5, 2021

No idea @kaldonir. But I moved the issue to the new maintainers (per hackage) repo.

@codygman
Copy link

To further complicate this matter, I wasn't even able to verify that resource-pool is thread-safe here:

https://gist.github.com/codygman/85431fc57c2a729b1527410f661f0605

I think there is a high chance I made a mistake there... Because surely more noise would be made about such an issue right?

The other pool issue that complicates this depending upon your implementation is:

bos/pool#32

@ondrap
Copy link

ondrap commented Jun 7, 2023

Just another instance of this problem - this time with beam-postgresql with conduit streaming. It appeared deterministically when I changed a type of a column from text to citext. This is not a built-in type, so the FromField has to check the oid. It caches the results, but if is not cached, it issues a select into the database which conflicts with the running streaming query.

@supki
Copy link
Contributor

supki commented Oct 31, 2024

Just to expand on @ondrap's comment, adding

queryWith_ (field @(CI Text)) conn [sql| SELECT ''::CITEXT |]

right before the streaming query seems to help (in my case, it was scotty + streaming/conduit streaming).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests