-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libpq: another command is already in progress #177
Comments
Well, your hypothesis seems quite plausible to me. I don't have any opinion on whether or not this could be caused by ignoring the result of Out of curiosity, does your application use the |
The application does not use The exception occurs in situations of high load, with lots of concurrent connections. But to be honest, a bug in postgresql-simple is only one possible explanation. I might very well miss code fragments where exceptions are caught before even reaching the pools |
Right, I don't care if you are using things from This might be the error you would get if you started a Your second hypothesis also seems quite plausible. And, assuming we are interpreting the error message correctly, I'd be at a loss to come up with a third hypothesis, because possibly other than the copy module, any successfully completed command should leave the connection in the ready state, although as you suggest, some wonky exception handling on your part could result in the attempt to reuse a connection that should have been discarded. (Though, if this is the case, there might be a strong argument that postgresql-simple should be doing something slightly different when it raises the exception.)
So, I agree with your reasoning so far. Keep me updated if you learn something more. As for some experiments, you might try changing this line to Another experiment whose results could be much more interesting, is to modify |
I'm running into this message. It could be that I'm doing something wrong, so let me first describe what I'm doing: I'm using postgresql-simple (through opaleye) to run a query, essentially coming down to I've added some logging to postgresql-simple, and the error occurs when trying to roll back the transaction (in Am I doing something wrong in my implementation? If not, how can I dig deeper to find and fix this issue? Here's a minimal example reproducing the problem for me with the latest git master: {-# LANGUAGE OverloadedStrings, ScopedTypeVariables #-}
import Database.PostgreSQL.Simple
import System.Timeout
main :: IO ()
main = do
conn <- connectPostgreSQL "dbname='foo' user='bar' password='xxxx'"
res <- timeout 1000000 $ withTransaction conn (query_ conn "select pg_sleep(5)")
case res of
Just [Only x] -> putStrLn "Done." >> return x
Just _ -> error "Impossible."
Nothing -> putStrLn "Timed out." For me (on OS X 10.10.5, running postgresql 9.3.4.1, GHC 7.10.3 and postgresql-simple commit 74b2f4a) this prints:
I'd expect it to print "Timed out." instead. |
Yeah, I'd expect it to print "Timed out." as well. It certainly seems like a bug to me. |
OK, that's good to know. I've worked around it for now by using postgres' |
Ok, so the issue is kinda obvious once I think about it: the async exception from Probably the most straightforward "solution" would be to modify (So, a variant of Async exceptions are a massive PITA. On the other hand, they are at times indispensible. Even so, in my own code I'm increasingly trying to get away from using Incidentally, is Silk using a modified version of the However, there is another potential solution to making |
Ah, that sounds very plausible. Ignoring exceptions from the rollback sounds a bit dangerous, like you say. Is there a way to detect if the connection is in an unusable state beforehand and not do the rollback in those cases? I didn't know about We're not using a modified version, just the latest master. You're right about the synchronous version, by the way: it didn't throw, and returned "Timed out.", but only after 5 seconds. As for your "interruptible FFI" solution, doesn't that have the same problem as the current code: |
Well, I don't think ignoring exceptions from the rollback attempt is that dangerous, after all, the only reason why postgresql-simple is trying to rollback is because there's an exception. So there will be an exception, always, it's just a question of which exception it will be. My musings about deadlock/ temporary unresponsiveness was more along the lines of, let's say the network starts misbehaving, and that somehow leads to an exception inside a Admittedly, this scenario isn't fully formed in my mind, so there are many details I can't really plausibly fill in in this sketch of a scenario. As for how an interruptible exec would behave in this scenario, I'm not sure. You are right in that it should (indeed, must) behave similarly, but there might be some differences. I just don't know how libpq's exec would respond to being interrupted, and what kind of state(s) that could leave the connection object in. But even in the "best case" scenario (for some (possibly weird) definition of "best"), where an interrupted So the rollback attempt would have to block until after the sleep statement finishes, unless it can error out before that. (In which case, it would be quite difficult to recover the connection, so should we close the connection object in that case, perhaps?) |
As an aside, my rule of thumb is that most resources that could have possibly been touched by a block of code that has been So yeah, I've long been slightly uncomfortable with the rollback behavior of |
This should fix the timeout/withTransaction problem reported in #177
@Chratho You don't happen to be using |
We are hitting this issue, too. Adding some information about our situation:
|
I'm hitting this bug in a multithreaded program without using transaction, and without using async exceptions. Well, I was once using async exceptions but I've removed all use of them in my code (I can't say for sure for all the libraries used). The bug still happens. While it is still a mystery why this keeps happening, I am hoping there is at least a way to catch it and restart my application so that it doesn't hang dead. Where is this error thrown any way? I guess it is from libpq, but is there a way to catch and handle it from Haskell? |
Hi, good discussion. -- If the action throws /any/ kind of exception (not just a
-- PostgreSQL-related exception), the transaction will be rolled back using
-- 'rollback', then the exception will be rethrown. I don't think this is the case. The code uses withTransactionMode :: TransactionMode -> Connection -> IO a -> IO a
withTransactionMode mode conn act =
mask $ \restore -> do
beginMode mode conn
r <- restore act `E.onException` rollback_ conn
commit conn
return r As detailed in Interruptible operations, As a result, this function does not guarantee that a rollback was issued. So if you continue using the You would have to use Thus if the current code is kept, at least the haddocks of the related functions should indicate that rolling back on exception is only a "best effort" approach, and that if an async exception appears at an unfortunate time, the connection might be in undefined state. I believe that the only reasonable thing that The reason why only the caller (user of
As a result, The user can do this, because they have different leverage: They might know precisely which async exception can be expected and whether or not it should be rethrown. In such cases, the user is allowed to "swallow" the async exception, when withInteruptibleTransaction :: ... -> Either String a
withInteruptibleTransaction action ... = do
eitherResult <- try (withTransaction action ...)
case eitherResult of
Right result -> return (Right result)
Left (ex :: MyKnownTimeoutExceptionType) -> rollback >> return (Left $ "got timeout from " ++ show ex) Here, the user doesn't have to rethrow End of argument why the user can handle async exception but
@lpsmith I believe this is not a correct solution. If anything, But this is conditional on Another topic: You cannot currently rely on |
@nh2, I find your argument convincing. There are a few minor things you aren't appreciating, but I doubt they change the overall conclusion of your argument. First, postgresql-simple largely bypasses GHC IO subsystem, in favor of FFI calls. On Unix we do use non-blocking calls to libpq and GHC's IO manager, so your analysis is likely correct anyway. (Though bugs in Second, when I wrote the code I had no way of distinguishing asynchronous exceptions, and I didn't fully appreciate many of their subtleties. This is one area where the whole ecosystem has improved, and postgresql-simple needs updating. And I wasn't comfortable with my kludge from last year, but it fixed the immediate issue. I agree that we could probably do a better job. Lastly, it isn't quite so simple to immediately rethrow IOErrors. I don't doubt that we should be leaving async exceptions alone, but dealing with synchronous exceptions better is more work than that. |
I'm not up-to-date on this discussion but I am fairly certain I'm hitting this issue when using |
Our team is also affected by this issue, it looks like there was never bandwidth to work towards a solution? |
In our team, we're regularly running into the same issue. Is there a workaround besides avoiding async exceptions on threads which execute transactions? |
No idea @kaldonir. But I moved the issue to the new maintainers (per hackage) repo. |
To further complicate this matter, I wasn't even able to verify that resource-pool is thread-safe here: https://gist.github.com/codygman/85431fc57c2a729b1527410f661f0605 I think there is a high chance I made a mistake there... Because surely more noise would be made about such an issue right? The other pool issue that complicates this depending upon your implementation is: |
Just another instance of this problem - this time with beam-postgresql with conduit streaming. It appeared deterministically when I changed a type of a column from |
Just to expand on @ondrap's comment, adding queryWith_ (field @(CI Text)) conn [sql| SELECT ''::CITEXT |] right before the streaming query seems to help (in my case, it was scotty + streaming/conduit streaming). |
Every once in a while I am noting "another command is already in progress" errors from libpq. The respective application manages database connections in a resource pool.
I am now wondering whether the application sometimes puts broken connections (with PGASYNC_BUSY still set) back into the pool, letting a subsequent access to the connection fail.
The implementation of
exec
contains the following line:_ <- PQ.consumeInput h -- FIXME?
Is ignoring the result really okay this way? Calling PQgetResult afterwards, you might not be able to decide whether the result indicates that fetching is simply done or an error occurred. If at least something was already received, an application might continue without even noting.
I am not sure whether this really is an issue, but currently its the only reason I could think of causing the above exception.
The text was updated successfully, but these errors were encountered: