-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
42.2.8: Infinite loops with message "Ignoring request to process copy results, already processing" #1592
Comments
… another thread closes the connection
@hyunkshinft can you test the PR above ? |
@davecramer thanks for the quick fix. I will try this as soon as I can and let you know of the result. |
@davecramer unfortunately this did not fix the problem. FYI the following is the piece of code that closed the stream.
|
hmmm strange. I would think that should cause an exception to be thrown. |
I thought so too. Your fix seems very obvious to me but somehow it passed connection close check. |
Something else must be going on. Any chance you can debug or provide a stand alone test case ? |
It does not happen always and might not duplicate under debugger but I will try. I am sorry but I don't know exact scenario yet. |
@davecramer one thing that I want to remark about this particular connection is the WAL records are usually as big as 48MB and the end to end connection has very long pipe with at least two ssh tunneling. |
@hyunkshinft OK, I have some cycles to look at this. Any update before I do ? |
@davecramer Now I don't close the stream so I don't hit "Ignoring request ..." message anymore. I even added an exception to fail immediately as below in processCopyResults in order to avoid excessive logging in case of this error.
I know you have implemented this thread protection code to address the issue with "Unexpected packet type during copy" error. But I am still getting "Unexpected packet type during copy" error with out hitting above Exception code.
My code has retry algorithms (close existing stream and connection then reopen) in case of PSQLException and it works for a while after retry then hit the same error again and again.
Working and fail goes on and on until it hits OOM when that "unexpected packet type" is among 'E' or 'N' probably coincidently. If it gets big elen or nlen then the process dies with OOM as below
I added the check in receiveErrorResponse() and receiveNoticeResponse() and return error when there is invalid elen or nlen as below.
And
And now it fails at different place.
I have never seen a message "Ignoring request to process copy results, already processing" if I don't stop stream and still getting above errors. Which means that "Unexpected .." error might not be related to thread racing condition in the first place? |
@davecramer I will describe our code design here a bit in order to help you understand what is going on. I have two threads. There is a flow control in place through the concurrent queue between the producer and the consumer. I suppose the stream API allocates a new ByteBuffer every time stream.read() or stream.readPending() is called so the consumer completely owns the buffer once it is read and returned by stream.read(). |
@hyunkshinft your code must have more than one reader. How would it be possible to get around the lock that was put in to avoid the Unexpected char message ? This might also explain the first symptom as well. |
@davecramer I am pretty sure that we have one reader thread. that is why I have never seen log message nor exception caused by this code.
|
OK, let me try to write a test case to test this. |
@davecramer I think I found a cause of the problems for "Unexpected packet type ..." as well as OOM, but I am not sure if it has any thing to do with the original infinite loop issue reported here. As far as I don't close the stream I haven't encountered this infinite loop. What is so bad about this infinite loop is that it produces so many logs especially in the production it increases cloud logging cost dramatically without being noticed as it is just an INFO message. |
@hyunkshinft I'd love to know the cause. are you able to share? |
Here is a proposed fix for the "class java.io.IOException: Unexpected packet type during copy: " and OOM. The root cause of these problems is that when you hit SocketTimeout in the middle of CopyData you should not fail but just wait as far as the connection is still connected. Otherwise, your next processCopyResuts will start at random position and ends up with Unexpected ... if the random character is unknown one and if the random character just happens to be the valid command then the next 4 bytes are taken as message length which is completely random. So you end up with unpredictable error including OOM thereafter. SocketTimeout can happen by the various reasons. Slow server, network latency, etc.
|
@davecramer let me know what you think. |
Were you able to confirm that this fixes things for you? I'd like to remove the code I put in and try this with the test case with very large copy. Seems like this might be the underlying reason |
Yes, I am running this code without problem for about a day now. It works for both 42.2.6 and 42.2.8 but as I said 42.2.8 has other problems so yes, you may consider revert your change. |
If you can briefly pause the communication during CopyData then you can duplicate this problem easily. |
thx,
I'll look into it.
Dave Cramer
…On Fri, 8 Nov 2019 at 13:45, hyunkshinft ***@***.***> wrote:
If you can briefly pause the communication during CopyData then you can
duplicate this problem easily.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1592?email_source=notifications&email_token=AADDH5VOVMIH7XX4AR7APYDQSWXVDA5CNFSM4JFIXQ72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDTAF3A#issuecomment-551944940>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDH5XYOJ62W2SSI3TBWADQSWXVDANCNFSM4JFIXQ7Q>
.
|
OK, I removed the lock and put your code in and I can still get the error with unexpected character pretty easily. |
Are you using multiple readers of the stream? My change is for fixing the issue with the input stream processing in a single threaded scenario. Your change may still be needed to cover multi threads scenario but the danger of that fix is the infinite loop in certain scenario.
|
No not using threaded readers. As I said JDBC is not thread-safe. |
Which means you can still duplicate the problem with new code while the issuer of #1466 cannot? |
That is the short version, yes. But I am using their code to duplicate it. So at the moment I don't have an answer |
It could be that there is more than SocketTimeoutException that causes the processCopyResults to fail prematurely. If you can easily duplicate then I suggest you to catch all Throwable instead of just SocketTimeoutException and see what else is being captured. |
I caught everything. Still failed. |
@davecramer I noticed that my proposed fix didn't make to the official release so I modified the test created by #1466 in order to duplicate the error quickly and reliably. The test reproduced "Unexpected packet type error" using the version 42.2.9. I also included the test with my proposed fix. Please find modified test case here. https://github.com/hyunkshinft/streamProblems#proposed-fix. Hopefully you incorporate my proposed fix in the next official release. |
@hyunkshinft I did try this and was able to reproduce the error quicker I think I may have to revisit this when I come back in the new year. Thanks for reminding me |
@davecramer please note that the proposed fix has been slightly changed. |
OK, I've removed the "hack" and applied your patch to the latest code and this seems to work. I'd like a few others to test it. I will push a fix so everyone can test it |
Thanks see #1665 |
so it appears that catching and ignoring the socket timeout exception causes other issues with getting async notifies. We use the socket timeout and since it is caught and ignored we never timeout. |
@hyunkshinft ya, while this fixes one problem it causes others. Need a rethink. Honestly I think we really need to look at reading data from the backend in a thread. |
@davecramer Sorry for the late response. I have been on vacation. I agree that it could cause an issue with non-blocking read. I will take a look at next week and propose you a quick solution. I think we should handle socket timeout differently when we are fresh with expecting a new message and when we have read partial data already. |
@hyunkshinft no worries. As I said I think we need to have reads running in a background thread. |
@davecramer is there an easy way to setup a regression test environment locally so that I can test my fix? |
I was able to just use the tests that were in the PR that you had. The
problem is a decent sized disk.
I am travelling this week so I couldn't test it but I can look at it to see
if we can find anything before hand
Dave Cramer
…On Mon, 13 Jan 2020 at 18:22, hyunkshinft ***@***.***> wrote:
@davecramer <https://github.com/davecramer> is there an easy way to setup
a regression test environment locally so that I can test my fix?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1592?email_source=notifications&email_token=AADDH5QELYJHFZV6RZSTUVDQ5TZTTA5CNFSM4JFIXQ72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI2VR4Y#issuecomment-573921523>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDH5RVDIUFPA7UKFMINLLQ5TZTTANCNFSM4JFIXQ7Q>
.
|
@davecramer I have added additional fix that addressed the problem with non-blocking IO. Hopefully this fixes the test failure you saw. The idea of this fix is that the socket timeout applies only at the message boundary. Which means socket timeout is taken into account until you saw the first character of message in processNotifies and processCopyResults. Once you have the first character of a new message, any socket timeout thereafter will be ignored until the full message is received in order to preserve the message boundary. The fix has been implemented on top of 42.2.6. Please take a look when you have a chance. https://github.com/hyunkshinft/streamProblems#proposed-fix |
@hyunkshinft ok, better but now failing setNetworkTImeout Test see #1679 |
@davecramer thanks for checking this out. I am going to address the test failure here. #1681 |
@davecramer I have fixed all the unit test failures in #1681. Would you please review it from that PR or review and incorporate the latest changes into yours for inclusion in the next release? |
* origin/master: (427 commits) refactor: make PSQLState enum consts for integrity constraint violations (pgjdbc#1699) [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release REL42.2.10 pass gpg key through arguments add passphrase to release mvn task chore: update signing key Metadata queries improvment (pgjdbc#1694) WIP release notes for 42.2.10 (pgjdbc#1688) chore(deps): bump checkstyle from 8.28 to 8.29 in /pgjdbc (pgjdbc#1691) Cleanup PGProperty, sort values, and add some missing to docs (pgjdbc#1686) fix: Fixes issue pgjdbc#1592 where one thread is reading the copy and another thread closes the connection (pgjdbc#1594) Fixing LocalTime rounding (losing precision) (pgjdbc#1570) sync error message value with tested value (pgjdbc#1664) add DatabaseMetaDataCacheTest to test suite to run it (pgjdbc#1685) Fix Network Performance of PgDatabaseMetaData.getTypeInfo() method (pgjdbc#1668) fix: Issue pgjdbc#1680 updating a boolean field requires special handling to set it to t or f instead of true or false (pgjdbc#1682) fix testSetNetworkTimeoutEnforcement test failure (pgjdbc#1681) minor: fix checkstyle violation of unused import (pgjdbc#1683) fix: pgjdbc#1677 NumberFormatException when fetching PGInterval with small value (pgjdbc#1678) fix: actually use milliseconds instead of microseconds for timeouts (pgjdbc#1653) ...
@hyunkshinft @davecramer I wonder if the change in https://github.com/pgjdbc/pgjdbc/pull/1681/files#diff-1c9bbe3aa0c1f2d0d7a6ecef10d981c2be064a526e8f91e2b47b04823fd50dedR142 is intentional. I.e. immediate returning false instead of setting The problem with that change is that nextStreamAvailableCheckTime field is not updated, which can lead to repeated calls to hasMessagePending blocking for 1 millisecond. |
Have you tested your solution? Can you provide a PR with the change so we
can review it ?
Dave Cramer
…On Tue, 31 May 2022 at 19:05, Fyodor Kupolov ***@***.***> wrote:
@hyunkshinft <https://github.com/hyunkshinft> @davecramer
<https://github.com/davecramer> I wonder if the change in
https://github.com/pgjdbc/pgjdbc/pull/1681/files#diff-1c9bbe3aa0c1f2d0d7a6ecef10d981c2be064a526e8f91e2b47b04823fd50dedR142
is intentional. I.e. immediate returning false instead of setting
available to false?
The problem with that change is that nextStreamAvailableCheckTime field is
not updated, which can lead to repeated calls to hasMessagePending blocking
for 1 millisecond.
—
Reply to this email directly, view it on GitHub
<#1592 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADDH5QP4KRF6ILHWW6FMG3VM2LLVANCNFSM4JFIXQ7Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you for replying.
Yes, we've run into this issue due to some odd usage of the driver on our side, where it manifested in essentially 1 millisecond being wasted and that was unexpected for us. The fix has been in place and was tested for a couple of months. I will prepare a PR soon. (May take week+, cause I need to run it with legal etc) |
I'm submitting a ...
Describe the issue
A clear and concise description of what the issue is.
The JDBC driver falls into infinite loops during the streaming. One thread attempt abort streaming by closing the stream while another tread is still reading it. It occasionally loops while printing "Ignoring request to process copy results, already processing" indefinitely.
This looks like related to the PR #1467
Driver Version?
42.2.8
Java Version?
java version "1.8.0_131"
OS Version?
Linux ip-10-47-10-161 4.9.51-10.52.amzn1.x86_64 #1 SMP Fri Sep 29 01:16:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
PostgreSQL Version?
10.7
To Reproduce
Steps to reproduce the behaviour:
thread 1:
thread 2:
Expected behaviour
A clear and concise description of what you expected to happen.
thread 1 throws an exception.
And what actually happens
Infinite loops printing the message "Ignoring request to process copy results, already processing"
Logs
If possible PostgreSQL logs surrounding the occurrence of the issue
Additionally logs from the driver can be obtained adding
to the connection string
The text was updated successfully, but these errors were encountered: