-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.sql.SQLRecoverableException: No more data to read from socket #20
Comments
@jgebal experienced this issue at any time since 3.0.4 version? |
We're using the most recent version 3.1.2 in a custom docker container (alpine based). About one in 25 calls (in a loop, each iteration using a freshly initialized container) get's blocked and only after waiting one full hour get's terminated eventually. The utplsql command generates no output at all in these (rare) cases. Furthermore, as already mentioned the exception/error exit code, due to the timeout seems to get lost on the way. Our call does get exit-code zero even after such a timeout. Therefore only the missing output (and the delay) can be used a indirect means of error handling... Any progress avoiding this blockade and/or at least a proper error escalation in case it can't be avoided would be greatly appreciated. |
Hi @pesse
|
I'll soon have a new version of cli ready with logging possibilities - I would then aks you to try it with this new version and provide the logs, if possible |
Absolutely! I look forward to verify the new version, thanks! |
That's a good point - want to make an issue for it? |
OK, done! -> configurable timeout #124 |
Hey @c0dw , can you try to reproduce it with this version of cli? use |
Hi @pesse After that, I tried to reproduce the error again with the official release Version 3.1.2 of utplsqlcli. It also took significantly more iterations until the error occured: More than 70 iterations in a first pass and 56 iterations in the second attempt. As you can see iteration 56 took one hour (3603 seconds) to timeout and it produced no log (zero lines in srqoauth2_doc.log). In the shebang of the script I set the '-e' flag to cause immediate termination in case of an error, which obviously never made it up the callstack...
I managed to enter the running Docker container and took a threaddump with jstack, if this is any help understanding the problem: I will try again to reproduce the failure and collect debugging info using your working version 201902142127. Next time I will be more patient and wait for more than 100 iterations just to be sure, the error does indeed not occur anymore. |
thanks a lot for your thorough analysis, @c0dw |
That's the least I can do! As a matter of fact, it happend already in Iteration 17, this time I can provide the full debug log and for comparison the one before (iteration16), where all was well: As you can see, the process is stuck while 'Initializing reporters'. I get one line of output every 30 seconds stating... ...and here is the according jstack threaddump of the utplsqlcli again: |
Hi @pesse Please let me know, if I can contribute anything else to resolve this blocking issue. Apparently it persists in the development snapshot 201902142127 you provided, as outlined in my last posting. I'd also like to point out, we still face the additional problem, that the timeout does not trigger a proper error escalation. I still don't get an exitcode after a timeout. My java skills are slightly rusty - so correct me if I'm wrong, but from your code (lines 203-209) I understand you're using java.util.concurrent.ExecutorService.shutdown() which causes an orderly shutdown and after that in line 209 you return 'returnCode[0]' which I suppose is initialized to literally zero as well. Therefore an invoking process does not get notified, that an irregular termination (due to a timeout) occurred, which makes error handling cumbersome. IMHO a timeout should push an error/exception up the call-stack.
According to the API doc awaitTermination() returns true if this executor terminated and false if the timeout elapsed before termination. |
Hi @c0dw Can you open a seperate issue for the timeout not causing an error? I agree that a timeout should throw an exception and be visible from the outside. As for the other issue: The most problematic thing is that we were not able to reproduce the issue yet. What you could help with is to investigate (maybe together with your DBA) which SQL statement is executed (or was executed last) in the case of a block. I will probably add some more log messages, too and come back with a new version. I'd also like to ask you for patience because utPLSQL is a 100% leisure time project, nobody gets anything paid for it and we do it in our free time on evenings besides our normal day jobs. Therefore complicated issues like this can be very challenging. |
Hi @pesse Thank's again for your fast response! Don't get me wrong, patience is not not a problem, we've got a workaround in place which reruns the affected tests. I certainly did not intend to push you, on the contrary I wish support for expensive commercial products was always as timely and substantial as your own, it's much appreciated!!! I followed up on my remarks, because I currently can reproduce the issue in my loop and hope this helps fixing it. If I can find out which SQL statement hangs, I'll let you know here. Regarding error escalation in the timeout scenario, I created a new issue as you suggested. |
Thank you for your kind and encouraging words :)
This might help to narrow down the problem. |
Could you also check to see if the same issue occurs with latest utPLSQL framework build from develop branch? One more thing to mote is that the problem could somehow relate to oracle fixed table stats. It would be great to get a trace enabled for the session. But it would need to be done by cli/api to enable capturing of the whole flow from the first statement. |
No, we don't. If it was the case we would also not see Reporter initialization messages.
Sounds reasonable, I just have to admit that this is something I have absolutely no experience with, so chances are high I need some time. Also not sure what privileges are necessary etc. |
A short update on a preliminary result: This issue seems to be related to #98 indeed. During two blocked sessions I was able to query the oracle database and came up with this:
Only SID 18 and 753 are relevant. These sessions were the two blocked ones in a loop of more than 300 iterations. So, when it happend in SID 18, the loop continued after the timeout until it happend again in SID 753. Meanwhile not only the currently blocked session 753, but also the former one SID 18 remained 'ACTIVE' in the DB. In Issue #98, you mentioned the exact same query-text
|
Currently I'm in the next blocked iteration:
All the session waits
History of wait events in a specific session
|
Hi @c0dw ,
and and provide it both together with the cli logfile? |
Hi @pesse Unfortunatly now I'm also having a hard time to reproduce the error condition again. In the past few days it did not happen once. Currently I have a loop running more than 1000 iterations and the test performed flawlessly each time. It is not running against the same Docker container anymore, I re-created it several times in past days, but I also tried using the very same docker image of the oracle db container we used before. As a matter of course I'm not aware of any changes to the Database which would explain this. Regardless, this seems to confirm our suspicion, that the root cause might be related to the state of the oracle DB, but until we face the next situation it is virtually impossible to deduce specifics. BTW, I heard , that by coincidence one of the utPLSQL committers, @PhilippSalvisberg is currently on assignment in our company. I'll try to schedule a consulting session with him. In the meantime I'll keep running my test loop and let you know as soon as the problem arises again. |
Hi @pesse
The query on ut_output_buffer_tmp returned empty, no rows!
The second query (on ut_output_buffer_info_tmp) returned 6868 rows! The spool is attached here, but I doubt it is very enlightening, since it contains only hexadecimal output IDs: |
@c0dw Thanks! @PhilippSalvisberg Thanks a lot! |
The one thing that surprises me is that there is no main session hanging while reporters are waiting for the output.
|
@pesse, please find below result of a run that was stuck with debug enabled and the result of above queries:
Query 1;
Returns no row Query 2:
returns:
|
Using flashback for query 1, I found these records:
Same at TO_TIMESTAMP('2019-02-26 11:12:10' |
@PhilippSalvisberg found the cause for the problem and we discussed approaches.
This also explains why we have the reporter IDs in |
@pesse, fyi I can reproduce the error on my local machine with this test case (running a single utPLSQL test):
Against a 12.2 I need to run the test 3 to 4 times until test run hangs. However, when I run it against 18.5 or 19.2 then issue happens always during the first run. So, if you are going to test it locally, it will help to use a newer version of the Oracle Database to reproduce the error (and check if the workaround solves the problem). The thread dumps I've taken are consistent with the one mentioned above. |
Awesome! Thank you very much, @PhilippSalvisberg! |
@c0dw @SebGt Could you please both try the following develop-version of cli? |
@pesse, thanks. I will test it today and I give my feedback asap. |
@pesse, All my tests (around 30 different set of tests) were all passing this morning. I will see the next days if it is confirmed but I have no doubt |
Great! Thanks for the feedback. |
No but I did not check all outputs. I will if it can help you. |
If you could check it would be great because it means the (suspected) Oracle bug still occurs but is caught reliably (which is a far better message than the Oracle bug we tried to catch didn't occur for whatever reason during x runs) |
@pesse, none of my executions (~20 Jenkins jobs) displayed a warning.
I will check next week. |
Hi @pesse, most of my tests ran without any issues today. Only one had the warning message twice time. Between the 2 runs with the warning I had other runs of the test without issues First one:
Second one:
Don't hesitate if you need more details. |
Thank you very much @SebGt ! |
You're welcome. |
@c0dw and @PhilippSalvisberg did some extensive testing and analysis yesterday (on Oracle 12.2) which they shared with me afterward. The findings:
The I/O exception caused by aborting the connection might be a bit irritating, so I'll try to only show it in debug-mode. |
I get the following error when running tests using
utplsql-cli
:The trouble is, the utplsql-cli does not stop/throw.
It seems to be stuck, doing nothing.
This leads to two potential problems:
Single test can run for quite some time.
The text was updated successfully, but these errors were encountered: