-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some KG2 query subprocesses never terminate #2114
Comments
Perhaps somewhere lurking in the RTXteam/RTX code-base, we have a blocking I/O action (e.g., a remote query or something) that doesn't have an explicit timeout? |
Hmm, this line of code seems to have a built-in timeout: RTX/code/ARAX/ARAXQuery/Expand/kg2_querier.py Line 186 in 2e7c407
|
As far as I can tell from the log files for these 3 queries, the code never progresses beyond this line. It should state what happened to the log:
But there's nothing. crickets. I don't see a stack trace either. Although it's a very complicated log. |
To be clear, this is a rare event. 99+% of the time it's fine, but seemingly at a rate of approx 1 in 2500 queries, we seem to execute line 186 but never proceed to 187. And the process just sits there forever. I could implement a process killer. After a child process has been going for 15 minutes or so, I could just SIGKILL it... |
Another possibility is that there's an upstream "try/except" that intercepts a failure from 186? and is totally quiet about the failure (which would odd)? although I don't see one a couple levels up. I wonder if it might be a good idea to wrap 186 in a try/except and be verbose to the logs about a failure? Or I wonder if switching to aiohttp might be more robust? |
Maybe line 186 should be catching a |
or just * any * error.. |
So I'm looking for places where
and noting that it is called from this line in RTX/code/ARAX/ARAXQuery/ARAX_expander.py Line 822 in 2f76900
with a |
This line of code RTX/code/ARAX/ARAXQuery/ARAX_expander.py Line 783 in 2f76900
is also in |
Hmm, this line of code in RTX/code/ARAX/ARAXQuery/ARAX_expander.py Line 936 in 2f76900
I wonder if this is leading to an uncaught error that is ultimately leading to a stuck waiting process. |
I don't know. I just tried reading the docs to understand what await means and my brain started to smoke.. |
OK, well perhaps another data point against L783 being the culprit is that that is where the TRAPI querier is called. But PloverDB doesn't speak TRAPI (as far as I know), so I would think TRAPIQuerier doesn't call PloverDB. Maybe it calls RTX-KG2, however. |
if does seem that requests can throw an exception: So it seems very sensible to try except handle line 186, and do something sensible if it fails. Maybe wait a couple seconds and retry and then fail with a noisy log message if it fails a second time? |
Yes, requests can throw a For debugging purposes, I would think that |
The problem is so rare that I suspect we just need to roll out some new code that might possibly fix it and watch. Failure rate seems to be 1 in 2500 |
Do you want me to put in a patch to |
(in a branch, that is) |
That would be great! |
on it |
Do we know if this is an ARAX problem or an RTX-KG2 problem? |
I observed it to be a KG2 problem. ARAX does not query Plover directly. That said, we also have an ARAX problem. But I think it is a different issue. |
Thank you. To be clear, my plan is to initially work this issue in a branch (I don't want to commit potentially breaking changes to master and it is convenient to commit to git in order to deploy into arax.ncats.io) but I will merge to master before broadly deploying. Sound good? |
not certain I fully understand, but feel free to work in a branch. Please restore test endpoints back to master when you're done. When I'm deploying around, I don't usually check to make sure they're still on master, because I usually leave them that way. |
What I meant to say is that I prefer to edit the code on my MBP and then deploy the updated module to the endpoint via Hmm, on second thought, that is a lot of steps. I suppose it would involve fewer deployment steps, to just edit source files directly inside the container. But then I have to set up a ssh key to enable me to commit back to GitHub from the container. Anyhow, |
Thanks to @saramsey new code is now on kg2beta. We'll what happens. We should be getting good testing. Max Wang (I assume) is using Aragorn to beat up kg2 and kg2 beta on arax.ncats.io: |
OK, @amykglen can you please do the merge? Thanks. |
Normally after a marge I would delete a branch, but in this case (issue 2114), it's a bit of a special case because the root cause appears to still be unresolved. Everything in the merge is basically additional improvements that came under consideration during the troubleshooting of the deadlock issue. So, in this case, I'd recommend keeping the branch around post-merge. We can use it for future work on the issue (I haven't given up on finding the root cause, though admittedly it has defied my best efforts so far). But, after the merge, @amykglen I would recommending switching the code repos for |
ok, sounds good. I'll merge and I'll look into switching |
I can go ahead and do this now. I'm fussing with something else in this area at the moment anyway. |
I have switch all the usual places to |
Thanks, all! And yes, @amykglen, switching an endpoint
check that the only modifications to tracked files are in
|
@amykglen: I am not great with git, so the above may be inefficient. But that's pretty much what I do. :-) |
thanks @edeutsch and @saramsey! I added Steve's recipe (with slight additions/modifications) to our wiki under operations/deployment SOPs: |
it looks like our github test builds have been failing since we merged it looks like the error is:
here's more detailed output:
|
Amy do you mean the Test Build GitHub Action that is run on |
I wonder if we should explicitly generate the kp_info_cache pickle file(s) before running the pytest suite. I bet the Test Build script could be set up to do this as a preliminary step before running the suite. |
yeah, good idea - we could add another step prior to actually running the pytest suite, somewhere in here: RTX/.github/workflows/pytest.yml Lines 35 to 86 in 0f430ea
|
It looks like |
This just adds another piece to the pile of previous evidence that the RTX-KG2 |
Isn't it already in 20.04? |
In |
and in fact since the same behavior is present in Ubuntu 16 and Ubuntu 20 suggests Python or a package is more likely the culprit rather than the OS? |
Well, hmm, I did some reading and I guess Docker uses the host OS kernel, which in the case of
I'm not sure what is running on which is (understandably) not public. I'm going to ask ITRB what Linux kernel version they are running in CI: |
OK, Pouyan says that on ITRB, we are running the following Linux kernel version in the host OS:
so that is two years newer than 5.4.0. This would seem to cast doubt on the "kernel bug" theory, for sure. |
Great, @sundareswarpullela can you please take care of making that change to the pytest Test Build workflow? |
Hmm, the plot thickens. Maybe the problem is heavy load plus the use of python |
Note, in the commits to fix issue #2170, I included a bunch of changes to eliminate use of Python |
It looks like ARAGORN is no longer queryig |
I'm going to close this issue; we haven't had |
This is mostly an ITRB phenomenon, but I've now seen it on arax.ncats.io now too very rarely. Here's part of the output from active queries:
![image](https://private-user-images.githubusercontent.com/12707718/263362446-f9dbc302-b6c5-41c8-b0ce-a55fd7842319.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2NzE2MTEsIm5iZiI6MTczOTY3MTMxMSwicGF0aCI6Ii8xMjcwNzcxOC8yNjMzNjI0NDYtZjlkYmMzMDItYjZjNS00MWM4LWIwY2UtYTU1ZmQ3ODQyMzE5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE2VDAyMDE1MVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTcwMzNlYTYwYjRjOGU1MWVjMmU1OTIyYmM2OGQyNzg1MDAxYzYxZGVjNzdhNzY0Mjg1YTM5YTVlZDk5YWU3YzYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.fU_1Uid92j9C-FmQvxD-35ZquLG9vy_UIevHNCfcxLA)
This is Aragorn querying kg2beta
The PIDS are 23235 and 16198
They're still running. but not apparently using any CPU.
What are they doing? Why haven't they terminated?
Since all the logs are merged, it's hard to know what these are.
AH! maybe we can add printing of the [PID] to the error logging. that might help figure out what the last message is.
The text was updated successfully, but these errors were encountered: