Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some KG2 query subprocesses never terminate #2114

Closed
edeutsch opened this issue Aug 25, 2023 · 139 comments
Closed

Some KG2 query subprocesses never terminate #2114

edeutsch opened this issue Aug 25, 2023 · 139 comments
Assignees

Comments

@edeutsch
Copy link
Collaborator

This is mostly an ITRB phenomenon, but I've now seen it on arax.ncats.io now too very rarely. Here's part of the output from active queries:
image

This is Aragorn querying kg2beta
The PIDS are 23235 and 16198

$ ps axuww | egrep '23235|16198'
rt       16198  0.0  2.4 3212000 1559948 ?     S    02:12   0:00 python3 -u -m openapi_server
rt       23235  0.0  2.4 3222756 1559756 ?     S    12:05   0:00 python3 -u -m openapi_server

They're still running. but not apparently using any CPU.
What are they doing? Why haven't they terminated?
Since all the logs are merged, it's hard to know what these are.

AH! maybe we can add printing of the [PID] to the error logging. that might help figure out what the last message is.

@saramsey
Copy link
Member

Perhaps somewhere lurking in the RTXteam/RTX code-base, we have a blocking I/O action (e.g., a remote query or something) that doesn't have an explicit timeout?

@edeutsch
Copy link
Collaborator Author

So added to the logging the PIDs, which I should have thought of much sooner, but it's there now, and it seems to reveal that very occasionally queries to Plover hang forever.

Currently the active queries shows:
image

If you search the kg2beta logs, their last message was always a query to Plover:

$ grep '(4050)' /tmp/RTX_OpenAPI_kg2beta.elog
...
2023-08-28T16:02:16.797306 DEBUG: (4050) [] Sending batch 1 to Plover (has 9 input curies)
2023-08-28T16:02:16.800008 DEBUG: (4050) [] Sending query to https://kg2cploverdb.ci.transltr.io


$ grep '(1486)' /tmp/RTX_OpenAPI_kg2beta.elog
...
2023-08-27T17:29:03.023708 DEBUG: (1486) [] Sending batch 1 to Plover (has 1 input curies)
2023-08-27T17:29:03.026345 DEBUG: (1486) [] Sending query to https://kg2cploverdb.ci.transltr.io


$ grep '(4727)' /tmp/RTX_OpenAPI_kg2beta.elog
...
2023-08-27T03:39:50.712505 DEBUG: (4727) [] Sending batch 1 to Plover (has 1 input curies)
2023-08-27T03:39:50.715471 DEBUG: (4727) [] Sending query to https://kg2cploverdb.ci.transltr.io

Can someone dive into the KG2 code that queries Plover to see how we could get into a state that just waits forever?

@saramsey
Copy link
Member

Hmm, this line of code seems to have a built-in timeout:

response = requests.post(f"{rtxc.plover_url}/query", json=dict_qg, timeout=60,

@edeutsch
Copy link
Collaborator Author

As far as I can tell from the log files for these 3 queries, the code never progresses beyond this line. It should state what happened to the log:

        if response.status_code == 200:
            log.debug(f"Got response back from Plover")
            return response.json(), response.status_code
        else:
            log.warning(f"Plover returned a status code of {response.status_code}. Response was: {response.text}")
            return dict(), response.status_code

But there's nothing. crickets. I don't see a stack trace either. Although it's a very complicated log.

@edeutsch
Copy link
Collaborator Author

To be clear, this is a rare event. 99+% of the time it's fine, but seemingly at a rate of approx 1 in 2500 queries, we seem to execute line 186 but never proceed to 187. And the process just sits there forever.

I could implement a process killer. After a child process has been going for 15 minutes or so, I could just SIGKILL it...

@edeutsch
Copy link
Collaborator Author

Another possibility is that there's an upstream "try/except" that intercepts a failure from 186? and is totally quiet about the failure (which would odd)? although I don't see one a couple levels up. I wonder if it might be a good idea to wrap 186 in a try/except and be verbose to the logs about a failure?

Or I wonder if switching to aiohttp might be more robust?

@saramsey
Copy link
Member

Maybe line 186 should be catching a requests.Timeout error and maybe also a requests.ConnectionError and handling either condition.

@edeutsch
Copy link
Collaborator Author

or just * any * error..

@saramsey
Copy link
Member

So I'm looking for places where kg2_querier.KG2Querier.answer_one_hop_query is called:

def answer_one_hop_query(self, query_graph: QueryGraph) -> QGOrganizedKnowledgeGraph:

and noting that it is called from this line in ARAX_expander.py:

answer_kg = kg2_querier.answer_one_hop_query(one_hop_qg)

with a try block around it. Now, it is possible that there are other places in the code-base where kg2_querier.KG2Querier.answer_one_hop_query is called; this is Python, after all, so it is not so easy to check by static analysis. In particular, I'm looking at lines 766 and 878 of ARAX_expander.py. On the other hand, line 766 and 878 are in try blocks.

@saramsey
Copy link
Member

saramsey commented Aug 29, 2023

This line of code

answer_kg = await kp_querier.answer_one_hop_query_async(edge_qg)

is also in try block so probably isn't the culprit.

@saramsey
Copy link
Member

Hmm, this line of code in ARAX_expander.py calls kg2_querier.KG2Querier.answer_single_node_query without a try block:

answer_kg = kp_querier.answer_single_node_query(single_node_qg)

I wonder if this is leading to an uncaught error that is ultimately leading to a stuck waiting process.

@edeutsch
Copy link
Collaborator Author

I don't know. I just tried reading the docs to understand what await means and my brain started to smoke..

@saramsey
Copy link
Member

saramsey commented Aug 29, 2023

OK, well perhaps another data point against L783 being the culprit is that that is where the TRAPI querier is called. But PloverDB doesn't speak TRAPI (as far as I know), so I would think TRAPIQuerier doesn't call PloverDB. Maybe it calls RTX-KG2, however.

@edeutsch
Copy link
Collaborator Author

if does seem that requests can throw an exception:
https://requests.readthedocs.io/en/latest/user/quickstart/#errors-and-exceptions

So it seems very sensible to try except handle line 186, and do something sensible if it fails. Maybe wait a couple seconds and retry and then fail with a noisy log message if it fails a second time?

@saramsey
Copy link
Member

saramsey commented Aug 29, 2023

Yes, requests can throw a requests.Timeout or requests.ConnectionError.

For debugging purposes, I would think that kg2_querier.py should be catching timeout and connection errors and printing a full stack trace to the log (after that, I don't care much what it does from a debugging standpoint). Perhaps we could stand up such a modified kg2_querier.py in /test or something and watch for issues. This is tough without a reliable way to trigger the problem.

@edeutsch
Copy link
Collaborator Author

The problem is so rare that I suspect we just need to roll out some new code that might possibly fix it and watch. Failure rate seems to be 1 in 2500

@saramsey
Copy link
Member

Do you want me to put in a patch to kg2_querier.py?

@saramsey
Copy link
Member

(in a branch, that is)

@edeutsch
Copy link
Collaborator Author

That would be great!
But I would go straight for master.
A branch is more difficult for me.

@saramsey
Copy link
Member

on it

@saramsey
Copy link
Member

Do we know if this is an ARAX problem or an RTX-KG2 problem?

@edeutsch
Copy link
Collaborator Author

I observed it to be a KG2 problem. ARAX does not query Plover directly.

That said, we also have an ARAX problem. But I think it is a different issue.

@saramsey
Copy link
Member

saramsey commented Aug 29, 2023

Thank you.

To be clear, my plan is to initially work this issue in a branch (I don't want to commit potentially breaking changes to master and it is convenient to commit to git in order to deploy into arax.ncats.io) but I will merge to master before broadly deploying. Sound good?

@edeutsch
Copy link
Collaborator Author

not certain I fully understand, but feel free to work in a branch. Please restore test endpoints back to master when you're done. When I'm deploying around, I don't usually check to make sure they're still on master, because I usually leave them that way.

saramsey added a commit that referenced this issue Aug 29, 2023
@saramsey saramsey mentioned this issue Aug 29, 2023
@saramsey
Copy link
Member

saramsey commented Aug 29, 2023

What I meant to say is that I prefer to edit the code on my MBP and then deploy the updated module to the endpoint via git pull inside the container. But until I've tested the code, I don't want to commit that code to the master branch, not least because I don't want to break kg2.ci.transltr.io. So I'll make an issue-specific branch off of master, commit to that branch, switch the endpoint temporarily to that branch, deploy, test, and then merge to master and then revert the branch back to master.

Hmm, on second thought, that is a lot of steps.

I suppose it would involve fewer deployment steps, to just edit source files directly inside the container. But then I have to set up a ssh key to enable me to commit back to GitHub from the container.

Anyhow, /kg2test is back on the master branch now!

@edeutsch
Copy link
Collaborator Author

Thanks to @saramsey new code is now on kg2beta. We'll what happens. We should be getting good testing. Max Wang (I assume) is using Aragorn to beat up kg2 and kg2 beta on arax.ncats.io:

image

image

@saramsey
Copy link
Member

OK, @amykglen can you please do the merge? Thanks.

@saramsey
Copy link
Member

saramsey commented Sep 26, 2023

Normally after a marge I would delete a branch, but in this case (issue 2114), it's a bit of a special case because the root cause appears to still be unresolved. Everything in the merge is basically additional improvements that came under consideration during the troubleshooting of the deadlock issue. So, in this case, I'd recommend keeping the branch around post-merge. We can use it for future work on the issue (I haven't given up on finding the root cause, though admittedly it has defied my best efforts so far). But, after the merge, @amykglen I would recommending switching the code repos for /kg2beta and /beta to the master branch so they continue to get code improvements that have been committed or merged to master.

@amykglen
Copy link
Member

ok, sounds good. I'll merge issue-2114 into master now, but leave the issue-2114 branch around.

and I'll look into switching /kg2beta and /beta back to the master branch this afternoon (haven't had to do a branch switch - or a deployment, for that matter - on arax.ncats.io before... is it basically a matter of checking out master and then restarting the service?)

@edeutsch
Copy link
Collaborator Author

and I'll look into switching /kg2beta and /beta back to the master branch this afternoon (haven't had to do a branch switch - or a deployment, for that matter - on arax.ncats.io before... is it basically a matter of checking out master and then restarting the service?)

I can go ahead and do this now. I'm fussing with something else in this area at the moment anyway.

@edeutsch
Copy link
Collaborator Author

I have switch all the usual places to master and deployed everything.

@saramsey
Copy link
Member

saramsey commented Sep 26, 2023

Thanks, all!

And yes, @amykglen, switching an endpoint /foo from branch xyz to branch master would involve (approximately) the following steps:

ssh arax.ncats.io
sudo docker exec -it rtx1 bash
su - rt
cd /mnt/data/orangeboard/foo/RTX
git status 

check that the only modifications to tracked files are in openapi.yaml and then do:

git pull origin
git checkout xyz
git pull origin xyz
exit
service RTX_OpenAPI_foo restart
tail -f /tmp/RTX_OpenAPI_foo.elog

@saramsey
Copy link
Member

saramsey commented Sep 26, 2023

@amykglen: I am not great with git, so the above may be inefficient. But that's pretty much what I do. :-)

@amykglen
Copy link
Member

thanks @edeutsch and @saramsey!

I added Steve's recipe (with slight additions/modifications) to our wiki under operations/deployment SOPs:
https://github.com/RTXteam/RTX/wiki/Operations-&-deployment-info#how-to-deploy-code-changes-on-araxncatsio

@amykglen
Copy link
Member

amykglen commented Oct 4, 2023

it looks like our github test builds have been failing since we merged issue-2114 into master - here's an example:
https://github.com/RTXteam/RTX/actions/runs/6409684715/job/17401385185

it looks like the error is:

MAX_TOTAL_WAIT_FOR_CACHE_SEC value was too small: 180.0

here's more detailed output:

=================================== FAILURES ===================================
_____________________ test_connect_acetaminophen_to_ptgs1 ______________________

    def test_connect_acetaminophen_to_ptgs1():
        query = {"operations": {"actions": [
                "create_message",
                "add_qnode(ids=CHEMBL.COMPOUND:CHEMBL112, key=n00)",
                "add_qnode(ids=UniProtKB:P23219, key=n01)",
                "connect(action=connect_nodes)",
                "resultify(ignore_edge_direction=true, debug=true)",
                "return(message=true, store=false)"
            ]}}
>       [response, message] = _do_arax_query(query)

test_ARAX_connect.py:91: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_ARAX_connect.py:28: in _do_arax_query
    response = araxq.query(query)
../ARAXQuery/ARAX_query.py:396: in query
    result = self.execute_processing_plan(query, mode=mode)
../ARAXQuery/ARAX_query.py:668: in execute_processing_plan
    from ARAX_expander import ARAXExpander
../ARAXQuery/ARAX_expander.py:33: in <module>
    from Expand.trapi_querier import TRAPIQuerier
../ARAXQuery/Expand/trapi_querier.py:35: in <module>
    class TRAPIQuerier:
../ARAXQuery/Expand/trapi_querier.py:38: in TRAPIQuerier
    kp_selector: KPSelector = KPSelector(), force_local: bool = False):
../ARAXQuery/Expand/kp_selector.py:27: in __init__
    self.meta_map, self.kp_urls, self.kps_excluded_by_version, self.kps_excluded_by_maturity = self._load_cached_kp_info()
../ARAXQuery/Expand/kp_selector.py:38: in _load_cached_kp_info
    smart_api_info, meta_map = kp_cacher.load_kp_info_caches(self.log)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <kp_info_cacher.KPInfoCacher object at 0x7f568c1d7760>
log = <ARAX_response.ARAXResponse object at 0x7f568c1d76d0>

    def load_kp_info_caches(self, log: ARAXResponse):
        """
        This method is meant to be used anywhere the meta map or smart API
        caches need to be used (i.e., by KPSelector).  Other modules should
        NEVER try to load the caches directly! They should only load them via
        this method.  It ensures that caches are up to date and that they don't
        become corrupted while refreshing.
        """
    
        refresher_pid_exists = os.path.exists(self.cache_refresh_pid_path)
    
        if refresher_pid_exists:
            # KP info cache is probably being regenerated
            log.info("the KP info cache is being refreshed; waiting for it")
            # if either pickled cache file is missing, then check if they are
            # being generated (on the other hand, if both exist, just move on
            # since we will use the cache files); see RTX issue 2072
            # Check if the refresher PID file exists
            try:
                with open(self.cache_refresh_pid_path, "r") as f:
                    refresher_pid = int(f.read())
                    assert refresher_pid != os.getpid()
                    # Get the PID of the process that is currently refreshing
                    # the caches
                    # Check if the process is still running
                    iter_ctr = 0
                    while True:
                        # if the caches are being actively refreshed, wait for
                        # it to finish
                        time.sleep(WAIT_LOOP_SEC)
                        iter_ctr += 1
    
                        refresher_is_running = psutil.pid_exists(refresher_pid)
                        cache_files_present = self.both_caches_are_present()
                        caches_being_refreshed = refresher_is_running and \
                            (refresher_pid_exists or (not cache_files_present))
    
                        if not caches_being_refreshed:
                            if not refresher_is_running and \
                               refresher_pid_exists:
                                log.warning("Removing KP info cache refresher "
                                            "PID file")
                                os.unlink(self.cache_refresh_pid_path)
    
                            if cache_files_present:
                                # the cache files have been updated
                                break
                            else:
                                raise Exception("kp_info_cacher file(s) are missing")
    
                        if WAIT_LOOP_SEC * iter_ctr > \
                           MAX_TOTAL_WAIT_FOR_CACHE_SEC:
>                           raise Exception("Timed out waiting for SmartAPI " +
                                            "cache creation; perhaps " +
                                            "MAX_TOTAL_WAIT_FOR_CACHE_SEC " +
                                            "value was too small: " +
                                            f"{MAX_TOTAL_WAIT_FOR_CACHE_SEC}")
E                                           Exception: Timed out waiting for SmartAPI cache creation; perhaps MAX_TOTAL_WAIT_FOR_CACHE_SEC value was too small: 180.0

../ARAXQuery/Expand/kp_info_cacher.py:183: Exception
----------------------------- Captured stderr call -----------------------------
2023-10-04T18:00:57.815646 INFO: (214) [] ARAX Query launching on incoming Query
2023-10-04T18:00:57.815728 DEBUG: (214) [] RTXConfiguration says maturity=development, current_branch=None, is_itrb_instance=False, arax_version=1.4.3, trapi_version=1.4.0
2023-10-04T18:00:57.815762 INFO: (214) [] Creating an empty template TRAPI Response
2023-10-04T18:00:57.815952: DEBUG: In BiolinkHelper init
2023-10-04T18:00:57.818538: DEBUG: Loading BL lookup map...
2023-10-04T18:00:57.818625: DEBUG: lookup map not here! /mnt/data/orangeboard/production/RTX/code/ARAX/BiolinkHelper/biolink_lookup_map_3.1.2_v4.pickle
2023-10-04T18:00:57.818713: INFO: Building local Biolink 3.1.2 ancestor/descendant lookup map because one doesn't yet exist
2023-10-04T18:01:00.792[291](https://github.com/RTXteam/RTX/actions/runs/6409684715/job/17401385185#step:11:292): DEBUG: Done loading BL lookup map
2023-10-04T18:01:00.795463 INFO: (214) [] Examine input Query for needed information for dispatch
2023-10-04T18:01:00.795542 INFO: (214) [] Found input processing plan. Sending to the ProcessingPlanExecutor
2023-10-04T18:01:00.795574 DEBUG: (214) [] Entering execute_processing_plan
2023-10-04T18:01:00.795721 DEBUG: (214) [] Connecting to ResponseCache
2023-10-04T18:01:01.866298 DEBUG: (214) [] Creating ARAXMessenger instance
2023-10-04T18:01:01.866389 DEBUG: (214) [] No starting messages were referenced. Will start with a blank template Message
2023-10-04T18:01:01.866414 INFO: (214) [] Creating an empty template TRAPI Response
2023-10-04T18:01:01.866498: DEBUG: In BiolinkHelper init
2023-10-04T18:01:01.868307: DEBUG: Loading BL lookup map...
2023-10-04T18:01:01.868354: DEBUG: Loading pickle file: /mnt/data/orangeboard/production/RTX/code/ARAX/BiolinkHelper/biolink_lookup_map_3.1.2_v4.pickle
2023-10-04T18:01:01.869675: DEBUG: Done loading BL lookup map
2023-10-04T18:01:01.871617 DEBUG: (214) [] Found actions
2023-10-04T18:01:01.871670 INFO: (214) [] Parsing input actions list
2023-10-04T18:01:01.871692 DEBUG: (214) [] Parsing action: create_message
2023-10-04T18:01:01.872039 DEBUG: (214) [] Parsing action: add_qnode(ids=CHEMBL.COMPOUND:CHEMBL112, key=n00)
2023-10-04T18:01:01.872414 DEBUG: (214) [] Parsing action: add_qnode(ids=UniProtKB:P23219, key=n01)
2023-10-04T18:01:01.872460 DEBUG: (214) [] Parsing action: connect(action=connect_nodes)
2023-10-04T18:01:01.872492 DEBUG: (214) [] Parsing action: resultify(ignore_edge_direction=true, debug=true)
2023-10-04T18:01:01.872527 DEBUG: (214) [] Parsing action: return(message=true, store=false)
2023-10-04T18:01:01.9[292](https://github.com/RTXteam/RTX/actions/runs/6409684715/job/17401385185#step:11:293)39 INFO: (214) [] the KP info cache is being refreshed; waiting for it
=========================== short test summary info ============================
FAILED test_ARAX_connect.py::test_connect_acetaminophen_to_ptgs1 - Exception:...
==== 1 failed, 134 passed, 129 skipped, 68429 warnings in 974.07s (0:16:14) ====
Error: Process completed with exit code 1.

@saramsey
Copy link
Member

saramsey commented Oct 4, 2023

Amy do you mean the Test Build GitHub Action that is run on cicd.rtx.ai ?

@saramsey
Copy link
Member

saramsey commented Oct 4, 2023

I wonder if we should explicitly generate the kp_info_cache pickle file(s) before running the pytest suite. I bet the Test Build script could be set up to do this as a preliminary step before running the suite.

@amykglen
Copy link
Member

amykglen commented Oct 4, 2023

yeah, good idea - we could add another step prior to actually running the pytest suite, somewhere in here:

steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up config files
run: |
scp [email protected]:config_secrets.json /mnt/config/config_secrets.json
echo "http://localhost:5008/api/rtxkg2/v1.4" > /mnt/config/kg2_url_override.txt
# - name: Append code dir to GITHUB_PATH # Causing RTXConfiguration import error
# run: echo $GITHUB_PATH
# - name: Download databases as needed
# run: python3 $HOME/RTX/code/ARAX/ARAXQuery/ARAX_database_manager.py --mnt --skip-if-exists --remove_unused
- name: Stop any running containers
continue-on-error: true
run: sudo docker stop $(sudo docker ps -aq)
- name: Clean up any left over containers
continue-on-error: true
run: sudo docker rm $(sudo docker ps -aq)
- name: Clean up any left over images
continue-on-error: true
run: sudo docker rmi $(sudo docker images -q)
- name: Build docker image
run: sudo docker build --no-cache=true --rm -t rtx:test DockerBuild/ -f DockerBuild/Merged-Dockerfile
- name: Run docker container
run: sudo docker run -d -it --name rtx_test -v /mnt/data/orangeboard/databases:/mnt/data/orangeboard/databases -v /mnt/config/kg2_url_override.txt:/mnt/data/orangeboard/production/RTX/code/kg2_url_override.txt -v /mnt/config/config_secrets.json:/mnt/data/orangeboard/production/RTX/code/config_secrets.json -v /mnt/config/config_secrets.json:/mnt/data/orangeboard/kg2/RTX/code/config_secrets.json rtx:test
# - name: Add config.domain
# run: |
# sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/production/RTX/code && echo \'RTXTeam/RTX Github actions test suite\' > config.domain'"
# sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/kg2/RTX/code && echo \'RTXTeam/RTX Github actions test suite\' > config.domain'"
- name: Create symlinks
run: |
sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/production/RTX && python3 code/ARAX/ARAXQuery/ARAX_database_manager.py'"
sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/kg2/RTX && python3 code/ARAX/ARAXQuery/ARAX_database_manager.py'"
- name: Start services
run: |
sudo docker exec rtx_test service apache2 start
sudo docker exec rtx_test service RTX_OpenAPI_kg2 start
sudo docker exec rtx_test service RTX_OpenAPI_production start
sudo docker exec rtx_test service RTX_Complete start
- name: Run tests with pytest
run: sudo docker exec rtx_test bash -c "cd /mnt/data/orangeboard/production/RTX/code/ARAX/test && pytest -v --disable-pytest-warnings"

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 5, 2023

For the record, two more stuck processes in CI:
image

@saramsey
Copy link
Member

saramsey commented Oct 5, 2023

It looks like kg2.ci.transltr.io been under heavy load recently from ARAGORN, so that checks out.

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 5, 2023

Blistering!
They seem pretty determined to pound away on all our instances.
I wonder how many years it will be to download all KG2 edges this way..

image

@saramsey
Copy link
Member

saramsey commented Oct 5, 2023

This just adds another piece to the pile of previous evidence that the RTX-KG2 futex hangs are load-dependent. I'm looking forward to standing up a RTX-KG2 in Ubuntu 20.04 to see if the issue goes away (maybe it's a thread bug in an a shared library like libssl.so or something, or a kernel bug like this futex bug, that is fixed in newer versions of Ubuntu).

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 5, 2023

Isn't it already in 20.04?

@saramsey
Copy link
Member

saramsey commented Oct 5, 2023

In kg2.ci.transltr.io, yes. Not in arax.ncats.io/kg2beta (right?). But I see your point-- the fact that we see it in ITRB CI would seem to rule out that it's due to an old distro. Sigh.

@edeutsch
Copy link
Collaborator Author

edeutsch commented Oct 5, 2023

and in fact since the same behavior is present in Ubuntu 16 and Ubuntu 20 suggests Python or a package is more likely the culprit rather than the OS?

@saramsey
Copy link
Member

saramsey commented Oct 5, 2023

Well, hmm, I did some reading and I guess Docker uses the host OS kernel, which in the case of arax.ncats.io is:

stephenr@ip-172-31-53-16:~$ uname -r
5.4.0-1056-aws

I'm not sure what is running on kg2.ci.transltr.io. It is based on this image:
https://github.com/RTXteam/RTX/blob/8372db4d9ccf4c0fab828723b4d2fa995676dca6/deploy/kg2/Jenkinsfile#L18C56-L18C56

which is (understandably) not public. I'm going to ask ITRB what Linux kernel version they are running in CI:
https://ncatstranslator.slack.com/archives/C023J4P0Y1M/p1696545384988789

@saramsey
Copy link
Member

saramsey commented Oct 6, 2023

OK, Pouyan says that on ITRB, we are running the following Linux kernel version in the host OS:

root@ip-10-11-7-83:~# uname -r
5.15.0-1026-aws

so that is two years newer than 5.4.0. This would seem to cast doubt on the "kernel bug" theory, for sure.

@saramsey
Copy link
Member

saramsey commented Oct 6, 2023

yeah, good idea - we could add another step prior to actually running the pytest suite, somewhere in here:

steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Set up config files
run: |
scp [email protected]:config_secrets.json /mnt/config/config_secrets.json
echo "http://localhost:5008/api/rtxkg2/v1.4" > /mnt/config/kg2_url_override.txt
# - name: Append code dir to GITHUB_PATH # Causing RTXConfiguration import error
# run: echo $GITHUB_PATH
# - name: Download databases as needed
# run: python3 $HOME/RTX/code/ARAX/ARAXQuery/ARAX_database_manager.py --mnt --skip-if-exists --remove_unused
- name: Stop any running containers
continue-on-error: true
run: sudo docker stop $(sudo docker ps -aq)
- name: Clean up any left over containers
continue-on-error: true
run: sudo docker rm $(sudo docker ps -aq)
- name: Clean up any left over images
continue-on-error: true
run: sudo docker rmi $(sudo docker images -q)
- name: Build docker image
run: sudo docker build --no-cache=true --rm -t rtx:test DockerBuild/ -f DockerBuild/Merged-Dockerfile
- name: Run docker container
run: sudo docker run -d -it --name rtx_test -v /mnt/data/orangeboard/databases:/mnt/data/orangeboard/databases -v /mnt/config/kg2_url_override.txt:/mnt/data/orangeboard/production/RTX/code/kg2_url_override.txt -v /mnt/config/config_secrets.json:/mnt/data/orangeboard/production/RTX/code/config_secrets.json -v /mnt/config/config_secrets.json:/mnt/data/orangeboard/kg2/RTX/code/config_secrets.json rtx:test
# - name: Add config.domain
# run: |
# sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/production/RTX/code && echo \'RTXTeam/RTX Github actions test suite\' > config.domain'"
# sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/kg2/RTX/code && echo \'RTXTeam/RTX Github actions test suite\' > config.domain'"
- name: Create symlinks
run: |
sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/production/RTX && python3 code/ARAX/ARAXQuery/ARAX_database_manager.py'"
sudo docker exec rtx_test bash -c "sudo -u rt bash -c 'cd /mnt/data/orangeboard/kg2/RTX && python3 code/ARAX/ARAXQuery/ARAX_database_manager.py'"
- name: Start services
run: |
sudo docker exec rtx_test service apache2 start
sudo docker exec rtx_test service RTX_OpenAPI_kg2 start
sudo docker exec rtx_test service RTX_OpenAPI_production start
sudo docker exec rtx_test service RTX_Complete start
- name: Run tests with pytest
run: sudo docker exec rtx_test bash -c "cd /mnt/data/orangeboard/production/RTX/code/ARAX/test && pytest -v --disable-pytest-warnings"

Great, @sundareswarpullela can you please take care of making that change to the pytest Test Build workflow?

@saramsey
Copy link
Member

saramsey commented Oct 19, 2023

Hmm, the plot thickens. Maybe the problem is heavy load plus the use of python logging plus the use of os.fork?

python/cpython#50970

@saramsey
Copy link
Member

saramsey commented Oct 19, 2023

Note, in the commits to fix issue #2170, I included a bunch of changes to eliminate use of Python logging package, which is apparently really not multiprocessing friendly. There is some faint hope this might help with the deadlock situation, maybe?

@saramsey
Copy link
Member

saramsey commented Oct 21, 2023

It looks like ARAGORN is no longer queryig /kg2 frequently. I wonder if we need to set up our own load-testing framework so we can continue to work on issue #2114? My guess is that if we don't have a heavy load, the frequency of occurrences of the futex issue will drop to zero, which of course will make it hard to test potential fixes, etc.

@saramsey
Copy link
Member

I'm going to close this issue; we haven't had futex stuck KG2 processes in a while. If it happens again, we can re-open this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants