-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KP info cache not refreshing hourly - BackgroundTasker issue? #2170
Comments
Thanks @amykglen for bringing this issue up. So, @isbluis can you provide some context for the issue yesterday with picking up the new meta-KG for automat-ctd? Which ARAX endpoint was this? How was it eventually resolved? Was the endpoint resetarted? Was the elog file saved before the Flask app was restarted? |
I think I added more logging on what the BackgroundTasker was doing, so looking at the logs (when we determine which instance we're talking about) would be useful. Based on minimal information, it appeared that the BackgroundTasker just stopped working with little information on what happened. The logs might show what its last communications were. I'm thinking this problem appears to be new since we switched from a thread to a child process. |
Hmm, this is pure conjecture, but my gut feeling is "unhandled signal" in the background tasker. Could be that we just need to install a signal handler on the Background Tasker process (or change the default action to be "log and ignore"). |
Hi @edeutsch can you remind me, do we need to run the Background Tasker on RTX-KG2 services? We don't use the KP info cache on those services, so maybe we don't need it? |
I'm doing this work in branch |
Yes, I think we should run it. There is also code that assesses the currently running child processes according to the system activity table and removes any that have died (i.e. if the child process is found to have gone away, it is marked as died and removed from the activity table. It also does some other things I think. |
Understood. I will fix commit 86b6c2 so BackgroundTasker is run in the RTX-KG2 service. Expect another commit for that tonight, to the |
In this branch, I'm also removing all use of |
OK, commit 7d622a8 should run the BackgroundTasker in the KG2 main.py now. |
Testing this on |
I think we should have https://stackoverflow.com/questions/564695/is-there-a-way-to-change-effective-process-name-in-python I will try to do that tonight |
ah, intriguing! |
OK, I think the code is looking pretty stable. I'm going to merge to |
very fancy! |
Hopefully this will help figure out what is going on inside the container, especially if a child process gets orphaned and then subsequently appears like a parent process. |
Not sure if any of those commits will stop the issue with Background Tasker dying, but I've made some efforts to try and prevent it terminating unnecessarily (or logging if it has an issue). |
So, where things stand-- the new code is at the moment only running on |
Please SMS me if these edits cause havoc tonight. |
Hope the new package dependency |
it installs fine for me.
|
OK, so @edeutsch was right, this issue is more frequent that I had thought. Out of four background tasker processes that I left running on
Let's see which service has PID 25194 inside the container. Inside the container, as user
So it's TCP port 5003. Looking at
we see that this is the |
I've copied the logfile
As of 10:04 PM PDT last night, the background tasker was running:
so somewhere in the last 9.5 hours, it died. |
OK, here is the relevant excerpt of the
Note that the validator runs and then the background tasker dies right away:
|
that suggests that the problem should be repeatable. Launch an instance and point a browser to: |
On it |
Restarting OK, in the restarted
|
OK, the problem has reoccurred. The response cache validator seemed to kill the background tasker with immediate effect:
|
I'm going to turn off multiprocessing in |
And.... everything is still fine with the background tasker:
|
This might help: or at least be amusing to read |
Fixed. Merged to master. Deployed to |
So, here is my explanation of what happened. When we switched to having the Flask application fork and run the background tasker in the child process, we needed a mechanism to be able to ensure that the background tasker doesn't persist as an orphan process if the Flask application is shut down via SIGTERM (which is how the System V init script will attempt to stop it, if you do Of course, there appears to be a cleaner solution: (from the pythonspeed article Eric kindly linked above):
but this approach might be a bit slower, so I opted to the presumably faster (but surely less elegant) solution described above. Anyhow, it's fixed. |
Since it has been merged to master, I have deleted the |
I havae rolled out |
great, thank you from sleuthing this out and fixing! |
this was first reported in #2106 (here), when Expand didn't pick up on a KP's updated meta KG as quickly as we expected (eventually the backup mechanism that refreshes the cache if it hasn't been touched for more than 24 hours kicked in).
I think to resolve that we stopped and restarted
/beta
? (@saramsey and @edeutsch were involved in that discussion over in #2106)I think we were hoping that maybe that would be a one-off problem where the BackgroundTasker randomly died.
but @isbluis reported yesterday that automat-ctd's recently-fixed meta KG (in dev) had not been picked up by Expand (after more than an hour had passed, I believe).
so it seems maybe something is up with the BackgroundTasker?
The text was updated successfully, but these errors were encountered: