-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate data100 503 errors #2677
Comments
A bunch of different issues were reported! is one. I haven't seen this one before, and there were several users reporting this. This is perhaps also a race condition between the multiple user schedulers we had, and could be fixed by #1817. Let's wait and watch to see if there are more reports of this. |
Message from the teaching team - sorry to tag you all (I know we decided on the Piazza route), but yet again, it's the due date for a major assignment and DataHub is down with a Service Unavailable screen (503?). Any ideas Link to Piazza1 and Piazza 2where students raised the issue. Request is to provide more resources to avoid such scenarios in the future! |
@yuvipanda while I cannot prove this conclusively, there appears to be a V shaped currently running users graph here: With a correlation to this particular PR: Can you think of any reason why that PR might result in lots of pods being knocked off? Is it possible that any change to data100 causes a mass drop off which could possibly explain the odd behavior above as well. I also see this in the hub log for data100-prod:
Along with a bunch of messages like this:
I also saw this:
|
Service showed 503 errors from 4.30 - 5.15 PM! Based on the GSI's report, it was back to normal after that. @yuvipanda Could this issue be due to the last merge that happened? If not, What would be the potential way forward to avoid such scenarios? Should we increase the Data 100 hub's resource requirement? |
This is the node on which the data100 hub and proxy pod were located. The '100%' is 'full saturation of all CPUs', and you can see that we were almost there. I don't know if this is a symptom or a cause, but we can give data100 pods more CPU guarantees to eliminate that as a cause - #2772. This is actual CPU usage. It was never saturated (hit 100% of 1 core, which is max python processes can do), but perhaps that was because CPU wasn't available. There's also a lot of user entries on the hub - about 8k. Sometimes, the CPU usage of the hub process scales linearly with total number of users (rather than active users). Upstream is working on fixing it, but in the meantime I've just deleted users who aren't currently active. This should help reduce CPU usage as well. |
@felder I looked at an alternative source of count of running users (from the hub), and you can see there's a gap in the data: When #2768 was merged, it restarted the data100-prod hub pod, and it took a long time to come back up! This isn't common - restarts mostly take much less time. I don't see any other gaps in the last 7 days in this data: So I looked at how long hub startup has taken: And it took many many minutes. I think that is the cause of the 503s in this case. jupyterhub/jupyterhub#2928 is probably super related. I think the script I ran to remove unused users from the database (which should have no practical effects) should help with this. |
jupyterhub/jupyterhub-idle-culler#30 might help too. |
I've run the script for the R hub (which cleaned up some 3235 users), DataHub (cleaned up some 14k users) and data100 hub (cleaned up some 7k users). |
One remaining mystery is why some user pods were missing / deleted during this time - I need to investigate that still. But I think that's a symptom rather than a cause. |
#2774 is the script for deleting unused users. |
@yuvipanda would it make sense to run #2774 or perhaps just clear db entirely during semester breaks? |
@felder yes! We should make a proper structured list |
From Andrew Lenz on Slack
There were many different issues provided, some which showed 503 and others exhibited themselves in other ways. We should track these, and fix them however we can.
I don't think most of them are due to lack of resources though, except for some kernel deaths.
The text was updated successfully, but these errors were encountered: