Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate data100 503 errors #2677

Closed
yuvipanda opened this issue Sep 2, 2021 · 13 comments
Closed

Investigate data100 503 errors #2677

yuvipanda opened this issue Sep 2, 2021 · 13 comments
Labels

Comments

@yuvipanda
Copy link
Contributor

yuvipanda commented Sep 2, 2021

From Andrew Lenz on Slack

Data 100 is wondering if we can get more resources/computational power for DataHub. We constantly get overloaded for resources and get 503s on due dates, which is quite unfortunate.

There were many different issues provided, some which showed 503 and others exhibited themselves in other ways. We should track these, and fix them however we can.

I don't think most of them are due to lack of resources though, except for some kernel deaths.

@balajialg balajialg added the bug label Sep 2, 2021
@yuvipanda
Copy link
Contributor Author

image

Looking at server start times, I don't see anything too bad. I also don't see a lot of pending or waiting users

image

image

@yuvipanda
Copy link
Contributor Author

A bunch of different issues were reported!

image

is one. I haven't seen this one before, and there were several users reporting this. This is perhaps also a race condition between the multiple user schedulers we had, and could be fixed by #1817. Let's wait and watch to see if there are more reports of this.

@balajialg
Copy link
Contributor

balajialg commented Sep 16, 2021

Message from the teaching team - sorry to tag you all (I know we decided on the Piazza route), but yet again, it's the due date for a major assignment and DataHub is down with a Service Unavailable screen (503?). Any ideas

Link to Piazza1 and Piazza 2where students raised the issue.

Request is to provide more resources to avoid such scenarios in the future!

@felder
Copy link
Contributor

felder commented Sep 17, 2021

@yuvipanda while I cannot prove this conclusively, there appears to be a V shaped currently running users graph here:
https://grafana.datahub.berkeley.edu/goto/g_YrOXSnz

With a correlation to this particular PR:
#2768

Can you think of any reason why that PR might result in lots of pods being knocked off? Is it possible that any change to data100 causes a mass drop off which could possibly explain the odd behavior above as well.

I also see this in the hub log for data100-prod:

[I 2021-09-16 23:28:00.379 JupyterHub app:2459] Running JupyterHub version 1.4.2
[I 2021-09-16 23:28:00.379 JupyterHub app:2489] Using Authenticator: canvasauthenticator.CanvasOAuthenticator
[I 2021-09-16 23:28:00.379 JupyterHub app:2489] Using Spawner: builtins.CustomAttrSpawner
[I 2021-09-16 23:28:00.380 JupyterHub app:2489] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.4.2
[I 2021-09-16 23:28:00.564 JupyterHub app:1838] Not using allowed_users. Any authenticated user will be allowed.
[I 2021-09-16 23:28:01.955 JupyterHub reflector:275] watching for pods with label selector='component=singleuser-server' in namespace data100-prod
[I 2021-09-16 23:28:02.655 JupyterHub reflector:275] watching for events with field selector='involvedObject.kind=Pod' in namespace data100-prod
[W 2021-09-16 23:28:06.954 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.956 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.958 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.962 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.969 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.970 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.974 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down

Along with a bunch of messages like this:

[W 2021-09-16 23:29:07.424 JupyterHub user:767] REMOVED server never showed up at http://10.20.14.206:8888/user/REMOVED/ after 30 seconds. Giving up
[E 2021-09-16 23:29:07.517 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.0.40:8888/user/REMOVED/, shutting it down.
[I 2021-09-16 23:29:07.518 JupyterHub spawner:2620] Deleting pod data100-prod/REMOVED

I also saw this:

[I 2021-09-16 23:29:07.659 JupyterHub service:121] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10
[E 210916 23:29:27 ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f1c8794f4f0>>, <Future finished exception=HTTP 599: Operation timed out after 20000 milliseconds with 0 bytes received>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 769, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub_idle_culler/__init__.py", line 120, in cull_idle
        resp = yield fetch(req)
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 775, in run
        yielded = self.gen.send(value)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub_idle_culler/__init__.py", line 113, in fetch
        return (yield client.fetch(req))
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
        value = future.result()
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 20000 milliseconds with 0 bytes received
[I 2021-09-16 23:30:08.202 JupyterHub log:189] 200 GET /hub/api/users ([email protected]) 60289.54ms
[E 2021-09-16 23:30:08.512 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.17.12:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.628 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.0.4:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.701 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.14.64:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.787 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.45.236:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.852 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.14.65:8888/user/REMOVED/, shutting it down.

@balajialg
Copy link
Contributor

balajialg commented Sep 17, 2021

Service showed 503 errors from 4.30 - 5.15 PM! Based on the GSI's report, it was back to normal after that.

@yuvipanda Could this issue be due to the last merge that happened?

If not, What would be the potential way forward to avoid such scenarios? Should we increase the Data 100 hub's resource requirement?

@yuvipanda
Copy link
Contributor Author

image

This is the node on which the data100 hub and proxy pod were located. The '100%' is 'full saturation of all CPUs', and you can see that we were almost there. I don't know if this is a symptom or a cause, but we can give data100 pods more CPU guarantees to eliminate that as a cause - #2772.

image

This is actual CPU usage. It was never saturated (hit 100% of 1 core, which is max python processes can do), but perhaps that was because CPU wasn't available.

There's also a lot of user entries on the hub - about 8k. Sometimes, the CPU usage of the hub process scales linearly with total number of users (rather than active users). Upstream is working on fixing it, but in the meantime I've just deleted users who aren't currently active. This should help reduce CPU usage as well.

@yuvipanda
Copy link
Contributor Author

@felder I looked at an alternative source of count of running users (from the hub), and you can see there's a gap in the data:

image

When #2768 was merged, it restarted the data100-prod hub pod, and it took a long time to come back up!

image

This isn't common - restarts mostly take much less time. I don't see any other gaps in the last 7 days in this data:

image

So I looked at how long hub startup has taken:

image

And it took many many minutes. I think that is the cause of the 503s in this case.

jupyterhub/jupyterhub#2928 is probably super related.

I think the script I ran to remove unused users from the database (which should have no practical effects) should help with this.

@yuvipanda
Copy link
Contributor Author

jupyterhub/jupyterhub-idle-culler#30 might help too.

@yuvipanda
Copy link
Contributor Author

I've run the script for the R hub (which cleaned up some 3235 users), DataHub (cleaned up some 14k users) and data100 hub (cleaned up some 7k users).

@yuvipanda
Copy link
Contributor Author

One remaining mystery is why some user pods were missing / deleted during this time - I need to investigate that still. But I think that's a symptom rather than a cause.

@yuvipanda
Copy link
Contributor Author

#2774 is the script for deleting unused users.

@felder
Copy link
Contributor

felder commented Sep 17, 2021

@yuvipanda would it make sense to run #2774 or perhaps just clear db entirely during semester breaks?

@yuvipanda
Copy link
Contributor Author

@felder yes! We should make a proper structured list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants