Investigate data100 503 errors #2677

yuvipanda · 2021-09-02T20:19:22Z

From Andrew Lenz on Slack

Data 100 is wondering if we can get more resources/computational power for DataHub. We constantly get overloaded for resources and get 503s on due dates, which is quite unfortunate.

There were many different issues provided, some which showed 503 and others exhibited themselves in other ways. We should track these, and fix them however we can.

I don't think most of them are due to lack of resources though, except for some kernel deaths.

yuvipanda · 2021-09-02T20:25:44Z

Looking at server start times, I don't see anything too bad. I also don't see a lot of pending or waiting users

yuvipanda · 2021-09-02T21:59:12Z

A bunch of different issues were reported!

is one. I haven't seen this one before, and there were several users reporting this. This is perhaps also a race condition between the multiple user schedulers we had, and could be fixed by #1817. Let's wait and watch to see if there are more reports of this.

balajialg · 2021-09-16T23:37:30Z

Message from the teaching team - sorry to tag you all (I know we decided on the Piazza route), but yet again, it's the due date for a major assignment and DataHub is down with a Service Unavailable screen (503?). Any ideas

Link to Piazza1 and Piazza 2where students raised the issue.

Request is to provide more resources to avoid such scenarios in the future!

felder · 2021-09-17T00:40:49Z

@yuvipanda while I cannot prove this conclusively, there appears to be a V shaped currently running users graph here:
https://grafana.datahub.berkeley.edu/goto/g_YrOXSnz

With a correlation to this particular PR:
#2768

Can you think of any reason why that PR might result in lots of pods being knocked off? Is it possible that any change to data100 causes a mass drop off which could possibly explain the odd behavior above as well.

I also see this in the hub log for data100-prod:

[I 2021-09-16 23:28:00.379 JupyterHub app:2459] Running JupyterHub version 1.4.2
[I 2021-09-16 23:28:00.379 JupyterHub app:2489] Using Authenticator: canvasauthenticator.CanvasOAuthenticator
[I 2021-09-16 23:28:00.379 JupyterHub app:2489] Using Spawner: builtins.CustomAttrSpawner
[I 2021-09-16 23:28:00.380 JupyterHub app:2489] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.4.2
[I 2021-09-16 23:28:00.564 JupyterHub app:1838] Not using allowed_users. Any authenticated user will be allowed.
[I 2021-09-16 23:28:01.955 JupyterHub reflector:275] watching for pods with label selector='component=singleuser-server' in namespace data100-prod
[I 2021-09-16 23:28:02.655 JupyterHub reflector:275] watching for events with field selector='involvedObject.kind=Pod' in namespace data100-prod
[W 2021-09-16 23:28:06.954 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.956 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.958 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.962 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.969 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.970 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down
[W 2021-09-16 23:28:06.974 JupyterHub app:2194] REMOVED appears to have stopped while the Hub was down

Along with a bunch of messages like this:

[W 2021-09-16 23:29:07.424 JupyterHub user:767] REMOVED server never showed up at http://10.20.14.206:8888/user/REMOVED/ after 30 seconds. Giving up
[E 2021-09-16 23:29:07.517 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.0.40:8888/user/REMOVED/, shutting it down.
[I 2021-09-16 23:29:07.518 JupyterHub spawner:2620] Deleting pod data100-prod/REMOVED

I also saw this:

[I 2021-09-16 23:29:07.659 JupyterHub service:121] Spawning python3 -m jupyterhub_idle_culler --url=http://localhost:8081/hub/api --timeout=3600 --cull-every=600 --concurrency=10
[E 210916 23:29:27 ioloop:761] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x7f1c8794f4f0>>, <Future finished exception=HTTP 599: Operation timed out after 20000 milliseconds with 0 bytes received>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 741, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.8/dist-packages/tornado/ioloop.py", line 765, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 769, in run
        yielded = self.gen.throw(*exc_info)  # type: ignore
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub_idle_culler/__init__.py", line 120, in cull_idle
        resp = yield fetch(req)
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 775, in run
        yielded = self.gen.send(value)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub_idle_culler/__init__.py", line 113, in fetch
        return (yield client.fetch(req))
      File "/usr/local/lib/python3.8/dist-packages/tornado/gen.py", line 762, in run
        value = future.result()
    tornado.curl_httpclient.CurlError: HTTP 599: Operation timed out after 20000 milliseconds with 0 bytes received
[I 2021-09-16 23:30:08.202 JupyterHub log:189] 200 GET /hub/api/users ([email protected]) 60289.54ms
[E 2021-09-16 23:30:08.512 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.17.12:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.628 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.0.4:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.701 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.14.64:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.787 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.45.236:8888/user/REMOVED/, shutting it down.
[E 2021-09-16 23:30:08.852 JupyterHub app:2178] REMOVED does not appear to be running at http://10.20.14.65:8888/user/REMOVED/, shutting it down.

balajialg · 2021-09-17T00:41:12Z

Service showed 503 errors from 4.30 - 5.15 PM! Based on the GSI's report, it was back to normal after that.

@yuvipanda Could this issue be due to the last merge that happened?

If not, What would be the potential way forward to avoid such scenarios? Should we increase the Data 100 hub's resource requirement?

Ref berkeley-dsep-infra#2677

yuvipanda · 2021-09-17T09:08:00Z

This is the node on which the data100 hub and proxy pod were located. The '100%' is 'full saturation of all CPUs', and you can see that we were almost there. I don't know if this is a symptom or a cause, but we can give data100 pods more CPU guarantees to eliminate that as a cause - #2772.

This is actual CPU usage. It was never saturated (hit 100% of 1 core, which is max python processes can do), but perhaps that was because CPU wasn't available.

There's also a lot of user entries on the hub - about 8k. Sometimes, the CPU usage of the hub process scales linearly with total number of users (rather than active users). Upstream is working on fixing it, but in the meantime I've just deleted users who aren't currently active. This should help reduce CPU usage as well.

yuvipanda · 2021-09-17T11:59:52Z

@felder I looked at an alternative source of count of running users (from the hub), and you can see there's a gap in the data:

When #2768 was merged, it restarted the data100-prod hub pod, and it took a long time to come back up!

This isn't common - restarts mostly take much less time. I don't see any other gaps in the last 7 days in this data:

So I looked at how long hub startup has taken:

And it took many many minutes. I think that is the cause of the 503s in this case.

jupyterhub/jupyterhub#2928 is probably super related.

I think the script I ran to remove unused users from the database (which should have no practical effects) should help with this.

yuvipanda · 2021-09-17T12:03:17Z

jupyterhub/jupyterhub-idle-culler#30 might help too.

yuvipanda · 2021-09-17T12:35:11Z

I've run the script for the R hub (which cleaned up some 3235 users), DataHub (cleaned up some 14k users) and data100 hub (cleaned up some 7k users).

yuvipanda · 2021-09-17T12:35:52Z

One remaining mystery is why some user pods were missing / deleted during this time - I need to investigate that still. But I think that's a symptom rather than a cause.

yuvipanda · 2021-09-17T12:46:51Z

#2774 is the script for deleting unused users.

felder · 2021-09-17T21:44:30Z

@yuvipanda would it make sense to run #2774 or perhaps just clear db entirely during semester breaks?

yuvipanda · 2021-10-05T15:44:52Z

@felder yes! We should make a proper structured list

balajialg added the bug label Sep 2, 2021

yuvipanda added a commit to yuvipanda/datahub that referenced this issue Sep 17, 2021

Bump CPU requests for data100 core pods

0a42ebe

Ref berkeley-dsep-infra#2677

yuvipanda mentioned this issue Sep 17, 2021

Bump CPU requests for data100 core pods #2772

Merged

This was referenced Sep 20, 2021

Don't restart the hub pod for image changes during deploy #2781

Open

Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

Closed

felder mentioned this issue Sep 21, 2021

Automatically watch for errors in our infrastructure logs #2693

Closed

yuvipanda mentioned this issue Sep 23, 2021

Weekly Update - Monday, September 20th #2780

Closed

balajialg closed this as completed Oct 7, 2021

yuvipanda mentioned this issue Oct 29, 2021

End-semester cleanup tasks [Clean up directories, upgrade image/packages and documentation] #2946

Open

11 tasks

shaneknapp mentioned this issue Mar 15, 2024

[DH-3] increase cpu allocation for data8 hub and proxy pods #5614

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate data100 503 errors #2677

Investigate data100 503 errors #2677

yuvipanda commented Sep 2, 2021 •

edited

Loading

yuvipanda commented Sep 2, 2021

yuvipanda commented Sep 2, 2021

balajialg commented Sep 16, 2021 •

edited

Loading

felder commented Sep 17, 2021

balajialg commented Sep 17, 2021 •

edited

Loading

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

felder commented Sep 17, 2021

yuvipanda commented Oct 5, 2021

Investigate data100 503 errors #2677

Investigate data100 503 errors #2677

Comments

yuvipanda commented Sep 2, 2021 • edited Loading

yuvipanda commented Sep 2, 2021

yuvipanda commented Sep 2, 2021

balajialg commented Sep 16, 2021 • edited Loading

felder commented Sep 17, 2021

balajialg commented Sep 17, 2021 • edited Loading

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

yuvipanda commented Sep 17, 2021

felder commented Sep 17, 2021

yuvipanda commented Oct 5, 2021

yuvipanda commented Sep 2, 2021 •

edited

Loading

balajialg commented Sep 16, 2021 •

edited

Loading

balajialg commented Sep 17, 2021 •

edited

Loading