Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status endpoint should confirm database connection is working #1214

Open
robertknight opened this issue Oct 13, 2023 · 3 comments
Open

Status endpoint should confirm database connection is working #1214

robertknight opened this issue Oct 13, 2023 · 3 comments

Comments

@robertknight
Copy link
Member

We had an outage of the YouTube integration today because an infrastructure change stopped an instance of Via from connecting to its DB. We weren't alerted to the issue immediately because the status endpoint doesn't check the ability to connect to the DB, like the corresponding endpoints in h and lms do.

@seanh
Copy link
Contributor

seanh commented Oct 17, 2023

Did the incident stop only one instance of Via (i.e. one autoscaling VM) from connecting? Or all of them?

If Via couldn't connect to its DB and requests that do DB queries (such as those related to the YouTube feature) would have been erroring (with 500's I imagine), why didn't that trigger an alarm?

Anyway, the status endpoint can certainly do a trivial DB query to test the DB connection, see h and LMS's status endpoints for an example. See How do our /_status endpoints work? for design notes. Note that I don't think we need the "optional checks" (query params) feature for checking a DB connection: that can just be a non-optional check. But it could return {"status": "down", "down": ["db"]} if it wants to be informative about why the status is down, although tools will not read this.

@robertknight
Copy link
Member Author

If Via couldn't connect to its DB and requests that do DB queries (such as those related to the YouTube feature) would have been erroring (with 500's I imagine), why didn't that trigger an alarm?

I don't know the full answer to that, but the way Via was failing was that instead of failing immediately with a 500, it would hang for 30s and then timeout. So I suspect this means that the Python app's attempt to connect to the DB was just hanging, rather than being rejected. Or maybe something was attempting a retry?

The infrastructure issue was that the Via instance got removed from the security group that allowed it to access the DB.

@seanh
Copy link
Contributor

seanh commented Oct 17, 2023

If Via couldn't connect to its DB and requests that do DB queries (such as those related to the YouTube feature) would have been erroring (with 500's I imagine), why didn't that trigger an alarm?

I don't know the full answer to that, but the way Via was failing was that instead of failing immediately with a 500, it would hang for 30s and then timeout. So I suspect this means that the Python app's attempt to connect to the DB was just hanging, rather than being rejected.

Hmm, that seems like an undesirable behaviour when the DB is down. We might want to reduce that timeout. Although ultimately the important thing is that the DB should be reliably up.

Or maybe something was attempting a retry?

Hmm, I don't think so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants