-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ui: race condition during initialization with login enabled #25771
Comments
We might be opening up the web endpoint too early in |
This used to be easy to repro, but strangely now I can't, with |
@couchand did a bisect and thinks this was fixed by ee6d448 in #26329, but we don't understand why. We thought that this crash was happening because the web authentication code was running SQL before the SQL system was initialized. It's not evident to me how that PR would have fixed this, but we didn't completely understand the bug in the first place. |
This bug was created in bedcbed when @vilterp added the bedcbed#diff-09856fe9becddf0199651f451409356aR1187 Previously, the Node initialization winds up setting the version cluster setting, and so only accesses after this initialization are legal. When a request is received after the root route is available but before the version cluster setting has been initialized, the process panics. The bug was fixed by @nvanbenschoten in ee6d448. This change removes the method ee6d448#diff-ed00a68165bb66ba6524401660df29aaL419 It's great that this specific panic is no longer an issue, but it is a bit troubling that it was so easy to make this situation happen and then very difficult to debug and identify the root issue. Is there something that can be done to prevent this sort of thing from happening again? @tschottdorf @nvanbenschoten @vilterp |
I just ran into this locally on a 1 node cluster on top of a recent master (61325cd):
I had a couple logged-in tabs of the admin UI already open from before restarting the process. I also had a tab trying to load the |
Attempting to repro and bisect. May have to do with #29230. |
Nope, much older than that. Bisect seems to show that first commit with this bug is bedcbed. |
That's unfortunate. Perhaps we should just live with a round trip for an API call as a short-term fix? |
Well that's exactly what I mean -- how can we make explicit which components are safe to touch and when? I don't think the answer can just be Currently, in a single 500 line function, we create a bunch of shared mutable state ( |
Oh, I somehow thought you meant something about the |
Occurred again today on
|
@vilterp what's the status here? Have you made any progress on this? |
Not yet, besides identifying the commit which seems to have introduced it. Hoping to look at it today. |
Friendly ping. I see this in crashes every now and then and presumably this was released in betas already? It's relatively urgent to fix. |
Hey Tobi, @couchand and I were debugging this yesterday and will be again today. We have a solid repro. The basic problem is that the authentication mux is running SQL with the internal executor before the Version cluster setting is initialized. The initialization of all the HTTP stuff is pretty spread out in |
Sounds good, thanks! FWIW, I'd be fine by me if you just added a way to check if the thing is initialized before you use it (or even just wait for it). That's a wart, but as long as we're calling it that, I think it's defensible at this point in the cycle. Plus, the version thing is basically impossible to use correctly. Maybe all callers should implicitly wait until there's a sane way to write code that doesn't accidentally trigger this. |
Hoping not to have to do something like that. Btw, noticed this method cockroach/pkg/settings/cluster/settings.go Lines 141 to 145 in fcc4eb4
|
This fixes cockroachdb#25771, in which the auth mux was using the internal executor to run SQL before the cluster version, which the internal executor is dependent upon, was initialized. A preferable solution would be to not register any handlers using the auth mux until the cluster setting is initialized, but it's not immediately apparent what reordering of things in Server.Start would achieve this without breaking anything else. Release note: None
Ok, minimal fix in #30640. Would be better to reorder things s.t. that check is not necessary, but I'm a little afraid of rejiggering a bunch of things in Server.Start at this point in the cycle. Open to better fixes if we can think of them. |
Discovered while working on a unit test to repro this that the crash only happens when the request contains a cookie; otherwise there is no session to look up and the auth mux never runs any SQL. So, the user has to log in, then restart a node or start a new node at the same address and send it a request containing that login cookie to trigger this. Doesn't really change the severity, since it's probably likely to be logged in and looking at the admin UI while you restart and add new nodes. |
Teach the cluster-init test to reproduce the crash during bootup described in issue cockroachdb#25771. The trick is to send requests with a session cookie, as the admin UI would, while the cluster is waiting for init. Release note: None
Teach the cluster-init test to reproduce the crash during bootup described in issue cockroachdb#25771. The trick is to send requests with a session cookie, as the admin UI would, while the cluster is waiting for init. Note that the crash was caused by the authenticationMux, which is used to protect secure clusters, but this test uses an insecure cluster. This is relying on a wart: an authenticationMux is installed for UI assets whether or not the cluster is secure. This could change in the future. A better test would use a secure cluster, however roachprod/roachtest do not easily support that at this time, nor is it easy to write a unit test that exercises the crash more robustly. Release note: None
Teach the cluster-init test to reproduce the crash during bootup described in issue cockroachdb#25771. The trick is to send requests with a session cookie, as the admin UI would, while the cluster is waiting for init. Note that the crash was caused by the authenticationMux, which is used to protect secure clusters, but this test uses an insecure cluster. This is relying on a wart: an authenticationMux is installed for UI assets whether or not the cluster is secure. This could change in the future. A better test would use a secure cluster, however roachprod/roachtest do not easily support that at this time, nor is it easy to write a unit test that exercises the crash more robustly. Release note: None
Teach the cluster-init test to reproduce the crash during bootup described in issue cockroachdb#25771. The trick is to send requests with a session cookie, as the admin UI would, while the cluster is waiting for init. Note that the crash was caused by the authenticationMux, which is used to protect secure clusters, but this test uses an insecure cluster. This is relying on a wart: an authenticationMux is installed for UI assets whether or not the cluster is secure. This could change in the future. A better test would use a secure cluster, however roachprod/roachtest do not easily support that at this time, nor is it easy to write a unit test that exercises the crash more robustly. Release note: None
30746: server/ui: fix crash caused by race in server initialization r=vilterp a=vilterp Fixes #25771 See commits for details. Co-authored-by: Pete Vilter <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>
Teach the cluster-init test to reproduce the crash during bootup described in issue cockroachdb#25771. The trick is to send requests with a session cookie, as the admin UI would, while the cluster is waiting for init. Note that the crash was caused by the authenticationMux, which is used to protect secure clusters, but this test uses an insecure cluster. This is relying on a wart: an authenticationMux is installed for UI assets whether or not the cluster is secure. This could change in the future. A better test would use a secure cluster, however roachprod/roachtest do not easily support that at this time, nor is it easy to write a unit test that exercises the crash more robustly. Release note: None
When login is enabled, there seems to be a race condition during initialization, manifesting as the error "settings/cluster/settings.go:164 Version() was called before having been initialized". It seems to occur when an existing admin ui browser tab is open during node startup, issuing requests such as health checks. It was previously theorized that #24945 fixed this issue, but it looks like it's come up since.
Stack trace for my repro just now:
The text was updated successfully, but these errors were encountered: