-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tls: simplify implementation and fix one class of crashing bug #12833
Conversation
At the expense of startup complexity, this removes the bookkeeper deletion model for slots, and blocks slot destruction until callbacks have been flushed on workers. This removes an entire class of bugs in which a user captures something in the lambda which get destroyed before the TLS operations run on the workers. Signed-off-by: Matt Klein <[email protected]>
@jmarantz @htuch @ggreenway @stevenzzzz @lambdai this is seriously lacking tests but fixes #12364 and I think would prevent a bunch of other issues we have had and have done one off fixes for (some of those fixes can likely be reverted). Anyway, before I invest a huge amount of time I wanted to get some initial feedback here. Let me know what you think about this direction. I tried to add a bunch of comments in the code but lmk if I need to explain more. After the community meeting convo it should be pretty clear what is going on. |
source/extensions/access_loggers/grpc/http_grpc_access_log_impl.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple concerns:
- Are we creating different, but still hard-to-diagnose issues, by possibly not running the callback at all (if the slot was destroyed)?
- What's the perf impact? An additional atomic operation for each callback? If it's on all workers, it will probably be contended (in hardware). A wrapper
function
around the pass-in function. - What's the worst-case wait time? (But it's only the main thread that waits, right?)
The default implementation does run the callback. It's not run in the immediate deletion case on startup. IMO it's fine to not run the callback if the slot has been deleted so I think this one is OK.
Yes some additional atomic increments for each callback. There is no blocking on worker updates, but there could be contention around the atomics. With that said, I think contention on stats would be vastly worse, so I don't view this as a large issue.
Worst case wait time on destruction is the event loop delay across all workers which could be hundreds of milliseconds. This is certainly the part that is not great and I'm happy to abandon this approach if we don't want this. I think blocking is the only way to fix the underlying problem if we don't take a different path and do something with code analysis like here: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The code analysis approach wouldn't be perfect, but it might be good enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other design alternative: could we remove the need to lock a weak_ptr at every callback if, in startGlobalThreading
, we go through all the callbacks and delete the ones that refer to a deleted slot? That would make a clear transition from "we're doing crazy stuff to handle startup-only problems" to "normal running state".
Yes this might be possible. Let me think about this more. |
@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not. If EDS cluster update is going through TLS then this approach would impact a lot. |
I could be wrong. |
It's likely a problem in the case of SRDS relatively frequently updates its scopes, with ~100workers, and say deleting 10 scopes(routeConfigProviders), this could mean main thread block for up to (100 events in workers) x 10 times.
|
Yes the blocking is the biggest risk with this approach. It's hard to say whether this is a real issue in practice or not. I'm skeptical it is but hard to say. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: do you think the still_alive_guard
pattern will be used in other places? If so the pattern could be extracted to an abstract class.
@ahedberg @lambdai @lizan @jmarantz @stevenzzzz @ggreenway I want to summarize some of the open lines of thought here so we can have a discussion on how to move forward. I don't want to spend more time until we decide how we want to handle this:
I will admit I'm not thrilled w/ the blocking proposed in this PR, but it doesn't feel completely terrible to me, especially as it removes the bookkeeper logic and prevents more potential issues. If we don't want the blocking and want to try for a more systemic fix beyond a clang-tidy plugin, one option that I have been noodling on is to figure out how to defer delete the entire listener, cluster, etc. when all nested TLS callbacks have flushed. This would be effectively extending the bookkeeper method on a larger scale, to prevent main thread objects from dying until all nested callbacks have flushed. I think with the way we pass around TLS slot factories, this may be less horrible to accomplish than it seems. So to summarize:
Thoughts? |
Basically Bookkeeper stops a slotimpl from being tearing down when there are outgoing lambdas in workers(for #7902). It also doesnt block when the destructor is called on main thread. I think we can extend the bookeeper with the update we have in this PR, that will give us a nonblocking, and safer slot impl. On the other hand, Tls is used to do cross-thread communication, we should probably state clearly in the doc that only stateless data should be captured and passed around, the clang tool will help here as well. |
How? Can you expand? I don't see any way of making it safer on its own without without blocking. |
If we went the clang-tidy route, would that be sufficient to also remove the BookKeeper code? |
No, I don't think so. The bookkeeper code is preventing the slot from getting recycled while callbacks are still in flight. Fundamentally it's the same root issue: we have callbacks in flight that refer to something on the main thread that is in the process of getting destroyed. This is why I'm still personally voting for the blocking solution: it doesn't thrill me but my intuition is that the blocking won't cause any issues in practice, especially with a combination of timeouts on the wait and the deadlock detection code we already have. |
I am thinking let's add the "still_alive_guard_" to bookkeeper, and in the wrapped callback capture the weak_ptr, check still_alive_guard_ is still alive before calling the real callback. this is an i'd say huge improvement to the bookkeeper already. |
Yes I agree this can't hurt, and I can add this, but it doesn't actually fix the underlying race condition, but I suppose it's better than nothing. |
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
Status update: I'm going to work on cleaning up the existing thread_local_impl code (I think it can be simplified) and also adding the still alive guard per discussion with @stevenzzzz. This should fix another class of bugs (in particular LDS listener failure during initial startup). After that I will take a look at a clang-tidy plugin. This seems like the least contentious path forward right now. |
Thanks! |
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look in detail at the prior state but I think I understand your use of weak_ptr now to avoid a race.
Are you going to use the thread synchronizer to make a test?
Assuming there are no objections to this approach I'm going to add a substantial number of new tests to cover this behavior. |
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
@ggreenway @jmarantz @stevenzzzz updated per comments. |
Signed-off-by: Matt Klein <[email protected]>
@ggreenway @jmarantz @stevenzzzz updated again. I fixed a bunch of broken captures and simplified the API. I will take another pass on broken captures when I work on the clang-tidy plugin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. The code ended up simpler, no super-complicated relationship between threads/objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great cleanup!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, and a shout-out to well to @stevenzzzz for finding this race, as well as to Matt for fixing it in a very elegant way. One small suggestion. Could be a follow-up.
/** | ||
* Set thread local data on all threads previously registered via registerThread(). | ||
* @param initializeCb supplies the functor that will be called *on each thread*. The functor | ||
* returns the thread local object which is then stored. The storage is via | ||
* a shared_ptr. Thus, this is a flexible mechanism that can be used to share | ||
* the same data across all threads or to share different data on each thread. | ||
* | ||
* NOTE: The initialize callback is not supposed to capture the Slot, or its owner. As the owner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is not supposed to/must not/ ?
Signed-off-by: Matt Klein <[email protected]>
This change does 2 things:
bookkeeper code. The main insight is that the slot itself does not
need to be captured in any callbacks, just the index. By doing this,
all bookkeeper complexity can be removed and all callbacks work
directly with the slot index which can be immediately recycled when
the slot is deleted.
all callbacks. This does not completely prevent broken captures,
but it does fix the common case of a slot being immediately deleted
before any callbacks run on the workers (which can be common during
initial startup listener creation failure for example).
In a follow up change I will look into adding a clang-tidy check to prevent
capturing "this" in any TLS callback.
Fixes #12364
Risk Level: Medium
Testing: TBD
Docs Changes: N/A
Release Notes: N/A