Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tls: simplify implementation and fix one class of crashing bug #12833

Merged
merged 16 commits into from
Sep 30, 2020

Conversation

mattklein123
Copy link
Member

@mattklein123 mattklein123 commented Aug 27, 2020

This change does 2 things:

  1. It greatly simplifies the TLS implementation by removing the
    bookkeeper code. The main insight is that the slot itself does not
    need to be captured in any callbacks, just the index. By doing this,
    all bookkeeper complexity can be removed and all callbacks work
    directly with the slot index which can be immediately recycled when
    the slot is deleted.
  2. Adds a "still alive" shared_ptr which is captured by weak_ptr in
    all callbacks. This does not completely prevent broken captures,
    but it does fix the common case of a slot being immediately deleted
    before any callbacks run on the workers (which can be common during
    initial startup listener creation failure for example).

In a follow up change I will look into adding a clang-tidy check to prevent
capturing "this" in any TLS callback.

Fixes #12364

Risk Level: Medium
Testing: TBD
Docs Changes: N/A
Release Notes: N/A

At the expense of startup complexity, this removes the bookkeeper
deletion model for slots, and blocks slot destruction until callbacks
have been flushed on workers. This removes an entire class of bugs
in which a user captures something in the lambda which get destroyed
before the TLS operations run on the workers.

Signed-off-by: Matt Klein <[email protected]>
@mattklein123
Copy link
Member Author

@jmarantz @htuch @ggreenway @stevenzzzz @lambdai this is seriously lacking tests but fixes #12364 and I think would prevent a bunch of other issues we have had and have done one off fixes for (some of those fixes can likely be reverted).

Anyway, before I invest a huge amount of time I wanted to get some initial feedback here. Let me know what you think about this direction. I tried to add a bunch of comments in the code but lmk if I need to explain more. After the community meeting convo it should be pretty clear what is going on.

@mattklein123 mattklein123 changed the title tls: refactor to synchronize slot removal on workers [WIP] tls: refactor to synchronize slot removal on workers Aug 27, 2020
Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple concerns:

  • Are we creating different, but still hard-to-diagnose issues, by possibly not running the callback at all (if the slot was destroyed)?
  • What's the perf impact? An additional atomic operation for each callback? If it's on all workers, it will probably be contended (in hardware). A wrapper function around the pass-in function.
  • What's the worst-case wait time? (But it's only the main thread that waits, right?)

@mattklein123
Copy link
Member Author

Are we creating different, but still hard-to-diagnose issues, by possibly not running the callback at all (if the slot was destroyed)?

The default implementation does run the callback. It's not run in the immediate deletion case on startup. IMO it's fine to not run the callback if the slot has been deleted so I think this one is OK.

What's the perf impact? An additional atomic operation for each callback? If it's on all workers, it will probably be contended (in hardware). A wrapper function around the pass-in function.

Yes some additional atomic increments for each callback. There is no blocking on worker updates, but there could be contention around the atomics. With that said, I think contention on stats would be vastly worse, so I don't view this as a large issue.

What's the worst-case wait time? (But it's only the main thread that waits, right?)

Worst case wait time on destruction is the event loop delay across all workers which could be hundreds of milliseconds. This is certainly the part that is not great and I'm happy to abandon this approach if we don't want this. I think blocking is the only way to fix the underlying problem if we don't take a different path and do something with code analysis like here: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The code analysis approach wouldn't be perfect, but it might be good enough.

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other design alternative: could we remove the need to lock a weak_ptr at every callback if, in startGlobalThreading, we go through all the callbacks and delete the ones that refer to a deleted slot? That would make a clear transition from "we're doing crazy stuff to handle startup-only problems" to "normal running state".

@mattklein123
Copy link
Member Author

One other design alternative: could we remove the need to lock a weak_ptr at every callback if, in startGlobalThreading, we go through all the callbacks and delete the ones that refer to a deleted slot? That would make a clear transition from "we're doing crazy stuff to handle startup-only problems" to "normal running state".

Yes this might be possible. Let me think about this more.

@lambdai
Copy link
Contributor

lambdai commented Aug 27, 2020

@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not.

If EDS cluster update is going through TLS then this approach would impact a lot.

@lambdai
Copy link
Contributor

lambdai commented Aug 27, 2020

@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not.

If EDS cluster update is going through TLS then this approach would impact a lot.

I could be wrong. publish by update is not impacted since the slot is not destroyed within

@stevenzzzz
Copy link
Contributor

@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not.

It's likely a problem in the case of SRDS relatively frequently updates its scopes, with ~100workers, and say deleting 10 scopes(routeConfigProviders), this could mean main thread block for up to (100 events in workers) x 10 times.

If EDS cluster update is going through TLS then this approach

@mattklein123
Copy link
Member Author

It's likely a problem in the case of SRDS relatively frequently updates its scopes, with ~100workers, and say deleting 10 scopes(routeConfigProviders), this could mean main thread block for up to (100 events in workers) x 10 times.

Yes the blocking is the biggest risk with this approach. It's hard to say whether this is a real issue in practice or not. I'm skeptical it is but hard to say.

Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do you think the still_alive_guard pattern will be used in other places? If so the pattern could be extracted to an abstract class.

@mattklein123
Copy link
Member Author

@ahedberg @lambdai @lizan @jmarantz @stevenzzzz @ggreenway I want to summarize some of the open lines of thought here so we can have a discussion on how to move forward. I don't want to spend more time until we decide how we want to handle this:

  • This change will prevent many classes of bugs we have had around TLS slot lifetime, but has the large downside of forcing blocking on the main thread while TLS slot updates flush.
  • It's possible we can get pretty far with a clang-tidy plugin that looks like this one: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The main thing that concerns me about the plugin approach is that it's only a partial fix. a) depending on what is captured it might still be broken depending on transitive lifetimes. b) there are potential race conditions (e.g. let's say we capture by weak_ptr and lock(). Depending on the structure of transitive lifetimes, we may lock something but there is still a race with destruction on the main thread.
  • The bookkeeper was already implemented by @stevenzzzz to fix similar problems (see https://docs.google.com/document/d/1_axmNmbkDO3aQE3jS5-WgSLvXyVxKwapOnrkBZn-DSk/edit#heading=h.ui18fonf1w1) but it doesn't fix all such problems and adds complexity. (basically the bookkeeper code keeps the slot from getting recycled until all callbacks are flushed, but still relies on the caller to not capture anything they are not supposed to be capturing).

I will admit I'm not thrilled w/ the blocking proposed in this PR, but it doesn't feel completely terrible to me, especially as it removes the bookkeeper logic and prevents more potential issues.

If we don't want the blocking and want to try for a more systemic fix beyond a clang-tidy plugin, one option that I have been noodling on is to figure out how to defer delete the entire listener, cluster, etc. when all nested TLS callbacks have flushed. This would be effectively extending the bookkeeper method on a larger scale, to prevent main thread objects from dying until all nested callbacks have flushed. I think with the way we pass around TLS slot factories, this may be less horrible to accomplish than it seems.

So to summarize:

  1. Do nothing, fix one off bugs that show up.
  2. Try for a clang-tidy plugin, fix one off bugs that show up.
  3. Do this PR (or a variant of it).
  4. Figure out how to extend bookkeeper to a larger level.

Thoughts?

@stevenzzzz
Copy link
Contributor

@ahedberg @lambdai @lizan @jmarantz @stevenzzzz @ggreenway I want to summarize some of the open lines of thought here so we can have a discussion on how to move forward. I don't want to spend more time until we decide how we want to handle this:

  • This change will prevent many classes of bugs we have had around TLS slot lifetime, but has the large downside of forcing blocking on the main thread while TLS slot updates flush.
  • It's possible we can get pretty far with a clang-tidy plugin that looks like this one: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The main thing that concerns me about the plugin approach is that it's only a partial fix. a) depending on what is captured it might still be broken depending on transitive lifetimes. b) there are potential race conditions (e.g. let's say we capture by weak_ptr and lock(). Depending on the structure of transitive lifetimes, we may lock something but there is still a race with destruction on the main thread.
  • The bookkeeper was already implemented by @stevenzzzz to fix similar problems (see https://docs.google.com/document/d/1_axmNmbkDO3aQE3jS5-WgSLvXyVxKwapOnrkBZn-DSk/edit#heading=h.ui18fonf1w1) but it doesn't fix all such problems and adds complexity. (basically the bookkeeper code keeps the slot from getting recycled until all callbacks are flushed, but still relies on the caller to not capture anything they are not supposed to be capturing).

I will admit I'm not thrilled w/ the blocking proposed in this PR, but it doesn't feel completely terrible to me, especially as it removes the bookkeeper logic and prevents more potential issues.

If we don't want the blocking and want to try for a more systemic fix beyond a clang-tidy plugin, one option that I have been noodling on is to figure out how to defer delete the entire listener, cluster, etc. when all nested TLS callbacks have flushed. This would be effectively extending the bookkeeper method on a larger scale, to prevent main thread objects from dying until all nested callbacks have flushed. I think with the way we pass around TLS slot factories, this may be less horrible to accomplish than it seems.

So to summarize:

  1. Do nothing, fix one off bugs that show up.
  2. Try for a clang-tidy plugin, fix one off bugs that show up.
  3. Do this PR (or a variant of it).
  4. Figure out how to extend bookkeeper to a larger level.

Thoughts?

Basically Bookkeeper stops a slotimpl from being tearing down when there are outgoing lambdas in workers(for #7902). It also doesnt block when the destructor is called on main thread.
This PR extend the life cycle coverage to part of captured objects by checking if the slotimpl destructor is called.

I think we can extend the bookeeper with the update we have in this PR, that will give us a nonblocking, and safer slot impl.

On the other hand, Tls is used to do cross-thread communication, we should probably state clearly in the doc that only stateless data should be captured and passed around, the clang tool will help here as well.

@mattklein123
Copy link
Member Author

I think we can extend the bookeeper with the update we have in this PR, that will give us a nonblocking, and safer slot impl.

How? Can you expand? I don't see any way of making it safer on its own without without blocking.

@ggreenway
Copy link
Contributor

If we went the clang-tidy route, would that be sufficient to also remove the BookKeeper code?

@mattklein123
Copy link
Member Author

If we went the clang-tidy route, would that be sufficient to also remove the BookKeeper code?

No, I don't think so. The bookkeeper code is preventing the slot from getting recycled while callbacks are still in flight. Fundamentally it's the same root issue: we have callbacks in flight that refer to something on the main thread that is in the process of getting destroyed.

This is why I'm still personally voting for the blocking solution: it doesn't thrill me but my intuition is that the blocking won't cause any issues in practice, especially with a combination of timeouts on the wait and the deadlock detection code we already have.

@stevenzzzz
Copy link
Contributor

How? Can you expand? I don't see any way of making it safer on its own without without blocking.

I am thinking let's add the "still_alive_guard_" to bookkeeper, and in the wrapped callback capture the weak_ptr, check still_alive_guard_ is still alive before calling the real callback. this is an i'd say huge improvement to the bookkeeper already.
for callbacks which has won the race on "destructing/checking still_alive_guard_", since the slotimpl is not deleted, the deferred deletion trick adds some safety to these winner callbacks. (same as this PR, it wont solve all the problems).

@mattklein123
Copy link
Member Author

I am thinking let's add the "still_alive_guard_" to bookkeeper, and in the wrapped callback capture the weak_ptr, check still_alive_guard_ is still alive before calling the real callback. this is an i'd say huge improvement to the bookkeeper already.
for callbacks which has won the race on "destructing/checking still_alive_guard_", since the slotimpl is not deleted, the deferred deletion trick adds some safety to these winner callbacks. (same as this PR, it wont solve all the problems).

Yes I agree this can't hurt, and I can add this, but it doesn't actually fix the underlying race condition, but I suppose it's better than nothing.

@mattklein123
Copy link
Member Author

Status update: I'm going to work on cleaning up the existing thread_local_impl code (I think it can be simplified) and also adding the still alive guard per discussion with @stevenzzzz. This should fix another class of bugs (in particular LDS listener failure during initial startup). After that I will take a look at a clang-tidy plugin. This seems like the least contentious path forward right now.

@stevenzzzz
Copy link
Contributor

Thanks!

@stale
Copy link

stale bot commented Sep 11, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 11, 2020
@mattklein123 mattklein123 added the no stalebot Disables stalebot from closing an issue label Sep 17, 2020
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look in detail at the prior state but I think I understand your use of weak_ptr now to avoid a race.

Are you going to use the thread synchronizer to make a test?

@mattklein123
Copy link
Member Author

Are you going to use the thread synchronizer to make a test?

Assuming there are no objections to this approach I'm going to add a substantial number of new tests to cover this behavior.

@mattklein123
Copy link
Member Author

@ggreenway @jmarantz @stevenzzzz updated per comments.

Signed-off-by: Matt Klein <[email protected]>
@mattklein123
Copy link
Member Author

@ggreenway @jmarantz @stevenzzzz updated again. I fixed a bunch of broken captures and simplified the API. I will take another pass on broken captures when I work on the clang-tidy plugin.

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. The code ended up simpler, no super-complicated relationship between threads/objects.

Copy link
Contributor

@stevenzzzz stevenzzzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great cleanup!

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and a shout-out to well to @stevenzzzz for finding this race, as well as to Matt for fixing it in a very elegant way. One small suggestion. Could be a follow-up.

/**
* Set thread local data on all threads previously registered via registerThread().
* @param initializeCb supplies the functor that will be called *on each thread*. The functor
* returns the thread local object which is then stored. The storage is via
* a shared_ptr. Thus, this is a flexible mechanism that can be used to share
* the same data across all threads or to share different data on each thread.
*
* NOTE: The initialize callback is not supposed to capture the Slot, or its owner. As the owner
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/is not supposed to/must not/ ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Envoy segfaults when a dynamic listener fails to bind to a port and gRPC access logging is enabled
7 participants