tls: simplify implementation and fix one class of crashing bug #12833

mattklein123 · 2020-08-27T00:09:41Z

This change does 2 things:

It greatly simplifies the TLS implementation by removing the
bookkeeper code. The main insight is that the slot itself does not
need to be captured in any callbacks, just the index. By doing this,
all bookkeeper complexity can be removed and all callbacks work
directly with the slot index which can be immediately recycled when
the slot is deleted.
Adds a "still alive" shared_ptr which is captured by weak_ptr in
all callbacks. This does not completely prevent broken captures,
but it does fix the common case of a slot being immediately deleted
before any callbacks run on the workers (which can be common during
initial startup listener creation failure for example).

In a follow up change I will look into adding a clang-tidy check to prevent
capturing "this" in any TLS callback.

Risk Level: Medium
Testing: TBD
Docs Changes: N/A
Release Notes: N/A

At the expense of startup complexity, this removes the bookkeeper deletion model for slots, and blocks slot destruction until callbacks have been flushed on workers. This removes an entire class of bugs in which a user captures something in the lambda which get destroyed before the TLS operations run on the workers. Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2020-08-27T00:11:43Z

@jmarantz @htuch @ggreenway @stevenzzzz @lambdai this is seriously lacking tests but fixes #12364 and I think would prevent a bunch of other issues we have had and have done one off fixes for (some of those fixes can likely be reverted).

Anyway, before I invest a huge amount of time I wanted to get some initial feedback here. Let me know what you think about this direction. I tried to add a bunch of comments in the code but lmk if I need to explain more. After the community meeting convo it should be pretty clear what is going on.

source/extensions/access_loggers/grpc/http_grpc_access_log_impl.h

source/common/thread_local/thread_local_impl.cc

ggreenway

A couple concerns:

Are we creating different, but still hard-to-diagnose issues, by possibly not running the callback at all (if the slot was destroyed)?
What's the perf impact? An additional atomic operation for each callback? If it's on all workers, it will probably be contended (in hardware). A wrapper function around the pass-in function.
What's the worst-case wait time? (But it's only the main thread that waits, right?)

source/common/thread_local/thread_local_impl.cc

mattklein123 · 2020-08-27T16:32:52Z

Are we creating different, but still hard-to-diagnose issues, by possibly not running the callback at all (if the slot was destroyed)?

The default implementation does run the callback. It's not run in the immediate deletion case on startup. IMO it's fine to not run the callback if the slot has been deleted so I think this one is OK.

What's the perf impact? An additional atomic operation for each callback? If it's on all workers, it will probably be contended (in hardware). A wrapper function around the pass-in function.

Yes some additional atomic increments for each callback. There is no blocking on worker updates, but there could be contention around the atomics. With that said, I think contention on stats would be vastly worse, so I don't view this as a large issue.

What's the worst-case wait time? (But it's only the main thread that waits, right?)

Worst case wait time on destruction is the event loop delay across all workers which could be hundreds of milliseconds. This is certainly the part that is not great and I'm happy to abandon this approach if we don't want this. I think blocking is the only way to fix the underlying problem if we don't take a different path and do something with code analysis like here: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The code analysis approach wouldn't be perfect, but it might be good enough.

ggreenway

One other design alternative: could we remove the need to lock a weak_ptr at every callback if, in startGlobalThreading, we go through all the callbacks and delete the ones that refer to a deleted slot? That would make a clear transition from "we're doing crazy stuff to handle startup-only problems" to "normal running state".

mattklein123 · 2020-08-27T17:04:47Z

One other design alternative: could we remove the need to lock a weak_ptr at every callback if, in startGlobalThreading, we go through all the callbacks and delete the ones that refer to a deleted slot? That would make a clear transition from "we're doing crazy stuff to handle startup-only problems" to "normal running state".

Yes this might be possible. Let me think about this more.

source/common/thread_local/thread_local_impl.cc

lambdai · 2020-08-27T18:47:03Z

@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not.

If EDS cluster update is going through TLS then this approach would impact a lot.

lambdai · 2020-08-27T19:03:52Z

@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not.

If EDS cluster update is going through TLS then this approach would impact a lot.

I could be wrong. publish by update is not impacted since the slot is not destroyed within

stevenzzzz · 2020-08-27T19:06:10Z

@mattklein123 IIRC http route config is published to worker by TLS. Route is probably carrying amount update, this may be acceptable or may not.

It's likely a problem in the case of SRDS relatively frequently updates its scopes, with ~100workers, and say deleting 10 scopes(routeConfigProviders), this could mean main thread block for up to (100 events in workers) x 10 times.

If EDS cluster update is going through TLS then this approach

mattklein123 · 2020-08-27T20:36:09Z

It's likely a problem in the case of SRDS relatively frequently updates its scopes, with ~100workers, and say deleting 10 scopes(routeConfigProviders), this could mean main thread block for up to (100 events in workers) x 10 times.

Yes the blocking is the biggest risk with this approach. It's hard to say whether this is a real issue in practice or not. I'm skeptical it is but hard to say.

lizan

Q: do you think the still_alive_guard pattern will be used in other places? If so the pattern could be extracted to an abstract class.

source/common/thread_local/thread_local_impl.cc

mattklein123 · 2020-08-31T22:04:27Z

@ahedberg @lambdai @lizan @jmarantz @stevenzzzz @ggreenway I want to summarize some of the open lines of thought here so we can have a discussion on how to move forward. I don't want to spend more time until we decide how we want to handle this:

This change will prevent many classes of bugs we have had around TLS slot lifetime, but has the large downside of forcing blocking on the main thread while TLS slot updates flush.
It's possible we can get pretty far with a clang-tidy plugin that looks like this one: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The main thing that concerns me about the plugin approach is that it's only a partial fix. a) depending on what is captured it might still be broken depending on transitive lifetimes. b) there are potential race conditions (e.g. let's say we capture by weak_ptr and lock(). Depending on the structure of transitive lifetimes, we may lock something but there is still a race with destruction on the main thread.
The bookkeeper was already implemented by @stevenzzzz to fix similar problems (see https://docs.google.com/document/d/1_axmNmbkDO3aQE3jS5-WgSLvXyVxKwapOnrkBZn-DSk/edit#heading=h.ui18fonf1w1) but it doesn't fix all such problems and adds complexity. (basically the bookkeeper code keeps the slot from getting recycled until all callbacks are flushed, but still relies on the caller to not capture anything they are not supposed to be capturing).

I will admit I'm not thrilled w/ the blocking proposed in this PR, but it doesn't feel completely terrible to me, especially as it removes the bookkeeper logic and prevents more potential issues.

If we don't want the blocking and want to try for a more systemic fix beyond a clang-tidy plugin, one option that I have been noodling on is to figure out how to defer delete the entire listener, cluster, etc. when all nested TLS callbacks have flushed. This would be effectively extending the bookkeeper method on a larger scale, to prevent main thread objects from dying until all nested callbacks have flushed. I think with the way we pass around TLS slot factories, this may be less horrible to accomplish than it seems.

So to summarize:

Do nothing, fix one off bugs that show up.
Try for a clang-tidy plugin, fix one off bugs that show up.
Do this PR (or a variant of it).
Figure out how to extend bookkeeper to a larger level.

Thoughts?

stevenzzzz · 2020-09-01T13:51:31Z

@ahedberg @lambdai @lizan @jmarantz @stevenzzzz @ggreenway I want to summarize some of the open lines of thought here so we can have a discussion on how to move forward. I don't want to spend more time until we decide how we want to handle this:

This change will prevent many classes of bugs we have had around TLS slot lifetime, but has the large downside of forcing blocking on the main thread while TLS slot updates flush.

It's possible we can get pretty far with a clang-tidy plugin that looks like this one: https://github.com/mesos/clang-tools-extra/blob/mesos_90/clang-tidy/mesos/ThisCaptureCheck.cpp. The main thing that concerns me about the plugin approach is that it's only a partial fix. a) depending on what is captured it might still be broken depending on transitive lifetimes. b) there are potential race conditions (e.g. let's say we capture by weak_ptr and lock(). Depending on the structure of transitive lifetimes, we may lock something but there is still a race with destruction on the main thread.

The bookkeeper was already implemented by @stevenzzzz to fix similar problems (see https://docs.google.com/document/d/1_axmNmbkDO3aQE3jS5-WgSLvXyVxKwapOnrkBZn-DSk/edit#heading=h.ui18fonf1w1) but it doesn't fix all such problems and adds complexity. (basically the bookkeeper code keeps the slot from getting recycled until all callbacks are flushed, but still relies on the caller to not capture anything they are not supposed to be capturing).

I will admit I'm not thrilled w/ the blocking proposed in this PR, but it doesn't feel completely terrible to me, especially as it removes the bookkeeper logic and prevents more potential issues.

If we don't want the blocking and want to try for a more systemic fix beyond a clang-tidy plugin, one option that I have been noodling on is to figure out how to defer delete the entire listener, cluster, etc. when all nested TLS callbacks have flushed. This would be effectively extending the bookkeeper method on a larger scale, to prevent main thread objects from dying until all nested callbacks have flushed. I think with the way we pass around TLS slot factories, this may be less horrible to accomplish than it seems.

So to summarize:

Do nothing, fix one off bugs that show up.

Try for a clang-tidy plugin, fix one off bugs that show up.

Do this PR (or a variant of it).

Figure out how to extend bookkeeper to a larger level.

Thoughts?

Basically Bookkeeper stops a slotimpl from being tearing down when there are outgoing lambdas in workers(for #7902). It also doesnt block when the destructor is called on main thread.
This PR extend the life cycle coverage to part of captured objects by checking if the slotimpl destructor is called.

I think we can extend the bookeeper with the update we have in this PR, that will give us a nonblocking, and safer slot impl.

On the other hand, Tls is used to do cross-thread communication, we should probably state clearly in the doc that only stateless data should be captured and passed around, the clang tool will help here as well.

mattklein123 · 2020-09-01T16:44:27Z

I think we can extend the bookeeper with the update we have in this PR, that will give us a nonblocking, and safer slot impl.

How? Can you expand? I don't see any way of making it safer on its own without without blocking.

ggreenway · 2020-09-01T17:00:44Z

If we went the clang-tidy route, would that be sufficient to also remove the BookKeeper code?

mattklein123 · 2020-09-01T17:22:23Z

If we went the clang-tidy route, would that be sufficient to also remove the BookKeeper code?

No, I don't think so. The bookkeeper code is preventing the slot from getting recycled while callbacks are still in flight. Fundamentally it's the same root issue: we have callbacks in flight that refer to something on the main thread that is in the process of getting destroyed.

This is why I'm still personally voting for the blocking solution: it doesn't thrill me but my intuition is that the blocking won't cause any issues in practice, especially with a combination of timeouts on the wait and the deadlock detection code we already have.

stevenzzzz · 2020-09-01T17:54:35Z

How? Can you expand? I don't see any way of making it safer on its own without without blocking.

I am thinking let's add the "still_alive_guard_" to bookkeeper, and in the wrapped callback capture the weak_ptr, check still_alive_guard_ is still alive before calling the real callback. this is an i'd say huge improvement to the bookkeeper already.
for callbacks which has won the race on "destructing/checking still_alive_guard_", since the slotimpl is not deleted, the deferred deletion trick adds some safety to these winner callbacks. (same as this PR, it wont solve all the problems).

mattklein123 · 2020-09-01T18:01:14Z

I am thinking let's add the "still_alive_guard_" to bookkeeper, and in the wrapped callback capture the weak_ptr, check still_alive_guard_ is still alive before calling the real callback. this is an i'd say huge improvement to the bookkeeper already.
for callbacks which has won the race on "destructing/checking still_alive_guard_", since the slotimpl is not deleted, the deferred deletion trick adds some safety to these winner callbacks. (same as this PR, it wont solve all the problems).

Yes I agree this can't hurt, and I can add this, but it doesn't actually fix the underlying race condition, but I suppose it's better than nothing.

Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2020-09-03T00:48:59Z

Status update: I'm going to work on cleaning up the existing thread_local_impl code (I think it can be simplified) and also adding the still alive guard per discussion with @stevenzzzz. This should fix another class of bugs (in particular LDS listener failure during initial startup). After that I will take a look at a clang-tidy plugin. This seems like the least contentious path forward right now.

stevenzzzz · 2020-09-03T15:02:08Z

Thanks!

stale · 2020-09-11T03:41:32Z

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

Signed-off-by: Matt Klein <[email protected]>

jmarantz

I didn't look in detail at the prior state but I think I understand your use of weak_ptr now to avoid a race.

Are you going to use the thread synchronizer to make a test?

source/common/thread_local/thread_local_impl.cc

source/common/thread_local/thread_local_impl.h

source/common/thread_local/thread_local_impl.cc

mattklein123 · 2020-09-28T14:57:02Z

Are you going to use the thread synchronizer to make a test?

Assuming there are no objections to this approach I'm going to add a substantial number of new tests to cover this behavior.

Signed-off-by: Matt Klein <[email protected]>

source/common/thread_local/thread_local_impl.cc

Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2020-09-28T19:35:46Z

@ggreenway @jmarantz @stevenzzzz updated per comments.

Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2020-09-28T22:24:21Z

@ggreenway @jmarantz @stevenzzzz updated again. I fixed a bunch of broken captures and simplified the API. I will take another pass on broken captures when I work on the clang-tidy plugin.

ggreenway

I think this looks good. The code ended up simpler, no super-complicated relationship between threads/objects.

stevenzzzz

great cleanup!

source/common/thread_local/thread_local_impl.cc

jmarantz

This is great, and a shout-out to well to @stevenzzzz for finding this race, as well as to Matt for fixing it in a very elegant way. One small suggestion. Could be a follow-up.

jmarantz · 2020-09-29T09:49:37Z

include/envoy/thread_local/thread_local.h

  /**
   * Set thread local data on all threads previously registered via registerThread().
   * @param initializeCb supplies the functor that will be called *on each thread*. The functor
   *                     returns the thread local object which is then stored. The storage is via
   *                     a shared_ptr. Thus, this is a flexible mechanism that can be used to share
   *                     the same data across all threads or to share different data on each thread.
+   *
+   * NOTE: The initialize callback is not supposed to capture the Slot, or its owner. As the owner


s/is not supposed to/must not/ ?

source/common/stats/thread_local_store.cc

Signed-off-by: Matt Klein <[email protected]>

mattklein123 changed the title ~~tls: refactor to synchronize slot removal on workers~~ [WIP] tls: refactor to synchronize slot removal on workers Aug 27, 2020

mattklein123 commented Aug 27, 2020

View reviewed changes

source/extensions/access_loggers/grpc/http_grpc_access_log_impl.h Outdated Show resolved Hide resolved

stevenzzzz reviewed Aug 27, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Outdated Show resolved Hide resolved

ggreenway reviewed Aug 27, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Outdated Show resolved Hide resolved

source/common/thread_local/thread_local_impl.cc Outdated Show resolved Hide resolved

ggreenway reviewed Aug 27, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Outdated Show resolved Hide resolved

ggreenway reviewed Aug 27, 2020

View reviewed changes

lambdai reviewed Aug 27, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Outdated Show resolved Hide resolved

lizan reviewed Aug 27, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Outdated Show resolved Hide resolved

mattklein123 added 2 commits September 2, 2020 18:02

Merge remote-tracking branch 'origin/master' into tls_fixes

d5a07f5

Signed-off-by: Matt Klein <[email protected]>

comments

f861429

Signed-off-by: Matt Klein <[email protected]>

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Sep 11, 2020

mattklein123 added the no stalebot Disables stalebot from closing an issue label Sep 17, 2020

mattklein123 added 2 commits September 28, 2020 02:39

fix

6c6673c

Signed-off-by: Matt Klein <[email protected]>

Merge remote-tracking branch 'origin/master' into tls_fixes

0aa5fd9

Signed-off-by: Matt Klein <[email protected]>

mattklein123 assigned ggreenway Sep 28, 2020

Merge remote-tracking branch 'origin/master' into tls_fixes

87fe5a2

Signed-off-by: Matt Klein <[email protected]>

mattklein123 assigned jmarantz Sep 28, 2020

jmarantz reviewed Sep 28, 2020

View reviewed changes

stevenzzzz reviewed Sep 28, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Show resolved Hide resolved

Merge remote-tracking branch 'origin/master' into tls_fixes

cc7e3bd

Signed-off-by: Matt Klein <[email protected]>

ggreenway reviewed Sep 28, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Show resolved Hide resolved

ggreenway reviewed Sep 28, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Show resolved Hide resolved

mattklein123 added 2 commits September 28, 2020 19:34

comments

73a641c

Signed-off-by: Matt Klein <[email protected]>

Merge remote-tracking branch 'origin/master' into tls_fixes

fc69ed1

Signed-off-by: Matt Klein <[email protected]>

repokitteh-read-only bot added the waiting label Sep 28, 2020

fix apis

80f9422

Signed-off-by: Matt Klein <[email protected]>

mattklein123 requested review from alyssawilk and snowp as code owners September 28, 2020 22:23

repokitteh-read-only bot removed the waiting label Sep 28, 2020

ggreenway approved these changes Sep 28, 2020

View reviewed changes

stevenzzzz approved these changes Sep 29, 2020

View reviewed changes

source/common/thread_local/thread_local_impl.cc Show resolved Hide resolved

jmarantz reviewed Sep 29, 2020

View reviewed changes

mattklein123 mentioned this pull request Sep 29, 2020

thread_local: thread safety follow ups #13313

Open

Merge remote-tracking branch 'origin/master' into tls_fixes

ede1287

Signed-off-by: Matt Klein <[email protected]>

mattklein123 merged commit 255dab8 into master Sep 30, 2020

mattklein123 deleted the tls_fixes branch September 30, 2020 15:09

mattklein123 mentioned this pull request Oct 8, 2020

Crash on zipkin instantiation if LDS is NACKed #13093

Closed

mattklein123 mentioned this pull request Nov 22, 2020

Can't initializate Envoy v1.16+ with CDS message #14119

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tls: simplify implementation and fix one class of crashing bug #12833

tls: simplify implementation and fix one class of crashing bug #12833

mattklein123 commented Aug 27, 2020 •

edited

Loading

mattklein123 commented Aug 27, 2020

ggreenway left a comment •

edited

Loading

mattklein123 commented Aug 27, 2020

ggreenway left a comment

mattklein123 commented Aug 27, 2020

lambdai commented Aug 27, 2020

lambdai commented Aug 27, 2020

stevenzzzz commented Aug 27, 2020

mattklein123 commented Aug 27, 2020

lizan left a comment

mattklein123 commented Aug 31, 2020

stevenzzzz commented Sep 1, 2020

mattklein123 commented Sep 1, 2020

ggreenway commented Sep 1, 2020

mattklein123 commented Sep 1, 2020

stevenzzzz commented Sep 1, 2020

mattklein123 commented Sep 1, 2020

mattklein123 commented Sep 3, 2020

stevenzzzz commented Sep 3, 2020

stale bot commented Sep 11, 2020

jmarantz left a comment

mattklein123 commented Sep 28, 2020

mattklein123 commented Sep 28, 2020

mattklein123 commented Sep 28, 2020

ggreenway left a comment

stevenzzzz left a comment

jmarantz left a comment

jmarantz Sep 29, 2020

tls: simplify implementation and fix one class of crashing bug #12833

tls: simplify implementation and fix one class of crashing bug #12833

Conversation

mattklein123 commented Aug 27, 2020 • edited Loading

mattklein123 commented Aug 27, 2020

ggreenway left a comment • edited Loading

Choose a reason for hiding this comment

mattklein123 commented Aug 27, 2020

ggreenway left a comment

Choose a reason for hiding this comment

mattklein123 commented Aug 27, 2020

lambdai commented Aug 27, 2020

lambdai commented Aug 27, 2020

stevenzzzz commented Aug 27, 2020

mattklein123 commented Aug 27, 2020

lizan left a comment

Choose a reason for hiding this comment

mattklein123 commented Aug 31, 2020

stevenzzzz commented Sep 1, 2020

mattklein123 commented Sep 1, 2020

ggreenway commented Sep 1, 2020

mattklein123 commented Sep 1, 2020

stevenzzzz commented Sep 1, 2020

mattklein123 commented Sep 1, 2020

mattklein123 commented Sep 3, 2020

stevenzzzz commented Sep 3, 2020

stale bot commented Sep 11, 2020

jmarantz left a comment

Choose a reason for hiding this comment

mattklein123 commented Sep 28, 2020

mattklein123 commented Sep 28, 2020

mattklein123 commented Sep 28, 2020

ggreenway left a comment

Choose a reason for hiding this comment

stevenzzzz left a comment

Choose a reason for hiding this comment

jmarantz left a comment

Choose a reason for hiding this comment

jmarantz Sep 29, 2020

Choose a reason for hiding this comment

mattklein123 commented Aug 27, 2020 •

edited

Loading

ggreenway left a comment •

edited

Loading