Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement to the Batch Span Processor #949

Closed
snehilchopra opened this issue Jul 27, 2020 · 7 comments
Closed

Improvement to the Batch Span Processor #949

snehilchopra opened this issue Jul 27, 2020 · 7 comments
Assignees
Labels
release:required-for-ga To be resolved before GA release sdk Affects the SDK package.

Comments

@snehilchopra
Copy link

snehilchopra commented Jul 27, 2020

Context
The python Batch Span Processor uses two threads, namely the main thread and a worker thread, in a synchronized manner to export batches of spans to the configured exporter.
However, it seems that the implementation might run into some unexpected behavior.

Problem
1.) On lines 163, 216, and 241, I think there is a high possibility that these notify calls actually miss waking up the worker thread.
In other words, it might be possible that the notify calls are made BEFORE the worker thread actually starts to wait.

2.) Moreover, I believe spurious wake-ups with correct timeout recalculation seems to be missing too, as seen in line 245. Although I do realize that they rarely ever occur, it might be a good idea to account for them for robustness.

Solution
I have just implemented the batch span processor in C++ addressing these concerns.
For preventing notify calls from being missed, they could be wrapped in a while loop with a flag to check if the worker thread actually woke up.
Moreover, for spurious wake-ups with timeout recalculation, something like this could be done for python.

Please let me know what you think about this! Thanks!

@mauriciovasquezbernal @reyang @Oberon00

@reyang
Copy link
Member

reyang commented Aug 21, 2020

@lzchen @ramthi

@lzchen lzchen added release:required-for-ga To be resolved before GA release sdk Affects the SDK package. labels Aug 24, 2020
@Oberon00
Copy link
Member

1.) On lines 163, 216, and 241, I think there is a high possibility that these notify calls actually miss waking up the worker thread.
In other words, it might be possible that the notify calls are made BEFORE the worker thread actually starts to wait.

Indeed this code looks problematic. Notice how all the notify calls are within with condition, i.e. we hold the condition lock when calling notify. Also the worker thread calls wait() within with condition. Now what's the problem here is that you are supposed to check the actual conditions also within the with wait (and set the conditions like self._flushing within the with notify, though I think setting it before should work just as well with less lock contention).

2.) Moreover, I believe spurious wake-ups with correct timeout recalculation seems to be missing too, as seen in line 245. Although I do realize that they rarely ever occur, it might be a good idea to account for them for robustness.

What is more, we don't even have a boolean to wait on. This could even result in a deadlock there if the flush completes before we start to wait for it. I guess both problems could be solved at once by replacing the flush_condition with a threading.Event.

@lzchen lzchen self-assigned this Sep 1, 2020
@lzchen
Copy link
Contributor

lzchen commented Sep 1, 2020

@snehilchopra
Hey thanks a lot for finding this issue. I understand the missing waking up the worker thread call cases. I'm a little confused as to why in the C++ implementation, there is a while loop encapsulating the wait. Wouldn't the wait block the main thread and wait until the worker thread notifies that the flush is done processing, which in turn would set is_force_flush_notified_ to true always?

@Oberon00

What is more, we don't even have a boolean to wait on. This could even result in a deadlock there if the flush completes before we start to wait for it. I guess both problems could be solved at once by replacing the flush_condition with a threading.Event.

Can you explain a bit more on how this would cause a deadlock, and why a theading.Event could solve this instead of flush_condition?

@Oberon00
Copy link
Member

Oberon00 commented Sep 1, 2020

@lzchen

Can you explain a bit more on how this would cause a deadlock, and why a theading.Event could solve this instead of flush_condition?

This is the sequence of events that would deadlock:

  1. T1: force_flush appends the flush token and notifies the condition
  2. T2: worker wakes up, calls drain_queue, calls export, deques the flush token, notifies the condition
  3. T1: force_flush only now starts waiting for the condition

T1 will now wait forever because the flush token has already been dequed. Notifying a condition only wakes up threads that are already waiting.

@mariojonke
Copy link
Contributor

i guess it is not really a deadlock since thread T1 will continue once the timeout on the flush condition elapses.

another problem is that calls to force_flush are not thread safe since all threads rely, unset and set the _flushing flag of the span processor instance.

@lzchen i made a PR addressing these issues in #1062

@snehilchopra
Copy link
Author

@snehilchopra
Hey thanks a lot for finding this issue. I understand the missing waking up the worker thread call cases. I'm a little confused as to why in the C++ implementation, there is a while loop encapsulating the wait. Wouldn't the wait block the main thread and wait until the worker thread notifies that the flush is done processing, which in turn would set is_force_flush_notified_ to true always?

Hi @lzchen. This is to handle spurious wake-ups. We check on a condition if the thread woke up spuriously. If the condition is satisfied, break out of the loop. Else, go back to waiting.

@lzchen lzchen assigned mariojonke and unassigned lzchen Sep 2, 2020
@lzchen
Copy link
Contributor

lzchen commented Sep 13, 2020

Closed with #1062

@lzchen lzchen closed this as completed Sep 13, 2020
srikanthccv pushed a commit to srikanthccv/opentelemetry-python that referenced this issue Nov 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:required-for-ga To be resolved before GA release sdk Affects the SDK package.
Projects
None yet
Development

No branches or pull requests

5 participants