Official means to put the full trace context back #74

garthk · 2020-07-19T00:22:23Z

TL;DR:

Anyone who misses an ot_tracer:end_span will cause problems for any of their callers using ot_tracer:end_span as the stack unwinds, until we hit an ot_tracer:with_span which can put the full context back
Extracting ot_tracer:set_ctx/2 from ot_tracer.with_span/3 would let us keep our “help implement automatic scope switching and define higher level APIs by SDK components” fix while also covering “and OpenTelemetry instrumentation libraries” at the end of the sentence (both quotes from the context spec; emphasis mine)

Split out from #26 while we're discussing renaming things and other API changes #73…

As someone writing OpenTelemetry integrations for packages I don't control, the current API gives me excellent support for when those packages provide a way to get between then and their back end eg. configurable behaviour. ot_tracer:with_span/3 or /4 wrapped around the next call will restore 100% of the trace context.

If the package instead gives me a chance to get called before and after a significant operation, I can't see a safe way to restore the trace context afterward.ot_tracer:current_ctx/0 will give it to me beforehand, but there's no way to put it back without reverse-engineering the tracer and context manager and hoping nobody changes either.

Elixir demo:

explain """
No, wait. Am I using OT properly? Let's start a span:
"""

loudly do
  previous_span_ctx = :ot_tracer.start_span(tracer, "outer parent", %{parent: :undefined})
  span_ctx = :ot_tracer.start_span(tracer, "benchmark1", %{parent: previous_span_ctx})
end

explain """
... and then start a new one, which we'll pretend is for some code that crashed out without
tidying up:
"""

loudly do
  ^span_ctx = :ot_tracer.current_span_ctx(tracer)
  inner_span_ctx = :ot_tracer.start_span(tracer, "benchmark2", %{parent: span_ctx})
end

explain """
Put back our span_ctx, end_span, and we'll be back at our previous_span_ctx, right?
"""

loudly do
  :ot_tracer.set_span(tracer, span_ctx)
  :ot_tracer.end_span(tracer)

  # desired: I got the span I just ended's parent
  # assert :ot_tracer.current_span_ctx(tracer) == previous_span_ctx

  # not desired: I got the previous argument to some hidden set_span:
  assert :ot_tracer.current_span_ctx(tracer) == inner_span_ctx
end

The text was updated successfully, but these errors were encountered:

tsloughter · 2020-07-19T17:03:08Z

I don't understand what loudly is? Is it a function? A macro?

As for being able to be called before and after, isn't that the reason we use pdict for context? That way the library you are instrumenting doesn't have to be modified to be passed context, instead you start a span in the before call and end it in the after through the process dictionary. And even though the library is not directly passed a context if it calls out to another library that is also instrumented it will be able to create children.

Your concern is simply that you don't trust the context to be intact when you get to the 'end' call? Like if the children don't get end'ed and you end up calling an end to one of them because they are the current span still?

garthk · 2020-07-21T23:12:46Z

Oh, sorry, loudly is a macro which outputs its code and its result, and explain outputs its code… ends up like a Jupyter notebook but it only takes 50 lines of Elixir.

Yes, my concern is that if someone I call doesn't end their spans in exactly the right way, I can't end mine the right way unless I use ot_tracer_default:with_span/4 via ot_tracer:with_span/3. I can't duplicate what it does, because it cheats and calls ot_ctx:set_value/3 in a way I shouldn't:

-spec with_span(opentelemetry:tracer(), opentelemetry:span_name(),
                ot_span:start_opts(), ot_tracer:traced_fun(T)) -> T.
with_span(Tracer, SpanName, Opts, Fun) ->
    %% starting the span makes it active in the `tracer_ctx'
    SpanCtx = start_inactive_span(Tracer, SpanName, Opts),
    PreviousTracerCtx = current_ctx(Tracer),
    set_span(Tracer, SpanCtx),
    try
        Fun(SpanCtx)
    after
        %% passing TracerCtx directly ensures that this `end_span' ends the span started
        %% in this function and sets the active span to the previous active tracer ctx.
        %% If spans in `Fun()' were started and not finished properly they will be
        %% abandoned and it be up to the `ot_span_sweeper' to eventually remove them.
        _ = end_span(Tracer, SpanCtx),
        ot_ctx:set_value(?TRACER_KEY, ?TRACER_CTX, PreviousTracerCtx)
    end.

We're exporting start_inactive_span, current_ctx, and set_span so I can work around an inability to use with_span, but we're not exporting the set_ctx I'd need to do it properly despite the stack.

If I use set_span I can end_span, but if my caller couldn't use with_span then they'll end up with my span context afterward unless they kept their previous span context and put it back, setting up a problem wrapping up their parents… it's a mess, but it doesn't have to be. We just need ot_tracer to give ot_tracer_default a way to share its capability to properly restore its context.

garthk · 2020-07-28T22:28:22Z

Specifically, I propose:

Add callback ot_tracer:set_ctx/2
Define and export ot_tracer:set_ctx/2
Fix the spec for exported ot_tracer:end_span/2 so it matches its callback
Update ot_tracer_default et al

In this repo, that'd be something like this but without any syntax errors?

diff --git src/ot_tracer.erl src/ot_tracer.erl
index eb80d7a..f2c3c83 100644
--- src/ot_tracer.erl
+++ src/ot_tracer.erl
@@ -24,2 +24,3 @@
          with_span/4,
+         set_ctx/2,
          current_ctx/1,
@@ -51,2 +52,3 @@
 -callback current_ctx(opentelemetry:tracer()) -> tracer_ctx().
+-callback set_ctx(opentelemetry:tracer(), tracer_ctx()) -> ok.
 -callback current_span_ctx(opentelemetry:tracer()) -> opentelemetry:span_ctx().
@@ -80,5 +82,5 @@ end_span(Tracer={Module, _}) ->
 
--spec end_span(opentelemetry:tracer(), ot_tracer:tracer_ctx()) -> boolean() | {error, term()}.
-end_span(Tracer={Module, _}, TracerCtx) ->
-    Module:end_span(Tracer, TracerCtx).
+-spec end_span(opentelemetry:tracer(), ot_tracer:span_ctx()) -> boolean() | {error, term()}.
+end_span(Tracer={Module, _}, SpanCtx) ->
+    Module:end_span(Tracer, SpanCtx).
 
@@ -88,2 +90,6 @@ current_ctx(Tracer={Module, _}) ->
 
+-spec set_ctx(opentelemetry:tracer(), ot_tracer:tracer_ctx()) -> ok.
+set_ctx(Tracer={Module, _}, ) ->
+    Module:current_ctx(Tracer).
+
 -spec current_span_ctx(opentelemetry:tracer()) -> opentelemetry:span_ctx().

In open-telemetry/opentelemetry-erlang, something like this?

diff --git src/ot_tracer_default.erl src/ot_tracer_default.erl
index 6f05f77..2e5a1f2 100644
--- src/ot_tracer_default.erl
+++ src/ot_tracer_default.erl
@@ -29,2 +29,3 @@
          current_ctx/1,
+         set_ctx/2,
          current_span_ctx/1,
@@ -133,2 +134,7 @@ current_ctx(_Tracer) ->
 
+-spec set_ctx(opentelemetry:tracer(), ot_tracer:tracer_ctx()) -> ok.
+set_ctx(_Tracer, PreviousTracerCtx) ->
+    ot_ctx:set_value(?TRACER_KEY, ?TRACER_CTX, PreviousTracerCtx),
+    ok.
+
 span_module({_, #tracer{span_module=SpanModule}}) ->

End result: anyone who can't use ot_tracer:with_span/3 will be able to restore the full tracer context before they call end_span, just like ot_tracer:with_span/3 does, even if some of the code they're calling has bugs leading to end_span calls being missed. Adapting the comments in with_span:

%% Wrapping execution in `current_ctx`,`start_span` and `set_ctx`, `end_span' pairs
%% ensures we end the correct span and restore the correct tracer context.
%% If spans in between those pairs were started and not finished properly they will be
%% abandoned and it be up to the `ot_span_sweeper' to eventually remove them.

garthk · 2020-07-29T00:43:50Z

Meanwhile OT-Python removed that API surface entirely open-telemetry/opentelemetry-python#420 citing the new Context spec. Note the spec's emphasis on the calls' availability to OT instrumentation library authors as well as the internal implementation:

These operations SHOULD only be used to implement automatic scope switching and define higher level APIs by SDK components and OpenTelemetry instrumentation libraries.

… and their care around ensuring we can restore context even if people stuff up:

This operation is intended to help making sure the correct Context is associated with the caller's current execution unit. Users can rely on it to identify a wrong call order, i.e. trying to detach a Context that is not the current instance.

garthk · 2020-07-30T23:18:05Z

TL;DR of my take from the SIG meeting:

We should document current_ctx/1, set_ctx/2 and other methods intended for the SDK and OT instrumentation library authors using text similar to the Context spec quoted above.
We wouldn't need set_ctx/2 if ot_tracer_default did it by magic, eg. by storing a map of span contexts to tracer contexts in the process dictionary as @tsloughter suggested. Instrumentation library authors would call current_span/1 before start_span/3 or set_span/2, do work, call span_ctx/1 to put the previous context back, call end_span/1, and everything would be fine. It'll be fiddly, though: set_span/2 would have to avoid updating that map of NewSpanCtx to OldTracerCtx in ot_ctx if an entry was already present, for example.
You could argue set_ctx/2 is simpler, and I'll do so below for other reason, but I think the magic is [also] worth a try because it'll help our users. Get the old span, start the new one, do work, put the old span back just in case, end it. Easy.
It worries me every tracer team would need to duplicate that effort, but maybe there'll never be one? Maybe the stack popping behaviour is an implementation detail of ot_tracer_default only, and they wouldn't copy it. Dunno.
Given the magic might be fiddly and time consuming, I think we should add set_ctx/2 with its scary warning to cover our bases in the mean time.

tsloughter · 2020-07-31T14:10:22Z

Note, I didn't mean a map of span contexts to trace contexts. That would be bad for garbage collection and memory copying.

Instead we would have a process dict like:

Key	Value
current_span_ctx	SpanCtx3
SpanCtx3	[SpanCtx2, SpanCtx1]
SpanCtx2	[SpanCtx1]
SpanCtx1	[]

Ending a span either takes no argument and uses current_span_ctx from the pdict to find the span to end or it can be passed the span ctx in which case it looks up the tracer context -- essentially a list of spans (the span stack).

So if your code created SpanCtx2 and then some third party code it calls creates SpanCtx3 but screws up and doesn't close it when you close SpanCtx2 it will result in SpanCtx1 being the new current_span_ctx.

But there is a clear memory leak here where unended spans stick around forever, along with their list of ancestors (trace context).

Maybe it is worth doing and simply documenting this potential leak while offering a flush function. We could also flush anytime the current span context becomes empty -- though that isn't technically going to resolve the issue because its possible the span never ends while it continues to loop and a child span not getting closed.

Tough call...

garthk · 2020-08-27T23:16:28Z

Yikes didn't see this a month ago; sorry.

Staying zoomed out, strikes me this helps OT instrumentation library authors deal with tricky integration situations, which I keep forgetting was my original goal.

Double-checking: call ot_tracer:start_inactive_span/3 and ot_tracer:set_span/2 at one end, call ot_tracer:end_span/2 at the other, and you'll be back where you started, no? Litmus test: we'd be able to remove the calls to current_ctx/0 and ot_ctx:set_value/3 from ot_tracer.with_span/4, right?

If you're not all 😬 I'm all 😀 as that author because the obvious method is correct. Relief!

Zooming in a little:

I'm comforted knowing the span count will tend to stay predictable unless someone repeatedly starts a span, crashes out of that call stack without ending it, rescues before the process dies, and lives to start another span again. That'll balloon with the current context storage, too.

If a map of span contexts to trace contexts is bad for GC and memory copying, but using the process dictionary to map span contexts to trace contexts is OK, what data structure is the process dictionary using? Can we use it, too? Feels weird taking that much key-space in a shared data structure.

How slow is too slow, re flush/0? timer:tc(fun erlang:get_keys/0) consistently returns with 0 or 1 as its measurement in microseconds. If I add 1000 values with keys by erlang:make_ref/0, it's 1..27µs… the high end worries me but it'll only happen under exceptional circumstances. (Literally.) Perhaps we can just do it automatically every time we end a span?

tsloughter · 2020-08-28T13:04:43Z

I think we should try my example of keying on the span ctx.

I realized even the current solution is bad if the number of levels of spans is large because its all a single pdict value meaning lots of copying and GC.

tsloughter · 2020-08-28T16:44:07Z

Or should context be something like:

Key	Value
`current_context`	`#{ref => #Ref<0.1.1>, parent => #Ref<0.1.0>, active_span_ctx => _, correlation_context => _}`
`{context, Ref}`	_
`{context, Ref}`	_

And ot_ctx:get_context returns either the whole context (these are immutable) or the reference to it.

Ending a span sets the current_context to the parent context if there is one and you can also set it manually with the whole context or the reference.

tsloughter · 2020-08-28T17:24:31Z

Instead of active_span_ctx it should probably be span_ctxs and a list.

Ending a span by passing a SpanCtx to ot_span does not change the context, it simply ends the that Span. Calling ot_tracer:end_span() would end the span at the head of the list and pop it.

To do what you need would instead be like:

CtxToken = ot_ctx:current(),

%% if a current context exists this creates a copy of it so we still have the correlation_context
NewCtxToken = ot_ctx:new(),

SpanCtx1 = ot_tracer:start_span(span1),
ot_span:end_span(SpanCtx1),

ot_ctx:set_context(CtxToken).

tsloughter closed this as completed Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Official means to put the full trace context back #74

Official means to put the full trace context back #74

garthk commented Jul 19, 2020 •

edited

Loading

tsloughter commented Jul 19, 2020

garthk commented Jul 21, 2020

garthk commented Jul 28, 2020 •

edited

Loading

garthk commented Jul 29, 2020

garthk commented Jul 30, 2020

tsloughter commented Jul 31, 2020

garthk commented Aug 27, 2020

tsloughter commented Aug 28, 2020

tsloughter commented Aug 28, 2020 •

edited

Loading

tsloughter commented Aug 28, 2020

Official means to put the full trace context back #74

Official means to put the full trace context back #74

Comments

garthk commented Jul 19, 2020 • edited Loading

tsloughter commented Jul 19, 2020

garthk commented Jul 21, 2020

garthk commented Jul 28, 2020 • edited Loading

garthk commented Jul 29, 2020

garthk commented Jul 30, 2020

tsloughter commented Jul 31, 2020

garthk commented Aug 27, 2020

tsloughter commented Aug 28, 2020

tsloughter commented Aug 28, 2020 • edited Loading

tsloughter commented Aug 28, 2020

garthk commented Jul 19, 2020 •

edited

Loading

garthk commented Jul 28, 2020 •

edited

Loading

tsloughter commented Aug 28, 2020 •

edited

Loading