Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api] Fix Baggage context leak issue #5208

Merged
merged 9 commits into from
Jan 16, 2024

Conversation

CodeBlanch
Copy link
Member

@CodeBlanch CodeBlanch commented Jan 10, 2024

Fixes #3257

Changes

  • Adds nullable annotations for the Baggage API.
  • Removes the wrapper object to prevent context leaks.

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes

@CodeBlanch CodeBlanch added the pkg:OpenTelemetry.Api Issues related to OpenTelemetry.Api NuGet package label Jan 10, 2024
@CodeBlanch CodeBlanch requested a review from a team January 10, 2024 17:28
Copy link

codecov bot commented Jan 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (6250307) 83.38% compared to head (6a4870f) 83.09%.
Report is 7 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5208      +/-   ##
==========================================
- Coverage   83.38%   83.09%   -0.29%     
==========================================
  Files         297      271      -26     
  Lines       12531    11956     -575     
==========================================
- Hits        10449     9935     -514     
+ Misses       2082     2021      -61     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 83.08% <100.00%> (?)
unittests-Solution-Stable 83.03% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/OpenTelemetry.Api/Baggage.cs 100.00% <100.00%> (ø)

... and 32 files with indirect coverage changes

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to understand if there is a valid use case for TryGetBaggage, and when should the user choose GetBaggage vs. TryGetBaggage.

Comment on lines 269 to 272
Assert.Equal(4, Baggage.Current.Count);
Assert.Equal(2, Baggage.Current.Count);
Assert.Equal("value1", Baggage.GetBaggage("key1"));
Assert.Equal("value2", Baggage.GetBaggage("key2"));
Assert.Equal("value3", Baggage.GetBaggage("key3"));
Copy link
Member

@alanwest alanwest Jan 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some, this could seem like a very nuanced and perhaps unintuitive behavioral difference that stems from a call to an async method versus a synchronous method call. For documentation purposes, I suggest adding a synchronous test that looks almost identical just to further highlight the nuanced behavior. Something like:

    [Fact]
    public void NonAsyncLocalTests()
    {
        Baggage.SetBaggage("key1", "value1");

        InnerTask();

        Baggage.SetBaggage("key4", "value4");

        Assert.Equal(4, Baggage.Current.Count);
        Assert.Equal("value1", Baggage.GetBaggage("key1"));
        Assert.Equal("value2", Baggage.GetBaggage("key2"));
        Assert.Equal("value3", Baggage.GetBaggage("key3"));
        Assert.Equal("value4", Baggage.GetBaggage("key4"));

        static void InnerTask()
        {
            Baggage.SetBaggage("key2", "value2");
            Baggage.SetBaggage("key3", "value3");

            Assert.Equal(3, Baggage.Current.Count);
            Assert.Equal("value1", Baggage.GetBaggage("key1"));
            Assert.Equal("value2", Baggage.GetBaggage("key2"));
            Assert.Equal("value3", Baggage.GetBaggage("key3"));
        }
    }

That said, though... while I understand the nuanced behavior, how would we want to explain this difference to end users? Can you add some additional description of the why in the PR description? I don't think as it stands it will be super clear to folks how this is more conformant with the specificaton.

Copy link
Member

@alanwest alanwest Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing my speculation on what I think (most?) users would intuitively expect in few different scenarios. Admittedly these things can be very subjective, however, this is based on my own experience supporting product features similar to the baggage api.

    public void BaggageBehaviorWithSyncCall()
    {
        Baggage.SetBaggage("key1", "value1");
        InnerTask();
        Assert.Equal(2, Baggage.Current.Count);
        Assert.Equal("value1", Baggage.GetBaggage("key1"));
        Assert.Equal("value2", Baggage.GetBaggage("key2"));

        static void InnerTask()
        {
            Baggage.SetBaggage("key2", "value2");
        }
    }

    public async Task BaggageBehaviorWithAsyncCall()
    {
        Baggage.SetBaggage("key1", "value1");
        await InnerTask();
        Assert.Equal(2, Baggage.Current.Count);
        Assert.Equal("value1", Baggage.GetBaggage("key1"));
        Assert.Equal("value2", Baggage.GetBaggage("key2"));

        static async Task InnerTask()
        {
            Baggage.SetBaggage("key2", "value2");
            await Task.Yield();
        }
    }

    public async Task BaggageBehaviorWithTaskRun()
    {
        Baggage.SetBaggage("key1", "value1");
        Task.Run(InnerTask).GetAwaiter().GetResult();
        Assert.Equal(1, Baggage.Current.Count);
        Assert.Equal("value1", Baggage.GetBaggage("key1"));

        static void InnerTask()
        {
            Baggage.SetBaggage("key2", "value2");
        }
    }

I think my Task.Run scenario represents what folks are after in #3257 and I agree that it is the ideal behavior in this specific scenario most of the time - i.e., a parallel operation should ideally have it's own context. I'd go as far to say that this expectation is sometimes not limited to just baggage, but also to trace context, etc.

When it comes to async/await calls, we're often not talking about a parallel operation in the CPU-bound kind of sense, but rather an I/O-bound operation that is logically considered part of the same unit of work as the caller. In this way, I find that people expect this scenario to behave in the same way as though they had made a synchronous call.

Copy link
Member

@reyang reyang Jan 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When it comes to async/await calls, we're often not talking about a parallel operation in the CPU-bound kind of sense, but rather an I/O-bound operation that is logically considered part of the same unit of work as the caller. In this way, I find that people expect this scenario to behave in the same way as though they had made a synchronous call.

This is also subjective and murky - for example, an async/await call to an authentication library which interacts with backend services. Here is my suggestion when we encounter API design challenges that are subjective:

  1. Follow the specification, if the spec does not make sense, propose a fix to the spec (which is good forcing function to make people think deeper).
  2. Prioritize fail-fast. In this case, if the async/await call set the Baggage inside a Task and the developers were expecting the Baggage update from the caller's view, they will immediately notice things didn't work then figure out why. On the other side, context/baggage leakage is very hard to troubleshoot and would normally be discovered at much later stage of the development (and is normally a privacy/security incident).
  3. Provide explicit way for people to achieve special scenarios, e.g. here are some potential solutions Baggage.Current gets shared/overwritten when running multiple parallel tasks? #3257 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow the specification, if the spec does not make sense, propose a fix to the spec (which is good forcing function to make people think deeper).

Fully agree with all of these principles, but I think this is the most pertinent one at the moment. I am not yet convinced that this PR brings us in closer alignment with the spec, but I do agree that it may be because the spec needs to be fixed if folks think this PR is a more correct behavior. As it stands, I do not find the new behavior proposed in this PR is any less murky or subjective.

Related, there exists critique that OpenTelemetry .NET does not implement the Context spec at all. See: open-telemetry/oteps#232 (comment).

I think we should hold on this PR and do some deeper thinking.

Provide explicit way for people to achieve special scenarios, e.g. here are some potential solutions Baggage.Current gets shared/overwritten when running multiple parallel tasks? #3257 (comment)

I support @CodeBlanch's workaround here #3257 (comment)

there is a workaround you can use right now: Call RuntimeContext.SetValue("otel.baggage", null)

This is effectively how Activity.Current is designed. If I spin up parallel work and wish to force a new trace I can set Activity.Current to null at the start of my new thread. To me, it makes more sense to try and keep the behavior of Baggage.Current aligned with the behavior of Activity.Current.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be more creative. Here is an idea: https://github.com/CodeBlanch/opentelemetry-dotnet/compare/api-baggage-context...CodeBlanch:opentelemetry-dotnet:api-baggage-context-flow?expand=1

What I'm doing there is add an enum so users can explicitly control the behavior.

public async Task InnerTask()
{
    // Explicitly fork baggage for downstream call
    Baggage.SetBaggage("key2", "value2", flowBehavior: BaggageFlowBehavior.Fork);

    await CallSomething();
}

public async Task InnerTask()
{
    // Modify baggage as it works today. This will flow back
    Baggage.SetBaggage("key2", "value2", flowBehavior: BaggageFlowBehavior.ModifyCurrent);

    await CallSomething();
}

public async Task InnerTask()
{
    // Whatever the default behavior is. Right now it is Fork
    Baggage.SetBaggage("key2", "value2");

    await CallSomething();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another creative solution: https://github.com/CodeBlanch/opentelemetry-dotnet/compare/api-baggage-context...CodeBlanch:opentelemetry-dotnet:api-baggage-context-mutate?expand=1

This is sort of an attempt to do what Reiley proposed ("Get a handle of the current baggage and pass the handle to other methods"):

        Baggage.SetBaggage("appInstanceId", "abcd");

        await ReadTenantIdFromDatabase();

        // Baggage.Current here will contain "appInstanceId" & "tenantId"

        async Task ReadTenantIdFromDatabase()
        {
            // Simulate async call to DB
            await Task.Yield();

            Baggage.MutateCurrentBaggage(b => b.SetBaggage("tenantId", "1234"));
        }

Copy link
Member

@alanwest alanwest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before moving forward with this PR, I think we need to have more foundational discussions with regards to how OpenTelemetry .NET aligns (or does not align) with the Context specification - baggage only being one part of the broader requirements.

#5208 (comment)

@CodeBlanch CodeBlanch marked this pull request as draft January 12, 2024 21:26
@CodeBlanch CodeBlanch mentioned this pull request Jan 12, 2024
4 tasks
@liutiexing
Copy link

liutiexing commented Jan 15, 2024

@alanwest @CodeBlanch @reyang
Async calls are so common in Azure services. We can't call RuntimeContext.SetValue everywhere when we use Baggage APIs. These workarounds will be everywhere and destroy the maintainability.
We need an Immutable container (container changes inside an async Task are invisible in parent Task).
Spec said The Baggage container MUST be immutable, so that the containing Context also remains immutable.
I think current implementation of Baggage is partially mutable.
Baggage.Current is based on AsyncLocal. AsyncLocal is so fundamental to the CLR and so broadly used in various libraries that it is de facto the .NET standard for how to propagate ambient context. If we are to have different mechanisms and semantics for capturing and propagating ambient context, the APIs should be different enough to prevent any confusion. The OTel Baggage uses AsyncLocal, so it inherits some of the behavior but then defeats the rest sneakily through that "holder" hack.
I found OpenTelemetry C++ just provides Baggage, but no Baggage.Current. I think this is because C++ has no standard de facto parallel mechanism. If we have a dispute about thread/parallel behavior, mabe we can provide only Baggage without Baggage.Current (split Baggage and thread/parallel behavior), let the users implements their own Baggage.Current.

Not sure, according to my understanding, PropagationContext is the implementation of Context in Spec.

@alanwest
Copy link
Member

alanwest commented Jan 15, 2024

Spec said The Baggage container MUST be immutable, so that the containing Context also remains immutable.

💯 I did not make this clear in my initial input on this PR, but I fully agree that mistakes have been made in the design of OpenTelemetry's baggage API. The idea that the baggage API has semantics similar to HttpContext.Current is a faulty premise and is definitely a violation of the specification.

I think current implementation of Baggage is partially mutable.

This is the part that bugs me as well. This PR seeks to address this, but I think it still falls short. For example, I'm uneasy with the fact that the following code behaves differently whether MakeHttpCall is an async method or not.

void Main() {
  MakeHttpCall();
  // Baggage.Current contains key/value at this point in the execution.
  // However, if the method were `async Task MakeHttpCall()` it will not contain key/value.
}

void MakeHttpCall() {
  Baggage.SetBaggage("key", "value");
}

Most importantly, I'm uneasy with the fact that, even with this PR, Baggage still has mutable properties.

Baggage.Current is based on AsyncLocal. AsyncLocal is so fundamental to the CLR and so broadly used in various libraries that it is de facto the .NET standard for how to propagate ambient context.

Baggage.Current and our underlying use of AsyncLocal definitely play a role, but I do not think they are the root of our design problems. I think the design of OpenTelemetry .NET's API is flawed. This is a rough idea, but I think the following is a better API design and would achieve spec compliance:

void Main() {
  using (var scope1 = ContextBuilder.NewContext().WithBaggage("key1", "key1").StartScope()) {
    // Baggage.Current contains key1/value1
    using (var scope2 = ContextBuilder.NewContext().WithBaggage({"key2", "value2"}).StartScope()) {;
      // Baggage.Current only contains key2/value2 here.
      
      // If you also wanted key1/value1 in baggage here, then it would need to be explicit.
      // Something like:
      // var scope2 = ContextBuilder.FromContext(scope1).WithBaggage({"key2", "value2"})
      // The context of scope2 would be derived from scope1 and key2/value2 would be additive
    }
    
    // scope2 has been disposed and scope1 has been restored, and
    // Baggage.Current contains key1/value1
  }  
}

In this new design, the baggage API would still have a Baggage.Current and it would still use AsyncLocal to propagate the scope across thread boundaries, but the important part is that Baggage would not be mutable under this design.


I have had some more time to consider this PR, and while I do not think it addresses our spec compliance problems, I believe that it does improve on the behaviors of our non-compliant baggage API, and I am comfortable considering the current behavior buggy. That said, it is a breaking change and I suspect the new behavior will be undesirable to a different set of our end users.

In order to truly fix our spec compliance issues in our existing APIs, I think would require us to make source-breaking changes, and I think we should avoid this.

For this reason, I will approve this PR, but I think we should consider how we can introduce new, actual spec-compliant baggage APIs in a future PR. The old APIs could remain, but would ideally be removed in a future major version.

Copy link
Member

@alanwest alanwest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on my comments here #5208 (comment).

@CodeBlanch CodeBlanch marked this pull request as ready for review January 16, 2024 17:46
@CodeBlanch CodeBlanch merged commit 90a9c6c into open-telemetry:main Jan 16, 2024
37 checks passed
@CodeBlanch CodeBlanch deleted the api-baggage-context branch January 16, 2024 18:23
@CodeBlanch
Copy link
Member Author

I merged this because we have enough approvals but I'm not sure anyone's thinking is totally correct 😄 I'll provide my thoughts just for fun.

The Baggage container MUST be immutable, so that the containing Context also remains immutable.

Context in this context (🤣) is describing a structure IMO.

public readonly struct Context
{
    public SpanContext SpanContext { get; }
    public Baggage Baggage { get; }
}

public readonly struct SpanContext {}
public readonly struct Baggage {}

If two threads have the same context instance, they shouldn't impact each other:

// thread 1
var newContext = someContext.SetBaggage("b2, "v2");
// Will never see b3

// thread 2
var newContext = someContext.SetBaggage("b3, "v3");
// Will never see b2

We don't have the Context structure. We have public readonly struct ActivityContext and public readonly struct Baggage. But both DO have the immutable behaviors so the above will hold true. I don't think the Baggage API is fundamentally flawed in that regard.

Now we have this thing...

public readonly struct Baggage
{
    public static Baggage Current { get; set; }
}

Baggage.Current is an ambient way to access the current "Context" \ Baggage.

This is where it gets less clear for me. The spec seems to bless that languages MAY provide some kind of global ambient thing if the language has support for it. But it doesn't prescribe exactly how it should work.

Baggage.Current.getter is like the spec Get current Context. No issues here I don't think.

Baggage.Current.setter performs some of what "Attach" and "Detach" are supposed to do (by the spec). If I could do it over again, we probably wouldn't have that setter. We could have public static IDisposable Attach(Baggage baggage) instead (which would be much closer to the spec).

The question for me is what happens when that "Attach" (setter) is called. Today the ambient context is like a scoped service. The spec says it should...

Associates a Context with the caller's current execution unit.

What is "execution unit" meaning here? Not sure the spec defines that clearly. Scope of the request could be the execution unit. In .NET we have AsyncLocal and ExecutionContext, that could also be the "execution unit" and that's what this PR does. Since Activity.Current also behaves this way, probably good to be consistent.

PS: If we say in .NET our "execution unit" is going to be ExecutionContext then I don't know if we need the global "attach" and "detach" operations because those APIs exist on ExecutionContext already ("capture" and "restore") 🤷

CodeBlanch added a commit to CodeBlanch/opentelemetry-dotnet that referenced this pull request Jan 17, 2024
@danelson
Copy link
Contributor

Adding some end user context as mentioned in the SIG. It is my understanding that after this PR was merged the following unit test which used to pass will now fail.

public async Task BaggageBehaviorWithAsyncCall()
{
    Baggage.SetBaggage("key1", "value1");
    await InnerTask();
    Assert.Equal(2, Baggage.Current.Count); // will instead be 1
    Assert.Equal("value1", Baggage.GetBaggage("key1"));
    Assert.Equal("value2", Baggage.GetBaggage("key2")); // will not exist

    static async Task InnerTask()
    {
        Baggage.SetBaggage("key2", "value2");
        await Task.Yield();
    }
}

This is going to cause a lot of issues for my organization. At my organization it is very common to execute a similar workflow as above where a value is retrieved in a nested async method (say from the database) and then that value is immediately set on the baggage. Requiring changes in method signatures to pass the information up the call stack will be quite an undertaking. If this was the behavior from the beginning I could understand, but changing this now seems pretty severe to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:OpenTelemetry.Api Issues related to OpenTelemetry.Api NuGet package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Baggage.Current gets shared/overwritten when running multiple parallel tasks?
5 participants