Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sampling] add distributed tracing capabilities #325

Closed
wants to merge 41 commits into from

Conversation

ufoot
Copy link
Contributor

@ufoot ufoot commented Aug 10, 2017

This is a work in progress and must not be merged until ready

@palazzem palazzem self-requested a review August 11, 2017 07:48
@palazzem palazzem added the wip label Aug 11, 2017
@palazzem palazzem changed the title [distributed sampling] WIP, do not merge [sampling] add distributed tracing capabilities Aug 11, 2017
@ufoot ufoot force-pushed the christian/issampled branch 3 times, most recently from 09b2a0a to 9e22203 Compare August 14, 2017 08:54
@LotharSee LotharSee force-pushed the christian/issampled branch 7 times, most recently from d5db637 to a0198b4 Compare August 24, 2017 13:06
Copy link

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will split the overall feedback in two different points:

  1. The API
    I think that primitives we currently have, makes this PR difficult to use. Mostly we don't have a clean way to set a new Context as active (missing API), considering that creating two Context will result in wrong data. Also, the fact that is not immutable doesn't help you a lot to workaround the current API. Whatever change we make here, could result in a breaking change in the future.

  2. The distributed tracing
    Having a sampling_priority with a local sampler that takes priority, is a good thing. We may cover scenarios where a distributed system sets the overall sampling and:

  • the system doesn't want to sample a lot of traces, but the local sampler wants because developers spot an issue and they want all traces for "some time"
  • the system wants to sample all the traces, but the local sampler doesn't because of really high span cardinality of the underlying system
  • any possible combination of the points above with <put_here_your_reason_to_sample_or_not> seems quite covered by the client itself

Probably I still need to see the big picture because it's a WIP, but in the overall I think keeping two different samplers is the way to go. What I'm missing:

  • how the Span sampling interacts with the Context sampling
  • the Writer receives a trace only if the Context is sampled:
    def record(self, context):
    """
    Record the given ``Context`` if it's finished.
    """
    # extract and enqueue the trace if it's sampled
    trace, sampled = context.get()
    if trace and sampled:
    self.write(trace)
    ; probably it's one of the missing steps because it's a [WIP]

self._sampled = sampled
self._sampling_priority = sampling_priority

def get_context_attributes(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to change the private API? that was set as internal because a possible refactoring could make the Context immutable so we can get rid of the lock. Anyway, it's something to keep in mind. Let me think of a possible alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't especially defend nor like this API.
What is sure, it is that we will need somewhere a public way to access the current tuple (tid, sid, priority) for remote propagation (both for our own integrations and for customers to instrument their own inter service propagation). Do you agree too?
After that, it could live anywhere.

@@ -29,24 +31,27 @@ def attach_context(request):
service = app[CONFIG_KEY]['service']
distributed_tracing = app[CONFIG_KEY]['distributed_tracing_enabled']

context = tracer.get_call_context()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this must be None. If it's not and we have a distributed_tracing == True, it means that we're going to create a new Context object using the new one as a child_of. In the current API, having two different Context objects in one request, means having two different traces (and it's wrong).

Also, creating manually a Context, doesn't set it as active. This means that the request_span lives in a Context that is not propagated and so any attempt to use start_span() or trace(), will add the new Span to a different (wrong) context.

Copy link
Contributor

@LotharSee LotharSee Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing, I only looked at ThreadLocalContext (which, when they are no context, create one and puts it in the current thread local) and assumed a similar behavior (I should have checked!).

So what is the proper thing to do there?

@@ -90,6 +92,7 @@ def __init__(

# sampling
self.sampled = True
self.priority = None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically, sampled and priority are duplicated of what we have in the Context. I'm worried that having the Context immutable is a requirement for a cleaner and stable API.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to say, that probably all these "flags" should stay only in the Context. Am I missing something about sampled traces but discarded children spans?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I agree both should be moved there.

So far we had sampled in the span (old, from before the Context entity) but both should be moved.
I don't know how much pain that would be though (since we used to check span.sampled in various integrations) ; mayve it could wait a second PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, we have to fix this in the scope of this PR.
If we go with the Context approach to get/set the priority, let's not store it here. Plus whatever method we provide to get/set it ; it has to hit the right place.

(for now, the span.set|get_sampling_priority won't work with context.get_context_attributes).

parent_span_id = parent.span_id
sampling_priority = parent.get_sampling_priority()
else:
trace_id, parent_span_id, sampling_priority = context.get_context_attributes()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work with the current API. We don't have a way to set_call_context() (the complete API was proposed to OpenTracing but we never made that change for ourselves), so it means that we can't:

ctx = Context(trace_id=trace_id, parent_id=parent_id, sampling_priority= sampling_priority)
tracer.set_call_context(ctx)  # we don't have this API

This means that context.get_context_attributes() returns always the value of the first created Context when you start_span() or trace(). Because the Context is mutable, to get the right values users should:

ctx = tracer.get_call_context()
ctx.set_context_attributes(...)
# use start_span() or trace() as usual

Of course the fact that Context is mutable, is something that is planned to be changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line wasn't asserting that trace_id / parent_id / sampling_priority could change over time nor that we would need a set_call_context.
This context. get_context_attributes was from the case we create a root span from a context (from a remote tid/sid/priority, like in the iohttp change) instead of a parent.

@@ -50,13 +56,28 @@ def get_current_span(self):
with self._lock:
return self._current_span

def _set_current_span(self, span):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the class is meant to be thread-safe, whatever API we provide, it should use the internal lock for any change even if it's an internal API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was just not to dupe code between add_span and close_span which themselves set the lock.
If we think that's too dangerous even if internal, we can inline it in these two functions.

if not span._parent:
span.set_tag(system.PID, getpid())

# TODO: add protection if the service is missing?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's out of the scope of this PR.

# When doing client sampling in the client, keep the sample rate so that we can
# scale up statistics in the next steps of the pipeline.
if isinstance(self.sampler, RateSampler):
span.set_metric(SAMPLE_RATE_METRIC_KEY, self.sampler.sample_rate)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_metric() is an hot-topic. We even have a discussion to remove it because it's confusing (it doesn't set a metric, it sets a tag/meta).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's for compatibility with what we are currently doing (SAMPLE_RATE_METRIC_KEY always was a metric).

We are likely to change it in the future.

self.priority = None
else:
try:
self.priority = int(sampling_priority)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only possible values here are 0 or 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term, it can be any int.
For now, we have 0 == not sampled, 1 sampled by our sampler. And very soon, 2 explicitly sampled by the user.

else:
if self.distributed_sampler:
# If dropped by the local sampler, distributed instrumentation can drop it too.
span.set_sampling_priority(0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That approach seems legit to me.

os.environ.get('TEST_DATADOG_INTEGRATION', False),
'You should have a running trace agent and set TEST_DATADOG_INTEGRATION=1 env variable'
)
class TestRateByService(TestCase):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of this test? TestAPITransport already check all of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Christian anticipated a bit, that's a feature not yet available.

@LotharSee
Copy link
Contributor

LotharSee commented Aug 30, 2017

Thanks for the deep review! To answer your points.

  1. The API. I don't especially care about what it looks like. At the high-level, the only things we need are:
  • a public API to access current trace_id/span_id/sampling_priority.
  • a public API to create a local trace from remote/propagated trace_id/span_id/sampling_priority.

And when I say a public API, it can be any combination of public methods, as long as it is clear and documented.

  1. The mix of 2 samplers is a way to combine the two approaches in a compatible way.
  • sampler is the older client-side sampling, simply dropping data according to a rate. When dropped, it doesn't even reach the Agent, so no stats. But also no performance footprint. In practice, not used that much. Disabled by default.
  • distributed_sampler is the one deciding in advance if the trace should be dropped by the Agent, and it propagates this decision.
    The goal: to have it enabled by default, applying a sampling rate per service, provided by the Agent.

I guess in the future we will be able to merge these 2 together, but that was too dangerous to consider so early.

  1. If we put all these in the immutable Context, there is the problem of the ability to update the sampling_priority after root creation.
    Does it mean we should move this attributes back to the Span? Here or there, we can have it cleaned and unified in a second PR.

@ufoot ufoot force-pushed the christian/issampled branch from 9126ac6 to b0337fe Compare September 4, 2017 15:10
@@ -1 +1,2 @@
FILTERS_KEY = 'FILTERS'
SAMPLING_PRIORITY_KEY = 'sampling.priority'
Copy link
Contributor

@LotharSee LotharSee Sep 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While our initial implementation is still "experimental" and is likely to change a lot in the future, what do we think about using a non-finale key for that?

Something like _sampling_priority_v1 , that way upgrades of our sampling logic will be much simpler.

The usage of the exact same OT key isn't important right, and it doesn't especially make sense anyway since we aren't compatible / the exact meaning of this value isn't properly defined by OT.

cc @palazzem @ufoot

@ufoot ufoot force-pushed the christian/issampled branch 3 times, most recently from 8c33f56 to 7c3d1db Compare September 4, 2017 16:25
# Apply the default configuration
self.configure(
enabled=True,
hostname=self.DEFAULT_HOSTNAME,
port=self.DEFAULT_PORT,
sampler=AllSampler(),
# TODO: by default, a ServiceSampler periodically updated
distributed_sampler=RateByServiceSampler(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, as a default we will want the distributed_sampler to be None so that we don't use the priority logic / we keep using the signature sampling (which is much more mature / interesting with non-distributed traces ).

@ufoot ufoot force-pushed the christian/issampled branch 5 times, most recently from 2b81010 to 0c7cfe3 Compare September 11, 2017 13:07
@ufoot ufoot force-pushed the christian/issampled branch from b240cad to 1f8e5af Compare September 18, 2017 11:33
@ufoot ufoot force-pushed the christian/issampled branch from 1f8e5af to e8f4387 Compare September 18, 2017 11:40
@ufoot ufoot force-pushed the christian/issampled branch from d2e221b to 1dd59b8 Compare September 25, 2017 11:11
Former code would bug when a key was disappearing from the agent.
This rarely happens in practice as a given agent often has a rather
constant set of services it consumes, but still, had to be fixed.
@ufoot ufoot force-pushed the christian/issampled branch from 1dd59b8 to b9b9d87 Compare September 25, 2017 12:19
@ufoot ufoot force-pushed the christian/issampled branch from 0f16839 to 5b9d9d2 Compare September 25, 2017 14:59
@ufoot ufoot force-pushed the christian/issampled branch from 3c1ee05 to 48897c0 Compare September 29, 2017 09:41
@ufoot ufoot removed the wip label Oct 4, 2017
@@ -55,8 +82,8 @@ def send_traces(self, traces):
response = self._put(self._traces, data, len(traces))

# the API endpoint is not available so we should downgrade the connection and re-try the call
if response.status in [404, 415] and self._compatibility_mode is False:
log.debug('calling the endpoint "%s" but received %s; downgrading the API', self._traces, response.status)
if response.status in [404, 415] and self._fallback:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There @palazzem I'd like to have your opinion on this. I thought it was OK to cascade down from v0.4 to v0.3 to v0.2 but I'm open to other possibilities. The advantage of doing this is that here I just care about the endpoint, but the question of knowing wether we have a JSON as an answer or a plain OK\n string is done later. This enables low coupling between this chunk of code and the one handling the JSON.

if body.startswith('OK'):
# This typically happens when using a priority-sampling enabled
# library with an outdated agent. It still works, but priority sampling
# will probably send too many traces, so the next step is to upgrade agent.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There @palazzem when we're here we could downgrade and disable priority sampling and switch on protocol v3. OTOH it introduces coupling between components and some complexity I think.

@LotharSee
Copy link
Contributor

One tip: to simplify this PR, I'd suggest to extract commits around the "set the service at root span creation" into a different PR which we could merge first.

@@ -80,13 +90,16 @@ def configure(self, enabled=None, hostname=None, port=None, sampler=None,
Otherwise they'll be dropped.
:param str hostname: Hostname running the Trace Agent
:param int port: Port of the Trace Agent
:param object sampler: A custom Sampler instance
:param object sampler: A custom Sampler instance, locally deciding to totally drop the trace or not.
:param object priority_sampler: A custom Sampler instance, taking the priority sampling decision.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like my comment disappeared, but I still advocate for removing this from the configure API and only keep the priority_sampling flag.

Copy link

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is obsolete and has been superseded by #359

@palazzem palazzem closed this Oct 26, 2017
@palazzem palazzem deleted the christian/issampled branch October 26, 2017 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants