-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sampling] add distributed tracing capabilities #325
Conversation
09b2a0a
to
9e22203
Compare
d5db637
to
a0198b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will split the overall feedback in two different points:
-
The API
I think that primitives we currently have, makes this PR difficult to use. Mostly we don't have a clean way to set a newContext
as active (missing API), considering that creating twoContext
will result in wrong data. Also, the fact that is not immutable doesn't help you a lot to workaround the current API. Whatever change we make here, could result in a breaking change in the future. -
The distributed tracing
Having asampling_priority
with a local sampler that takes priority, is a good thing. We may cover scenarios where a distributed system sets the overall sampling and:
- the system doesn't want to sample a lot of traces, but the local sampler wants because developers spot an issue and they want all traces for "some time"
- the system wants to sample all the traces, but the local sampler doesn't because of really high span cardinality of the underlying system
- any possible combination of the points above with <put_here_your_reason_to_sample_or_not> seems quite covered by the client itself
Probably I still need to see the big picture because it's a WIP, but in the overall I think keeping two different samplers is the way to go. What I'm missing:
- how the
Span
sampling interacts with theContext
sampling - the
Writer
receives a trace only if theContext
is sampled:Lines 279 to 286 in a0198b4
def record(self, context): """ Record the given ``Context`` if it's finished. """ # extract and enqueue the trace if it's sampled trace, sampled = context.get() if trace and sampled: self.write(trace)
self._sampled = sampled | ||
self._sampling_priority = sampling_priority | ||
|
||
def get_context_attributes(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to change the private API? that was set as internal because a possible refactoring could make the Context
immutable so we can get rid of the lock
. Anyway, it's something to keep in mind. Let me think of a possible alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't especially defend nor like this API.
What is sure, it is that we will need somewhere a public way to access the current tuple (tid, sid, priority) for remote propagation (both for our own integrations and for customers to instrument their own inter service propagation). Do you agree too?
After that, it could live anywhere.
@@ -29,24 +31,27 @@ def attach_context(request): | |||
service = app[CONFIG_KEY]['service'] | |||
distributed_tracing = app[CONFIG_KEY]['distributed_tracing_enabled'] | |||
|
|||
context = tracer.get_call_context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this must be None
. If it's not and we have a distributed_tracing == True
, it means that we're going to create a new Context
object using the new one as a child_of
. In the current API, having two different Context
objects in one request, means having two different traces (and it's wrong).
Also, creating manually a Context
, doesn't set it as active. This means that the request_span
lives in a Context
that is not propagated and so any attempt to use start_span()
or trace()
, will add the new Span
to a different (wrong) context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing, I only looked at ThreadLocalContext
(which, when they are no context, create one and puts it in the current thread local) and assumed a similar behavior (I should have checked!).
So what is the proper thing to do there?
@@ -90,6 +92,7 @@ def __init__( | |||
|
|||
# sampling | |||
self.sampled = True | |||
self.priority = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, sampled
and priority
are duplicated of what we have in the Context
. I'm worried that having the Context
immutable is a requirement for a cleaner and stable API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to say, that probably all these "flags" should stay only in the Context
. Am I missing something about sampled traces but discarded children spans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I agree both should be moved there.
So far we had sampled
in the span (old, from before the Context
entity) but both should be moved.
I don't know how much pain that would be though (since we used to check span.sampled
in various integrations) ; mayve it could wait a second PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yes, we have to fix this in the scope of this PR.
If we go with the Context
approach to get/set the priority, let's not store it here. Plus whatever method we provide to get/set it ; it has to hit the right place.
(for now, the span.set|get_sampling_priority
won't work with context.get_context_attributes
).
parent_span_id = parent.span_id | ||
sampling_priority = parent.get_sampling_priority() | ||
else: | ||
trace_id, parent_span_id, sampling_priority = context.get_context_attributes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work with the current API. We don't have a way to set_call_context()
(the complete API was proposed to OpenTracing but we never made that change for ourselves), so it means that we can't:
ctx = Context(trace_id=trace_id, parent_id=parent_id, sampling_priority= sampling_priority)
tracer.set_call_context(ctx) # we don't have this API
This means that context.get_context_attributes()
returns always the value of the first created Context
when you start_span()
or trace()
. Because the Context
is mutable, to get the right values users should:
ctx = tracer.get_call_context()
ctx.set_context_attributes(...)
# use start_span() or trace() as usual
Of course the fact that Context
is mutable, is something that is planned to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line wasn't asserting that trace_id / parent_id / sampling_priority could change over time nor that we would need a set_call_context.
This context. get_context_attributes
was from the case we create a root span from a context (from a remote tid/sid/priority, like in the iohttp change) instead of a parent.
@@ -50,13 +56,28 @@ def get_current_span(self): | |||
with self._lock: | |||
return self._current_span | |||
|
|||
def _set_current_span(self, span): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the class is meant to be thread-safe, whatever API we provide, it should use the internal lock
for any change even if it's an internal API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was just not to dupe code between add_span
and close_span
which themselves set the lock.
If we think that's too dangerous even if internal, we can inline it in these two functions.
if not span._parent: | ||
span.set_tag(system.PID, getpid()) | ||
|
||
# TODO: add protection if the service is missing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's out of the scope of this PR.
# When doing client sampling in the client, keep the sample rate so that we can | ||
# scale up statistics in the next steps of the pipeline. | ||
if isinstance(self.sampler, RateSampler): | ||
span.set_metric(SAMPLE_RATE_METRIC_KEY, self.sampler.sample_rate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_metric()
is an hot-topic. We even have a discussion to remove it because it's confusing (it doesn't set a metric, it sets a tag/meta).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's for compatibility with what we are currently doing (SAMPLE_RATE_METRIC_KEY always was a metric).
We are likely to change it in the future.
self.priority = None | ||
else: | ||
try: | ||
self.priority = int(sampling_priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only possible values here are 0 or 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term, it can be any int
.
For now, we have 0
== not sampled, 1
sampled by our sampler. And very soon, 2
explicitly sampled by the user.
ddtrace/tracer.py
Outdated
else: | ||
if self.distributed_sampler: | ||
# If dropped by the local sampler, distributed instrumentation can drop it too. | ||
span.set_sampling_priority(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That approach seems legit to me.
os.environ.get('TEST_DATADOG_INTEGRATION', False), | ||
'You should have a running trace agent and set TEST_DATADOG_INTEGRATION=1 env variable' | ||
) | ||
class TestRateByService(TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of this test? TestAPITransport
already check all of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Christian anticipated a bit, that's a feature not yet available.
Thanks for the deep review! To answer your points.
And when I say a public API, it can be any combination of public methods, as long as it is clear and documented.
I guess in the future we will be able to merge these 2 together, but that was too dangerous to consider so early.
|
9126ac6
to
b0337fe
Compare
ddtrace/constants.py
Outdated
@@ -1 +1,2 @@ | |||
FILTERS_KEY = 'FILTERS' | |||
SAMPLING_PRIORITY_KEY = 'sampling.priority' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While our initial implementation is still "experimental" and is likely to change a lot in the future, what do we think about using a non-finale key for that?
Something like _sampling_priority_v1
, that way upgrades of our sampling logic will be much simpler.
The usage of the exact same OT key isn't important right, and it doesn't especially make sense anyway since we aren't compatible / the exact meaning of this value isn't properly defined by OT.
8c33f56
to
7c3d1db
Compare
ddtrace/tracer.py
Outdated
# Apply the default configuration | ||
self.configure( | ||
enabled=True, | ||
hostname=self.DEFAULT_HOSTNAME, | ||
port=self.DEFAULT_PORT, | ||
sampler=AllSampler(), | ||
# TODO: by default, a ServiceSampler periodically updated | ||
distributed_sampler=RateByServiceSampler(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, as a default we will want the distributed_sampler
to be None
so that we don't use the priority logic / we keep using the signature sampling (which is much more mature / interesting with non-distributed traces ).
2b81010
to
0c7cfe3
Compare
b240cad
to
1f8e5af
Compare
…o if tracer is disabled
1f8e5af
to
e8f4387
Compare
d2e221b
to
1dd59b8
Compare
Former code would bug when a key was disappearing from the agent. This rarely happens in practice as a given agent often has a rather constant set of services it consumes, but still, had to be fixed.
1dd59b8
to
b9b9d87
Compare
0f16839
to
5b9d9d2
Compare
3c1ee05
to
48897c0
Compare
@@ -55,8 +82,8 @@ def send_traces(self, traces): | |||
response = self._put(self._traces, data, len(traces)) | |||
|
|||
# the API endpoint is not available so we should downgrade the connection and re-try the call | |||
if response.status in [404, 415] and self._compatibility_mode is False: | |||
log.debug('calling the endpoint "%s" but received %s; downgrading the API', self._traces, response.status) | |||
if response.status in [404, 415] and self._fallback: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There @palazzem I'd like to have your opinion on this. I thought it was OK to cascade down from v0.4 to v0.3 to v0.2 but I'm open to other possibilities. The advantage of doing this is that here I just care about the endpoint, but the question of knowing wether we have a JSON as an answer or a plain OK\n
string is done later. This enables low coupling between this chunk of code and the one handling the JSON.
if body.startswith('OK'): | ||
# This typically happens when using a priority-sampling enabled | ||
# library with an outdated agent. It still works, but priority sampling | ||
# will probably send too many traces, so the next step is to upgrade agent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There @palazzem when we're here we could downgrade and disable priority sampling and switch on protocol v3. OTOH it introduces coupling between components and some complexity I think.
One tip: to simplify this PR, I'd suggest to extract commits around the "set the service at root span creation" into a different PR which we could merge first. |
@@ -80,13 +90,16 @@ def configure(self, enabled=None, hostname=None, port=None, sampler=None, | |||
Otherwise they'll be dropped. | |||
:param str hostname: Hostname running the Trace Agent | |||
:param int port: Port of the Trace Agent | |||
:param object sampler: A custom Sampler instance | |||
:param object sampler: A custom Sampler instance, locally deciding to totally drop the trace or not. | |||
:param object priority_sampler: A custom Sampler instance, taking the priority sampling decision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like my comment disappeared, but I still advocate for removing this from the configure
API and only keep the priority_sampling
flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is obsolete and has been superseded by #359
This is a work in progress and must not be merged until ready