[sampling] add distributed tracing capabilities #325

ufoot · 2017-08-10T16:32:57Z

This is a work in progress and must not be merged until ready

palazzem

I will split the overall feedback in two different points:

The API
I think that primitives we currently have, makes this PR difficult to use. Mostly we don't have a clean way to set a new Context as active (missing API), considering that creating two Context will result in wrong data. Also, the fact that is not immutable doesn't help you a lot to workaround the current API. Whatever change we make here, could result in a breaking change in the future.
The distributed tracing
Having a sampling_priority with a local sampler that takes priority, is a good thing. We may cover scenarios where a distributed system sets the overall sampling and:

the system doesn't want to sample a lot of traces, but the local sampler wants because developers spot an issue and they want all traces for "some time"
the system wants to sample all the traces, but the local sampler doesn't because of really high span cardinality of the underlying system
any possible combination of the points above with <put_here_your_reason_to_sample_or_not> seems quite covered by the client itself

Probably I still need to see the big picture because it's a WIP, but in the overall I think keeping two different samplers is the way to go. What I'm missing:

how the Span sampling interacts with the Context sampling

the Writer receives a trace only if the Context is sampled:

dd-trace-py/ddtrace/tracer.py

Lines 279 to 286 in a0198b4

    
               def record(self, context): 
        
                   """ 
        
                   Record the given ``Context`` if it's finished. 
        
                   """ 
        
                   # extract and enqueue the trace if it's sampled 
        
                   trace, sampled = context.get() 
        
                   if trace and sampled: 
        
                       self.write(trace)

; probably it's one of the missing steps because it's a [WIP]

palazzem · 2017-08-29T07:43:49Z

ddtrace/context.py

+        self._sampled = sampled
+        self._sampling_priority = sampling_priority
+
+    def get_context_attributes(self):


do we need to change the private API? that was set as internal because a possible refactoring could make the Context immutable so we can get rid of the lock. Anyway, it's something to keep in mind. Let me think of a possible alternative.

I don't especially defend nor like this API.
What is sure, it is that we will need somewhere a public way to access the current tuple (tid, sid, priority) for remote propagation (both for our own integrations and for customers to instrument their own inter service propagation). Do you agree too?
After that, it could live anywhere.

palazzem · 2017-08-29T07:56:05Z

ddtrace/contrib/aiohttp/middlewares.py

@@ -29,24 +31,27 @@ def attach_context(request):
        service = app[CONFIG_KEY]['service']
        distributed_tracing = app[CONFIG_KEY]['distributed_tracing_enabled']

+        context = tracer.get_call_context()


Technically this must be None. If it's not and we have a distributed_tracing == True, it means that we're going to create a new Context object using the new one as a child_of. In the current API, having two different Context objects in one request, means having two different traces (and it's wrong).

Also, creating manually a Context, doesn't set it as active. This means that the request_span lives in a Context that is not propagated and so any attempt to use start_span() or trace(), will add the new Span to a different (wrong) context.

When implementing, I only looked at ThreadLocalContext (which, when they are no context, create one and puts it in the current thread local) and assumed a similar behavior (I should have checked!).

So what is the proper thing to do there?

palazzem · 2017-08-29T07:59:56Z

ddtrace/span.py

@@ -90,6 +92,7 @@ def __init__(

        # sampling
        self.sampled = True
+        self.priority = None


technically, sampled and priority are duplicated of what we have in the Context. I'm worried that having the Context immutable is a requirement for a cleaner and stable API.

Forgot to say, that probably all these "flags" should stay only in the Context. Am I missing something about sampled traces but discarded children spans?

Overall I agree both should be moved there.

So far we had sampled in the span (old, from before the Context entity) but both should be moved.
I don't know how much pain that would be though (since we used to check span.sampled in various integrations) ; mayve it could wait a second PR?

But yes, we have to fix this in the scope of this PR.
If we go with the Context approach to get/set the priority, let's not store it here. Plus whatever method we provide to get/set it ; it has to hit the right place.

(for now, the span.set|get_sampling_priority won't work with context.get_context_attributes).

palazzem · 2017-08-29T08:23:05Z

ddtrace/tracer.py

+            parent_span_id = parent.span_id
+            sampling_priority = parent.get_sampling_priority()
+        else:
+            trace_id, parent_span_id, sampling_priority = context.get_context_attributes()


This doesn't work with the current API. We don't have a way to set_call_context() (the complete API was proposed to OpenTracing but we never made that change for ourselves), so it means that we can't:

ctx = Context(trace_id=trace_id, parent_id=parent_id, sampling_priority= sampling_priority) tracer.set_call_context(ctx) # we don't have this API

This means that context.get_context_attributes() returns always the value of the first created Context when you start_span() or trace(). Because the Context is mutable, to get the right values users should:

ctx = tracer.get_call_context() ctx.set_context_attributes(...) # use start_span() or trace() as usual

Of course the fact that Context is mutable, is something that is planned to be changed.

This line wasn't asserting that trace_id / parent_id / sampling_priority could change over time nor that we would need a set_call_context.
This context. get_context_attributes was from the case we create a root span from a context (from a remote tid/sid/priority, like in the iohttp change) instead of a parent.

palazzem · 2017-08-29T08:25:49Z

ddtrace/context.py

@@ -50,13 +56,28 @@ def get_current_span(self):
        with self._lock:
            return self._current_span

+    def _set_current_span(self, span):


Because the class is meant to be thread-safe, whatever API we provide, it should use the internal lock for any change even if it's an internal API.

That was just not to dupe code between add_span and close_span which themselves set the lock.
If we think that's too dangerous even if internal, we can inline it in these two functions.

palazzem · 2017-08-29T08:31:04Z

ddtrace/tracer.py

+        if not span._parent:
+            span.set_tag(system.PID, getpid())
+
+        # TODO: add protection if the service is missing?


I think it's out of the scope of this PR.

palazzem · 2017-08-29T08:32:23Z

ddtrace/tracer.py

+                # When doing client sampling in the client, keep the sample rate so that we can
+                # scale up statistics in the next steps of the pipeline.
+                if isinstance(self.sampler, RateSampler):
+                    span.set_metric(SAMPLE_RATE_METRIC_KEY, self.sampler.sample_rate)


set_metric() is an hot-topic. We even have a discussion to remove it because it's confusing (it doesn't set a metric, it sets a tag/meta).

Yes. That's for compatibility with what we are currently doing (SAMPLE_RATE_METRIC_KEY always was a metric).

We are likely to change it in the future.

palazzem · 2017-08-29T08:33:41Z

ddtrace/span.py

+            self.priority = None
+        else:
+            try:
+                self.priority = int(sampling_priority)


the only possible values here are 0 or 1?

Long term, it can be any int.
For now, we have 0 == not sampled, 1 sampled by our sampler. And very soon, 2 explicitly sampled by the user.

palazzem · 2017-08-29T08:39:05Z

ddtrace/tracer.py

+            else:
+                if self.distributed_sampler:
+                    # If dropped by the local sampler, distributed instrumentation can drop it too.
+                    span.set_sampling_priority(0)


That approach seems legit to me.

palazzem · 2017-08-29T08:42:15Z

tests/test_integration.py

+    os.environ.get('TEST_DATADOG_INTEGRATION', False),
+    'You should have a running trace agent and set TEST_DATADOG_INTEGRATION=1 env variable'
+)
+class TestRateByService(TestCase):


what's the purpose of this test? TestAPITransport already check all of this.

Christian anticipated a bit, that's a feature not yet available.

LotharSee · 2017-08-30T13:35:20Z

Thanks for the deep review! To answer your points.

The API. I don't especially care about what it looks like. At the high-level, the only things we need are:

a public API to access current trace_id/span_id/sampling_priority.
a public API to create a local trace from remote/propagated trace_id/span_id/sampling_priority.

And when I say a public API, it can be any combination of public methods, as long as it is clear and documented.

The mix of 2 samplers is a way to combine the two approaches in a compatible way.

sampler is the older client-side sampling, simply dropping data according to a rate. When dropped, it doesn't even reach the Agent, so no stats. But also no performance footprint. In practice, not used that much. Disabled by default.
distributed_sampler is the one deciding in advance if the trace should be dropped by the Agent, and it propagates this decision.
The goal: to have it enabled by default, applying a sampling rate per service, provided by the Agent.

I guess in the future we will be able to merge these 2 together, but that was too dangerous to consider so early.

If we put all these in the immutable Context, there is the problem of the ability to update the sampling_priority after root creation.
Does it mean we should move this attributes back to the Span? Here or there, we can have it cleaned and unified in a second PR.

LotharSee · 2017-09-04T15:46:52Z

ddtrace/constants.py

@@ -1 +1,2 @@
 FILTERS_KEY = 'FILTERS'
+SAMPLING_PRIORITY_KEY = 'sampling.priority'


While our initial implementation is still "experimental" and is likely to change a lot in the future, what do we think about using a non-finale key for that?

Something like _sampling_priority_v1 , that way upgrades of our sampling logic will be much simpler.

The usage of the exact same OT key isn't important right, and it doesn't especially make sense anyway since we aren't compatible / the exact meaning of this value isn't properly defined by OT.

cc @palazzem @ufoot

LotharSee · 2017-09-04T19:27:40Z

ddtrace/tracer.py

        # Apply the default configuration
        self.configure(
            enabled=True,
            hostname=self.DEFAULT_HOSTNAME,
            port=self.DEFAULT_PORT,
            sampler=AllSampler(),
+            # TODO: by default, a ServiceSampler periodically updated
+            distributed_sampler=RateByServiceSampler(),


In fact, as a default we will want the distributed_sampler to be None so that we don't use the priority logic / we keep using the signature sampling (which is much more mature / interesting with non-distributed traces ).

… provider must get/set the current active context

…o if tracer is disabled

…s on

…g func

Former code would bug when a key was disappearing from the agent. This rarely happens in practice as a given agent often has a rather constant set of services it consumes, but still, had to be fixed.

…rity sampling

…wrong

ufoot · 2017-10-04T16:47:53Z

ddtrace/api.py

@@ -55,8 +82,8 @@ def send_traces(self, traces):
        response = self._put(self._traces, data, len(traces))

        # the API endpoint is not available so we should downgrade the connection and re-try the call
-        if response.status in [404, 415] and self._compatibility_mode is False:
-            log.debug('calling the endpoint "%s" but received %s; downgrading the API', self._traces, response.status)
+        if response.status in [404, 415] and self._fallback:


There @palazzem I'd like to have your opinion on this. I thought it was OK to cascade down from v0.4 to v0.3 to v0.2 but I'm open to other possibilities. The advantage of doing this is that here I just care about the endpoint, but the question of knowing wether we have a JSON as an answer or a plain OK\n string is done later. This enables low coupling between this chunk of code and the one handling the JSON.

ufoot · 2017-10-04T16:49:11Z

ddtrace/sampler.py

+            if body.startswith('OK'):
+                # This typically happens when using a priority-sampling enabled
+                # library with an outdated agent. It still works, but priority sampling
+                # will probably send too many traces, so the next step is to upgrade agent.


There @palazzem when we're here we could downgrade and disable priority sampling and switch on protocol v3. OTOH it introduces coupling between components and some complexity I think.

LotharSee · 2017-10-06T10:17:45Z

One tip: to simplify this PR, I'd suggest to extract commits around the "set the service at root span creation" into a different PR which we could merge first.

LotharSee · 2017-10-25T08:31:41Z

ddtrace/tracer.py

@@ -80,13 +90,16 @@ def configure(self, enabled=None, hostname=None, port=None, sampler=None,
            Otherwise they'll be dropped.
        :param str hostname: Hostname running the Trace Agent
        :param int port: Port of the Trace Agent
-        :param object sampler: A custom Sampler instance
+        :param object sampler: A custom Sampler instance, locally deciding to totally drop the trace or not.
+        :param object priority_sampler: A custom Sampler instance, taking the priority sampling decision.


Looks like my comment disappeared, but I still advocate for removing this from the configure API and only keep the priority_sampling flag.

palazzem

This PR is obsolete and has been superseded by #359

palazzem self-requested a review August 11, 2017 07:48

palazzem added the wip label Aug 11, 2017

palazzem changed the title ~~[distributed sampling] WIP, do not merge~~ [sampling] add distributed tracing capabilities Aug 11, 2017

ufoot force-pushed the christian/issampled branch 3 times, most recently from 09b2a0a to 9e22203 Compare August 14, 2017 08:54

LotharSee force-pushed the christian/issampled branch 7 times, most recently from d5db637 to a0198b4 Compare August 24, 2017 13:06

palazzem reviewed Aug 29, 2017

View reviewed changes

ufoot force-pushed the christian/issampled branch from 9126ac6 to b0337fe Compare September 4, 2017 15:10

LotharSee reviewed Sep 4, 2017

View reviewed changes

ufoot force-pushed the christian/issampled branch 3 times, most recently from 8c33f56 to 7c3d1db Compare September 4, 2017 16:25

LotharSee reviewed Sep 4, 2017

View reviewed changes

ufoot force-pushed the christian/issampled branch 5 times, most recently from 2b81010 to 0c7cfe3 Compare September 11, 2017 13:07

Emanuele Palazzetti added 4 commits September 18, 2017 13:32

[core] Tracer Context Provider is available via public API; a context…

6bef657

… provider must get/set the current active context

[asyncio] honors the Context Provider public API

c5d9620

[gevent] honors the Context Provider public API

0f474e9

[tornado] use the exposed context_provider alias

1a88131

[distributed sampling] fixed aiohttp and django tests

d6e0961

ufoot force-pushed the christian/issampled branch from b240cad to 1f8e5af Compare September 18, 2017 11:33

[distributed sampling] aiohttp integration skips distributed meta inf…

e8f4387

…o if tracer is disabled

ufoot force-pushed the christian/issampled branch from 1f8e5af to e8f4387 Compare September 18, 2017 11:40

ufoot added 3 commits September 21, 2017 14:20

[distributed sampling] using 0.4 endpoint when distributed sampling i…

0286531

…s on

[distributed tracing] fixed tracer initializer

97c5984

[distributed sampling] fix priority sampling handling in tracer confi…

73f42a9

…g func

ufoot force-pushed the christian/issampled branch from d2e221b to 1dd59b8 Compare September 25, 2017 11:11

[distributed sampling] fixing key deletion in rates by sample

b9b9d87

Former code would bug when a key was disappearing from the agent. This rarely happens in practice as a given agent often has a rather constant set of services it consumes, but still, had to be fixed.

ufoot force-pushed the christian/issampled branch from 1dd59b8 to b9b9d87 Compare September 25, 2017 12:19

[distributed sampling] fixed sampler test for Python 2/3 compat

5b9d9d2

ufoot force-pushed the christian/issampled branch from 0f16839 to 5b9d9d2 Compare September 25, 2017 14:59

ufoot added 4 commits September 27, 2017 18:56

[distributed sampling] generalized downgrade mecanism

9f9b269

[distributed sampling] setting pylons service earlier for better prio…

a69f9c9

…rity sampling

[distributed tracing] fixing API downgrade

61ad28c

[distributed sampling] adding hint to upgrade agent when protocol is …

48897c0

…wrong

ufoot force-pushed the christian/issampled branch from 3c1ee05 to 48897c0 Compare September 29, 2017 09:41

ufoot removed the wip label Oct 4, 2017

ufoot commented Oct 4, 2017

View reviewed changes

[distributed sampling] setting pyscopg service as soon as possible

7a7804a

ufoot mentioned this pull request Oct 9, 2017

[distributed sampling] introduce new priority sampler for distributed sampling #359

Merged

palazzem added this to the 0.10.0 milestone Oct 11, 2017

palazzem added the core label Oct 11, 2017

LotharSee mentioned this pull request Oct 24, 2017

[tornado] patch concurrent.futures if available #362

Merged

LotharSee reviewed Oct 25, 2017

View reviewed changes

palazzem suggested changes Oct 26, 2017

View reviewed changes

palazzem closed this Oct 26, 2017

palazzem deleted the christian/issampled branch October 26, 2017 07:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sampling] add distributed tracing capabilities #325

[sampling] add distributed tracing capabilities #325

ufoot commented Aug 10, 2017 •

edited by palazzem

Loading

palazzem left a comment

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017 •

edited

Loading

palazzem Aug 29, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

LotharSee Oct 25, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

palazzem Aug 29, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

palazzem Aug 29, 2017

palazzem Aug 29, 2017

LotharSee Aug 30, 2017

LotharSee commented Aug 30, 2017 •

edited

Loading

LotharSee Sep 4, 2017 •

edited

Loading

LotharSee Sep 4, 2017

ufoot Oct 4, 2017

ufoot Oct 4, 2017

LotharSee commented Oct 6, 2017

LotharSee Oct 25, 2017

palazzem left a comment

	def record(self, context):
	"""
	Record the given ``Context`` if it's finished.
	"""
	# extract and enqueue the trace if it's sampled
	trace, sampled = context.get()
	if trace and sampled:
	self.write(trace)

		@@ -1 +1,2 @@
		FILTERS_KEY = 'FILTERS'
		SAMPLING_PRIORITY_KEY = 'sampling.priority'

[sampling] add distributed tracing capabilities #325

[sampling] add distributed tracing capabilities #325

Conversation

ufoot commented Aug 10, 2017 • edited by palazzem Loading

palazzem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LotharSee Aug 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LotharSee commented Aug 30, 2017 • edited Loading

LotharSee Sep 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LotharSee commented Oct 6, 2017

Choose a reason for hiding this comment

palazzem left a comment

Choose a reason for hiding this comment

ufoot commented Aug 10, 2017 •

edited by palazzem

Loading

LotharSee Aug 30, 2017 •

edited

Loading

LotharSee commented Aug 30, 2017 •

edited

Loading

LotharSee Sep 4, 2017 •

edited

Loading