Refactor service metadata pipeline #348

p-lambert · 2018-01-19T21:08:34Z

This PR adds TraceServiceExtractor to our trace processing pipeline. The idea is extract service metadata out of all top-level spans belonging to a given trace and send them downstream (to ServiceWriter).

By doing this we intend to simplify instrumentation on the client side, since we no longer require service metadata to be set at the tracer level.

Update

Based on feedback from @LotharSee and @ufoot I've made the following changes:

Created ServiceMapper which should be now the entry point for ingesting ServicesMetadata (either coming from HTTP requests or from metadata embedded on traces). ServiceMapper is just a cache layer around ServiceWriter;
Removed caching functionality/DeepEqual logic from ServiceWriter;

LotharSee · 2018-01-22T16:31:32Z

agent/agent.go

@@ -237,6 +240,11 @@ func (a *Agent) Process(t model.Trace) {
 		pt.Env = tenv
 	}

+	go func() {
+		defer watchdog.LogOnPanic()
+		a.ServiceExtractor.Process(t)


There is a risk of race condition as process access s.TopLevel which itself read the s.Metrics map, but as sampler.Add modifies s.Metrics it can crash (we know it from experience).

I see two solutions:

use WeightedTrace instead of Trace where weighted spans contain a TopLevel which is safe to access.

Do it here and not in a goroutine. The only dangerous part is ts.outServices <- meta which can be blocking: We can move this part to a goroutine.

If those s.Metrics maps can be modified by other go routines then changing this is a no-brainer. I'll see what I can do about that (I guess it might be worth it to approach this in another way, as suggested by the other comments).

LotharSee · 2018-01-22T16:36:52Z

agent/trace_service_extractor.go

+}
+
+// Process extracts service metadata from top-level spans and sends it downstream
+func (ts *TraceServiceExtractor) Process(t model.Trace) {


With this approach, for each trace we will:

generate a ServicesMetadata and fill it with data

send it to the ServiceWriter which send does a DeepEqual

I'm curious about a simpler approach where we move the ServiceWriter.serviceBuffer into this extractor.

make the ServiceExtractor (or whatever we call it) ingest both service metadata from the Agent API and the processing of traces

make it store a single instance of model.ServicesMetadata acting as a reference value

to be faster/simpler, don't allow service metadata updates: when processing a span/service, do nothing if already available in the map

send the map to the ServiceWriter only if the map got modified. The ServiceWriter no longer need the deepEqual logic.

What do you think?
This comment might be a bit confusing, ping me if you want to discuss it directly.

Humn, I like the idea of simplifying the service "cache" as you mentioned (by simply checking for the existence of service names instead of doing update operations followed by DeepEqual calls).

I just find a bit weird to move this to the "extractor" because then we would have 3 responsibilities for it:

Maintain the service cache;

Extract service metadata from traces and send it downstream;

Send downstream the service data originated from the receiver (the service endpoint);

By doing these changes we would write less to this service channel, but do we care about it? (It's not a rhetorical question – I know very little about Go 😬 )
In terms of separation of concerns, I think the way it is now makes more sense to me. But again, I'm not sure if it's expensive to write to this channel for each trace processed and having the cache downstream just to prevent the flushes (that is, the requests to the trace API)

Right, that's why the "Extractor" would no longer be a suitable name. (more like a Mapper? I don't know I'm not good at names).
I think it makes sense to have a single component maintaining a single source of truth source metadata map (built both from the API and the span extraction). Its job is just to keep and update this map. And on-update, send it to the writer.
Plus, we know that in the future we will remove the part from the client service metadata, making this part only extracting meta from spans.

That'd be quite similar with what we do for stats: the writer only receives data to flush. Then the ServiceWriter only cares about serializing/flushing the payload, with a potential retry and debounce logic.

I felt it was actually simpler/with uncoupled responsabilities ; but maybe you have a different read?

I agree with removing the service map from the writer and bringing it upwards;

I just think we shouldn't mix the handling of two "data sources" into the same "component". I think the entry point to this ServiceMapper should be agnostic of the origins of the data and should be something as "universal" as a method or channel that accepts ServiceMetadata.

I'm of the opinion that if we ever change the source of our metadata, this mapper shouldn't change. If we hook new sources, we'll just have other parts of the codebase interacting with it. If we delete the service endpoint, for example, we basically confine the change to the service section of the HTTPReceiver. Makes sense?

So the change here would be basically:

Creating the ServiceMapper;

Keeping the TraceServiceExtractor as it is, but sending the information to ServiceMapper;

Keep the HTTPReceiver as it is, but instead of writing to the service channel, we write to ServiceMapper.

And now we have clear responsibilities:

ServiceMapper (or whatever name is best): The entry point for managing the service metadata (a cache layer around ServiceWriter)

TraceServiceExtractor: extracts-service-info-from-traces 😛

HTTPReceiver: extracts-service-info-from-http-requests

Great idea 💯 go for it!

ufoot

Cool, we're almost there, I still believe we need to make sure we don't spam our API with too many services payloads (which, I think, this code could do).

ufoot · 2018-01-23T09:32:48Z

agent/trace_service_extractor.go

+	meta := make(model.ServicesMetadata)
+
+	for _, s := range t {
+		if !s.TopLevel() {


Besides the fact it can raise a race condition as @LotharSee mentioned, I'm not even sure we care about it. It's slightly more optimized if we skip non top-level spans, but in practice many of them are top-level (our stats show this for now) and saving a test also makes code simpler etc. Reporting service info for non top-level spans won't break anything.

Actually it would, some web frameworks have http type for the top level but other types (like template) for some of its sub spans.

ufoot · 2018-01-23T09:39:36Z

agent/trace_service_extractor.go

+	}
+
+	if len(meta) > 0 {
+		ts.outServices <- meta


So I also agree we should not fill this channel for every single trace. Only when something actually changed, so at some point we need to cache the data. One way to do it that I could think of, is a goroutine ranging over a trace (or maybe even span...) channel and pumping data in that outServices channel with a heartbeat. The cache should be cleared at some point and/or provide a way to cope with the fact users might want to change the type of a service at some point (you rarely get it right on the first try) and won't want to restart their agent to acknowledge this change.

As far as I understand, ServiceWriter already implements such cache. That's why the approach taken here is so simplistic. I'm merely creating another "path" for ingesting the service metadata, but I'm still relying on all housekeeping done downstream. I might have to reconsider it though, because of the race condition mentioned.

LotharSee · 2018-01-26T14:10:10Z

writer/service_writer.go

@@ -179,7 +175,7 @@ func (w *ServiceWriter) updateInfo() {
 	swInfo.Retries = atomic.SwapInt64(&w.stats.Retries, 0)

 	w.conf.StatsClient.Count("datadog.trace_agent.service_writer.payloads", int64(swInfo.Payloads), nil, 1)
-	w.conf.StatsClient.Gauge("datadog.trace_agent.service_writer.services", float64(swInfo.Services), nil, 1)
+	w.conf.StatsClient.Count("datadog.trace_agent.service_writer.services", int64(swInfo.Services), nil, 1)


We should keep a Gauge here: it represents an instant value (number of different services seen) not an increment. Over a given time frame, it doesn't make sense to sum these values.

Actually it's quite the opposite now. Before, due to the cache at the ServiceWriter level, we would flush the same services over and over again, so this number would monotonically increase. Now we just keep track of the number of new services flushed. Perhaps we want now to have a gauge representing the total number of services tracked at ServiceMapper. If we keep the gauge for the ServiceWriter, we will have super noisy graphs (they will be blank for the most part)

Oh right, I missed this order / the fact we send changes and not s.cache.

s.out <- changes s.cache.Merge(changes)

Do you want me to add a gauge at the ServiceMapper? I already have a instrumentation "ticker" in place (in the event loop)

Nope it is all good as it :)

LotharSee reviewed Jan 22, 2018

View reviewed changes

ufoot suggested changes Jan 23, 2018

View reviewed changes

p-lambert added the do-not-merge/WIP label Jan 25, 2018

p-lambert force-pushed the pedro/add-tracer-service-extractor branch 2 times, most recently from 14237ab to 75d9fb5 Compare January 26, 2018 00:05

p-lambert removed the do-not-merge/WIP label Jan 26, 2018

p-lambert changed the title ~~Add TraceServiceExtractor~~ Refactor service metadata pipeline Jan 26, 2018

LotharSee reviewed Jan 26, 2018

View reviewed changes

LotharSee approved these changes Jan 26, 2018

View reviewed changes

p-lambert force-pushed the pedro/add-tracer-service-extractor branch from 3a053a6 to 08d0c21 Compare January 26, 2018 16:10

p-lambert added 8 commits January 26, 2018 16:00

Add TraceServiceExtractor

5beea40

Add ServiceMapper

16191e3

Remove cache logic from ServiceWriter

c5c6dac

Wire TraceServiceExtractor, ServiceMapper and ServiceWriter

ae700ad

Use WeightedTrace in TraceServiceExtractor

3224645

Ensure ServiceMapper is gracefully shutdown

65203bd

Change cache policy

248e363

Fix channel configuration for ServiceWriter

279fd0b

p-lambert force-pushed the pedro/add-tracer-service-extractor branch from 55311e9 to 279fd0b Compare January 26, 2018 18:05

palazzem added this to the 5.21.1 milestone Jan 27, 2018

talwai merged commit b8977a9 into master Jan 27, 2018

dtilghman deleted the pedro/add-tracer-service-extractor branch January 28, 2018 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor service metadata pipeline #348

Refactor service metadata pipeline #348

p-lambert commented Jan 19, 2018 •

edited

Loading

LotharSee Jan 22, 2018

p-lambert Jan 23, 2018

LotharSee Jan 22, 2018

p-lambert Jan 23, 2018

LotharSee Jan 23, 2018 •

edited

Loading

p-lambert Jan 23, 2018

LotharSee Jan 23, 2018

ufoot left a comment

ufoot Jan 23, 2018

LotharSee Jan 23, 2018

ufoot Jan 23, 2018

p-lambert Jan 23, 2018

LotharSee Jan 26, 2018

p-lambert Jan 26, 2018 •

edited

Loading

LotharSee Jan 26, 2018

p-lambert Jan 26, 2018 •

edited

Loading

LotharSee Jan 26, 2018

Refactor service metadata pipeline #348

Refactor service metadata pipeline #348

Conversation

p-lambert commented Jan 19, 2018 • edited Loading

Update

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LotharSee Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ufoot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-lambert commented Jan 19, 2018 •

edited

Loading

LotharSee Jan 23, 2018 •

edited

Loading

p-lambert Jan 26, 2018 •

edited

Loading

p-lambert Jan 26, 2018 •

edited

Loading