-
Notifications
You must be signed in to change notification settings - Fork 31
Conversation
agent/agent.go
Outdated
@@ -237,6 +240,11 @@ func (a *Agent) Process(t model.Trace) { | |||
pt.Env = tenv | |||
} | |||
|
|||
go func() { | |||
defer watchdog.LogOnPanic() | |||
a.ServiceExtractor.Process(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a risk of race condition as process access s.TopLevel
which itself read the s.Metrics
map, but as sampler.Add
modifies s.Metrics
it can crash (we know it from experience).
I see two solutions:
- use
WeightedTrace
instead ofTrace
where weighted spans contain aTopLevel
which is safe to access. - Do it here and not in a goroutine. The only dangerous part is
ts.outServices <- meta
which can be blocking: We can move this part to a goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If those s.Metrics
maps can be modified by other go routines then changing this is a no-brainer. I'll see what I can do about that (I guess it might be worth it to approach this in another way, as suggested by the other comments).
agent/trace_service_extractor.go
Outdated
} | ||
|
||
// Process extracts service metadata from top-level spans and sends it downstream | ||
func (ts *TraceServiceExtractor) Process(t model.Trace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this approach, for each trace we will:
- generate a
ServicesMetadata
and fill it with data - send it to the
ServiceWriter
which send does aDeepEqual
I'm curious about a simpler approach where we move the ServiceWriter.serviceBuffer
into this extractor.
- make the ServiceExtractor (or whatever we call it) ingest both service metadata from the Agent API and the processing of traces
- make it store a single instance of
model.ServicesMetadata
acting as a reference value - to be faster/simpler, don't allow service metadata updates: when processing a span/service, do nothing if already available in the map
- send the map to the ServiceWriter only if the map got modified. The ServiceWriter no longer need the deepEqual logic.
What do you think?
This comment might be a bit confusing, ping me if you want to discuss it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humn, I like the idea of simplifying the service "cache" as you mentioned (by simply checking for the existence of service names instead of doing update operations followed by DeepEqual
calls).
I just find a bit weird to move this to the "extractor" because then we would have 3 responsibilities for it:
- Maintain the service cache;
- Extract service metadata from traces and send it downstream;
- Send downstream the service data originated from the receiver (the service endpoint);
By doing these changes we would write less to this service channel, but do we care about it? (It's not a rhetorical question – I know very little about Go 😬 )
In terms of separation of concerns, I think the way it is now makes more sense to me. But again, I'm not sure if it's expensive to write to this channel for each trace processed and having the cache downstream just to prevent the flushes (that is, the requests to the trace API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that's why the "Extractor" would no longer be a suitable name. (more like a Mapper? I don't know I'm not good at names).
I think it makes sense to have a single component maintaining a single source of truth source metadata map (built both from the API and the span extraction). Its job is just to keep and update this map. And on-update, send it to the writer.
Plus, we know that in the future we will remove the part from the client service metadata, making this part only extracting meta from spans.
That'd be quite similar with what we do for stats: the writer only receives data to flush. Then the ServiceWriter only cares about serializing/flushing the payload, with a potential retry and debounce logic.
I felt it was actually simpler/with uncoupled responsabilities ; but maybe you have a different read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree with removing the service map from the writer and bringing it upwards;
- I just think we shouldn't mix the handling of two "data sources" into the same "component". I think the entry point to this
ServiceMapper
should be agnostic of the origins of the data and should be something as "universal" as a method or channel that acceptsServiceMetadata
.
I'm of the opinion that if we ever change the source of our metadata, this mapper shouldn't change. If we hook new sources, we'll just have other parts of the codebase interacting with it. If we delete the service endpoint, for example, we basically confine the change to the service section of the HTTPReceiver
. Makes sense?
So the change here would be basically:
- Creating the
ServiceMapper
; - Keeping the
TraceServiceExtractor
as it is, but sending the information toServiceMapper
; - Keep the
HTTPReceiver
as it is, but instead of writing to the service channel, we write toServiceMapper
.
And now we have clear responsibilities:
ServiceMapper
(or whatever name is best): The entry point for managing the service metadata (a cache layer aroundServiceWriter
)TraceServiceExtractor
: extracts-service-info-from-traces 😛HTTPReceiver
: extracts-service-info-from-http-requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea 💯 go for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, we're almost there, I still believe we need to make sure we don't spam our API with too many services payloads (which, I think, this code could do).
agent/trace_service_extractor.go
Outdated
meta := make(model.ServicesMetadata) | ||
|
||
for _, s := range t { | ||
if !s.TopLevel() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the fact it can raise a race condition as @LotharSee mentioned, I'm not even sure we care about it. It's slightly more optimized if we skip non top-level spans, but in practice many of them are top-level (our stats show this for now) and saving a test also makes code simpler etc. Reporting service info for non top-level spans won't break anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it would, some web frameworks have http
type for the top level but other types (like template
) for some of its sub spans.
} | ||
|
||
if len(meta) > 0 { | ||
ts.outServices <- meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I also agree we should not fill this channel for every single trace. Only when something actually changed, so at some point we need to cache the data. One way to do it that I could think of, is a goroutine ranging over a trace (or maybe even span...) channel and pumping data in that outServices
channel with a heartbeat. The cache should be cleared at some point and/or provide a way to cope with the fact users might want to change the type of a service at some point (you rarely get it right on the first try) and won't want to restart their agent to acknowledge this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, ServiceWriter
already implements such cache. That's why the approach taken here is so simplistic. I'm merely creating another "path" for ingesting the service metadata, but I'm still relying on all housekeeping done downstream. I might have to reconsider it though, because of the race condition mentioned.
14237ab
to
75d9fb5
Compare
TraceServiceExtractor
writer/service_writer.go
Outdated
@@ -179,7 +175,7 @@ func (w *ServiceWriter) updateInfo() { | |||
swInfo.Retries = atomic.SwapInt64(&w.stats.Retries, 0) | |||
|
|||
w.conf.StatsClient.Count("datadog.trace_agent.service_writer.payloads", int64(swInfo.Payloads), nil, 1) | |||
w.conf.StatsClient.Gauge("datadog.trace_agent.service_writer.services", float64(swInfo.Services), nil, 1) | |||
w.conf.StatsClient.Count("datadog.trace_agent.service_writer.services", int64(swInfo.Services), nil, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep a Gauge here: it represents an instant value (number of different services seen) not an increment. Over a given time frame, it doesn't make sense to sum these values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's quite the opposite now. Before, due to the cache at the ServiceWriter
level, we would flush the same services over and over again, so this number would monotonically increase. Now we just keep track of the number of new services flushed. Perhaps we want now to have a gauge representing the total number of services tracked at ServiceMapper
. If we keep the gauge for the ServiceWriter
, we will have super noisy graphs (they will be blank for the most part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, I missed this order / the fact we send changes and not s.cache.
s.out <- changes
s.cache.Merge(changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want me to add a gauge at the ServiceMapper
? I already have a instrumentation "ticker" in place (in the event loop)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope it is all good as it :)
3a053a6
to
08d0c21
Compare
55311e9
to
279fd0b
Compare
This PR adds
TraceServiceExtractor
to our trace processing pipeline. The idea is extract service metadata out of all top-level spans belonging to a given trace and send them downstream (toServiceWriter
).By doing this we intend to simplify instrumentation on the client side, since we no longer require service metadata to be set at the tracer level.
Update
Based on feedback from @LotharSee and @ufoot I've made the following changes:
ServiceMapper
which should be now the entry point for ingestingServicesMetadata
(either coming from HTTP requests or from metadata embedded on traces).ServiceMapper
is just a cache layer aroundServiceWriter
;DeepEqual
logic fromServiceWriter
;