-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store upstream paths in transactions/spans for service maps #364
Comments
Right now the assumption is that we should store these paths on both spans and transactions:
Let's suppose we the following service map: We can describe it with the following events: [
{ "processor.event": "transaction", "service.name": "a" },
{ "processor.event": "span", "service.name": "a", "span.destination.service.resource": "service-b:3000", "span.destination.hash": "hashed-service-a", "event.outcome": "success" },
{ "processor.event": "transaction", "service.name": "b", "transaction.upstream.hash": "hashed-service-a" },
{ "processor.event": "span", "service.name": "a", "span.destination.service.resource": "service-c:3001", "span.destination.hash": "hashed-service-a", "event.outcome": "success" },
{ "processor.event": "transaction", "service.name": "c", "transaction.upstream.hash": "hashed-service-a" },
{ "processor.event": "span", "service.name": "b", "span.destination.service.resource": "proxy:3002", "span.destination.hash": "hashed-service-a-b", "event.outcome": "success" },
{ "processor.event": "transaction", "service.name": "d", "transaction.upstream.hash": "hashed-service-a-b" },
{ "processor.event": "span", "service.name": "c", "span.destination.service.resource": "service-d:3003", "span.destination.hash": "hashed-service-a-c", "event.outcome": "success" },
{ "processor.event": "transaction", "service.name": "d", "transaction.upstream.hash": "hashed-service-a-c" },
{ "processor.event": "span", "service.name": "b", "span.destination.service.resource": "postgres:3004", "span.destination.hash": "hashed-service-a-b", "event.outcome": "failure" },
{ "processor.event": "span", "service.name": "d", "span.destination.service.resource": "postgres:3004", "span.destination.hash": "hashed-service-a-c-d", "event.outcome": "success" }
] To get the global service map:
{
"key" : {
"span.destination.hash" : null,
"transaction.upstream.hash" : null,
"service.name" : "a",
"span.destination.service.resource" : null
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : null,
"transaction.upstream.hash" : "hashed-service-a",
"service.name" : "b",
"span.destination.service.resource" : null
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : null,
"transaction.upstream.hash" : "hashed-service-a",
"service.name" : "c",
"span.destination.service.resource" : null
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : null,
"transaction.upstream.hash" : "hashed-service-a-b",
"service.name" : "d",
"span.destination.service.resource" : null
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : null,
"transaction.upstream.hash" : "hashed-service-a-c",
"service.name" : "d",
"span.destination.service.resource" : null
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : "hashed-service-a",
"transaction.upstream.hash" : null,
"service.name" : "a",
"span.destination.service.resource" : "service-b:3000"
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : "hashed-service-a",
"transaction.upstream.hash" : null,
"service.name" : "a",
"span.destination.service.resource" : "service-c:3001"
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : "hashed-service-a-b",
"transaction.upstream.hash" : null,
"service.name" : "b",
"span.destination.service.resource" : "postgres:3004"
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : "hashed-service-a-b",
"transaction.upstream.hash" : null,
"service.name" : "b",
"span.destination.service.resource" : "proxy:3002"
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : "hashed-service-a-c",
"transaction.upstream.hash" : null,
"service.name" : "c",
"span.destination.service.resource" : "service-d:3003"
},
"doc_count" : 1
},
{
"key" : {
"span.destination.hash" : "hashed-service-a-c-d",
"transaction.upstream.hash" : null,
"service.name" : "d",
"span.destination.service.resource" : "postgres:3004"
},
"doc_count" : 1
}
] We can then construct the paths by mapping For dependency metrics (e.g. request rate from service A to service B or service A to postgres), we should filter the documents on |
We should also look into how this affects the cardinality of transaction/span metrics. |
@dgieselaar nice! Say we replaced D with two identical services D1 and D2, and say the proxy load-balances across them. In that case we would have a one-to-many relation from upstream |
Tagging @AlexanderWert who has experience building a metrics-based service map. |
@axw what are two identical services D1 and D2 that are interchangeable (load-balanced)? Shouldn't this be considered a wrong setup where user should be advised to set the same service name "D" for both and rely on Besides, in @dgieselaar's aggregation example, if they do have different service names configured, the combined key will be different, thus the count can be done separately, or did I misunderstand this? |
Sorry, I meant identical in terms of their input/output and interaction with other services, not necessarily the exact same code. They could be two implementations of a service (e.g. you're migrating from a Java to a Go implementation ), and might have slightly different
If we introduce d2:
... and What would we show on the edges |
Regardless, I believe that any interchangeable nodes (ones that can be load-balanced) should belong to the same service in our terminology and concepts. Any other filtering/aggregation should rely on other data like agent type, environment or node name.
I see. Will this be solved if |
Actually, without this, how would there even be edges |
@axw:
I didn't intend for the proxy to be shown on the actual service map, my bad. We would ignore it, as we have a match for a In this example, I think we could show a split edge from service C to D1/D2, and show the edge metrics once, if that makes sense.
Agree that |
@felixbarny thank you for looping me in. I just wanted to drop in a different idea / approach to realize the service map purely on metric data, thus detaching it from the need of collecting 100% of traces / spans, etc. Feels related to this issue. The concept is quite simple, based on the following:
We would get a set of metrics with the following conceptual structure (here illustrated as a table): These metrics represent in their tags (origin-service, service) bi-leteral dependencies between services, so they can be used to reconstruct a graph / service map with corresponding metric values attached. This is just the core idea, if it is of interest I can elaborate more on the details. With some additional context propagation and tagging of metrics, this approach is quite powerful, and allows for the following (while it is highly scaleable in terms of data collection and query/ data processing):
|
If I understand correctly, we would have something like (apologies, I do not have @felixbarny's ASCII art mastery):
I think that works well. Seeing as the edge metrics are meant to be from C's perspective, I suppose it makes sense that they're not attributed to a particular service on the edges. We can still look at transaction/node metrics for the split. How would we know that we should remove the proxy from the graph, and that it's in between C and D? Perhaps like @eyalkoren described above, we include the destination service resource (proxy:...) in the outbound hash, and propagate that? @AlexanderWert thanks for your input!
We don't necessarily have to capture 100% of traces/spans. We have recently started aggregating metrics based on trace events in APM Server, and we scale them based on the configured sampling rate. The metrics are then stored and used for populating charts (currently opt-in, expected to become the default in the future.) I think it would make sense to extend these metrics as described above to power the service map.
I'd just like to clarify one thing here. IIANM, what you illustrated in the table is a point-to-point graph representation. In that model you're right, it doesn't matter if we propagate the service name or a hash of it (disregarding possible privacy concerns). That's certainly an option, and would keep things fairly simple. What @dgieselaar has described above is instead a path representation of a graph. This will enable the UI to filter the graph down to a subgraph that includes some node(s), and then only show metrics related to the paths through those nodes and not the excluded nodes. I'd be very interested to hear if you have experience with this approach. |
It's removed from the graph by virtue of the span on service C being connected to the transaction on service D, via the hash. I'm not sure if we can tell that there is a proxy in between, or a load balancer, or any other non-instrumented services, even if |
I will assume "C" in the last comments was meant to be "B", even though the last one is confusing because there is a
If I read this correctly, it means that given these keys:
there is a However, it looks the same as looking at:
How would you know that
I think you are right, this is not enough by itself to discover a proxy. Maybe this is something we can rely on request headers for - I think that the As for load balancing (@axw's example), assuming we do send the destination, this should be easier - if multiple services (transactions) have the same upstream path AND destination, then you have enough info to add a load-balancer node to the map and have metrics for all edges - the edge to the load balancer and each edge from the load balancer to the service. |
Ron suggested to do a POC, perhaps we can pivot elastic/kibana#82598 into one? That way we don't need agent support, and we need to calculate paths there anyway. Thoughts? |
I'm a little confused by |
After a quick call with @eyalkoren, I understand what you mean and you are right: the outgoing hash should include the perceived destination. If we don't do that, when service A is talking to service B and postgres via the same hash ( |
One more thing to notice- if service B had two nodes behind a load balancer and the user chose to assign each its own unique service name, say - B1 and B2, then adding the destination helps with that as well - once you see that two services get the same upstream path (including the destination, e.g.
|
Sounds like a good idea to me. Perhaps start with a small POC (e.g. using some hand-written data like above) to validate the idea generally, and then expand on that by generating some complex graph data to test the scalability. |
To work around the load balancer issue (which is actually happening on dev-next right now, see elastic/kibana#83152 (comment)), we could consider having an called service reply with a response header with its own hash. The calling service would then use this hash when storing span metrics. If the response header is not there, the calling service will hash its own hash + |
@dgieselaar response headers are of course an option that opens even more possibilities, however it means an implementation of a new capability by all agents, including the potential added complications (e.g. such related to modifying a response). |
@eyalkoren How would we correctly attribute span metrics to either B1 or B2? I thought metrics would be aggregated for A -> LB only. |
Let's assume we have these data:
Because we append the destination to the path, the fact that two services have the same |
@eyalkoren it does, I was operating under the assumption that we'd use span metrics always. Are there any downsides to mixing those two? |
In this case it is actually straightforward to rely on both I think. In other cases, there may be contradictions, so we will have to decide how we treat those. I agree it will require some thinking. |
Instead of doing a composite agg on
|
I believe with the current discussed approach you would be able to draw B and C with proper metrics, but not D. I don't think we can rely on attributing excess exit counts from A (based on spans) to an "external service" because mismatches between span metrics and transaction metrics will be common. In order to support that, we may add a @dgieselaar do I understand correctly that currently the idea is to do a POC with the discussed approach based on span and transaction documents and to apply that in the future to rely purely on stored metrics? |
If the service responds with its own hash, and the calling service uses that hash to store its span metrics, we would not need transaction metrics, and we would have an "other" bucket where D would fall under, but calls that fail at the gateway or network issues would also fall into that bucket.
Yes, but I'm not sure if we will get to that in the 7.11 timeframe. Might need @sqren or @graphaelli here for some prioritisation. Also, there are a couple of approaches in play, I'm not sure if we decided which one is best. We can investigate some of it in a POC. |
Exactly, so in this regard it means adding complication but being left with the same limitation. Moreover, one limitation to keep in mind with response headers is that they will not be able to support async communication, like messaging systems. If you think of a message bus used to create requests to multiple services, you can support that through the use of different destination resources/sub-resources (e.g. message queues/topics), but response headers are irrelevant for such use cases. |
We are already pretty strapped for time and since service maps is not on the roadmap goals for 7.11 any bigger improvements will have to wait until 7.12. |
Can you elaborate why no data changes are needed for your suggested approach? AFAICT, there needs to be some kind of property on transaction metrics that identifies its upstream, and I don't think we have that yet? |
You're right, we need to implement the addition to |
Just to add here - we don't know ahead of time how the destination route is defined in terms of API gateway rules. Would it be possible to dynamically calculate traffic to D based on overall traffic to host:port minus traffic to A and B? |
My suggestion is to make it opt in anyway, so you might as well add a complex (meaning - non-boolean) config to define the routing factor. But I'd say this is for advance use cases.
Probably not, especially if we are moving to rely on metrics in the future. When using metrics you must assume discrepancies between metrics reported about the same connection from two different services, even if you enforce rigid synchronization of metric collection and sending. Using counts in such metrics (as opposed to rates/percentages) may be very tricky. |
Another thing to keep in mind when implementing this mechanism: there are cases where the middle-node is such that we DO want to show, even if there is only one direct connection, for example - message queues. |
@sqren @felixbarny and I have been discussing adding configuration to APM Server to set a default value for I don't think it makes sense to include the environment in the hash only sometimes (i.e. only when the environment is known to the agent), as that way we could end up with the same hash for multiple environments (i.e. when the default is changed); so I think we'll have to leave it out altogether. @dgieselaar do you you see any problem with that? |
Do you mean leaving out the environment altogether? I think we decided to show calls to a service in different environments separately (correct me if I'm wrong @sqren @formgeist). If the service environment is not included in the hash, I think we will collapse multiple environments into one, is that correct? |
Yes. Kinda. We can still filter on |
I've implemented a POC using edge metrics in elastic/kibana#114468. It roughly does as described here. It also shows message queues and load balancers. However, it does/will not work for Otel, or older agents. My main goal here is to get rid of the scripted metric aggregation, so I want to propose the following alternative for Otel/older agents:
The main difference here between the hashed approach is that we'll display any service that was part of an inspected trace. So if service A is talking to service B and service C in the same trace, and we focus on service B, we display both service A => B and service A => C. Is that an acceptable tradeoff? |
@dgieselaar nice! Sorry it took me a while to get to it, I checked out your POC and it looks sensible.
I think this is fine. There are likely other data sources we can consume which will only provide point-to-point edge information (I'm thinking of Istio/service meshes), and not paths, meaning we could not filter down to only the relevant paths for a given service. Do you have an idea of what the performance is like for your proposal for old/OTel agent data, compared to the current scripted metric agg approach? I think it sounds OK, but there will likely be extensive period where users will be running older agents. |
I have some concerns about the general approach of relying on propagating path hashes downstream and the fact that this makes OTel traces a 2nd class citizen. The proposal works for situations where all data is coming from up-to-date Elastic APM agents. However, in scenarios where there's a mix of OTel agents and different versions of Elastic agents, it doesn't work reliably anymore. Seems like there a lots of different goals when it comes to service map improvements. Let me try to untangle them.
|
@axw:
Haven't measured, but it's 3 consecutive requests, so it'll be slower. A lot of it will depend on the sampling rate.
We can overlay them, but it won't solve any performance issues.
The scripted metric aggregation is mostly very unpredictable in terms of performance. E.g., if your traces are very long, it will use a lot of memory (and even cause an OOM). The new proposed approach might be a bit slower in some cases but it will definitely be more robust. Fwiw, I think it's reasonable to migrate to that approach first, and then figure out using metrics later.
I think that's a fair concern - we might want to consider making it optional in any case.
Yeah, I don't think there's a way to reliably do this today, or in the near future. Plus there are downsides to post-processing (historical data, searchable snapshots, etc).
It is, but the performance/reliability improvements are pretty important IMO, and I have not seen an alternative yet. Additionally, requiring transactions/spans for the service map to work means that as soon as those are dropped, it'll break, and the accuracy will decrease as the sampling rate goes down. |
Can we limit the number of traces we're looking at to detect connections?
That's great!
ack
I think we should be opinionated and have exactly one way that a service map is drawn. My main concerns are usability, maintenance and code complexity. To maintain backwards compatibility, we may be forced to maintain two implementations at the same time. If that gets multiplied by different ways to draw the service map things get even more dire.
Agreed. But mid- to long-term it's something we could work on together with the stack team.
That's unless we store a persistent representation of the service connections that still works when traces are deleted. |
We have to use a composite agg to get a diverse set of traces, and that means iterating over the whole data set (at least, restricted by transactions and exit spans). The amount of traces we inspect can be a constant and a relatively low amount.
I too would like to have one way - but I don't think these will result in vastly different implementations, at least not on Kibana's side. One issue I do see is that we cannot (easily) auto-detect which version to use.
IMO these options are so far out that I feel we are better off ignoring them for the sake of this conversation. There's one more downside to the OTel/legacy approach: it can only see connections from the perspective of the caller. Let's say you have service A talking to a messaging system, and that messaging system is talking to service B and C. With Otel, we might only show Service A => messaging system => service B (or service C) due to the sampling approach. This is not an issue with the hashed-based approach. |
We met this morning and agreed to move forward with the Otel based approach as a first step. This achieves the goal of removing the scripted metrics aggregation and supporting OTel/legacy agents without any work needed from the agents. Once this is implemented we can evaluate performance and accuracy more appropriately and decide whether or not additional work is needed. There are some gaps with the Otel based approach that I will try to list here. The main issue is that it is hard to get a representative set of traces without hashes. This is specifically an issue with the focused service map but there are also some scenarios where the global service map might not display some connections. I’ve wireframed some example scenarios: In the first scenario, we are looking at a focused service map for service C. We get sample traces that include A) a transaction happening on service C, and B) an exit span that talks to service D. This means that it is not guaranteed that we’ll see traces for both A and B talking to C, and D talking to F and E. The dashed connections might not show up, which is more likely as the amount of traces to inspect go up and also when certain connections are occurring more often than others. In the second scenario, we are looking at a global service map. We get sample traces that include A) a transaction happening on service A, B, C, and D, B) an exit span on service A that talks to an API gateway ( In this scenario, we will see at least the connection between the API gateway and service D, but it is not guaranteed that we’ll see the connection to service C, for the same reason as previously listed - we might have sampled a trace to service D only, based on the exit span. |
I wonder if using a diversified sampler aggregation on the transaction name field could improve the chances of detecting more connections. In cases where a particular transaction group is dominant, we also connect connections that come from more rare transaction groups. |
it might. there's a perf hit that comes with it as well though. |
@alex-fedotyev if I look at some of the perf issues that our bigger users have the service map always comes up. I don't think we can easily improve performance with the OTEL-compliant approach without making the data it displays even more unreliable. Does it still make sense to prioritise an OTEL-compliant approach? |
We currently walk traces (via a scripted metric aggregation) to get paths/connections between services. However, that's untenable for a couple of reasons:
One solution is to store (hashed) paths in transaction or span metrics, per @axw's suggestion.
Here's how that could possibly work:
We should consider the following use cases when deciding where and how to store the hashed paths:
One requirement is that we should be able to resolve all connections with one or two requests, without using a scripted metric aggregation.
The text was updated successfully, but these errors were encountered: