-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for partial success in an OTLP export response #2636
Add support for partial success in an OTLP export response #2636
Conversation
Even though the Proto PR is not merged, I thought of already opening the spec change, to kick start discussions. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@open-telemetry/specs-metrics-approvers please take a look. The corresponding change in OTLP has been approved: open-telemetry/opentelemetry-proto#390 |
specification/protocol/otlp.md
Outdated
the specific message to use in the [Success](#success), | ||
[Partial Success](#partial-success) and [Failure](#failures) cases). | ||
|
||
##### Success | ||
|
||
The success response indicates telemetry data is successfully processed by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"processed" seems very vague. I think producers care the most about the chain of custody, i.e. that the data was handed over to the server who is now responsible for this data, subject to its SLOs (including SLOs for data loss).
If we phrase it in terms of accepting responsibility, then the next paragraph about "processing at a later point" becomes irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah now that you mentioned I think I agree. I will rephrase it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 75c4bf1. @yurishkuro let me know what you think.
specification/protocol/otlp.md
Outdated
`ExportLogsPartialSuccess` message for logs), and it MUST set the respective | ||
`accepted_spans`, `accepted_data_points` or `accepted_log_records` field with | ||
the number of spans/data points/log records it accepted. In case the server | ||
rejected everything, the `accepted_*` field MUST be set to `0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it rejected everything, how is it a "success"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's tricky. One can see it as a partial success in the sense the message was "valid", the server was able to parse it and tried to accept/process it but couldn't for whatever reasons. @jmacd and I discussed in the Proto PR and both of us tend to prefer returning 0
to indicate such situation than returning an error. For example, a 400 BadRequest
doesn't fit well, nor a 500
.
@@ -382,8 +432,9 @@ numbers or strings are accepted when decoding. | |||
|
|||
#### OTLP/HTTP Response | |||
|
|||
Response body MUST be the appropriate serialized Protobuf message (see below for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we repeating such a large body of text that is mostly identical to OTLP/gRPC? This constitutes the semantic definition of the protocol, which is independent of the encoding / transport used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same 😕. I had an idea about putting the "common" things in a separated section but I was concerned the PR would touch too many things and it would be harder to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few nuances, for ex, for HTTP, we indicate which Content-Type
should be used for the response, compression, which status code to return. Such things don't apply to gRPC.. so I was not sure what makes sense. Having a combined section and doing "If HTTP, return xyz header" doesn't sound good. Another point is if someone is implementing a gRPC server, they don't have to read the HTTP stuff and vice-versa.
I'm open to improve it, but I'm not sure to what extent. It could also be a follow up to make the review easier? Open to suggestions :)
Co-authored-by: Tigran Najaryan <[email protected]>
### Partial Success | ||
#### Partial Success Retry | ||
|
||
Each server has its particularities and its way of treating data. There |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a bit verbose. I think we should specify more explicitly here:
- Clients, by default, will not attempt automatic retry.
- Servers should leverage partial success when they fully understand that resending the exact same bundle of telemetry would lead to another error (preventing retry loops).
The rest of this commentary is nice for WHY, but the WHAT should be called out explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan has raised concerns about the use of optional. I believe we should revise this PR so that the partial_success messages always have a meaningful 0 value, which means counting dropped things not accepted things.
I still have a problem with Errors from the server can be viewed in two separate categories: client errors and server errors.
I feel like the wording in this PR is trying to ignore this complexity and use poorly defined terms like "accepted". So I suggest to first start with a clear motivation - why does the client need to know partial success? What are the possible outcomes of such response? This will then inform how the response format should look. |
I don't see it that way. This specification refers to conditions that are neither client errors not server errors and it does not change the interpretation of either. The data is well formed, there is neither a client error nor a server error, but there is still a problem in the data. One easy example -- an OTLP export request arrives with weeks-old data. We are not willing to accept such old data, so we have to drop those points. It's not a client error, not a server error, it's likely a wall clock error. We have no grounds to fail the request (it was not a client error nor a server error) -- the OTLP was also well formed, just couldn't be accepted. My goal in supporting this proposal is that I think we MUST have a way to count the number of points that are being accepted vs. dropped. I support having a free-form error message string too, but I am not willing to include an error message for every point that is not accepted, thus I need a way to count points accepted/dropped one way or another independent of the error details string. |
Your example seems quite clearly a client error, which breaks the implied contract that the dates should be within a certain time range. The server can reject it with "client error" indicator and the client will know it should not resubmit it. Saying that it's some grey area is precisely what bothers me in the proposal. We can argue about the validity of certain server decisions, but the protocol must be unambiguous with clear path forward after receiving the partial response. Currently the only path forward after receiving partial success is to do nothing functional. Client can log something, but it cannot resubmit the data that was not "accepted" (for reasons unknown in machine-readable way). In this case, why not just add accepted counts to the regular successful response? It's backwards compatible. It also doesn't address the server-error situation where some items could be retried - it doesn't fit in any of the 3 response categories. I'd still have a conceptual issue with server returning all accepted counts as zeros - it's equivalent to a failed response, so the duality indicates a design issue. |
I do not believe there is any such contract. In any case, what would you advise if half of the data is valid (good clocks) and half of the data is out of date? |
I think this is what this proposal does. It adds an optional message that is an indication to the client that it needs to log an error. The message contains a human readable error text and a number that indicates how many records were successful. Both of these items (error text and successful count) seem useful for diagnosing bad situations. I would expect the client to log something like this: "Request partially successful. Sent $sent_records, accepted $accepted_records. Error text: $error_text". Such a message IMO can be very useful for understanding why you send X records, but somehow only Y appear on the backend.
Not necessarily equivalent. Accepted=0 indicates the Server has a problem with individual data records (all of them). To use @jmacd's example - perhaps all data points timestamps are stale and cannot be accepted, although the data itself is understandable to the Server. As opposed to that a failed response indicates that the Server has a problem with the request as a whole (i.e. can't be decoded as Protobuf, hence returns 400 Bad Data response). Perhaps the design issue here is that we have 2 different ways to indicate an error to the client: one using HTTP response codes and another using Protobuf messages. I think this is inevitable. We do want to return proper HTTP response codes when they applicable (i.e. 400 when request payload is impossible to decode). However, HTTP response codes alone are not sufficient to convey more nuanced situations, hence additional data must be present in the response body, which happens to a Protobug message. In theory we could say that all failure information should be conveyed via response body in the Protobuf message (i.e. always return 200 and a body with an "error=true" field inside). Unfortunately we can't do this anymore, since that would be a breaking change. |
Another example we have at Lightstep are metrics-type related errors. You cannot begin reporting a Gauge as a Counter, for example, and if you change metric type we consider it an error. It will be the common case that a client SDK has mostly valid metrics but a few errors of this nature. We first saw this in https://github.com/lightstep/opentelemetry-prometheus-sidecar, a process that consumes Prometheus logs and sends OTLP data. In a large-scale Prometheus system (which has no validation errors of this sort, presently), you are certain to have a small percentage of errors. We've seen popular projects change metric kinds in recent years, too. This is not uncommon.
I don't understand this use of "functional". The explicit desire here is for the client to be able to count how many items it is reporting vs. not reporting.
I would be happy with this outcome, for what it's worth. The reason I support putting this information into a partial_success message is that we expect the message to not be set in vast majority of cases, which means that a server handling this data will usually just have a nil pointer, i.e., one word of machine memory devoted to the rare error response.
Here I have a strong disagreement and evidence that this is confusing to users. When you return an RPC error or a non-200 HTTP status code, alarms go off. It looks like the OTel SDK is doing something wrong, even when it's acting correctly. If the OTel SDK is acting correctly, there should never be a Client error. It's true that we don't want the client to re-send the data. Some of the data could have been valid, as described in the metrics-type examples above. Users need to a way to monitor when they are dropping 2 points out of every 1000, for example, and I believe can't rely on console logging. The case where 1000/1000 points are being dropped is certainly extreme, but it needs to be well defined so that the count is correct when it happens. You're right that the 1000/1000 case could be treated as a Client error--I'll be happy as long as all cases are well defined--but if we return Client errors it's likely to create issues in the wrong place: OTel maintainers will receive complaints instead of vendors. |
@yurishkuro I'm not sure I agree and I'm a bit confused as well. I used the word The others gave good examples already, but another one I could think of is a customer hitting some sort of quota/limitation on their subscription model. A customer with a subscription allowing 100gb of logs in their subscription, multiple apps are sending logs. A server might start accepting a batch of logs, and halfway stop because the quota was reached by another app. Is this a client or a server error? IMHO they are neither. Another case: A back-end might have specific rules about what they accept for metric keys. What should a server respond in case a client sends a batch of 1000 datapoints where half of it has "invalid" metric keys and the server rejected it? You can say, ah but this is a client error, they should know the back-end format for metric keys. Ok, what if the metrics are not coming from the app itself, but from a 3rd party library the app uses, e.g. a db driver library? Today, the only way a server can signal such case is via a 400 Bad request because we have a payload to put stuff in. I honestly don't think this is a good experience for anyone. With the
This was also discussed in the Proto PR, and it seemed to me we had an agreement that it might make sense to include more detailed and structured information about the error so they could be retry-able, but this was not a focus for now. The problem is, none of us could come with well-defined examples of such error cases nor a good structure. Also, this would blow up the scope of this PR and require massive work around defining semantic conventions first. What we had in these two PRs seemed like a good first iteration that would help many cases and we can always improve it and add more structure in a non-breaking manner. I even created a separated issue to discuss such extra structure/retry for metrics open-telemetry/opentelemetry-proto#404 I'm out of ideas on how to continue with this at this point. Should we stop it altogether? Open an OTEP to discuss it? |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
As we agreed, I opened a new PR to change the semantics from "accepted" to "rejected". open-telemetry/opentelemetry-proto#414. Let me know your thoughts. If we agree on that approach I will make the changes to this PR. |
This can be reopened if need be, but there appears to be consensus around #2696. |
Thank you @joaopgrassi !!!! |
Fixes #2454
Changes
This PR adds guidance around how servers can signal a partial success to clients. It indicates in which situations servers can respond with a partial success and how the fields introduced in opentelemetry-proto #390 should be populated.
I also refactored the
OTLP/gRPC Response
section a bit, to make it consistent with the HTTP one. The main inconsistency I found was that the gRPC section did not have clear sections forSuccess
,Failure
as the HTTP does. Now both share the same sections and the wording should be consistent between the two.Related
Proto change: open-telemetry/opentelemetry-proto#390