Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for partial success in an OTLP export response #2636

Conversation

joaopgrassi
Copy link
Member

@joaopgrassi joaopgrassi commented Jun 29, 2022

Fixes #2454

Changes

This PR adds guidance around how servers can signal a partial success to clients. It indicates in which situations servers can respond with a partial success and how the fields introduced in opentelemetry-proto #390 should be populated.

I also refactored the OTLP/gRPC Response section a bit, to make it consistent with the HTTP one. The main inconsistency I found was that the gRPC section did not have clear sections for Success, Failure as the HTTP does. Now both share the same sections and the wording should be consistent between the two.

Related
Proto change: open-telemetry/opentelemetry-proto#390

@joaopgrassi joaopgrassi requested review from a team June 29, 2022 14:55
@joaopgrassi
Copy link
Member Author

Even though the Proto PR is not merged, I thought of already opening the spec change, to kick start discussions.

cc @tigrannajaryan @jmacd @jack-berg @reyang

@github-actions
Copy link

github-actions bot commented Jul 7, 2022

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 7, 2022
@jmacd
Copy link
Contributor

jmacd commented Jul 11, 2022

@open-telemetry/specs-metrics-approvers please take a look. The corresponding change in OTLP has been approved: open-telemetry/opentelemetry-proto#390

the specific message to use in the [Success](#success),
[Partial Success](#partial-success) and [Failure](#failures) cases).

##### Success

The success response indicates telemetry data is successfully processed by the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"processed" seems very vague. I think producers care the most about the chain of custody, i.e. that the data was handed over to the server who is now responsible for this data, subject to its SLOs (including SLOs for data loss).

If we phrase it in terms of accepting responsibility, then the next paragraph about "processing at a later point" becomes irrelevant.

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah now that you mentioned I think I agree. I will rephrase it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 75c4bf1. @yurishkuro let me know what you think.

`ExportLogsPartialSuccess` message for logs), and it MUST set the respective
`accepted_spans`, `accepted_data_points` or `accepted_log_records` field with
the number of spans/data points/log records it accepted. In case the server
rejected everything, the `accepted_*` field MUST be set to `0`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it rejected everything, how is it a "success"?

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's tricky. One can see it as a partial success in the sense the message was "valid", the server was able to parse it and tried to accept/process it but couldn't for whatever reasons. @jmacd and I discussed in the Proto PR and both of us tend to prefer returning 0 to indicate such situation than returning an error. For example, a 400 BadRequest doesn't fit well, nor a 500.

@@ -382,8 +432,9 @@ numbers or strings are accepted when decoding.

#### OTLP/HTTP Response

Response body MUST be the appropriate serialized Protobuf message (see below for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we repeating such a large body of text that is mostly identical to OTLP/gRPC? This constitutes the semantic definition of the protocol, which is independent of the encoding / transport used.

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same 😕. I had an idea about putting the "common" things in a separated section but I was concerned the PR would touch too many things and it would be harder to review.

Copy link
Member Author

@joaopgrassi joaopgrassi Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few nuances, for ex, for HTTP, we indicate which Content-Type should be used for the response, compression, which status code to return. Such things don't apply to gRPC.. so I was not sure what makes sense. Having a combined section and doing "If HTTP, return xyz header" doesn't sound good. Another point is if someone is implementing a gRPC server, they don't have to read the HTTP stuff and vice-versa.

I'm open to improve it, but I'm not sure to what extent. It could also be a follow up to make the review easier? Open to suggestions :)

@github-actions github-actions bot removed the Stale label Jul 12, 2022
### Partial Success
#### Partial Success Retry

Each server has its particularities and its way of treating data. There
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a bit verbose. I think we should specify more explicitly here:

  1. Clients, by default, will not attempt automatic retry.
  2. Servers should leverage partial success when they fully understand that resending the exact same bundle of telemetry would lead to another error (preventing retry loops).

The rest of this commentary is nice for WHY, but the WHAT should be called out explicitly.

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan has raised concerns about the use of optional. I believe we should revise this PR so that the partial_success messages always have a meaningful 0 value, which means counting dropped things not accepted things.

@yurishkuro
Copy link
Member

I still have a problem with accepted_<signal> = 0, and I think this problem comes from unclear motivation in the original ticket.

Errors from the server can be viewed in two separate categories: client errors and server errors.

  • Client errors mean the request or data is malformed, and the server wants to reject it. In this case there is no point for the client to resent the data.
  • Server errors mean it's not the client fault - the server may be overloaded and unable to accept the data, or the server may be trying to write to storage synchronously and there's a hiccup there. The server may indicate whether the error with the data item is retryable or not to inform the client what it can do.

I feel like the wording in this PR is trying to ignore this complexity and use poorly defined terms like "accepted".

So I suggest to first start with a clear motivation - why does the client need to know partial success? What are the possible outcomes of such response? This will then inform how the response format should look.

@jmacd
Copy link
Contributor

jmacd commented Jul 14, 2022

I feel like the wording in this PR is trying to ignore this complexity

I don't see it that way. This specification refers to conditions that are neither client errors not server errors and it does not change the interpretation of either. The data is well formed, there is neither a client error nor a server error, but there is still a problem in the data.

One easy example -- an OTLP export request arrives with weeks-old data. We are not willing to accept such old data, so we have to drop those points. It's not a client error, not a server error, it's likely a wall clock error. We have no grounds to fail the request (it was not a client error nor a server error) -- the OTLP was also well formed, just couldn't be accepted.

My goal in supporting this proposal is that I think we MUST have a way to count the number of points that are being accepted vs. dropped. I support having a free-form error message string too, but I am not willing to include an error message for every point that is not accepted, thus I need a way to count points accepted/dropped one way or another independent of the error details string.

@yurishkuro
Copy link
Member

Your example seems quite clearly a client error, which breaks the implied contract that the dates should be within a certain time range. The server can reject it with "client error" indicator and the client will know it should not resubmit it.

Saying that it's some grey area is precisely what bothers me in the proposal. We can argue about the validity of certain server decisions, but the protocol must be unambiguous with clear path forward after receiving the partial response.

Currently the only path forward after receiving partial success is to do nothing functional. Client can log something, but it cannot resubmit the data that was not "accepted" (for reasons unknown in machine-readable way). In this case, why not just add accepted counts to the regular successful response? It's backwards compatible.

It also doesn't address the server-error situation where some items could be retried - it doesn't fit in any of the 3 response categories.

I'd still have a conceptual issue with server returning all accepted counts as zeros - it's equivalent to a failed response, so the duality indicates a design issue.

@jmacd
Copy link
Contributor

jmacd commented Jul 14, 2022

breaks the implied contract that the dates should be within a certain time range.

I do not believe there is any such contract.

In any case, what would you advise if half of the data is valid (good clocks) and half of the data is out of date?

@tigrannajaryan
Copy link
Member

Client can log something, but it cannot resubmit the data that was not "accepted" (for reasons unknown in machine-readable way). In this case, why not just add accepted counts to the regular successful response? It's backwards compatible.

I think this is what this proposal does. It adds an optional message that is an indication to the client that it needs to log an error. The message contains a human readable error text and a number that indicates how many records were successful. Both of these items (error text and successful count) seem useful for diagnosing bad situations. I would expect the client to log something like this:

"Request partially successful. Sent $sent_records, accepted $accepted_records. Error text: $error_text".

Such a message IMO can be very useful for understanding why you send X records, but somehow only Y appear on the backend.

I'd still have a conceptual issue with server returning all accepted counts as zeros - it's equivalent to a failed response, so the duality indicates a design issue.

Not necessarily equivalent. Accepted=0 indicates the Server has a problem with individual data records (all of them). To use @jmacd's example - perhaps all data points timestamps are stale and cannot be accepted, although the data itself is understandable to the Server. As opposed to that a failed response indicates that the Server has a problem with the request as a whole (i.e. can't be decoded as Protobuf, hence returns 400 Bad Data response).

Perhaps the design issue here is that we have 2 different ways to indicate an error to the client: one using HTTP response codes and another using Protobuf messages. I think this is inevitable. We do want to return proper HTTP response codes when they applicable (i.e. 400 when request payload is impossible to decode). However, HTTP response codes alone are not sufficient to convey more nuanced situations, hence additional data must be present in the response body, which happens to a Protobug message. In theory we could say that all failure information should be conveyed via response body in the Protobuf message (i.e. always return 200 and a body with an "error=true" field inside). Unfortunately we can't do this anymore, since that would be a breaking change.

@jmacd
Copy link
Contributor

jmacd commented Jul 14, 2022

@yurishkuro

Another example we have at Lightstep are metrics-type related errors. You cannot begin reporting a Gauge as a Counter, for example, and if you change metric type we consider it an error. It will be the common case that a client SDK has mostly valid metrics but a few errors of this nature.

We first saw this in https://github.com/lightstep/opentelemetry-prometheus-sidecar, a process that consumes Prometheus logs and sends OTLP data. In a large-scale Prometheus system (which has no validation errors of this sort, presently), you are certain to have a small percentage of errors. We've seen popular projects change metric kinds in recent years, too. This is not uncommon.

Currently the only path forward after receiving partial success is to do nothing functional.

I don't understand this use of "functional". The explicit desire here is for the client to be able to count how many items it is reporting vs. not reporting.

Client can log something, but it cannot resubmit the data that was not "accepted" (for reasons unknown in machine-readable way). In this case, why not just add accepted counts to the regular successful response? It's backwards compatible.

I would be happy with this outcome, for what it's worth. The reason I support putting this information into a partial_success message is that we expect the message to not be set in vast majority of cases, which means that a server handling this data will usually just have a nil pointer, i.e., one word of machine memory devoted to the rare error response.

The server can reject it with "client error" indicator and the client will know it should not resubmit it.

Here I have a strong disagreement and evidence that this is confusing to users. When you return an RPC error or a non-200 HTTP status code, alarms go off. It looks like the OTel SDK is doing something wrong, even when it's acting correctly. If the OTel SDK is acting correctly, there should never be a Client error.

It's true that we don't want the client to re-send the data. Some of the data could have been valid, as described in the metrics-type examples above. Users need to a way to monitor when they are dropping 2 points out of every 1000, for example, and I believe can't rely on console logging.

The case where 1000/1000 points are being dropped is certainly extreme, but it needs to be well defined so that the count is correct when it happens. You're right that the 1000/1000 case could be treated as a Client error--I'll be happy as long as all cases are well defined--but if we return Client errors it's likely to create issues in the wrong place: OTel maintainers will receive complaints instead of vendors.

@joaopgrassi
Copy link
Member Author

I feel like the wording in this PR is trying to ignore this complexity and use poorly defined terms like "accepted".

@yurishkuro I'm not sure I agree and I'm a bit confused as well. I used the word processing to define what the server did with the data, which you pointed out to be vague and that senders care on chain of custody, which I completely agree. That's why I changed the semantics to "accepted". Once the server accepted the data, what it does after is not to be cared by senders. Maybe I misunderstood you then? 🤔

The others gave good examples already, but another one I could think of is a customer hitting some sort of quota/limitation on their subscription model. A customer with a subscription allowing 100gb of logs in their subscription, multiple apps are sending logs. A server might start accepting a batch of logs, and halfway stop because the quota was reached by another app. Is this a client or a server error? IMHO they are neither.

Another case: A back-end might have specific rules about what they accept for metric keys. What should a server respond in case a client sends a batch of 1000 datapoints where half of it has "invalid" metric keys and the server rejected it? You can say, ah but this is a client error, they should know the back-end format for metric keys. Ok, what if the metrics are not coming from the app itself, but from a 3rd party library the app uses, e.g. a db driver library? Today, the only way a server can signal such case is via a 400 Bad request because we have a payload to put stuff in. I honestly don't think this is a good experience for anyone. With the partial_success we could solve this and stop returning 400s and triggering false alarms on customers.

It also doesn't address the server-error situation where some items could be retried - it doesn't fit in any of the 3 response categories.

This was also discussed in the Proto PR, and it seemed to me we had an agreement that it might make sense to include more detailed and structured information about the error so they could be retry-able, but this was not a focus for now.

The problem is, none of us could come with well-defined examples of such error cases nor a good structure. Also, this would blow up the scope of this PR and require massive work around defining semantic conventions first. What we had in these two PRs seemed like a good first iteration that would help many cases and we can always improve it and add more structure in a non-breaking manner. I even created a separated issue to discuss such extra structure/retry for metrics open-telemetry/opentelemetry-proto#404

I'm out of ideas on how to continue with this at this point. Should we stop it altogether? Open an OTEP to discuss it?

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jul 23, 2022
@joaopgrassi
Copy link
Member Author

As we agreed, I opened a new PR to change the semantics from "accepted" to "rejected". open-telemetry/opentelemetry-proto#414.

Let me know your thoughts. If we agree on that approach I will make the changes to this PR.

@jmacd
Copy link
Contributor

jmacd commented Jul 26, 2022

This can be reopened if need be, but there appears to be consensus around #2696.

@jmacd jmacd closed this Jul 26, 2022
@jmacd
Copy link
Contributor

jmacd commented Jul 26, 2022

Thank you @joaopgrassi !!!!

@arminru arminru deleted the feature/otlp-partial-success branch August 2, 2022 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:protocol Related to the specification/protocol directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability for OTLP response to signal partial acceptance of data
6 participants