Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querier error returned as compressed binary data #938

Closed
leth opened this issue Aug 20, 2018 · 17 comments
Closed

Querier error returned as compressed binary data #938

leth opened this issue Aug 20, 2018 · 17 comments

Comments

@leth
Copy link
Contributor

leth commented Aug 20, 2018

If I run a certain query in graph mode, I am told: Error: rpc error: code = Code(413) desc = exceeded maximum number of samples in a query (100000)
If I run the same query in table mode, instead I am told: Server error (rpc error: code = Code(500) desc = ����������4�M �0�����V-���*ЕW��1��B��{ ��w��g3�<Śo�� U�����^)N�ܨٯ�-N��|�!�H̸���<�FDZ� n���H~[RO�=ݨ(w�Ou�a��xt�a:~��������������+W����� )

@leth
Copy link
Contributor Author

leth commented Aug 20, 2018

This issue started after 6024681. @tomwilkie Perhaps this was introduced in #910 as it was the next build to release a new querier image?

@leth
Copy link
Contributor Author

leth commented Sep 3, 2018

I'm not sure what fixed this, but it seems to no longer happen on the latest releases!

@leth leth closed this as completed Sep 3, 2018
@leth
Copy link
Contributor Author

leth commented Sep 3, 2018

Turns out we didn't test what we thought we had tested!

@leth leth reopened this Sep 3, 2018
@leth
Copy link
Contributor Author

leth commented Sep 5, 2018

Can confirm this is still an issue on the latest image

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

It's working on 6024681c and broken on 57b71bbf, which is the first commit that produced a Querier image after the working one.

57b71bbf is from #910 which changed httpgrpc.pb.go.

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

@leth gave me a tip which revealed what the corruption is - the code generating the error message is writing to a compressedResponseWriter, while ErrorFromHTTPResponse which pulls it out into gRPC is assuming it can use those bytes directly.

@tomwilkie does that raise any thoughts as to how we should avoid this?

@tomwilkie
Copy link
Contributor

tomwilkie commented Sep 5, 2018 via email

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

So this was actually caused by #898 not #910, and the reason I was looking at the wrong commit is that a unit test failed on the merge for #910 https://circleci.com/gh/weaveworks/cortex/2700

@bboreham
Copy link
Contributor

bboreham commented Sep 5, 2018

@tomwilkie yeah it can wait. Maybe I'll just delete httpgrpc 😉

@tomwilkie
Copy link
Contributor

tomwilkie commented Sep 5, 2018 via email

@bboreham bboreham changed the title Querier error returned as binary data Querier error returned as compressed binary data Sep 6, 2018
@bboreham
Copy link
Contributor

bboreham commented Sep 6, 2018

One approach would be to inspect the content-encoding of the response buffer and decompress if necessary.

Seems like a lot of work.

@tomwilkie
Copy link
Contributor

So this was actually caused by #898

I don't see how - would you mind explaining? AFAICT the Prometheus API Register has always compressed results. I don't see how it ever worked...

@tomwilkie
Copy link
Contributor

One approach would be to inspect the content-encoding of the response buffer and decompress if necessary.

Not sure - everything should roundtrip. If we return a non-200 compressed response from the querier, it gets propagated as a grpc error from querier -> authfe, then converted back to a http response and written (compression header and all) back to the client.

@bboreham
Copy link
Contributor

would you mind explaining

I haven't identified a specific code change, but we can make the issue come and go by switching to those querier versions noted above.

Even if round-trip was working, we'd still have a smaller problem that the error message is dumped into the OpenTracing span.

@bboreham
Copy link
Contributor

I wonder if this relates to #780 which is also about corruption of response data.

@bboreham
Copy link
Contributor

One difference between query and query_range is that the Prometheus API turns a storage error into an internal error (500), whereas it doesn't do that on query_range.

Cortex does send back storage errors, so we should avoid that in cases like exceeded maximum number of samples, but there is a deeper problem: that Prometheus switch statement actually treats all errors as storage errors, because a type assertion on an interface type is an 'implements' check, not an identical match.

Checking into where it breaks, I confirmed it returns 500s, but readable, 75de09a5 but returns compressed data in 57b71bbf, so it broke in #910. And if I update our front-end proxy to the same version of weaveworks/common the garbled data goes away, so I conclude the change to gRPC/proto generation was the culprit.

@bboreham
Copy link
Contributor

Problems went away eventually after updating everything to the latest version after protobuf changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants