Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exemplar: Can not set partial response. #4676

Closed
hanjm opened this issue Sep 18, 2021 · 13 comments · Fixed by #4680 or #4681
Closed

Exemplar: Can not set partial response. #4676

hanjm opened this issue Sep 18, 2021 · 13 comments · Fixed by #4680 or #4681

Comments

@hanjm
Copy link
Member

hanjm commented Sep 18, 2021

Thanos, Prometheus and Golang version used:

Object Storage Provider:
COS
What happened:
query_exemplar response error:
error: "retrieving exemplars: proxy Exemplars: receiving exemplars from exemplars client &{0xc000b2a000}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars"

What you expected to happen:
partial response with warning.
How to reproduce it (as minimally and precisely as possible):
Thanos Query 0.23 beta + a old version sidecar not support exemplar
Full logs to relevant components:

Anything else we need to know:

seems missing flag to control exemplar partial reponse

@hanjm
Copy link
Member Author

hanjm commented Sep 18, 2021

if I add a flag and pass it to NewExemplarsHandler, https://github.com/thanos-io/thanos/blob/main/pkg/api/query/v1.go#L798, thanos-query run OOM when query. may be too many exemplars? could we set a number limit to it?

pprof heap is here

File: thanos
Type: inuse_space
Time: Sep 18, 2021 at 9:52pm (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 461.11MB, 97.97% of 470.68MB total
Dropped 66 nodes (cum <= 2.35MB)
Showing top 10 nodes out of 37
      flat  flat%   sum%        cum   cum%
  256.06MB 54.40% 54.40%   256.06MB 54.40%  github.com/pkg/errors.callers
  140.52MB 29.85% 84.26%   142.02MB 30.17%  github.com/pkg/errors.(*withMessage).Error
   23.50MB  4.99% 89.25%   271.56MB 57.69%  github.com/pkg/errors.New (inline)
   16.95MB  3.60% 92.85%   288.50MB 61.30%  github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send
    7.98MB  1.70% 94.54%     7.98MB  1.70%  google.golang.org/grpc/internal/transport.newBufWriter
       6MB  1.27% 95.82%        6MB  1.27%  reflect.New
    3.61MB  0.77% 96.59%     3.61MB  0.77%  bufio.NewReaderSize
    3.50MB  0.74% 97.33%     4.04MB  0.86%  fmt.Sprintf
       2MB  0.42% 97.76%     3.50MB  0.74%  github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.(*opentracingClientReporter).PostCall
       1MB  0.21% 97.97%        3MB  0.64%  google.golang.org/grpc/internal/status.(*Error).Error
(pprof)
(pprof) traces github.com/pkg/errors.callers
File: thanos
Type: inuse_space
Time: Sep 18, 2021 at 9:52pm (CST)
-----------+-------------------------------------------------------
     bytes:  24B
  512.01kB   github.com/pkg/errors.callers
             github.com/pkg/errors.Wrapf
             github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
             github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
             golang.org/x/sync/errgroup.(*Group).Go.func1
-----------+-------------------------------------------------------
     bytes:  24B
   20.50MB   github.com/pkg/errors.callers
             github.com/pkg/errors.New
             github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send
             github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
             github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
             golang.org/x/sync/errgroup.(*Group).Go.func1
-----------+-------------------------------------------------------
     bytes:  256B
    7.50MB   github.com/pkg/errors.callers
             github.com/pkg/errors.Wrapf
             github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
             github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
             golang.org/x/sync/errgroup.(*Group).Go.func1
-----------+-------------------------------------------------------
     bytes:  256B
  227.56MB   github.com/pkg/errors.callers
             github.com/pkg/errors.New
             github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send
             github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
             github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
             golang.org/x/sync/errgroup.(*Group).Go.func1
-----------+-------------------------------------------------------


@yeya24
Copy link
Contributor

yeya24 commented Sep 18, 2021

So actually partial response in the Exemplars API works, right? @hanjm Can you help confirm this?

If the flag works, then let's rename the issue to discuss the exemplars limit.

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

@yeya24 I found seems missing exemplar.partial-response flag, so i add exemplar.partial-response flag in hanjm@681608c, Partial response works in my brach.

@yeya24
Copy link
Contributor

yeya24 commented Sep 19, 2021

I see. @hanjm Looks like we have this config in the struct but forget to add a flag for it. Would you like to open a pr for it?

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

I see. @hanjm Looks like we have this config in the struct but forget to add a flag for it. Would you like to open a pr for it?

ok.

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

I am investigate it.
from pprof heap, it seems too many warning message consume the memory.

then i add a debug log to (*exemplarsServer).Send https://github.com/thanos-io/thanos/blob/main/pkg/exemplars/exemplars.go#L42

func (srv *exemplarsServer) Send(res *exemplarspb.ExemplarsResponse) error {
	if res.GetWarning() != "" {
		err := errors.New(res.GetWarning())
		log.Printf("err message size: errors:%d, srv.warnings:%d, res.GetWarning():%d",
			len(err.Error()),
			len(srv.warnings),
			len(res.GetWarning()))
		srv.warnings = append(srv.warnings, err)
		return nil
	}

it will print a lot of logs like

2021-09-19_12:19:54 2021/09/19 04:19:54 err message size: errors:130, srv.warnings:4277283, res.GetWarning():130

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

then i print svr.warnings first ten message.

func (srv *exemplarsServer) Send(res *exemplarspb.ExemplarsResponse) error {
	if res.GetWarning() != "" {
		err := errors.New(res.GetWarning())
		if len(srv.warnings) == 100 {
			log.Printf("err message size: errors:%d, srv.warnings:%d, res.GetWarning():%d, srv.warnings:%+v",
				len(err.Error()),
				len(srv.warnings),
				len(res.GetWarning()),
				srv.warnings[:10],
			)
		}

		srv.warnings = append(srv.warnings, err)
		return nil
	}

it print a log like

2021/09/19 06:09:37 err message size: errors:130, srv.warnings:100, res.GetWarning():130, srv.warnings:[receiving exemplars from exemplars client &{0xc0000a0700}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars
github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:45
github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:193
github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/jimmiehan/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
        /Users/jimmiehan/sdk/go1.16.3/src/runtime/asm_amd64.s:1371 receiving exemplars from exemplars client &{0xc0000a0700}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars
github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:45
github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:193
github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/jimmiehan/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit
        /Users/jimmiehan/sdk/go1.16.3/src/runtime/asm_amd64.s:1371 receiving exemplars from exemplars client &{0xc0000a0700}: rpc error: code = Unimplemented desc = unknown service thanos.Exemplars
github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsServer).Send
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/exemplars.go:45
github.com/thanos-io/thanos/pkg/exemplars.(*exemplarsStream).receive
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:193
github.com/thanos-io/thanos/pkg/exemplars.(*Proxy).Exemplars.func1
        /Users/jimmiehan/github.com/hanjm/thanos/pkg/exemplars/proxy.go:133
golang.org/x/sync/errgroup.(*Group).Go.func1
        /Users/jimmiehan/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:57
runtime.goexit

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

Seems it is better if keep one warning in single gRPC bid-stream call. instead of multi warning . @yeya24 do you think so?

@MrYueQ
Copy link

MrYueQ commented Sep 19, 2021

Hi. The version of thanos has been rolled back to 0.22. In addition, in the current version, --query-frontend.downstream-url uses the load balancing mechanism, and the load balancing mechanism is weighted round-robin. No matter how the client's IP changes. The upstream of the backend will only be routed to the same IP.

thanos-query host resource
image

thanos-query-frontend host resource

image

image

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

I found the root cause: If the store-api target not implement exemplar API, err is rpc error: code = Unimplemented desc = unknown service thanos.Exemplars. the grpc bid-steam loop will be infinite loop due to the wrong error handing.
grpc doc say https://github.com/grpc/grpc-go/blob/v1.40.0/stream.go#L123

	// RecvMsg blocks until it receives a message into m or the stream is
	// done. It returns io.EOF when the stream completes successfully. On
	// any other error, the stream is aborted and the error contains the RPC
	// status.

https://github.com/thanos-io/thanos/blob/main/pkg/exemplars/proxy.go#L195

for {
		exemplar, err := exemplars.Recv()
		if err == io.EOF {
			return nil
		}

		if err != nil {
			err = errors.Wrapf(err, "receiving exemplars from exemplars client %v", stream.client)

			if stream.request.PartialResponseStrategy == storepb.PartialResponseStrategy_ABORT {
				return err
			}

			if err := stream.server.Send(exemplarspb.NewWarningExemplarsResponse(err)); err != nil {
				return errors.Wrapf(err, "sending exemplars error to server %v", stream.server)
			}

			continue
		}

		if w := exemplar.GetWarning(); w != "" {
			if err := stream.server.Send(exemplarspb.NewWarningExemplarsResponse(errors.New(w))); err != nil {
				return errors.Wrapf(err, "sending exemplars warning to server %v", stream.server)
			}
			continue
		}

		select {
		case stream.channel <- exemplar.GetData():
		case <-ctx.Done():
			return ctx.Err()
		}
	}

hanjm added a commit to hanjm/thanos that referenced this issue Sep 19, 2021
…etsStreamStream).receive infinite loop when target response Unimplemented error (thanos-io#4676)

Signed-off-by: hanjm <[email protected]>
hanjm added a commit to hanjm/thanos that referenced this issue Sep 19, 2021
…/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (thanos-io#4676)

Signed-off-by: hanjm <[email protected]>
hanjm added a commit to hanjm/thanos that referenced this issue Sep 19, 2021
…/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (thanos-io#4676)

Signed-off-by: hanjm <[email protected]>
@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

Fix at PR #4681 , should merge #4680 first.

@MrYueQ
Copy link

MrYueQ commented Sep 19, 2021

image
image

@hanjm
Copy link
Member Author

hanjm commented Sep 19, 2021

Hi. The version of thanos has been rolled back to 0.22. In addition, in the current version, --query-frontend.downstream-url uses the load balancing mechanism, and the load balancing mechanism is weighted round-robin. No matter how the client's IP changes. The upstream of the backend will only be routed to the same IP.

@MrYueQ Seems not relevent? Please feel free to open another issue ~

yeya24 pushed a commit that referenced this issue Sep 19, 2021
…/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (#4676) (#4681)

Signed-off-by: hanjm <[email protected]>
bwplotka pushed a commit that referenced this issue Sep 20, 2021
…/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (#4676) (#4681)

Signed-off-by: hanjm <[email protected]>
kakkoyun pushed a commit that referenced this issue Sep 20, 2021
* Sidecar: Fix process external label on promethues v2.28+ use units.Bytes config type (#4657)

* Sidecar: Fix process external label when promethues v2.28+ use units.Bytes config type (#4656)

Signed-off-by: hanjm <[email protected]>

* E2E: Upgrade prometheus image version

Signed-off-by: hanjm <[email protected]>

* upgrade Prometheus dependency version to v2.30.0 (#4669)

* upgrade Prometheus dependency version to v2.30.0

Signed-off-by: Ben Ye <[email protected]>

* fix unit test

Signed-off-by: Ben Ye <[email protected]>
# Conflicts:
#	go.mod
#	go.sum

* Query: Fix (*exemplarsStream).receive/(*metricMetadataStream).receive/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (#4676) (#4681)

Signed-off-by: hanjm <[email protected]>

* Cut 0.23.0-rc.1

Signed-off-by: Bartlomiej Plotka <[email protected]>

Co-authored-by: Jimmiehan <[email protected]>
Co-authored-by: Ben Ye <[email protected]>
bwplotka added a commit that referenced this issue Oct 6, 2021
* Cut release 0.23.0-rc.0 (#4625)

Signed-off-by: Bartlomiej Plotka <[email protected]>

* Updated version.

Signed-off-by: Bartlomiej Plotka <[email protected]>

* Cut 0.23.0-rc.1 and cherry picked 3 critical commits from main. (#4684)

* Sidecar: Fix process external label on promethues v2.28+ use units.Bytes config type (#4657)

* Sidecar: Fix process external label when promethues v2.28+ use units.Bytes config type (#4656)

Signed-off-by: hanjm <[email protected]>

* E2E: Upgrade prometheus image version

Signed-off-by: hanjm <[email protected]>

* upgrade Prometheus dependency version to v2.30.0 (#4669)

* upgrade Prometheus dependency version to v2.30.0

Signed-off-by: Ben Ye <[email protected]>

* fix unit test

Signed-off-by: Ben Ye <[email protected]>
# Conflicts:
#	go.mod
#	go.sum

* Query: Fix (*exemplarsStream).receive/(*metricMetadataStream).receive/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (#4676) (#4681)

Signed-off-by: hanjm <[email protected]>

* Cut 0.23.0-rc.1

Signed-off-by: Bartlomiej Plotka <[email protected]>

Co-authored-by: Jimmiehan <[email protected]>
Co-authored-by: Ben Ye <[email protected]>

* Cut 0.23.0 release. (#4697)

* Endpointset: Do not use info client to obtain metadata (for now) (#4714)

* Do not use info client to obtain metadata

Signed-off-by: Matej Gera <[email protected]>

* Update CHANGELOG.

Signed-off-by: Matej Gera <[email protected]>

* Comment out client.info usage

Signed-off-by: Matej Gera <[email protected]>

* Fix lint error

Signed-off-by: Matej Gera <[email protected]>

* Cutting 0.23.1 (#4718)

Signed-off-by: Bartlomiej Plotka <[email protected]>

* Moved tutorials Thanos versions to 0.23.1

Signed-off-by: Bartlomiej Plotka <[email protected]>

* Added volounteer for shepharding, fixed VERSION.

Signed-off-by: Bartlomiej Plotka <[email protected]>

Co-authored-by: Jimmiehan <[email protected]>
Co-authored-by: Ben Ye <[email protected]>
Co-authored-by: Matej Gera <[email protected]>
someshkoli pushed a commit to someshkoli/thanos that referenced this issue Nov 7, 2021
…/(*targetsStreamStream).receive infinite loop when target response Unimplemented error (thanos-io#4676) (thanos-io#4681)

Signed-off-by: hanjm <[email protected]>
sthaha added a commit to sthaha/thanos that referenced this issue Apr 12, 2022
This cherry-picks upstream patch that fixes the bug

Query: Fix (*exemplarsStream).receive/(*metricMetadataStream).receive/(*targetsStreamStream).receive
  infinite loop when target response Unimplemented error (thanos-io#4676) (thanos-io#4681)

See:
  - thanos-io#4676 (comment)
  - thanos-io#4681

Signed-off-by: hanjm <[email protected]>
(cherry picked from commit 2d4d140)

Signed-off-by: Sunil Thaha <[email protected]>
sthaha added a commit to sthaha/thanos that referenced this issue Apr 12, 2022
This cherry-picks upstream patch that fixes the bug

Query: Fix (*exemplarsStream).receive/(*metricMetadataStream).receive/(*targetsStreamStream).receive
  infinite loop when target response Unimplemented error (thanos-io#4676) (thanos-io#4681)

See:
  - thanos-io#4676 (comment)
  - thanos-io#4681

Signed-off-by: hanjm <[email protected]>
(cherry picked from commit 2d4d140)

Signed-off-by: Sunil Thaha <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants