Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky TestDistributor/caching_unmarshal_data_enabled/series_with_exemplars #7448

Closed
dimitarvdimitrov opened this issue Feb 23, 2024 · 10 comments

Comments

@dimitarvdimitrov
Copy link
Contributor

The test failed on a PR which doesn't work with exemplars or the distributor

https://github.com/grafana/mimir/actions/runs/8016409523/job/21899113333?pr=7173

=== RUN   TestDistributor/caching_unmarshal_data_enabled/series_with_exemplars
08:44:30 querier: ts=2024-02-23T08:44:30.956257016Z caller=loader.go:116 level=warn msg="bucket index not found" user=e2e-user
    distributor_test.go:321: 
        	Error Trace:	/home/runner/work/mimir/mimir/integration/distributor_test.go:321
        	Error:      	Not equal: 
        	            	expected: []v1.ExemplarQueryResult{v1.ExemplarQueryResult{SeriesLabels:model.LabelSet{"__name__":"foobar_with_exemplars"}, Exemplars:[]v1.Exemplar{v1.Exemplar{Labels:model.LabelSet{"test":"test"}, Value:[123](https://github.com/grafana/mimir/actions/runs/8016409523/job/21899113333?pr=7173#step:15:124), Timestamp:1708674267000}}}}
        	            	actual  : []v1.ExemplarQueryResult{}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,16 +1,2 @@
        	            	-([]v1.ExemplarQueryResult) (len=1) {
        	            	- (v1.ExemplarQueryResult) {
        	            	-  SeriesLabels: (model.LabelSet) (len=1) {
        	            	-   (model.LabelName) (len=8) "__name__": (model.LabelValue) (len=21) "foobar_with_exemplars"
        	            	-  },
        	            	-  Exemplars: ([]v1.Exemplar) (len=1) {
        	            	-   (v1.Exemplar) {
        	            	-    Labels: (model.LabelSet) (len=1) {
        	            	-     (model.LabelName) (len=4) "test": (model.LabelValue) (len=4) "test"
        	            	-    },
        	            	-    Value: (model.SampleValue) 123,
        	            	-    Timestamp: (model.Time) 1708674267000
        	            	-   }
        	            	-  }
        	            	- }
        	            	+([]v1.ExemplarQueryResult) {
        	            	 }
        	Test:       	TestDistributor/caching_unmarshal_data_enabled/series_with_exemplars
@dimitarvdimitrov
Copy link
Contributor Author

dimitarvdimitrov commented Feb 23, 2024

there's another instance of the same test in a different PR a few minutes apart https://github.com/grafana/mimir/actions/runs/8017427753/job/21902224686?pr=7446

=== RUN   TestDistributor/caching_unmarshal_data_disabled/series_with_exemplars
    distributor_test.go:321: 
        	Error Trace:	/home/runner/work/mimir/mimir/integration/distributor_test.go:321
        	Error:      	Not equal: 
        	            	expected: []v1.ExemplarQueryResult{v1.ExemplarQueryResult{SeriesLabels:model.LabelSet{"__name__":"foobar_with_exemplars"}, Exemplars:[]v1.Exemplar{v1.Exemplar{Labels:model.LabelSet{"test":"test"}, Value:123, Timestamp:1708679924000}}}}
        	            	actual  : []v1.ExemplarQueryResult{}
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,16 +1,2 @@
        	            	-([]v1.ExemplarQueryResult) (len=1) {
        	            	- (v1.ExemplarQueryResult) {
        	            	-  SeriesLabels: (model.LabelSet) (len=1) {
        	            	-   (model.LabelName) (len=8) "__name__": (model.LabelValue) (len=21) "foobar_with_exemplars"
        	            	-  },
        	            	-  Exemplars: ([]v1.Exemplar) (len=1) {
        	            	-   (v1.Exemplar) {
        	            	-    Labels: (model.LabelSet) (len=1) {
        	            	-     (model.LabelName) (len=4) "test": (model.LabelValue) (len=4) "test"
        	            	-    },
        	            	-    Value: (model.SampleValue) 123,
        	            	-    Timestamp: (model.Time) 1708679924000
        	            	-   }
        	            	-  }
        	            	- }
        	            	+([]v1.ExemplarQueryResult) {
        	            	 }
        	Test:       	TestDistributor/caching_unmarshal_data_disabled/series_with_exemplars

The test was last changed in August 2023 thought. It could be a genuine failure since it started showing up so often

@pstibrany
Copy link
Member

I was just going to add this. I've seen this failing all morning today in 3 separate PRs.

@pracucci pracucci self-assigned this Feb 23, 2024
@pracucci
Copy link
Collaborator

I'm going to look into this.

@pracucci
Copy link
Collaborator

pracucci commented Feb 23, 2024

I reviewed again #7393 to see if I introduced a bug in QueryExemplars() and, just looking at the changes, I can't see it.

@pstibrany
Copy link
Member

I'm going to look into this.

I'm trying to understand it via draft PR: #7454

@pracucci
Copy link
Collaborator

I found the cause. It's not a new thing. It's a timing issue in the ingester. The exemplar is not not ingested (but silently) because max exemplars is set to 0 in TSDB, due to a timing issue.

I'm working on a fix.

@pracucci
Copy link
Collaborator

pracucci commented Feb 23, 2024

I think the issue got worse when we merged this PR yesterday #7440 because it computes the limits using the ingesters ring instead of lifecycler, but the ingester ring may not be updated so it could see "zero ingesters in the ring" when computing the desired "max local exemplars". If it see no ingesters in the ring, then it computes a value of 0, which for the case of exemplars means "no exemplars allowed".

@pracucci
Copy link
Collaborator

Discussed different options on how to fix it with @pstibrany. Going to fix it in #7424 since it's already modifying how we compute the local exemplar limits.

@pracucci
Copy link
Collaborator

I pushed a commit to #7424 to fix this issue, which was introduced with #7440.

In details, the problem:

  • Ingesters local limits are applied looking at the number of ingesters/partitions in the ring
  • For all limits except exemplars, if the ingester sees an empty ring applied "unlimited". For exemplars, it applied "disabled" (0 value in TSDB).
  • After Switch ingester limiter to use ingesterRing rather than lifecycler #7440 the ingester may see "no ingesters in the ring" right after starting because the ingesters ring client is updated async (before that PR we were querying the lifecycler which at least sees the ingester itself). That's why Switch ingester limiter to use ingesterRing rather than lifecycler #7440 introduced the flakyness in this test, because ingester was applying "exemplars disabled" in TSDB until the ingesters ring client was updated.
  • To fix it, I changed the exemplars limit to do something similar to other limits: if the ingester sees an empty ring it applies the exemplars global limit instead of 0 (which, again, means disabled for exemplars).

@pracucci
Copy link
Collaborator

Fixed by #7424.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants