Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics-generator in 2.2 panics with: fatal error: concurrent map iteration and map write #2785

Closed
yvrhdn opened this issue Aug 11, 2023 · 2 comments · Fixed by #2786
Closed

Comments

@yvrhdn
Copy link
Member

yvrhdn commented Aug 11, 2023

Describe the bug
Reported in community Slack: https://grafana.slack.com/archives/C01D981PEE5/p1691583814497569

Hi, we upgraded our production cluster to tempo v2.2.0 today and since then we have been having sporadic Metric generator restarts with the error fatal error: concurrent map iteration and map write. Any idea what might be causing this? Our staging cluster is running no problem and when we downgraded back to 2.1.1 it ran fine again. We are using a helm chart to run

Stack trace

github.com/grafana/tempo/modules/generator.(*instance).diffProcessors(_, _, {{0x2540be400, 0x2710, 0xa, {0xc0001f1600, 0x8, 0x8}, {0x0, 0x0, ...}, ...}, ...})
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/modules/generator/instance.go:229 +0x6c
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/grafana/tempo/modules/generator.(*instance).updateProcessors(0xc0e3ae1950)
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/modules/generator/instance.go:190 +0x21e
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/grafana/tempo/modules/generator.(*instance).watchOverrides(0xc0e3ae1950)
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/modules/generator/instance.go:123 +0xf6
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- created by github.com/grafana/tempo/modules/generator.newInstance
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/modules/generator/instance.go:109 +0x405
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- goroutine 1 [select, 48 minutes]:
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/grafana/dskit/services.(*Manager).AwaitStopped(0xc0008d9f20, {0x2a37078, 0xc000194000})
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/github.com/grafana/dskit/services/manager.go:145 +0x6d
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/grafana/tempo/cmd/tempo/app.(*App).Run(0xc0007ce000)
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/cmd/tempo/app/app.go:235 +0xa25
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- main.main()
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/cmd/tempo/main.go:110 +0x9b0
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- goroutine 40 [select]:
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- go.opencensus.io/stats/view.(*worker).start(0xc000143a80)
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/go.opencensus.io/stats/view/worker.go:292 +0xad
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- created by go.opencensus.io/stats/view.init.0
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/go.opencensus.io/stats/view/worker.go:34 +0x96
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- goroutine 221 [chan receive, 48 minutes]:
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/grafana/dskit/services.(*BasicService).AddListener.func1()
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:344 +0x66
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- created by github.com/grafana/dskit/services.(*BasicService).AddListener
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:343 +0x1ca
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- goroutine 165 [select, 1 minutes]:
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/uber/jaeger-client-go.(*RemotelyControlledSampler).pollControllerWithTicker(0xc000764f70, 0xc000647450)
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/github.com/uber/jaeger-client-go/sampler_remote.go:153 +0x89
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- github.com/uber/jaeger-client-go.(*RemotelyControlledSampler).pollController(0xc000764f70)
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/github.com/uber/jaeger-client-go/sampler_remote.go:148 +0x6d
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- created by github.com/uber/jaeger-client-go.NewRemotelyControlledSampler
 Aug 09 12:33:54 54.246.15.219 tempo[tempo-tempo-distributed-metrics-generator- 	/drone/src/vendor/github.com/uber/jaeger-client-go/sampler_remote.go:87 +0x15b

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo v2.2
  2. Use metrics-generator
  3. ???
  4. Metrics-generator panics and restarts

Expected behavior
No panics. No restarts.

Environment:

  • Infrastructure: [e.g., Kubernetes, bare-metal, laptop]
  • Deployment tool: [e.g., helm, jsonnet]

Additional Context

@yvrhdn
Copy link
Member Author

yvrhdn commented Aug 11, 2023

The panic happened on this line:

for processorName := range desiredProcessors {

We are iterating over the desiredProcessors, this is a map we receive from the overrides module.

🤔 It could be that the overrides module is writing to the map while we iterate over it, but this code has been in place for a really long time so it seems unlikely we didn't spot this before.

diffProcessors only reads the map, but we have another function in here that modifies the map: updateSubProcessors

func (i *instance) updateSubprocessors(desiredProcessors map[string]struct{}, desiredCfg ProcessorConfig) (map[string]struct{}, ProcessorConfig) {

Added in april 19, tempo 2.1.1 was released april 28 --> this might be new to 2.2? I would have to check in more detail.

This function will modify desiredProcessors under certain conditions.

updateProcessor is called on a timer, I can imagine it could be called concurrently if a user has multiple instances --> they get the same config from the overrides and try to modify --> panic 💥

@yvrhdn
Copy link
Member Author

yvrhdn commented Aug 11, 2023

Can confirm the entire updateSubprocessors was added between the tags 2.1.1 and 2.2.

yvrhdn pushed a commit to yvrhdn/tempo that referenced this issue Aug 17, 2023
…efault overrides (grafana#2785)

* Modify concurrency test to consider multitenant / concurrent updateProcessors calls

* Copy desiredProcessors map before modifying it

* Update CHANGELOG.md

(cherry picked from commit 9af6bb2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant