Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: Race condition found in integration tests #4888

Closed
2opremio opened this issue May 31, 2023 · 3 comments · Fixed by #4889
Closed

services/horizon: Race condition found in integration tests #4888

2opremio opened this issue May 31, 2023 · 3 comments · Fixed by #4889
Assignees

Comments

@2opremio
Copy link
Contributor

See https://github.com/stellar/go/actions/runs/5136157151/jobs/9242535941

WARNING: DATA RACE
Read at 0x00c00046a758 by goroutine 70689:
  github.com/stellar/go/services/horizon/internal/ingest.(*system).GetCurrentState()
      /home/runner/work/go/go/services/horizon/internal/ingest/main.go:487 +0x3a
  github.com/stellar/go/services/horizon/internal.(*App).UpdateCoreLedgerState()
      /home/runner/work/go/go/services/horizon/internal/app.go:225 +0x1a8
  github.com/stellar/go/services/horizon/internal.(*App).Tick.func1()
      /home/runner/work/go/go/services/horizon/internal/app.go:450 +0x58

Previous write at 0x00c00046a758 by goroutine 70456:
  github.com/stellar/go/services/horizon/internal/ingest.(*system).runStateMachine()
      /home/runner/work/go/go/services/horizon/internal/ingest/main.go:655 +0x444
  github.com/stellar/go/services/horizon/internal/ingest.(*system).Run()
      /home/runner/work/go/go/services/horizon/internal/ingest/main.go:543 +0x49
  github.com/stellar/go/services/horizon/internal.(*App).Serve.func1()
      /home/runner/work/go/go/services/horizon/internal/app.go:104 +0x5e

Goroutine 70689 (running) created at:
  github.com/stellar/go/services/horizon/internal.(*App).Tick()
      /home/runner/work/go/go/services/horizon/internal/app.go:450 +0x259
  github.com/stellar/go/services/horizon/internal.(*App).run()
      /home/runner/work/go/go/services/horizon/internal/app.go:584 +0x19a
  github.com/stellar/go/services/horizon/internal.(*App).Serve.func6()
      /home/runner/work/go/go/services/horizon/internal/app.go:92 +0x39

Goroutine 70456 (running) created at:
  github.com/stellar/go/services/horizon/internal.(*App).Serve()
      /home/runner/work/go/go/services/horizon/internal/app.go:103 +0x4d1
  github.com/stellar/go/services/horizon/internal/test/integration.(*Test).StartHorizon.func4()
      /home/runner/work/go/go/services/horizon/internal/test/integration/integration.go:435 +0x55
@2opremio 2opremio added the bug label May 31, 2023
@2opremio
Copy link
Contributor Author

2opremio commented May 31, 2023

This seems to have been introduced in #4860

The new GetCurrentState() method of system is racy since it's called by a goroutine while system.currentState is updated in a separate goroutine during ingestion.

@urvisavla
Copy link
Contributor

Here are some options to address the race condition:

  1. Add locking for the currentState variable within the state machine loop. This may have performance impact because of locking overhead.
  2. Represent currentState as an integer and use atomic operations instead of locks. This will have less performance hit compared to 1.
  3. Instead of maintaining the current state in the state machine, use a flag to indicate when Horizon is in the build state. Set the flag at the beginning of the buildState::Run function and reset it when the build state is complete. In this approach we’ll be adding a write lock only during the build state, which happens only at the startup thus reducing continuous performance impact. Additionally, by checking the flag only when we get an error querying captive-core we can avoid read locks for other states.
  4. Revert the previous fix that suppresses the error message when Horizon is in the build state. Instead focus on improving the error message. This approach won’t have any performance impact but the error message can be very confusing for users.

@2opremio, @tamirms, @tsachiherman, @sreuland I'd appreciate your input on the best approach to resolve the issue.

@tamirms
Copy link
Contributor

tamirms commented Jun 1, 2023

@urvisavla I don't think there will be significant performance impact caused by locking because I don't expect there to be much contention on the lock. The ingestion loop executes once every few seconds because a new ledger appears on average once every 5 seconds and,UpdateCoreLedgerState(), the caller of GetCurrentState(), executes once every second.

@urvisavla urvisavla moved this from Backlog to Current Sprint in Platform Scrum Jun 1, 2023
@urvisavla urvisavla self-assigned this Jun 1, 2023
@urvisavla urvisavla moved this from Current Sprint to In Progress in Platform Scrum Jun 1, 2023
@urvisavla urvisavla moved this from In Progress to Done in Platform Scrum Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
3 participants