Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use single go-routine instead of mutex for shard-controller #447

Merged

Conversation

merlimat
Copy link
Collaborator

When the leader election on one shard is continuosly failing, this would lead to the node-controller to get stuck as well.

The reason is that shard-controller mutex is kept busy while retrying the leader election. At the same time, if a node fails, the node controller will try to grab the mutex on the shard as well, leading to a standstill. Technically it's not a deadlock (if the leader election recovers, than node-controller would also recover).

1 @ 0x440b0e 0x452018 0x451fef 0x46fc25 0x47ed5d 0x1741f48 0x1741f2a 0x1743328 0x5d4b11 0x5d467c 0x174329b 0x1743234 0xa1c7b3 0xa151fd 0xa1c736 0x473c81
# labels: {"addr":"oxia-2.oxia-svc:6649", "oxia":"node-controller-send-updates"}
#	0x46fc24	sync.runtime_SemacquireMutex+0x24										/usr/local/go/src/runtime/sema.go:77
#	0x47ed5c	sync.(*Mutex).lockSlow+0x15c											/usr/local/go/src/sync/mutex.go:171
#	0x1741f47	sync.(*Mutex).Lock+0x47												/usr/local/go/src/sync/mutex.go:90
#	0x1741f29	github.com/streamnative/oxia/coordinator/impl.(*nodeController).Status+0x29					/src/oxia/coordinator/impl/node_controller.go:139
#	0x1743327	github.com/streamnative/oxia/coordinator/impl.(*nodeController).sendAssignmentsUpdatesWithRetries.func2+0x47	/src/oxia/coordinator/impl/node_controller.go:268
#	0x5d4b10	github.com/cenkalti/backoff/v4.doRetryNotify[...]+0x1d0								/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:107
#	0x5d467b	github.com/cenkalti/backoff/v4.RetryNotifyWithTimer+0x5b							/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:61
#	0x174329a	github.com/cenkalti/backoff/v4.RetryNotify+0x19a								/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:49
#	0x1743233	github.com/streamnative/oxia/coordinator/impl.(*nodeController).sendAssignmentsUpdatesWithRetries+0x133		/src/oxia/coordinator/impl/node_controller.go:265
#	0xa1c7b2	github.com/streamnative/oxia/common.DoWithLabels.func1+0x12							/src/oxia/common/pprof.go:46
#	0xa151fc	runtime/pprof.Do+0x9c												/usr/local/go/src/runtime/pprof/runtime.go:51
#	0xa1c735	github.com/streamnative/oxia/common.DoWithLabels+0x375								/src/oxia/common/pprof.go:42


1 @ 0x440b0e 0x452018 0x451fef 0x46fc25 0x47ed5d 0x1745171 0x1745158 0x173ccb7 0x174269a 0x5d4b11 0x5d467c 0x17423db 0x1742374 0xa1c7b3 0xa151fd 0xa1c736 0x473c81
# labels: {"addr":"oxia-2.oxia-svc:6649", "oxia":"node-controller"}
#	0x46fc24	sync.runtime_SemacquireMutex+0x24									/usr/local/go/src/runtime/sema.go:77
#	0x47ed5c	sync.(*Mutex).lockSlow+0x15c										/usr/local/go/src/sync/mutex.go:171
#	0x1745170	sync.(*Mutex).Lock+0x70											/usr/local/go/src/sync/mutex.go:90
#	0x1745157	github.com/streamnative/oxia/coordinator/impl.(*shardController).HandleNodeFailure+0x57			/src/oxia/coordinator/impl/shard_controller.go:141
#	0x173ccb6	github.com/streamnative/oxia/coordinator/impl.(*coordinator).NodeBecameUnavailable+0x276		/src/oxia/coordinator/impl/coordinator.go:270
#	0x1742699	github.com/streamnative/oxia/coordinator/impl.(*nodeController).healthCheckWithRetries.func2+0x279	/src/oxia/coordinator/impl/node_controller.go:174
#	0x5d4b10	github.com/cenkalti/backoff/v4.doRetryNotify[...]+0x1d0							/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:107
#	0x5d467b	github.com/cenkalti/backoff/v4.RetryNotifyWithTimer+0x5b						/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:61
#	0x17423da	github.com/cenkalti/backoff/v4.RetryNotify+0x19a							/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:49
#	0x1742373	github.com/streamnative/oxia/coordinator/impl.(*nodeController).healthCheckWithRetries+0x133		/src/oxia/coordinator/impl/node_controller.go:153
#	0xa1c7b2	github.com/streamnative/oxia/common.DoWithLabels.func1+0x12						/src/oxia/common/pprof.go:46
#	0xa151fc	runtime/pprof.Do+0x9c											/usr/local/go/src/runtime/pprof/runtime.go:51
#	0xa1c735	github.com/streamnative/oxia/common.DoWithLabels+0x375							/src/oxia/common/pprof.go:42

Modifications

Instead of using a mutex to ensure only 1 operation is applied on the shard at the same time, we switch to use a dedicated go-routine event loop and we communicate use channels.

@mattisonchao mattisonchao merged commit 128892a into streamnative:main Apr 23, 2024
5 checks passed
@merlimat merlimat deleted the fix-shard-controller-locking branch April 23, 2024 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants