Use single go-routine instead of mutex for shard-controller #447

merlimat · 2024-04-23T00:31:02Z

When the leader election on one shard is continuosly failing, this would lead to the node-controller to get stuck as well.

The reason is that shard-controller mutex is kept busy while retrying the leader election. At the same time, if a node fails, the node controller will try to grab the mutex on the shard as well, leading to a standstill. Technically it's not a deadlock (if the leader election recovers, than node-controller would also recover).

1 @ 0x440b0e 0x452018 0x451fef 0x46fc25 0x47ed5d 0x1741f48 0x1741f2a 0x1743328 0x5d4b11 0x5d467c 0x174329b 0x1743234 0xa1c7b3 0xa151fd 0xa1c736 0x473c81
# labels: {"addr":"oxia-2.oxia-svc:6649", "oxia":"node-controller-send-updates"}
#	0x46fc24	sync.runtime_SemacquireMutex+0x24										/usr/local/go/src/runtime/sema.go:77
#	0x47ed5c	sync.(*Mutex).lockSlow+0x15c											/usr/local/go/src/sync/mutex.go:171
#	0x1741f47	sync.(*Mutex).Lock+0x47												/usr/local/go/src/sync/mutex.go:90
#	0x1741f29	github.com/streamnative/oxia/coordinator/impl.(*nodeController).Status+0x29					/src/oxia/coordinator/impl/node_controller.go:139
#	0x1743327	github.com/streamnative/oxia/coordinator/impl.(*nodeController).sendAssignmentsUpdatesWithRetries.func2+0x47	/src/oxia/coordinator/impl/node_controller.go:268
#	0x5d4b10	github.com/cenkalti/backoff/v4.doRetryNotify[...]+0x1d0								/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:107
#	0x5d467b	github.com/cenkalti/backoff/v4.RetryNotifyWithTimer+0x5b							/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:61
#	0x174329a	github.com/cenkalti/backoff/v4.RetryNotify+0x19a								/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:49
#	0x1743233	github.com/streamnative/oxia/coordinator/impl.(*nodeController).sendAssignmentsUpdatesWithRetries+0x133		/src/oxia/coordinator/impl/node_controller.go:265
#	0xa1c7b2	github.com/streamnative/oxia/common.DoWithLabels.func1+0x12							/src/oxia/common/pprof.go:46
#	0xa151fc	runtime/pprof.Do+0x9c												/usr/local/go/src/runtime/pprof/runtime.go:51
#	0xa1c735	github.com/streamnative/oxia/common.DoWithLabels+0x375								/src/oxia/common/pprof.go:42


1 @ 0x440b0e 0x452018 0x451fef 0x46fc25 0x47ed5d 0x1745171 0x1745158 0x173ccb7 0x174269a 0x5d4b11 0x5d467c 0x17423db 0x1742374 0xa1c7b3 0xa151fd 0xa1c736 0x473c81
# labels: {"addr":"oxia-2.oxia-svc:6649", "oxia":"node-controller"}
#	0x46fc24	sync.runtime_SemacquireMutex+0x24									/usr/local/go/src/runtime/sema.go:77
#	0x47ed5c	sync.(*Mutex).lockSlow+0x15c										/usr/local/go/src/sync/mutex.go:171
#	0x1745170	sync.(*Mutex).Lock+0x70											/usr/local/go/src/sync/mutex.go:90
#	0x1745157	github.com/streamnative/oxia/coordinator/impl.(*shardController).HandleNodeFailure+0x57			/src/oxia/coordinator/impl/shard_controller.go:141
#	0x173ccb6	github.com/streamnative/oxia/coordinator/impl.(*coordinator).NodeBecameUnavailable+0x276		/src/oxia/coordinator/impl/coordinator.go:270
#	0x1742699	github.com/streamnative/oxia/coordinator/impl.(*nodeController).healthCheckWithRetries.func2+0x279	/src/oxia/coordinator/impl/node_controller.go:174
#	0x5d4b10	github.com/cenkalti/backoff/v4.doRetryNotify[...]+0x1d0							/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:107
#	0x5d467b	github.com/cenkalti/backoff/v4.RetryNotifyWithTimer+0x5b						/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:61
#	0x17423da	github.com/cenkalti/backoff/v4.RetryNotify+0x19a							/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:49
#	0x1742373	github.com/streamnative/oxia/coordinator/impl.(*nodeController).healthCheckWithRetries+0x133		/src/oxia/coordinator/impl/node_controller.go:153
#	0xa1c7b2	github.com/streamnative/oxia/common.DoWithLabels.func1+0x12						/src/oxia/common/pprof.go:46
#	0xa151fc	runtime/pprof.Do+0x9c											/usr/local/go/src/runtime/pprof/runtime.go:51
#	0xa1c735	github.com/streamnative/oxia/common.DoWithLabels+0x375							/src/oxia/common/pprof.go:42

Modifications

Instead of using a mutex to ensure only 1 operation is applied on the shard at the same time, we switch to use a dedicated go-routine event loop and we communicate use channels.

Use single go-routine instead of mutex for shard-controller

c6ae17e

merlimat added the type/bug label Apr 23, 2024

merlimat requested review from mattisonchao, coderzc and RobertIndie as code owners April 23, 2024 00:31

merlimat self-assigned this Apr 23, 2024

mattisonchao approved these changes Apr 23, 2024

View reviewed changes

mattisonchao merged commit 128892a into streamnative:main Apr 23, 2024
5 checks passed

merlimat deleted the fix-shard-controller-locking branch April 23, 2024 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use single go-routine instead of mutex for shard-controller #447

Use single go-routine instead of mutex for shard-controller #447

merlimat commented Apr 23, 2024

Use single go-routine instead of mutex for shard-controller #447

Use single go-routine instead of mutex for shard-controller #447

Conversation

merlimat commented Apr 23, 2024

Modifications