Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race in Get/Set after Close is called #60

Merged
merged 3 commits into from
Jan 31, 2025
Merged

Conversation

Yiling-J
Copy link
Owner

Fix #59

@@ -51,6 +51,7 @@ type Shard[K comparable, V any] struct {
vgroup *Group[K, V] // used in secondary cache
counter uint
mu *RBMutex
closed bool
Copy link

@miparnisari miparnisari Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be an atomic.Bool? since you can have multiple goroutines trying to read & write to it?

the downside is that it will make all operations slightly slower.

it could also be a bool but guarded with a RWMutex.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closed field is protected by the shard mutex, and as you can see, I check it immediately before or after accessing the shard hashmap, so the mutex is guaranteed to already be held at that point. Meanwhile, the Close method also first acquires the shard mutex before updating the closed field, making it safe to use a basic bool type here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good if you documented which fields are guarded by mu. For example: https://github.com/openfga/openfga/blob/94dd78f03dc99e51f46e7f71deaaea878e8340c9/pkg/storage/memory/memory.go#L125-L128

This doesn't change anything at runtime but it's useful documentation.

@@ -51,6 +51,7 @@ type Shard[K comparable, V any] struct {
vgroup *Group[K, V] // used in secondary cache
counter uint
mu *RBMutex
closed bool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good if you documented which fields are guarded by mu. For example: https://github.com/openfga/openfga/blob/94dd78f03dc99e51f46e7f71deaaea878e8340c9/pkg/storage/memory/memory.go#L125-L128

This doesn't change anything at runtime but it's useful documentation.

func (s *Store[K, V]) Close() {

for _, s := range s.shards {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, s := range s.shards {
for _, shard := range s.shards {

otherwise it gets confusing reading this, as s is already the Store itself.

Comment on lines +756 to +757
// Close waits for all current read and write operations to complete,
// then clears the hashmap and shuts down the maintenance goroutine.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Close waits for all current read and write operations to complete,
// then clears the hashmap and shuts down the maintenance goroutine.
// Close clears the hashmap and shuts down the maintenance goroutine,
// then waits for all current read and write operations to complete.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer my original version here. For each shard, shard.mu.Lock() is called first, which ensures that the current read/write operations holding the mutex are completed before closing.

@@ -1081,6 +1104,10 @@ func (s *LoadingStore[K, V]) Get(ctx context.Context, key K) (V, error) {
loaded, err, _ := shard.group.Do(key, func() (Loaded[V], error) {
// load and store should be atomic
shard.mu.Lock()
if shard.closed {
shard.mu.Unlock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out the code locally and I see that the locking & unlocking code gets weird, in that sometimes a function locks but doesn't unlock (hopefully because it calls a function that does the unlock). You risk introducing deadlock scenarios with this.

My recommendation (not for this PR) is that you at least document each function:

  • if it unlocks a mutex
  • if it assumes that the mutex has already been locked by the caller.

I also recommend that you use this pattern:

mutex.Lock()
defer mutex.Unlock()

to prevent the issue completely (at the expense that you will be holding the mutex for a longer time).

@@ -201,3 +203,92 @@ func TestStore_SinkWritePolicyWeight(t *testing.T) {
require.Equal(t, 8, int(store.policy.weightedSize))

}

func TestStore_CloseRace(t *testing.T) {
store := NewStore[int, int](1000, false, true, nil, nil, nil, 0, 0, nil)
Copy link

@miparnisari miparnisari Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but holy cow that is a lot of parameters 🤣 are all of those mandatory?! If they aren't, I suggest using the "functional options pattern" described here [scroll to option 3]. My team has been using that with quite a lot of success.

@Yiling-J
Copy link
Owner Author

@miparnisari I think the suggestions are very reasonable, and some refactoring is definitely needed. I'll address those in a separate PR and keep this one focused on fixing the close race issue.

@Yiling-J Yiling-J merged commit ba3711c into main Jan 31, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race when Get happens after Close
2 participants