Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: sync should only happen when both sides are enabled #764

Merged
merged 35 commits into from
Aug 27, 2024

Conversation

EvanHahn
Copy link
Contributor

Closes #762.

@gmaclennan
Copy link
Member

I added a commit which I think fixes this. I don't particularly like this, and I feel like this code is fragile, given that we are doing something not supported by hypercore, but I have written some more robust unit tests for this now.

I extracted unreplicate into its own function, and added more comprehensive unit tests for different ways that cores could be unreplicated, e.g. in different orders, and with delays. I found that a delay between unreplicating (e.g. if one peer replicates after the other) caused errors, due to the peer being removed from core.peers when the other side closes the channel, which results in not finding the channel that needs to be closed.

The solution I found for this is that rather than iterating through peers attached to a core to get the channel reference (to close it), I iterate through the channels of a protomux. I'm not sure what the consequences are if there are multiple channels with the same id on a protomux, but I don't think we do that so I think it's fine. With the unit tests for unreplicate and the fuzzy sync tests (thanks @EvanHahn !) I think this is ok, even though I don't like how I'm implementing this!

@gmaclennan
Copy link
Member

Correction: not yet fixed! Fuzz tests are showing errors still with re-sync not working as expected. I am unable to reproduce locally, I wonder if there is a way we can make fuzz tests reproducible?

@gmaclennan
Copy link
Member

The issue could be because of edge cases (race conditions) around starting and stopping sync in the same tick. Opening a protomux channel (which happens when replicating a core) is asynchronous, and I'm not sure what happens when we close a protomux channel in the same tick as when it is opening. Our sync api start() should probably be async, and implement start-stop-state-machine so that stop waits for start before attempting to stop.

@EvanHahn
Copy link
Contributor Author

The issue could be because of edge cases (race conditions) around starting and stopping sync in the same tick. Opening a protomux channel (which happens when replicating a core) is asynchronous, and I'm not sure what happens when we close a protomux channel in the same tick as when it is opening. Our sync api start() should probably be async, and implement start-stop-state-machine so that stop waits for start before attempting to stop.

Filed #786.

@EvanHahn EvanHahn marked this pull request as ready for review August 27, 2024 20:20
@EvanHahn
Copy link
Contributor Author

Gregor and I paired on this so I will merge without additional review.

@EvanHahn EvanHahn merged commit b61e132 into main Aug 27, 2024
7 checks passed
@EvanHahn EvanHahn deleted the 2024-08-22-sync-bug-exploration branch August 27, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sync can unexpectedly occur with devices that have not enabled it
2 participants