-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak logs and attempt to avoid races around removing nodes #504
Conversation
jsdw
commented
Oct 5, 2022
- Updated some of the log messages to the nicer inline fmt style, and tweaked the odd message.
- Don't queue up "remove node" messages from shard to core if disconnected; they won't make sense after a reconnect.
- Try to avoid reusing IDs; this might cause races around disconnects if messages are queued or received with "old" IDs and new nodes are reusing said IDs.
// It's possible that some race between removing and disconnecting shards might lead to | ||
// more than one remove message for the same node. This isn't really a problem, but we | ||
// hope it won't happen so make a note if it does: | ||
log::debug!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we worried about the process consuming too much memory because we may never call remove_nodes_and_broadcast_result
here? Leading to a potentially OOM kill?
Would it be possible for a malicious party to force the shard to submit Remove { random_id }
?
And in that could we "mute" / disconnect the shard if it makes N wrong Remove
calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be possible for any party to have any control over any of the ids; we assign them all internally.
And if the id doesn't exist for this remove message, it means the node has been removed already :)
Remove calls are also entirely internal; an external person cannot cause it to be called except by disconnecting (or when messages with certain ids go stale).
// Leave the `current_id` as-is. Why? To avoid reusing IDs and risking | ||
// race conditions where old messages can accidentally screw with new nodes | ||
// that have been assigned the same ID. | ||
self.mapping = BiMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't understand.
Before we were resetting this.id == 0
in BiMap::new()
and multiple nodes to entities could do this an end-up with id == 0
after clear and now just ignore this id and the next one will be this.id +1
?
It looks like the id
which will wrap in release mode if this.id + 1 > MAX
Can you add an explicit panic or something so we are sure that it doesn't wrap around and ends up to replace an "old ID" or what will happen then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I did this because I worried that "old" IDs might be reused in some sorts of obscure race conditions, leading to new nodes being affected by "old" messages or something like that.
Indeed, eventually it'll overflow and panic, but since the ID is a 64 bit value it'll never realisically hit usize::MAX
so I wasn't worried about overflowing (I almost considered adding an explicit wrapping_add but meh)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it was so easy to do I did make the thing wrapping_add :)
…ch#504) * Tweak logs and attempt to avoid races around removing nodes * wrapping_add in assign_id