Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(core): Fix panic when reloading config under load #14875

Merged
merged 4 commits into from
Dec 7, 2022

Conversation

Zettroke
Copy link
Contributor

@Zettroke Zettroke commented Oct 18, 2022

Fixes #13814

Fix panic when reloading config under load.
Panic was caused by incorrect pause implementation in SendGroup, which tries to remove senders without handling case when send is in-flight

It appears to me that handling control signals during events sending is useless. Because in Fanout::send before each send we consume and process all control signals.
Also add, replace and pause handlers in SendGroup is equivalent to handlers in Fanout (they didn't affect in-flight sends). Only remove is handled differently, and I am not sure if this special handling is required.

Also removing control signals handling during send simplifies code a bit.

@netlify
Copy link

netlify bot commented Oct 18, 2022

Deploy Preview for vector-project ready!

Name Link
🔨 Latest commit c0ac04d
🔍 Latest deploy log https://app.netlify.com/sites/vector-project/deploys/638f9315719fbd00095f9c55
😎 Deploy Preview https://deploy-preview-14875--vector-project.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions github-actions bot added the domain: core Anything related to core crates i.e. vector-core, core-common, etc label Oct 18, 2022
@tobz
Copy link
Contributor

tobz commented Oct 18, 2022

@Zettroke I appreciate you taking the time to try and come up with a fix for the linked issue.

Fix panic when reloading config under load. Panic was caused by incorrect pause implementation in SendGroup, which tries to remove senders without handling case when send is in-flight

Can you elaborate more on what the broken logic is?

The main reason that we handle control messages during an in-flight send in Fanout is precisely for the configuration reloading case, in order to avoid a deadlocked component (a component that we're blocked on sending to) from preventing us from reconfiguring Fanout to match the new configured topology. Imagine reconfiguring Vector to get rid of a component that was pointed at a non-existent network endpoint, which was causing the sink to stall and block Vector from making progress... and you now you can't remove it without fully restarting Vector because Fanout has an in-flight send to that component which it's blocked on completing before it responds to control messages to remove that component.

I see your change gets rid of the handling of control messages during an in-flight send, and simply blocks waiting for the send to fully complete... which would certainly avoid issues of handling changes to Fanout during an in-flight send, but as mentioned above, this basically makes Fanout unfit for the intended purpose.

It appears to me that handling control signals during events sending is useless. Because in Fanout::send before each send we consume and process all control signals. Also add, replace and pause handlers in SendGroup is equivalent to handlers in Fanout (they didn't affect in-flight sends). Only remove is handled differently, and I am not sure if this special handling is required.

This is partially addressed but what I've mentioned above, but the reason for the symmetry -- having SendGroup handle not only remove but also add/replace/pause -- is because we need to keep control messages in order, and there's no way to receive a control message, see if it's a remove message, and then push it back onto the front of the control message channel if it's not. So if we want to handle any control messages at all in SendGroup, we have to be able to handle all of them in SendGroup.

@Zettroke
Copy link
Contributor Author

Zettroke commented Oct 18, 2022

@tobz Seems I rushed with fix without completely understanding reload logic. Thanks for explanation.
I've replaced fix in this pull request. Now I am replicating remove logic in pause, by detaching in-flight sends

Can you elaborate more on what the broken logic is?

pause tries to remove sender from SendGroup.senders, but if it happens while send is in-flight, assertion will fails resulting in panic. And removing assertion wont help, because when send completes it will put sender back.

@tobz
Copy link
Contributor

tobz commented Oct 18, 2022

pause tries to remove sender from SendGroup.senders, but if it happens while send is in-flight, assertion will fails resulting in panic. And removing assertion wont help, because when send completes it will put sender back.

I'll preface this with saying that I realize I touched this code most recently, and so I'm still refreshing myself on all the reasons I wrote it the way that I did, so apologies for asking questions that theoretically I should know the answer to. 😂

I'm still unclear on why this bug is occurring. The assert in pause is there to make sure we're not trying to double pause a sender, because that would imply that the topology had lost track of what components should be in what state. Specifically, the panic itself isn't that the send future encounters an error because we moved it out of the SendGroup, but that we tried to pause a sender that had already been previously paused.

This change is still papering over that fact by simply not asserting that we actually got back the sender value when we pause it... which, yes, solves the panic but doesn't answer the question of why the topology is trying to double pause that specific sender.

@Zettroke
Copy link
Contributor Author

Zettroke commented Oct 18, 2022

Topology doesn't send pause twice.
When pause arrives, send is in progress and sender is not in place. So assert is failing.

Also fun fact: problem occurs in release mode, but not in debug mode

There is debug logs which may help with my lousy explanations)
I have simple topology sources.input -> transforms.kappa -> sinks.out
and changing transforms.kappa

image
image

@Zettroke
Copy link
Contributor Author

Considering assert against double pause, and other invalid control sequences. I can add assert!(self.sends.contains_key(id)) before try_detach_send

@tobz
Copy link
Contributor

tobz commented Oct 18, 2022

Topology doesn't send pause twice.
When pause arrives, send is in progress and sender is not in place. So assert is failing.

Ah, right! Sorry, I think I'm finally caught up. Right, so we take the Sender out of Option<Sender> when we actually build the send future and the send is in progress... so of course it won't be there while the send is in progress because the future still owns it.

Thinking it through, I believe your new approach is the correct one. The main constraint we need to uphold is that the sender slot -- Option<Sender> -- is None after a pause command is processed, which this provides. Since the sender slot is already None during a send (as we took ownership of the Sender), and we only give it back when we poll the send future to completion and get the Sender as a result... a detached send has no way to add itself back to the sender slot.

I'd still like to keep the assertion that we aren't double pausing, though. One way we could do that is to have try_detach_send return a bool, where we return true if we actually found a send for the given component and detached it, and false otherwise. With that, we would simply assert that we actually detached the send, since we should never be in a send group unless we created a send future for all senders before awaiting SendGroup::send.

I'm thinking we may also want to do the same type of assert for SendGroup::remove, too, since pause/replace and remove should be mutually exclusive operations, so we wouldn't be removing a paused component, only ever replacing it... but I'll think about that a little more.

@netlify
Copy link

netlify bot commented Dec 6, 2022

Deploy Preview for vrl-playground ready!

Name Link
🔨 Latest commit c0ac04d
🔍 Latest deploy log https://app.netlify.com/sites/vrl-playground/deploys/638f93155daffd0009e40760
😎 Deploy Preview https://deploy-preview-14875--vrl-playground.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions github-actions bot added the domain: topology Anything related to Vector's topology code label Dec 6, 2022
Copy link
Contributor

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me, and thinking it through, it also seems good to me. Going to get another review on it, though, as now I myself have added/changed code.

@jszwedko jszwedko requested a review from bruceg December 6, 2022 19:14
@tobz tobz merged commit 550aa8b into vectordotdev:master Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: core Anything related to core crates i.e. vector-core, core-common, etc domain: topology Anything related to Vector's topology code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vector panic on reload
3 participants