Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 659: sync and async timer state rework #906

Merged
merged 14 commits into from
Feb 25, 2021

Conversation

Wraith2
Copy link
Contributor

@Wraith2 Wraith2 commented Feb 10, 2021

Adds in finer grained timerout state and allows checking of timeouts synchronously in case async timers are behaving unreliably. Running the original repro I can no longer get invalid results. @cheenamalhotra @David-Engel

@cheenamalhotra
Copy link
Member

HI @Wraith2

This is good progress! It does improve things, I've extended my repro in last comment in #659 further so we can now consistently reproduce this issue. But I see exceptions where it's not expected:

image

Source: https://gist.github.com/cheenamalhotra/5f4b35603c491484e51be38abf15548a#file-659_repro2-cs
Please try it too! 🤞🏻
[Configure OnTimeout with 5 seconds delay]

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 10, 2021

I'm not suggesting this is a full fix, at best it's just a mitigation for the timing problem and as you point out it still needs further testing and dev. It is however minimally invasive and builds on the existing timeout code as well as eliminating the problem in the original test case. It's a promising avenue.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 11, 2021

Caught it. The timeout on the 3rd entry is a queued callback from the previous one. I forced the timer to be disposed each run through and set the state value on the timer when it was created. When the 3rd test is run and the callback runs the state value does not match the one it should have been, it's the value from before. Sadly we can't use this to detect the fault because the only way to set the timer on the state is in the constructor. It does mean that we need to identify each callback and link it to the individual request that it was generated for so we can ignore ones that don't match the current query.

ChangeNetworkPacketTimeout(1000,-1) on iteration=1
OnTimeoutAsync(state=1) pre wait
ChangeNetworkPacketTimeout(-1,-1) on iteration=1
ChangeNetworkPacketTimeout(10000,-1) on iteration=2
ChangeNetworkPacketTimeout(-1,-1) on iteration=2
ChangeNetworkPacketTimeout(10000,-1) on iteration=3
OnTimeoutAsync(state=1) post wait <-- this timeout should be ignored.

@cheenamalhotra
Copy link
Member

Yes because this StateObj is reused, we need to also associate it with something that identifies the query/command that's executed.
Just relying on StateObj is going to be unreliable and we can break in.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 11, 2021

I've updated this to include the changes I described in #907 (comment) The timer must be disposed each time, this is unavoidable. I've added the state value, source and timeout state class as well as a spare to allow safe re-use when the timer is behaving (which should be most of the time).

The CI tests just aren't working correctly. It's failing on all sorts of things unreliably. Locally the tests pass. Debug and Release run through the original 4000 thread without encountering any invalid data, up to 400,000 requests in release mode at time or writing. The 2 repros you linked above also behave as expected now @cheenamalhotra and I've added a note and ifdef around the thread sleep so even if it got merged it wouldn't make it to production.

The first commit made timeouts more reliable under load. This second commit adds the fix for late timeout callbacks that @David-Engel @cheenamalhotra and me have converged on by investigation. I think this is a decent candidate for the fix with both pieces of work included even though we strictly only need the second.

Processed: 509942 - Network errors:  25166 - Missing:      0 - Invalid:      0

Copy link
Contributor

@David-Engel David-Engel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that your implementation improves the reliability of timeout to throw under load by calling OnTimeoutSync() directly from ReadSni instead of relying solely on the timer. But I think we need to address the rollover of _timeoutIdentitySource. I'll probably close my PR in favor of moving yours forward.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 11, 2021

I've removed the spare. I realised that it would only be useful when a timeout occurs which should be rare. As such it wasn't worth the complexity.

cheenamalhotra and others added 2 commits February 12, 2021 17:48
@Wraith2 Wraith2 marked this pull request as ready for review February 12, 2021 17:53
@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 12, 2021

Tests merged. CI is running now as i've changed status to open.

Do you want to try and do netfx in this same PR or a separate one? I'd prefer a second one but let me know what you think.

@cheenamalhotra
Copy link
Member

Tests merged. CI is running now as i've changed status to open.

Do you want to try and do netfx in this same PR or a separate one? I'd prefer a second one but let me know what you think.

Let's do it here, I see pipelines are failing in NetFx due to same reason.

@Wraith2
Copy link
Contributor Author

Wraith2 commented Feb 12, 2021

Let's do it here, I see pipelines are failing in NetFx due to same reason.

Ok Done. I was worried that the netfx version was going to be very different but it looks like the relevant pieces of code were highly conserved between the two. The only thing of note is that netfx creates the timer without suppressing execution context flow. I didn't change the behaviour but it might be a perf win on netfx to look at in future if the same thing can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants