Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix | Fixes incorrect data errors due to timeout running very late #907

Closed
wants to merge 7 commits into from

Conversation

David-Engel
Copy link
Contributor

Attempted fix for #659.

Many thanks to Cheena for the legwork in locating the troublesome area.

The underlying problem is that the timer that fires OnTimeout can fire very late when there are no free threads in the pool, even after the state object has been recycled and is put back into use on another query. My solution is to track the owning object that the timeout is supposed to be associated with and if it is has changed when the timer runs, OnTimeout should simply return rather than send attention for the now incorrect query.

Cheena also observed timeouts not working on subsequent re-uses of the state object. I think this also fixes those. We should add tests specific to these scenarios to this PR before merging.

This seems to fix the incorrect data repros for 659 that I've run. Hopefully others can validate it, too, so that we can get a solution finalized.

CC: @cheenamalhotra @Wraith2

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 11, 2021

Looks reasonable. I've arrived at the timer state solution as well and mentioned it in the draft i posted earlier. We need a way to identify the item that the callback is targeted at. I'm not sure if the identity of the owner is unique enough for this, if you've traced the lifetime and it is then ok. I was considering just an auto incremented counter. My problem is I really won't want to create a timer for each invocation.

Does this fix even the fast running cases from the original repo?

@David-Engel
Copy link
Contributor Author

I have only run some of Cheena's repros with 6000+ tasks that repro'd well for me and the one (from today) that interleaved multiple queries on the same connection with varying waits + cancels. I have not run the original repro. It would be great if someone else could run that one. I haven't had time to run them all. I'm even okay if someone wants to take my PR and run with it on their own.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Feb 11, 2021

Yes, the root case related to Timeout delegate delays seems to be fixed here completely, to explain more, below issues are addressed with this PR:

  1. [Fixed] Timeout delegate would sent attention to server and corrupt next queries even when the SqlDataReader was processed and data was sent to user. This eventually triggered wrong data flow.
  2. [Fixed] Future query timeouts on the same connection did not send Attention signals.
  3. [Fixed] In Non-Mars connection (when using transactions), wrong data flow was replaced by Transaction errors and users would get errors like: "The server failed to resume this transaction" and "This SqlTransaction has completed; it is no longer usable."

I'm working on adding tests to ensure we can test these scenarios in pipelines with our repro available.

@David-Engel this also needs to be done in NetFx source, so would request the same. :)

@David-Engel
Copy link
Contributor Author

Another note: In the high load scenario, we haven't been able to reliably dispose of the timer or keep it from firing when it shouldn't. That's the reason for this particular solution.

@David-Engel
Copy link
Contributor Author

David-Engel commented Feb 11, 2021

@Wraith2

I've arrived at the timer state solution as well and mentioned it in the draft i posted earlier. We need a way to identify the item that the callback is targeted at. I'm not sure if the identity of the owner is unique enough for this, if you've traced the lifetime and it is then ok. I was considering just an auto incremented counter. My problem is I really won't want to create a timer for each invocation.

I almost went with an auto-incremented counter, too, but noticed the owner link and thought that would be more reliable and easier than yet another counter. I couldn't think of a reliable way to ensure OnTimeout wouldn't run incorrectly without re-creating the timer with a new state reference. You and Cheena probably have a better overall picture of all the async lifetime and interleaving than I do, so I'd rely more on your confidence in this solution than my own. 😄

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 11, 2021

I thought this through overnight while I was supposed to be sleeping and believe that we have to recreate the timer. It's annoying but the only way to reach the state object is through the constructor because of how the internals of it work, even with private reflection we couldn't reliably do it. So I'm going to write this one off against the performance capital gained from other changes and just accept it.

We have to recreate the timer to set the state object. We need the state object to be a ref type to avoid silly boxing behaviour so we'll probably end up with something like StrongBox<int> and an auto incrementing counter, copied into a field state variable on timer setup and then cleared anytime the state changes. Then on timer callback compare state counter to current counter and discard if the counters aren't equal. We can use the spare object pattern to keep a single clean object around for cases where everything is working correctly, it'll save a little memory.

I believe that the changes in my draft PR #906 are still useful because they increase the reliability of the timeout. I think I should add the logic described above to that draft, what do you think?

@cheenamalhotra
Copy link
Member

Just to complete this PR, tests are here: David-Engel#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants