Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: validator - assimilator race condition - should validator set assimilate_state? #5706

Open
bema-aei opened this issue Jul 24, 2024 · 3 comments · May be fixed by #5707
Open

Server: validator - assimilator race condition - should validator set assimilate_state? #5706

bema-aei opened this issue Jul 24, 2024 · 3 comments · May be fixed by #5707

Comments

@bema-aei
Copy link
Contributor

bema-aei commented Jul 24, 2024

Occasionally Einstein@Home encounters the following situation:

  • quorum of a workunit is completed, the workunit is validated, a canonical result is found and the validator sets assimilate_state =1
  • while the assimilator is working on the workunit (and a chunk of others), another result is reported by the scheduler, need_validate is set and the validator reads the workunit record
  • the assimilator finishes assimilation and sets assimilate_state=2. file deletion is triggered
  • the validator finishes validation and sets assimilate_state=1, overriding the value the assimilator just set
  • the assimilator tries to assimilate the same workunit again.
    Depending on whether the file deleter has already finished, the assimilator either tries to assimilate the same canonical result files a second time or it doesn't find these at all. Both conditions cause our assimilator to halt (for good reason), and require manual intervention to analyze and resolve the situation.

Also, when setting assimilate_state, the validator doesn't care about results of the workunit that are still "in progress". Triggering assimilation (and subsequently file deletion) of a workunit immediately after finding a canonical result means that these "late" results can not be (successfully) validated, assimilated and credited even if these are valid and arrive on time, i.e. before their deadline.

I was about to modify the validator to not set assimilate_state directly, but just let the transitioner know to take a look at that workunit (immediately). It does handle such late results correctly and, and the transitioner doesn't have such a delay between reading the workuits and writing modifications as the validator has (at least occasionally).

But then I came across that comment in the validator:

https://github.com/BOINC/boinc/blob/master/sched/validator.cpp#L677-L679

                // if we found a canonical result,
                // trigger the assimilator, but do NOT trigger
                // the transitioner - doing so creates a race condition

What race condition is this comment referring to, and how would we resolve the problems above then?

@bema-aei bema-aei changed the title validator - assimilator race condition - should validator set assimilate_state? Server: validator - assimilator race condition - should validator set assimilate_state? Jul 24, 2024
@brevilo
Copy link
Contributor

brevilo commented Jul 24, 2024

Note, you want to use permalinks when referring to code, so the link stays valid over time (i.e. in context with the issue). See the ellipsis in the upper-right corner of the file view.

@brevilo
Copy link
Contributor

brevilo commented Jul 24, 2024

What race condition is this comment referring to

Does the actual commit help? Maybe even try and tap into Bruce's memory?

@bema-aei
Copy link
Contributor Author

bema-aei commented Jul 24, 2024

Thanks it does.

The interesting thing here is that the commit actually doesn't change the
wu.assimilate_state = ASSIMILATE_READY; line, it is just left untouched. This means the code before that commit did both, set assimilate_state to trigger assimilation and trigger immediate transition. This will surely create a race condition.

Bruce was correct to avoid this, but I think he fixed it at the wrong end. IMHO immediate transition should be triggered here, and not assimilation directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

3 participants