Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to tbb::task_group #305

Merged
merged 31 commits into from
Feb 15, 2022
Merged

Conversation

makortel
Copy link
Collaborator

@makortel makortel commented Feb 4, 2022

Resolves #295. Following the approach in cms-sw/cmssw#32804.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 9, 2022

All the backends I tested (all but hip and kokkos) compile, run, and pass the validation.

With 8 or more threads, the performance is the same as before.
With less than 8 threads the performance is visibly worse than before:

image

Is this expected ?

@makortel
Copy link
Collaborator Author

Thanks @fwyzard for the check. Chris measured at a time the performance of this approach with results in https://indico.cern.ch/event/1005030/#6-adapting-cmssw-to-tbbs-inter. E.g. on slide 9 this setup corresponds the green line (single tbb::task_group). With very fast tasks a significant degradation would be expected, but in CMSSW with "trivial configuration" (slide 13) the degradation turned out to be small (with orders of magnitude higher throughput than here).

It could be that the setup here lies somewhere in between Chris' toy framework and CMSSW. I'm looking into this.

@makortel
Copy link
Collaborator Author

The culprit appears to be this spinloop

do {
group.wait();
} while (not globalWaitTask.done());

which is needed to guarantee the program won't terminate while there is no other work than waiting for an asynchronous work launched from ExternalWork acquire() to complete. perf showed high load on this, and the "best" way to hit to this code path is one-thread (or low-thread) jobs, where the impact on throughput is visible.

I gave a try to the "deferred task" preview feature of recent oneTBB versions (as suggested to us in uxlfoundation/oneTBB#346 (comment)), and that seemed to restore the performance (at least within ~2 % in my limited tests). I used oneTBB v2021.4.0 (which we use in CMSSW_12_3_X) for that test. While exploring this option I noticed GCC 8.4.0 to crash when compiling oneTBB v2021.4.0, while GCC 9.3.0 worked fine. I'd suggest we just bump the minimum GCC version to 9; I suppose we don't have hard requirement to keep supporting GCC 8 (I've kept that because that was used in the CMSSW version the cuda code was originally exported).

I had been thinking to update the oneTBB anyway after this PR. The main question is then if it would be better to move to recent oneTBB update and use of deferred task already in this PR, or in two subsequent PRs. I don't have strong feelings at this point. @fwyzard, which would you prefer?

@fwyzard
Copy link
Contributor

fwyzard commented Feb 11, 2022

@makortel thanks for the investigation !

I would make one PR to bump the GCC requirement, and them update this PR to include the newer version of TBB and the use of deferred tasks.

@makortel
Copy link
Collaborator Author

I would make one PR to bump the GCC requirement, and them update this PR to include the newer version of TBB and the use of deferred tasks.

Sounds good, I'll prepare another PR to bump the GCC requirement.

…e task_arena for runToCompletion()

Propagating changes from
cms-patatrack#242
cms-patatrack#245

For consistency  make --numerOfThreads 0 to use all CPU cores, and reorder #includes.
@makortel
Copy link
Collaborator Author

makortel commented Feb 14, 2022

Updates

m_task->increment_ref_count();
m_handle = std::make_shared<tbb::task_handle>(m_group->defer([task = m_task]() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not terribly happy with the approach I took for tbb::task_handle here, e.g. there are now three members doing reference counting (two shared_ptrs implicitly and m_task explicitly), but I'd leave improvements (likely a redesign) to a later time.

Comment on lines +105 to +108
m_group->run([task]() {
TaskSentry s{task};
task->execute();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to understand what this pattern does:

  • m_group is a pointer to the tbb::task_group that will run the task
  • m_task is the a pointer to the WaitingTask to be run
  • task is a copy of m_task (the comment above explains why the extra copy is needed)
  • m_group takes a functor (here a lambda) to be run
  • the lambda makes a copy of task
  • the TaskSentry takes ownership of the task, guaranteeing that it will be deleted after it has run (also in case an exception is thrown)
  • the task is executed inside the lambda inside the task_group

The only confusing aspect is that TaskSentry will delete the task once the lambda is complete, except if it is a FinalWaitingTask, in which case the destructor of the TaskSentry effectively does not do anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That difference comes from objects of all other task types than FinalWaitingTask being allocated with new and intended to be deleted right after the task execution has finished. The FinalWaitingTask object is allocated on the stack of the function that initiates the "asynchronous region" and does a wait on the tbb::task_group. It must be kept alive until all the asynchronous tasks of the region have finished (e.g. to check if exception was thrown).

@fwyzard
Copy link
Contributor

fwyzard commented Feb 15, 2022

With these latest changes, the performance using tbb::task_group is the same as using tbb::task:
image

Zoomed-in version of the CUDA performance:
image

Zoomed-in version of the CPU performance:
image

Note that these plots were made using the tbb::spin_mutex in the caching allocators.

@fwyzard
Copy link
Contributor

fwyzard commented Feb 15, 2022

@makortel do you prefer to keep the individual commits or squash them ?

@makortel
Copy link
Collaborator Author

Thanks a lot Andrea for the detailed checks!

do you prefer to keep the individual commits or squash them ?

In general I've been trying to touch only one test program code in one commit, and I'd like to keep that. I'd like to also separate "migrate to tbb::task_group", "TBB update", and "use of deferred tasks" as separate sets of commits. But I see now that I have some spurious "ScopedContext" etc commits that I intended to squash, so let me fix those.

@makortel
Copy link
Collaborator Author

But I see now that I have some spurious "ScopedContext" etc commits that I intended to squash, so let me fix those.

Done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TBB removed tbb::task
2 participants