Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PFClusterSoAProducer to read a device collection #46887

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Dec 6, 2024

PR description:

Fix PFClusterSoAProducer to read a device collection instead of a host collection, when running on a GPU backend.

Make the PFRecHitSoAProducer produce an additional host-only collection with the number of PF rechits. The PFClusterSoAProducer can then consume the device collection of PF rechits, and the host collection with the number of PF rechits.

PR validation:

Tested that the 2024 HLT menu runs.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

May be backported to 14.2.x or earlier if there is interest.

Make the PFRecHitSoAProducer produce an additional host-only collection
with the number of PF rechits.
Make the the PFClusterSoAProducer consume the device collection of PF
rechits, and the host collection with the number of PF rechits.
@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2024

This is an alternative fix to #46830.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2024

type bugfix

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2024

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2024

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46887/42922

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2024

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • RecoParticleFlow/PFClusterProducer (reconstruction)
  • RecoParticleFlow/PFRecHitProducer (reconstruction)

@jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@felicepantaleo, @hatakeyamak, @lgray, @missirol, @mmarionncern, @rovere, @sameasy, @seemasharmafnal this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 6, 2024

+1

Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ad32cd/43293/summary.html
COMMIT: 58094c1
CMSSW: CMSSW_15_0_X_2024-12-05-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46887/43293/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3494042
  • DQMHistoTests: Total failures: 461
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3493561
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 45 files compared)
  • Checked 202 log files, 172 edm output root files, 46 DQM output files
  • TriggerResults: found differences in 1 / 44 workflows

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53058
  • DQMHistoTests: Total failures: 28
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53030
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

void produce(device::Event& event, const device::EventSetup& setup) override {
event.emplace(pfRecHitsToken_, std::move(*pfRecHits_));
event.emplace(sizeToken_, *size_);
pfRecHits_.reset();

if (synchronise_)
alpaka::wait(event.queue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment of the synchronise configuration parameter says

Add synchronisation point after execution (for benchmarking asynchronous execution)

so maybe this code should be moved to the end of acquire()? (as no asynchronous activities are launched from the produce())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, ok.

@makortel
Copy link
Contributor

makortel commented Dec 6, 2024

From a technical (and performance) perspective I think this PR is better than #46830 (and is probably good enough for the immediate needs). For long term general pattern I'm concerned of the storing the number of filled elements separately from the PortableDeviceCollection.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2024

Next improvements I'd like to try to implement and measure are

  • filling the size in pinned host memory directly from an existing kernel, so as to avoid the copy (similar to PFClusterSoAProducer);
  • trying to extend the SoA to hold columns or scalars in pinned host memory in addition to global gpu memory.

The latter will require a bit of work on the SoA infrastructure, though, so it will take some time to happen.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 9, 2024

Here is the impact of this PR on the HCAL+PF reconstruction, measured on a machine with 2× AMD Bergamo CPUs and 4× NVIDIA L4 GPUs.

baseline

Running 4 times over 20500 events with 16 jobs, each with 32 threads, 24 streams, and 1 GPUs
  7717.3 ±   0.2 ev/s (20000 events, 96.5% overlap),   7708.2 ±   0.2 ev/s (⩾ 17570 events, overlap-only)
  7731.3 ±   0.2 ev/s (20000 events, 96.6% overlap),   7726.6 ±   0.2 ev/s (⩾ 17680 events, overlap-only)
  7744.3 ±   0.2 ev/s (20000 events, 97.2% overlap),   7738.5 ±   0.2 ev/s (⩾ 17880 events, overlap-only)
  7737.4 ±   0.2 ev/s (20000 events, 96.9% overlap),   7730.5 ±   0.2 ev/s (⩾ 17690 events, overlap-only)
 --------------------
  7732.6 ±  11.5 ev/s,   7725.9 ±  12.8 ev/s (⩾ 17570 events, overlap-only)

with this PR

Running 4 times over 20500 events with 16 jobs, each with 32 threads, 24 streams, and 1 GPUs
  8570.7 ±   0.1 ev/s (20000 events, 98.6% overlap),   8569.5 ±   0.1 ev/s (⩾ 19170 events, overlap-only)
  8554.2 ±   0.1 ev/s (20000 events, 98.2% overlap),   8552.7 ±   0.1 ev/s (⩾ 18960 events, overlap-only)
  8565.4 ±   0.1 ev/s (20000 events, 98.4% overlap),   8564.1 ±   0.1 ev/s (⩾ 19100 events, overlap-only)
  8569.5 ±   0.2 ev/s (20000 events, 98.4% overlap),   8569.2 ±   0.2 ev/s (⩾ 19140 events, overlap-only)
 --------------------
  8564.9 ±   7.5 ev/s,   8563.9 ±   7.9 ev/s (⩾ 18960 events, overlap-only)

i.e. a > 10% speed up.

@jfernan2
Copy link
Contributor

Next improvements I'd like to try to implement and measure are

* filling the size in pinned host memory directly from an existing kernel, so as to avoid the copy (similar to `PFClusterSoAProducer`);

* trying to extend the SoA to hold columns or scalars in pinned host memory _in addition_ to global gpu memory.

The latter will require a bit of work on the SoA infrastructure, though, so it will take some time to happen.

@fwyzard for the first improvement, do you plan to include it already in this PR or later in another PR? Thanks

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 11, 2024

for the first improvement, do you plan to include it already in this PR or later in another PR?

I cannot work on it this week, so it could go in a separate PR.

@jfernan2
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @rappoccio, @mandrenguyen, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 16f5cb4 into cms-sw:master Dec 11, 2024
14 checks passed
@fwyzard fwyzard deleted the fix_PFClusterSoAProducer_input_collection_alt branch January 21, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants