Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCC 11.4; CUDA 11.8; cleanup for cmsplatf/cmsos use #8545

Merged
merged 5 commits into from
Jun 20, 2023

Conversation

smuzaffar
Copy link
Contributor

  • Update GCC version to 11.4.1 and its prerequisites and build with zstd support ( changes backported from gcc 12 branch)
  • CUDA 11.8 which has GCC 11.4 support
  • Fixes to resolve cmsdist changes needed to avoid use of cmsplatf #7571 i.e. clean up for the usage of cmsplatf and cmsos
  • Added missing dependencies for sqlite, swig, py3-pillow, gnuplot

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @smuzaffar (Malik Shahzad Muzaffar) for branch IB/CMSSW_13_2_X/master.

@cmsbuild, @smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Jun 14, 2023

test parameters:

  • full_cmssw = true
  • enable = gpu

@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33147/summary.html
COMMIT: 97b1049
CMSSW: CMSSW_13_2_X_2023-06-14-1100/el8_amd64_gcc11
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8545/33147/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33147/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33147/git-merge-result

Comparison Summary

Summary:

  • You potentially removed 201 lines from the logs
  • Reco comparison results: 43 differences found in the comparisons
  • DQMHistoTests: Total files compared: 47
  • DQMHistoTests: Total histograms compared: 3194792
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3194761
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 46 files compared)
  • Checked 205 log files, 158 edm output root files, 47 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor Author

Please test

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Jun 15, 2023

@makortel @fwyzard , this PR updates CUDA 11.8 and GCC 11.4 (along with some cleanup). Tests look good, do you want to run any local tests before we integrate it for 13.2.X (after 13.2.0.pre2 build)

@fwyzard
Copy link
Contributor

fwyzard commented Jun 15, 2023

I'm OK with the changes.

I had already tested CUDA 11.8.0 standalone, and the performance was the same as 12.0 and 12.1: they have a slightly worse performance than 11.5 in the standalone test, but the CMSSW measurement with 12.x did not show any impact, so it shouldn't be a problem and from that point of view we can go ahead.

Since the main update in 11.8.0 is the support for the latest GPU generations, I think we should update the cuda-flags.file accordingly:

cmsdist/cuda-flags.file

Lines 4 to 12 in 97b1049

# on X86 and Power, build support for Pascal, Volta and Turing
%ifarch x86_64 ppc64le
%define cuda_arch 60 70 75
%endif
# on ARM, build support for Volta, Xavier SoC and Turing
%ifarch aarch64
%define cuda_arch 70 72 75
%endif

I would simplify it to

# build support for Pascal, Volta, Turing, Ampere, Lovelace and Hopper
%define cuda_arch 60 70 75 80 86 89 90

This list targets 80 for the A100 and A30, 86 for the A40, A10 and the RTX 3000 series, and 89 for the L4, L40, and RTX 4000 series.

The main drawback of building for more architectures is that it will take longer and produce slightly larger libraries.

If that's a problem, a more minimal list could be:

# build support for Pascal, Volta, Turing, Ampere, and Hopper
%define cuda_arch 60 70 75 80 90

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33167/summary.html
COMMIT: 97b1049
CMSSW: CMSSW_13_2_X_2023-06-14-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8545/33167/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33167/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33167/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS
---> test test_edmPickEvents had ERRORS

Comparison Summary

Summary:

  • You potentially removed 47 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 11999 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3196062
  • DQMHistoTests: Total failures: 127384
  • DQMHistoTests: Total nulls: 60
  • DQMHistoTests: Total successes: 3068596
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.328 KiB( 47 files compared)
  • DQMHistoSizes: changed ( 12434.0,... ): -0.164 KiB L1T/L1TStage2uGT
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: found differences in 4 / 46 workflows

GPU Comparison Summary

Summary:

  • You potentially removed 10 lines from the logs
  • Reco comparison results: 244 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 40086
  • DQMHistoTests: Total failures: 22224
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 17862
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 6 edm output root files, 3 DQM output files
  • TriggerResults: found differences in 2 / 2 workflows

@makortel
Copy link
Contributor

Just to be explicit, I'm ok with the changes as well.

@cmsbuild
Copy link
Contributor

Pull request #8545 was updated.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33223/summary.html
COMMIT: 55adc5f
CMSSW: CMSSW_13_2_X_2023-06-18-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8545/33223/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33223/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33223/git-merge-result

Build

I found compilation error when building:

>> Cuda Device Link tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneHistoContainer_t/gpuOneHistoContainer_t_cudadlink.o 
>> Building binary gpuOneHistoContainer_t
Copying tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneHistoContainer_t/gpuOneHistoContainer_t to productstore area:
>> Compiling  /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_2_X_2023-06-18-2300/src/HeterogeneousCore/CUDAUtilities/test/OneToManyAssoc_t.cu
nvcc error   : 'ptxas' died due to signal 9 (Kill signal)
gmake: *** [tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneToManyAssocRT_debug/OneToManyAssoc_t.cu.o] Error 1
>> Cuda Device Link tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneToManyAssocRT_debug/gpuOneToManyAssocRT_debug_cudadlink.o 
nvlink fatal   : Could not open input file 'tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneToManyAssocRT_debug/OneToManyAssoc_t.cu.o' (target: sm_60)
gmake: *** [tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneToManyAssocRT_debug/gpuOneToManyAssocRT_debug_cudadlink.o] Error 1
>> Building binary gpuOneToManyAssocRT_debug
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/bin/../lib/gcc/x86_64-redhat-linux-gnu/11.4.1/../../../../x86_64-redhat-linux-gnu/bin/ld.bfd: cannot find tmp/el8_amd64_gcc11/src/HeterogeneousCore/CUDAUtilities/test/gpuOneToManyAssocRT_debug/OneToManyAssoc_t.cu.o: No such file or directory


@smuzaffar
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

Pull request #8545 was updated.

@smuzaffar
Copy link
Contributor Author

Please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-81a383/33252/summary.html
COMMIT: 1454e3a
CMSSW: CMSSW_13_2_X_2023-06-19-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8545/33252/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 171 lines from the logs
  • Reco comparison results: 45 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3197958
  • DQMHistoTests: Total failures: 11
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3197925
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • You potentially removed 8 lines from the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 40086
  • DQMHistoTests: Total failures: 23
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 40063
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 6 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor Author

DQMHistoTests: Total failures: 11

mostly in message logger but there are couple of errors for HLT https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_13_2_X_2023-06-19-2300+81a383/57586/12434.0_TTbar_14TeV+2023/

Reco comparison results: 45 differences found in the comparisons

mostly due to message logger but there are few in wfs 4.53 and 9.0 which are non-message logger failure.

@smuzaffar
Copy link
Contributor Author

+externals

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_13_2_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@makortel
Copy link
Contributor

DQMHistoTests: Total failures: 11

mostly in message logger but there are couple of errors for HLT https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_13_2_X_2023-06-19-2300+81a383/57586/12434.0_TTbar_14TeV+2023/

Those differences look compatible with random differences reported in cms-sw/cmssw#41200

@smuzaffar
Copy link
Contributor Author

@perrotta @rappoccio this is good to go in. It includes GCC minor version update and cuda 11.8 . Note that due to GCC changes it rebuilts all the externals. I would like to get this in 13.2.X IBs now so that we have couple of weeks of IBs before we cut last open pre-release

@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmsdist changes needed to avoid use of cmsplatf
5 participants