Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add support for NVIDIA nvc++ compiler #17000

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cgleggett
Copy link

@cgleggett cgleggett commented Nov 20, 2024

This Pull request:

Adds support for NVIDIA's nvc++ compiler

Fixes #16975

Changes or fixes:

  • check if at least ver 24.11 of nvc++ compiler is being used
  • when building cling, nvc++ uses a different syntax than gcc to extract the system include header locations. Instead of -xc++ -E -v, and then a bunch of sed parsing, it uses the -drygccinc flag to produce a colon separated list of paths.

This allows clean compilation on ARM (NVIDIA Grace) CPUs.

Copy link

Test Results

    18 files      18 suites   4d 15h 23m 21s ⏱️
 2 683 tests  2 683 ✅ 0 💤 0 ❌
46 430 runs  46 430 ✅ 0 💤 0 ❌

Results for commit f41386a.

Copy link
Member

@vgvassilev vgvassilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried to run the test suite for ROOT with this pull request and nvcc? My worry is that we have to execute cuda code in the interpreter through clang and some changes in the interpreter's cmake files seem to go towards getting the nvcc runtime somehow in cling.

@@ -9,7 +9,14 @@
#---------------------------------------------------------------------------------------------------

if(NOT CMAKE_CXX_COMPILER_ID MATCHES "(Apple|)Clang|GNU|Intel|MSVC")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need a check here for NVHPC too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this condition filters out supported compilers. As NVHPC is only partially supported, it must enter this condition - note the NOT in front.

@vgvassilev vgvassilev requested a review from hahnjo November 22, 2024 09:00
@hahnjo
Copy link
Member

hahnjo commented Nov 22, 2024

@cgleggett so this means the most recent nvc++ is "fixed enough" to build a working LLVM and ROOT? So all dictionaries build and basic operation works, ie running some of the basic tutorials?

@cgleggett
Copy link
Author

what is the recommended way to test a build? I was able to do I/O and read/write root files, but that obviously doesn't cover everything.

@pcanal
Copy link
Member

pcanal commented Nov 22, 2024

If you would like to run the full test suite, you can configure with:

cmake -Droottest="ON" -Dtesting="ON" .

and run ctest -j something (it takes a while :) ).

You can also start without the -Droottest=ON to get the "unit" test and tutorials to run.

@cgleggett cgleggett changed the title Add support for NVIDIA nvc++ compiler WIP: Add support for NVIDIA nvc++ compiler Dec 4, 2024
@cgleggett
Copy link
Author

building the tests have revealed a few more issues with support for flags in nvc++. WIPing until these are addressed.

@hahnjo hahnjo marked this pull request as draft December 5, 2024 16:53
@cgleggett
Copy link
Author

@pcanal : running the tests showed 413 failures with nvc++. However, making a build on identical source with gcc 13.3 and then running the tests showed 181 failures. Surprisingly, the overlap was only 15 tests which failed in BOTH nvc++ and gcc. So, any suggestions on which tests are critical and I should really look at?

@pcanal
Copy link
Member

pcanal commented Dec 18, 2024

@cgleggett I assume you have the same branch/corresponding commits from the root and roottest repository. If so all tests are meant to succeed with gcc. A first steps to ensure that all the test from cmake -Droottest="OFF" -Dtesting="ON" . do work.

@cgleggett
Copy link
Author

@pcanal: I'm building the same sources for both compilers, using
cmake -DCMAKE_CXX_STANDARD=17 -Dx11=OFF -Dtbb=OFF -Dopengl=OFF -Dgviz=OFF -Dimt=OFF -Ddavix=OFF -Dvdt=OFF -DCMAKE_CXX_FLAGS="-w" -DCMAKE_C_FLAGS="-w" -Droottest="ON" -Dtesting="ON" ../src_fork

should this produce zero failed tests with gcc13? I was trying to build a smaller library without a lot of the graphics bits.

@pcanal
Copy link
Member

pcanal commented Dec 18, 2024

should this produce zero failed tests with gcc13?

It definitively should ... albeit the options -Dimt=OFF -Ddavix=OFF -Dvdt=OFF -DCMAKE_CXX_FLAGS="-w" (in particular the first 3) are unusual and it is plausible (but wrong and thus needing to be fixed) that it would attempt to run test that depends on this features. I am going to go ahead an try that combination on my end.

@cgleggett
Copy link
Author

I forked from master at e344b22. the commits in the fork should have no impact on the gcc builds. Should I have forked from a different branch to test, eg latest-stable?

Building with just cmake -DCMAKE_CXX_STANDARD=17 -DCMAKE_CXX_FLAGS="-w" -DCMAKE_C_FLAGS="-w" -Droottest="ON" -Dtesting="ON" ../src_fork still produces 180 ctest failures.

@pcanal
Copy link
Member

pcanal commented Dec 19, 2024

In my case, I see the following failures:

270:gtest-roofit-roofit-vectorisedPDFs-testLandau
682:tutorial-io-tree-tree502_staff
1826:roottest-root-io-prefetching-make

The 682 is due to a missing protection against building without OpenGL. The 270 might be because of the missing VDT. The later one is due (in my case) to the missing Davix ... but seems to work on other node without davix.

Many python test will fails if you are missing the pip install of root/requirements.txt and roottest/requirements.txt. Some TMVA test will fails if you have a missing BLAS library.

@pcanal
Copy link
Member

pcanal commented Dec 23, 2024

still produces 180 ctest failures.

Can you give me the list? Actually, just send me the full Testing/Temporary/LastTest.log Testing/Temporary/LastTestsFailed.log

I forked from master at e344b22

You need to ensure that you have a corresponding roottest. Do:

cd ../src_fork/roottest # or cd ../roottest
git checkout c27bdc053c5792e30ab3e55736c1fcad526e991c

@cgleggett
Copy link
Author

still produces 180 ctest failures.

Can you give me the list? Actually, just send me the full Testing/Temporary/LastTest.log Testing/Temporary/LastTestsFailed.log

I forked from master at e344b22

You need to ensure that you have a corresponding roottest. Do:

cd ../src_fork/roottest # or cd ../roottest
git checkout c27bdc053c5792e30ab3e55736c1fcad526e991c

Picking this up after the break....

I've installed all the requirements in root/requirements.txt and roottest/requirements.txt, checked out the appropriate version of roottest, and rebuilt.

Attached is the list of failed tests, as well as LastTest.log and LastTestsFailed.log for gcc13.3.0. I'll attach the ones for nvc++ in another post.

@cgleggett
Copy link
Author

cgleggett commented Jan 14, 2025

for gcc 13.3.0:
gcc_fail.log
LastTestsFailed.log
LastTest.log.gz

@pcanal
Copy link
Member

pcanal commented Jan 25, 2025

There is something very wrong with the python interactions, a lot of the test seems to crash badly:
For example this simple tutorial fails (i.e. 'just' python tutorials/hist/fillrandom.py).
Can you run that under valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp ?

1153/2438 Testing: tutorial-hist-fillrandom-py
1153/2438 Test: tutorial-hist-fillrandom-py
Command: "/opt/cmake/3.30.5/bin/cmake" "-DCMD=/usr/bin/python3.9^/bld4/home/leggett/sw/src/root/src_fork/tutorials/hist/fillrandom.py" "-DSYS=/bld4/home/leggett/sw/src/root/bld_gcc_full" "-DENV=PATH=/bld4/home/leggett/sw/src/root/bld_gcc_full/bin:/opt/xrootd/82a3fc0e8_gcc133/bin:/opt/cmake/3.30.5/bin:/opt/gcc/13.3.0/bin:/opt/git/2.26.0/bin:/bld4/home/leggett/scripts:/usr/sue/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/puppetlabs/bin:/usr/local/cuda/bin:/bld4/home/leggett/bin:/bld4/home/leggett/bin/scripts:/bld4/home/leggett/.local/bin:/bld4/home/leggett/bin#LD_LIBRARY_PATH=/bld4/home/leggett/sw/src/root/bld_gcc_full/lib:/opt/xrootd/82a3fc0e8_gcc133/lib64:/opt/gcc/13.3.0/lib64:/opt/gcc/13.3.0/lib#ROOTSYS=/bld4/home/leggett/sw/src/root/bld_gcc_full#PYTHONPATH=/bld4/home/leggett/sw/src/root/bld_gcc_full/lib::/bld4/home/leggett/.local/lib/python3.9/site-packages#OMP_NUM_THREADS=1#OPENBLAS_NUM_THREADS=1#MKL_NUM_THREADS=1" "-P" "/bld4/home/leggett/sw/src/root/bld_gcc_full/RootTestDriver.cmake"
Directory: /bld4/home/leggett/sw/src/root/bld_gcc_full/runtutorials
"tutorial-hist-fillrandom-py" start time: Jan 13 18:44 PST
Output:
----------------------------------------------------------
fillrandom: Real Time =   0.63 seconds Cpu Time =   0.50 seconds
 *** Break *** segmentation violation



===========================================================
There was a crash (#7 0x00007f433b44cfae in SigHandler(ESignals) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libCore.so).
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f43492d89fa in wait4 () from /lib64/libc.so.6
#1  0x00007f434924b243 in do_system () from /lib64/libc.so.6
#2  0x00007f433b450e6a in TUnixSystem::Exec(char const*) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libCore.so
#3  0x00007f433b45170d in TUnixSystem::StackTrace() () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libCore.so
#4  0x00007f433badfd94 in (anonymous namespace)::do_trace(int) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libcppyy_backend.so
#5  0x00007f433badfe10 in (anonymous namespace)::TExceptionHandlerImp::HandleException(int) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libcppyy_backend.so
#6  0x00007f433b4550f9 in TUnixSystem::DispatchSignals(ESignals) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libCore.so
#7  0x00007f433b44cfae in SigHandler(ESignals) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libCore.so
#8  0x00007f433b45504f in sighandler(int) () from /bld4/home/leggett/sw/src/root/bld_gcc_full/lib/libCore.so
#9  <signal handler called>
#10 0x00007f4349707c1a in visit_reachable () from /lib64/libpython3.9.so.1.0
#11 0x00007f434970705c in deduce_unreachable () from /lib64/libpython3.9.so.1.0
#12 0x00007f4349706d47 in collect () from /lib64/libpython3.9.so.1.0
#13 0x00007f434978556e in collect_with_callback () from /lib64/libpython3.9.so.1.0
#14 0x00007f43497b9dee in PyGC_Collect () from /lib64/libpython3.9.so.1.0
#15 0x00007f43497b9957 in Py_FinalizeEx () from /lib64/libpython3.9.so.1.0
#16 0x00007f43497ab34d in Py_RunMain () from /lib64/libpython3.9.so.1.0
#17 0x00007f434977ba8d in Py_BytesMain () from /lib64/libpython3.9.so.1.0
#18 0x00007f4349229590 in __libc_start_call_main () from /lib64/libc.so.6
#19 0x00007f4349229640 in __libc_start_main_impl () from /lib64/libc.so.6
#20 0x000055df90e35095 in _start ()
===========================================================

@cgleggett
Copy link
Author

here's the output of valgrind:

> valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp  python tutorials/hist/fillrandom.py
==1051246== Memcheck, a memory error detector
==1051246== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1051246== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==1051246== Command: python tutorials/hist/fillrandom.py
==1051246== 
==1051246== Conditional jump or move depends on uninitialised value(s)
==1051246==    at 0x158A3E38: clang::CodeGen::CodeGenModule::SetLLVMFunctionAttributesForDefinition(clang::Decl const*, llvm::Function*) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x158D1498: clang::CodeGen::CodeGenModule::EmitGlobalFunctionDefinition(clang::GlobalDecl, llvm::GlobalValue*) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x158CDDAB: clang::CodeGen::CodeGenModule::EmitGlobalDefinition(clang::GlobalDecl, llvm::GlobalValue*) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x158D666D: clang::CodeGen::CodeGenModule::EmitDeferred() (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x158D783F: clang::CodeGen::CodeGenModule::Release() (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x15794569: clang::CodeGeneratorImpl::HandleTranslationUnit(clang::ASTContext&) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x15605489: cling::IncrementalParser::codeGenTransaction(cling::Transaction*) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x15605764: cling::IncrementalParser::commitTransaction(llvm::PointerIntPair<cling::Transaction*, 2u, cling::IncrementalParser::EParseResult, llvm::PointerLikeTypeTraits<cling::Transaction*>, llvm::PointerIntPairInfo<cling::Transaction*, 2u, llvm::PointerLikeTypeTraits<cling::Transaction*> > >&, bool) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x15608897: cling::IncrementalParser::Compile(llvm::StringRef, cling::CompilationOptions const&) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x155782A9: cling::Interpreter::declare(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::Transaction**) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x1557937F: cling::Interpreter::DeclareCFunction(llvm::StringRef, llvm::StringRef, bool, cling::Transaction*&) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246==    by 0x15579574: cling::Interpreter::compileFunction(llvm::StringRef, llvm::StringRef, bool, bool) (in /bld6/root/bld_gcc_full/lib/libCling.so)
==1051246== 
fillrandom: Real Time =  19.42 seconds Cpu Time =  19.22 seconds
 *** Break *** segmentation violation
#0  0x00000000580e88b0 in ?? ()
#1  0x000000005809f0ec in ?? ()
#2  0x000000005809a843 in ?? ()
#3  0x000000005809c928 in ?? ()
#4  0x00000000580e8a3e in ?? ()
#5  0x0000000000000000 in ?? ()
 *** Break *** segmentation violation
#0  0x00000000580e88b0 in ?? ()
#1  0x000000005809f0ec in ?? ()
#2  0x000000005809a843 in ?? ()
#3  0x000000005809c928 in ?? ()
#4  0x00000000580e8a3e in ?? ()
#5  0x0000000000000000 in ?? ()
==1051246== 
==1051246== HEAP SUMMARY:
==1051246==     in use at exit: 96,031,569 bytes in 198,774 blocks
==1051246==   total heap usage: 1,022,370 allocs, 823,596 frees, 506,501,860 bytes allocated
==1051246== 
==1051246== LEAK SUMMARY:
==1051246==    definitely lost: 256 bytes in 4 blocks
==1051246==    indirectly lost: 2,384 bytes in 37 blocks
==1051246==      possibly lost: 1,866,639 bytes in 16,883 blocks
==1051246==    still reachable: 93,804,968 bytes in 177,055 blocks
==1051246==                       of which reachable via heuristic:
==1051246==                         newarray           : 27,008 bytes in 46 blocks
==1051246==                         multipleinheritance: 3,432 bytes in 8 blocks
==1051246==         suppressed: 357,322 bytes in 4,795 blocks
==1051246== Rerun with --leak-check=full to see details of leaked memory
==1051246== 
==1051246== Use --track-origins=yes to see where uninitialised values come from
==1051246== For lists of detected and suppressed errors, rerun with: -s
==1051246== ERROR SUMMARY: 23 errors from 1 contexts (suppressed: 2145 from 35)

running with gdb doesn't show much:

> (gdb)  where
#0  0x00007ffff7d07c1a in visit_reachable () from /lib64/libpython3.9.so.1.0
#1  0x00007ffff7d0705c in deduce_unreachable () from /lib64/libpython3.9.so.1.0
#2  0x00007ffff7d06d47 in collect () from /lib64/libpython3.9.so.1.0
#3  0x00007ffff7d8556e in collect_with_callback () from /lib64/libpython3.9.so.1.0
#4  0x00007ffff7db9dee in PyGC_Collect () from /lib64/libpython3.9.so.1.0
#5  0x00007ffff7db9957 in Py_FinalizeEx () from /lib64/libpython3.9.so.1.0
#6  0x00007ffff7dab34d in Py_RunMain () from /lib64/libpython3.9.so.1.0
#7  0x00007ffff7d7ba8d in Py_BytesMain () from /lib64/libpython3.9.so.1.0
#8  0x00007ffff7829590 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff7829640 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000555555555095 in _start ()

the nvc++ build runs this test fine

@pcanal
Copy link
Member

pcanal commented Jan 30, 2025

Hack :( ... okay, let's side steps this for a round and see what else might be failing.
Can you rebuild and rerun and re-shared the testing log with:
-Dpyroot=OFF

@cgleggett
Copy link
Author

probably unrelated, but I get this error message when building:

[ 94%] Linking CXX executable TestRModelParserPyTorch
/usr/bin/ld: cannot find -lFALSE

the compile line looks like:

/opt/gcc/13.3.0/bin/g++ -w -Wno-implicit-fallthrough -Wno-noexcept-type -pipe  -Wshadow -Wall -W -Woverloaded-virtual -fsigned-char -pthread  -rdynamic CMakeFiles/TestRModelParserPyTorch.dir/TestRModelParserPyTorch.C.o ../../../core/testsupport/CMakeFiles/TestSupport.dir/src/TestSupport.cxx.o -o TestRModelParserPyTorch  -Wl,-rpath,/bld4/home/leggett/sw/src/root/bld_gcc_nopyroot/lib ../../../lib/libROOTTMVASofie.so ../../../lib/libTMVA.so /usr/lib64/libpython3.9.so ../../../googletest-prefix/src/googletest-build/lib//libgtest_main.a ../../../googletest-prefix/src/googletest-build/lib//libgmock.a ../../../googletest-prefix/src/googletest-build/lib//libgmock_main.a -lFALSE ../../../lib/libMinuit.so ../../../lib/libMLP.so ../../../lib/libTreePlayer.so ../../../lib/libGraf3d.so ../../../lib/libTree.so ../../../lib/libGpad.so ../../../lib/libGraf.so ../../../lib/libHist.so ../../../lib/libMatrix.so ../../../lib/libMathCore.so ../../../lib/libXMLIO.so ../../../lib/libImt.so ../../../lib/libMultiProc.so ../../../lib/libNet.so ../../../lib/libRIO.so ../../../lib/libThread.so ../../../lib/libCore.so ../../../googletest-prefix/src/googletest-build/lib//libgtest.a

there are in fact another one:

/opt/gcc/13.3.0/bin/g++ -w -Wno-implicit-fallthrough -Wno-noexcept-type -pipe  -Wshadow -Wall -W -Woverloaded-virtual -fsigned-char -pthread  -rdynamic CMakeFiles/TestRModelParserKeras.dir/TestRModelParserKeras.C.o ../../../core/testsupport/CMakeFiles/TestSupport.dir/src/TestSupport.cxx.o -o TestRModelParserKeras  -Wl,-rpath,/bld4/home/leggett/sw/src/root/bld_gcc_nopyroot/lib ../../../lib/libPyMVA.so /usr/lib64/libpython3.9.so ../../../googletest-prefix/src/googletest-build/lib//libgtest_main.a ../../../googletest-prefix/src/googletest-build/lib//libgmock.a ../../../googletest-prefix/src/googletest-build/lib//libgmock_main.a -lFALSE ../../../lib/libROOTTMVASofie.so ../../../lib/libTMVA.so ../../../lib/libMinuit.so ../../../lib/libMLP.so ../../../lib/libTreePlayer.so ../../../lib/libGraf3d.so ../../../lib/libTree.so ../../../lib/libGpad.so ../../../lib/libGraf.so ../../../lib/libHist.so ../../../lib/libXMLIO.so ../../../lib/libMatrix.so ../../../lib/libMathCore.so ../../../lib/libImt.so ../../../lib/libMultiProc.so ../../../lib/libNet.so ../../../lib/libRIO.so ../../../lib/libThread.so ../../../lib/libCore.so ../../../googletest-prefix/src/googletest-build/lib//libgtest.a
/usr/bin/ld: cannot find -lFALSE

if I remove the -lFALSE from both of those, the rest compiles fine.

now to try the ctest....

@cgleggett
Copy link
Author

The tests that failed with -Dpyroot=OFF are

	261 - test-stressgraphics-firefox-skip3d (Failed)
	289 - PyMVA-AdaBoost-Classification (Failed)
	290 - PyMVA-AdaBoost-Multiclass (Failed)
	295 - PyMVA-Keras-Classification (Failed)
	296 - PyMVA-Keras-Regression (Failed)
	297 - PyMVA-Keras-Multiclass (Failed)
	298 - gtest-tmva-pymva-TestRModelParserKeras (Failed)
	306 - pyunittests-tmva-tmva-rbdt-xgboost (Failed)
	439 - tutorial-tmva-TMVA_SOFIE_GNN_Parser (Failed)
	952 - tutorial-tmva-TMVA_CNN_Classification (Failed)
	953 - tutorial-tmva-TMVA_Higgs_Classification (Failed)
	955 - tutorial-tmva-TMVA_SOFIE_GNN_Application (Failed)
	956 - tutorial-tmva-TMVA_SOFIE_Keras (Failed)
	957 - tutorial-tmva-TMVA_SOFIE_Keras_HiggsModel (Failed)
	959 - tutorial-tmva-TMVA_SOFIE_RDataFrame (Failed)
	961 - tutorial-tmva-TMVA_SOFIE_RSofieReader (Failed)

LastTest.log.gz

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

/usr/bin/ld: cannot find -lFALSE

Is fixed by #17459

A workaround is to reconfigure (cmake .)

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

test-stressgraphics-firefox-skip3d

is minor:

Test 14: TMathText................................................. OK
         PDF output................................................ OK
         JPG output......................................... 14 FAILED
         Result    = 78712
         Reference = 69198
         Error     = 9514 (was 9500)

@couet What can be done here?

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

289 - PyMVA-AdaBoost-Classification (Failed)
290 - PyMVA-AdaBoost-Multiclass (Failed)

Are either a missing install or mismatch in version:

sklearn.utils._param_validation.InvalidParameterError: The 'algorithm' parameter of AdaBoostClassifier must be a str among {'SAMME'}. Got 'SAMME.R' instead.
�[37;41;1m<FATAL>                         : Failed to train classifier�[0m

@guitargeek What can be done here?

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

295 - PyMVA-Keras-Classification (Failed)
296 - PyMVA-Keras-Regression (Failed)
297 - PyMVA-Keras-Multiclass (Failed)
439 - tutorial-tmva-TMVA_SOFIE_GNN_Parser (Failed)
953 - tutorial-tmva-TMVA_Higgs_Classification (Failed)
956 - tutorial-tmva-TMVA_SOFIE_Keras (Failed)

Are more 'interesting' and may or may not be due to how Tensor flow was built.

/bld4/home/leggett/.local/lib/python3.9/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
I0000 00:00:1738342086.794823 1283147 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 224 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:3b:00.0, compute capability: 8.6
I0000 00:00:1738342086.795992 1283147 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 6388 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:af:00.0, compute capability: 7.5
2025-01-31 08:48:06.827503: I external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1193] failed to allocate 224.62MiB (235536384 bytes) from device: RESOURCE_EXHAUSTED: : CUDA_ERROR_OUT_OF_MEMORY: out of memory
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1738342087.015640 1284532 gpu_backend_lib.cc:579] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  generateKerasModelMulticlass.py.runfiles/cuda_nvcc
  generat/cuda_nvcc
  
  /usr/local/cuda
  /bld4/home/leggett/.local/lib/python3.9/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /bld4/home/leggett/.local/lib/python3.9/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  /bld4/home/leggett/.local/lib/python3.9/site-packages/tensorflow/python/platform/../../cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
W0000 00:00:1738342087.034522 1284531 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.

However it is not clear what the real error is. It ends with:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'keras._tf_keras.keras.backend' has no attribute 'set_session'
�[37;41;1m<FATAL>                         : Failed to run python code�[0m
***> abort program execution

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

298 - gtest-tmva-pymva-TestRModelParserKeras (Failed)
306 - pyunittests-tmva-tmva-rbdt-xgboost (Failed)
439 - tutorial-tmva-TMVA_SOFIE_GNN_Parser (Failed)

Probably due to the python/pyroot issue. Let's ignore those for now.

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

The other test fails because of the prior failures.

@pcanal
Copy link
Member

pcanal commented Jan 31, 2025

The conclusion is that in addition to the issue with python/pyroot, the other issues are Keras or TensorFlow related.

Were the results #17000 (comment) with nvcc, gcc or both?

@guitargeek
Copy link
Contributor

289 - PyMVA-AdaBoost-Classification (Failed)
290 - PyMVA-AdaBoost-Multiclass (Failed)

Are either a missing install or mismatch in version:

sklearn.utils._param_validation.InvalidParameterError: The 'algorithm' parameter of AdaBoostClassifier must be a str among {'SAMME'}. Got 'SAMME.R' instead.
�[37;41;1m<FATAL>                         : Failed to train classifier�[0m

@guitargeek What can be done here?

Did you install the test environment with ROOTs requirements.txt file? Note that there are maximum supported torch and TensorFlow versions.

@cgleggett
Copy link
Author

I installed python 3.11 and then rebuilt (without the -Dpyroot=off) and re-ran the tests, and only had a single failure: test-stressgraphics-firefox-skip3d

So maybe there's an issue with the default python 3.9 in Alma9?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow compilation with NVIDIA nvc++
5 participants