Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch5n_nbo QA test fails from v7.2.0 #864

Open
drew-parsons opened this issue Sep 21, 2023 · 37 comments
Open

ch5n_nbo QA test fails from v7.2.0 #864

drew-parsons opened this issue Sep 21, 2023 · 37 comments

Comments

@drew-parsons
Copy link

drew-parsons commented Sep 21, 2023

Describe the bug
I'm running QA tests on a debian package build of nwchem 7.2.0. All the CI tests that debian runs are passing, except for ch5n_nbo which fails trivially (energy difference 1e-5 Hartree). What's the best way to handle it?

Describe settings used
Building with gcc 13.2.0 with both openmpi 4.1.5 and mpich 4.1.2.
Linux (Debian unstable)

To Reproduce

  1. cd QA
  2. MPIRUN_PATH=/usr/bin/mpirun.openmpi NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.openmpi NWCHEM_TOP=$PWD/.. ./runtests.mpi.unix procs 1 ch5n_nbo
    2b. MPIRUN_PATH=/usr/bin/mpirun.mpich NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.mpich NWCHEM_TOP=$PWD/.. ./runtests.mpi.unix procs 1 ch5n_nbo

Expected behavior
Precision in energy calculations should be reproducible such that tests pass.

Screenshots

$ MPIRUN_PATH=/usr/bin/mpirun.mpich NWCHEM_TARGET=LINUX64 NWCHEM_EXECUTABLE=/usr/bin/nwchem.mpich NWCHEM_TOP=`pwd`/.. ./runtests.mpi.unix procs 1 ch5n_nbo

 
 Running tests/ch5n_nbo/ch5n_nbo 
 
     cleaning scratch
     copying input and verified output files
     running nwchem (/usr/bin/nwchem.mpich)  with 1 processors 
 
     verifying output ... failed
@@@     Comparison of Output Files
@@ -1,3 +1,3 @@
 Effective nuclear repulsion energy (a.u.) 42.05
-Total SCF energy = -94.67944
-Total SCF energy = -94.67944
+Total SCF energy = -94.67945
+Total SCF energy = -94.67945
 
Failed
$ echo $?
1

Additional context
Debian CI tests only run a subset of available tests. A random launch of various other tests shows that ch4-scf-dft-prop.nw also fails, evidently due to Issue #776 (sets cosmo dielectric constant to 78.40 instead of the 3.90 given in the input file). Likewise ch4-dft-scf-prop.

@edoapra
Copy link
Collaborator

edoapra commented Sep 21, 2023

@drew-parsons could you try this patch for the ch5n_nbo failure?

diff --git a/QA/tests/ch5n_nbo/ch5n_nbo.nw b/QA/tests/ch5n_nbo/ch5n_nbo.nw
index 976585de0f..a291436ff4 100644
--- a/QA/tests/ch5n_nbo/ch5n_nbo.nw
+++ b/QA/tests/ch5n_nbo/ch5n_nbo.nw
@@ -23,6 +23,7 @@ geometry
 end
 
 scf
+ direct
  sym off
  adapt off
 end

@edoapra
Copy link
Collaborator

edoapra commented Sep 21, 2023

Additional context Debian CI tests only run a subset of available tests. A random launch of various other tests shows that ch4-scf-dft-prop.nw also fails, evidently due to Issue #776 (sets cosmo dielectric constant to 78.40 instead of the 3.90 given in the input file). Likewise ch4-dft-scf-prop.

ch4-scf-dft-prop and ch4-dft-scf-prop were not part of the default QA script doqmtests.mpi for release 7.2.0 (and unfortunately are not part of it even in the current master ...), therefore they were not thoroughly tested as the test appearing in doqmtests.mpi

@drew-parsons
Copy link
Author

drew-parsons commented Sep 21, 2023

scf

  • direct
    sym off
    adapt off
    end

Good suggestion, adding direct enables the test to pass. The full calculated energy becomes

Total SCF energy =    -94.679444210504

(was -94.679444920261 without direct)
compared to

Total SCF energy =    -94.679444210479

in the reference file, which is within the tolerance required for the QA tests.

@edoapra
Copy link
Collaborator

edoapra commented Sep 21, 2023

Thanks for the feedback.
Unfortunately, this is not a very good fix. The root cause of the failure is in the rounding method used in the NWChem perl scripts. I might be able to get a better fix sometime soon.

@drew-parsons
Copy link
Author

Ok. I'll apply this patch to the debian tests in the meantime.

@edoapra edoapra mentioned this issue Sep 23, 2023
@edoapra
Copy link
Collaborator

edoapra commented Oct 4, 2023

Current master and hotfix/release-7-2-0 have now a more reliable fix to address this issue.

@edoapra
Copy link
Collaborator

edoapra commented Oct 7, 2023

v7.2.1 is now available for download

@drew-parsons
Copy link
Author

Thanks. Debian builds are underway.

@drew-parsons
Copy link
Author

drew-parsons commented Oct 16, 2023

v7.2.1 is now built and available for Debian, https://buildd.debian.org/status/package.php?p=nwchem
The ch5n_nbo test is now passing.

@edoapra
Copy link
Collaborator

edoapra commented Oct 16, 2023

v7.2.1 is now built and available for Debian, https://buildd.debian.org/status/package.php?p=nwchem The ch5n_nbo test is now passing.

Great.
The patch below should fix the loong64 failure

diff --git a/src/config/makefile.h b/src/config/makefile.h
index f6cd4e44e9..3702fbfb1e 100644
--- a/src/config/makefile.h
+++ b/src/config/makefile.h
@@ -2131,6 +2131,9 @@ ifneq ($(TARGET),LINUX)
             CFLAGS_FORGA   = -mabi=64
         endif
 
+        ifeq ($(_CPU),loong64)
+            DONTHAVEM64OPT=Y
+        endif
         ifeq ($(_CPU),riscv64)
             DONTHAVEM64OPT=Y
             COPTIONS   =  -march=rv64gc -mabi=lp64d

@drew-parsons
Copy link
Author

_CPU is uname -m. I think that's the first part of the multiarch triplet. The loong64 log at
https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=loong64&ver=7.2.1-2&stamp=1697224746&raw=0
shows the triplet as loongarch64-linux-gnu (in /usr/lib/loongarch64-linux-gnu/libblas.so etc).

Should the _CPU be loongarch64 rather than loong64 in that case?

I don't have a loong64 machine to log into and check, but can try it one way or the other in the next build upload.

@edoapra
Copy link
Collaborator

edoapra commented Oct 17, 2023

You are right. Thanks for spotting this

I have no name!@durian:/# uname -a
Linux durian 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep  7 13:12:03 UTC 2 loongarch64 GNU/Linux
I have no name!@durian:/# uname -m
loongarch64

edoapra added a commit to edoapra/nwchem that referenced this issue Oct 17, 2023
@drew-parsons
Copy link
Author

loong64 has now built successfully (via loongarch64), https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=loong64&ver=7.2.1-3&stamp=1697658700&raw=0

I've one more general question related to tests. In this issue we got the ch5n_nbo test passing, and it's passing on all arches (apart from some MPI trouble on s390x and ia64[gcc]).

But on some of the less common architectures some of the other tests fail in trivial ways (just a small difference in energy). Pragmatically it might be simplest to just drop selected failing tests on the less common architectures, so CI can pass on the other tests.

Tests with only trivial differences in energy compared to the amd64 reference include

  • armhf, mips64el, hppa: pspw_md
  • s390x, sparc64: cosmo_h3co, h2o_dk
  • sparc64: cosmo_h2o

Tests with more serious failure include

  • armel, armhf, i386, hurd-i386, x32: n2_ccsd (SIGFPE) (must be 32-bit related, potentially a debian build configuration issue)
  • s390x, sparc64: dft_he2+ (large DFT energy difference)
  • s390x: various tests with mpich: Assertion failed looputil.c at line 815: *lengthp > 0
  • hppa, powerpc: auh2o, h2mp2, n2_ccsd, mp2_scs_n2, tddftgrad_ch2o (energy NaN)
  • hppa: tddft_h2o, tddftgrad_n2_uks (A-B is not positive-definite; try TDA)
  • sparc64: cosmo_h3co (SIGBUS, energy NaN)
  • sparc64: band, pspw, pspw_md, pspw_stress (SIGFPE)

A number of tests ran out of heap memory on s390x, I don't think that's a bug.

Test logs are a combination of run-time (CI) testing at https://ci.debian.net/packages/n/nwchem/ and build-time testing at https://buildd.debian.org/status/package.php?p=nwchem (debian build time test failures are currently ignored, the green bands indicate the build succeeded, not necessarily all the tests)

I can start skipping these failing tests in the debian builds for these minor architectures so that we can monitor reliable passing of the other tests (i.e. not ignore test failures). Let me know if you want me to apply any specific patches.

@edoapra
Copy link
Collaborator

edoapra commented Oct 19, 2023

Thanks for the detailed update. I will try to see what I can replicate/reproduce with a generic build on some of the architectures. I have never tried to use s390x ... I am even surprised that the code can build and run.

@edoapra
Copy link
Collaborator

edoapra commented Oct 19, 2023

I am surprised the code works on x32 ... do you get just a single failure on n2_ccsd?

@drew-parsons
Copy link
Author

For sure it means the code is in good condition, that it builds and passes most tests on most minor architectures.

That's right, all the other tests passed on x32, n2_ccsd is the only one that failed. The x32 build log is
https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=x32&ver=7.2.1-3&stamp=1697647806&raw=0

In patch 02_makefile_flags.patch we added LINUX64 alongside LINUX in one of the blocks in config/makefile.h to set some build flags. But we also deactivated LINUXCPU in the same block to avoid chip-specific optimisations (we don't want i786 instructions in the i386 build, for instance).

@edoapra
Copy link
Collaborator

edoapra commented Oct 21, 2023

I see cosmo_h3co mentioned twice for sparc64 ... does it result in small failure or a SIGBUS?

@drew-parsons
Copy link
Author

The sparc64 log is https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=sparc64&ver=7.2.1-3&stamp=1697877511&raw=0

cosmo_h3co fails with the SIGBUS with mpich,

It was cosmo_h2o not cosmo_h3co that fails with the small energy difference,

-Total SCF energy = -76.04274
+Total SCF energy = -76.05245

@drew-parsons
Copy link
Author

alpha was slow to build but has finally declared it wants to join the DONTHAVEM64OPT group.
https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=alpha&ver=7.2.1-3&stamp=1697807798&raw=0

@edoapra
Copy link
Collaborator

edoapra commented Oct 21, 2023

In patch 02_makefile_flags.patch we added LINUX64 alongside LINUX in one of the blocks in config/makefile.h to set some build flags. But we also deactivated LINUXCPU in the same block to avoid chip-specific optimisations (we don't want i786 instructions in the i386 build, for instance).

The following bit of 02_makefile_flags.patch does not seem a good idea to me, since you end up using the 32-bit options for 64-bit linux
In the proper LINUX64 makefile.h part, you can now set USE_HWOPT=n to disable all the native CPU optimizations.

-ifeq ($(TARGET),$(findstring $(TARGET),LINUX CYGNUS CYGWIN))
+ifeq ($(TARGET),$(findstring $(TARGET),LINUX LINUX64 CYGNUS CYGWIN))

@drew-parsons
Copy link
Author

USE_HWOPT will help the generic debian package builds. I'll try it along with removing the LINUX64 patch in the next upload.

edoapra added a commit to edoapra/nwchem that referenced this issue Oct 21, 2023
@edoapra
Copy link
Collaborator

edoapra commented Oct 23, 2023

powerpc architecture: I am not able to reproduce your failures (other than n2_ccsd).
There two difference in my build from the debian ci: I am using the vanilla 7.2.1 makefile.h and tools/GA using the builting ga 5.8.2 (tested both ARMCI_NETWORK=MPI-TS and ARMCI_NETWORK=ARMCI)

@drew-parsons
Copy link
Author

Strange about powerpc. We have an external ga but it's still 5.8.2.

With alpha (build log), it's getting the wrong flags for gcc building peigs,

make[5]: Entering directory '/<<PKGBUILDDIR>>/build-mpich/src/peigs/src/c'
Compiling clustrfix.c...
cc: error: unrecognized command-line option ‘-fast’; did you mean ‘-Ofast’?
cc: error: unrecognized command-line option ‘-tune’; did you mean ‘-mtune=’?
cc: error: unrecognized command-line option ‘-arch’
make[5]: *** [GNUmakefile:281: clustrfix.o] Error 1

That's coming from

peigs_CC += -O2 -fast -tune host -arch host

What's the best way to handle it?

x32 is apparently not happy having to handle LINUXCPU, https://buildd.debian.org/status/fetch.php?pkg=nwchem&arch=x32&ver=7.2.1-4&stamp=1698161945&raw=0

Making libraries in rtdb
Compiling rtdb_seq.c...
In file included from rtdb_seq.c:5:
/usr/include/stdlib.h:26:10: fatal error: bits/libc-header-start.h: No such file or directory
   26 | #include <bits/libc-header-start.h>
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~

I'm a bit confused by the problem. nwchem is just including stdlib.h, stdlib.h should know what bits it needs. There's no special handling for LINUXCPU=x32 anyway, the behaviour shouldn't have changed.

@edoapra
Copy link
Collaborator

edoapra commented Oct 24, 2023

Let me have a look at the alpha build.
x32 should not be considered as a viable NWChem architecture. I would rather deprecated.
I am making some progress on the s390x issues. They might be due to the debian shipped OpenBLAS libraries.

@drew-parsons
Copy link
Author

One point about BLAS. Debian policy is to build against the generic BLAS (libblas-dev). libblas.so is ABI compatible so any (optimised) BLAS package can be installed on end-user systems to provide the desired optimised blas implementation (OpenBLAS or one of the others) at runtime. But in the debian CI tests we haven't configured installation of an optimised BLAS package. So the s390x failure is using generic BLAS.

@edoapra
Copy link
Collaborator

edoapra commented Oct 24, 2023

One point about BLAS. Debian policy is to build against the generic BLAS (libblas-dev). libblas.so is ABI compatible so any (optimised) BLAS package can be installed on end-user systems to provide the desired optimised blas implementation (OpenBLAS or one of the others) at runtime. But in the debian CI tests we haven't configured installation of an optimised BLAS package. So the s390x failure is using generic BLAS.

Could you elaborate a bit on how to switch from generic libblas-dev to libopenblas-dev at runtime?

@drew-parsons
Copy link
Author

drew-parsons commented Oct 24, 2023

It just needs libopenblas-dev to be installed, nothing else. Then the install scripts use the debian alternatives mechanism to set a symlink for the preferred libblas.so. The preference can be set manually using update-alternatives, but the default alternative is determined by priority values defined for each implementation in their installation scripts (openblas is higher priority than generic blas).

There is an additional choice about which openblas build. libopenblas-dev will install the pthread build (libopenblas-pthread-dev) if not otherwise specified, but openmp threads (libopenblas-openmp-dev) and no threading (libopenblas-serial-dev) is also available.

64-bit BLAS is also an option (libblas64-dev, etc). We've had a request to provide a longint nwchem build, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=935993 But that's a separate library, libblas64.so not libblas.so.

@edoapra
Copy link
Collaborator

edoapra commented Oct 24, 2023

64-bit BLAS is not enough, since we have Scalapack, too. Scalapack has been ported to use 64-bit integers for a while (this I what I use by default and do strongly recommend). I wish these changes would make it to debian some time soon.
From https://github.com/Reference-ScaLAPACK/scalapack/releases/tag/v2.2.0

Release 2.2.0

New features:

Allow compilation in ILP64 mode, PR Reference-ScaLAPACK/scalapack#19

@edoapra
Copy link
Collaborator

edoapra commented Oct 24, 2023

I am testing the alpha plaform fixes.

@edoapra
Copy link
Collaborator

edoapra commented Oct 25, 2023

My earlier analysis on the s390x just proved to be wrong. It was an issue with 64bit to 32bit conversion.
I am guessing that s390x and sparc64 being big endian architectures are less forgiving for this kind of issues.

@drew-parsons
Copy link
Author

Endianess can be tricky.

As far as scalapack goes, we have a bug request to provide a 64-bit build at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961186

@edoapra
Copy link
Collaborator

edoapra commented Oct 27, 2023

I might craft a 7.2.2 release with all these latest changes

@drew-parsons
Copy link
Author

drew-parsons commented Oct 27, 2023

776068e the change to ALPHA_CPU, should that apply also to CRAY-T3 ?

peigs_CPU = ALPHA

peigs_CPU = ALPHA

@edoapra
Copy link
Collaborator

edoapra commented Oct 27, 2023

I believe there are no Cray-T3D or Cray-T3E powered on at this point in time. They are only present in computer museum.s We should delete all those lines of makefile at some point.
https://cray-cyber.org/old/systems/t3e.php

@edoapra
Copy link
Collaborator

edoapra commented Nov 3, 2023

@drew-parsons
Copy link
Author

Will be interesting to see how long debian keeps it on the build list. Likely debian will have to drop it too if it's no longer in the linux kernel.

@edoapra
Copy link
Collaborator

edoapra commented Nov 4, 2023

NWChem 7.2.2 is out with all the latest fixes to make the build work on big-endian architectures (the fix for Python 3.12 #892 is there, too).
https://github.com/nwchemgit/nwchem/releases/tag/v7.2.2-release
https://github.com/nwchemgit/nwchem/blob/master/release.notes.7.2.2.md

GA 5.8.2 needs to be patched to work on big-endian architectures with the following patches available in the ga develop branch
https://github.com/GlobalArrays/ga/commit/adccda8f2ac107c70eafa2c9c5b6e83e453107a7.patch
https://github.com/GlobalArrays/ga/commit/c833499616ca8fe2bd13c59dd666a404a80f9405.patch

edoapra added a commit to edoapra/nwchem that referenced this issue Apr 3, 2024
edoapra added a commit to edoapra/nwchem that referenced this issue Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants