Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: MpiWrapper::allReduce overload for arrays. #3446

Merged
merged 19 commits into from
Jan 29, 2025

Conversation

CusiniM
Copy link
Collaborator

@CusiniM CusiniM commented Nov 15, 2024

We had a bug because two arrays of different size were exchanged by grabbing the buffers directly. This overload will prevent it from happening.

I had to add a small helper coz LvArray has ValueType instead of value_type like std objects.

@CusiniM CusiniM self-assigned this Nov 15, 2024
@CusiniM CusiniM added the flag: no rebaseline Does not require rebaseline label Nov 15, 2024
@CusiniM
Copy link
Collaborator Author

CusiniM commented Dec 5, 2024

@rrsettgast , @corbett5 , @wrtobin this is ready for review.

src/coreComponents/common/TypesHelpers.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@MelReyCG MelReyCG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, can we drop or at least set to private the following signature?

template< typename T >
  static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );

All external calls seem to be able to be translated to Span<T[]>.

@CusiniM
Copy link
Collaborator Author

CusiniM commented Dec 7, 2024

At this point, can we drop or at least set to private the following signature?

template< typename T >
  static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );

All external calls seem to be able to be translated to Span<T[]>.

yes, you are right, let me try to do this.

@CusiniM
Copy link
Collaborator Author

CusiniM commented Jan 2, 2025

At this point, can we drop or at least set to private the following signature?

template< typename T >
  static int allReduce( T const * sendbuf, T * recvbuf, int count, MPI_Op op, MPI_Comm comm = MPI_COMM_GEOS );

All external calls seem to be able to be translated to Span<T[]>.

done.

@CusiniM CusiniM added ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI labels Jan 3, 2025
@CusiniM
Copy link
Collaborator Author

CusiniM commented Jan 3, 2025

@sframba or @acitrain can you please have a quick look at the changes to the wave solvers files and approve this?

Comment on lines 1016 to 1018
MpiWrapper::allReduce( dasReceivers,
dasReceivers,
MpiWrapper::Reduction::Sum,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sframba @acitrain this change triggers these diffs:

NFO: Total number of log files processed: 1537

WARNING: Found unfiltered diff in: /Users/cusini1/Downloads/integratedTests/TestResults/test_data/wavePropagation/elas3D_DAS_smoke_08/1497.python3_elas3D_DAS_smoke_08_2_restartcheck_.log
INFO: Details of diffs:   ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

  ********************************************************************************

  Error: /Problem/Solvers/elasticSolver/dasSignalNp1AtReceivers

  	Arrays of types float32 and float32 have 402 values of which 200 fail both the relative and absolute tests.

  		Max absolute difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  		Max relative difference is at index (np.int64(200), np.int64(1)): value = 0.2, base_value = 0.0

  	Statistics of the q values greater than 1.0 defined by absolute tolerance: N = 200

  		max = 2000.0001, mean = 500.0, std = 646.787

The only explanation I can find is that, in the previous code, m_linearDASGeometry.size( 0 ) was not the size of the array. The fact that arrayView2d< real32 > const dasReceivers = m_dasSignalNp1AtReceivers.toView(); makes me wonder why m_linearDASGeometry.size( 0 ) was passed as the size. I am not sure how these two objects are related...

Copy link
Contributor

@sframba sframba Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CusiniM , I think that is exactly the reason: we have an extra receiver that simply tracks time:

m_dasSignalNp1AtReceivers.resize( m_nsamplesSeismoTrace, numReceiversGlobal + 1 );

This receiver does not need to be summed across ranks. Therefore only the first dasReceivers .size() - 1 (that is, m_linearDASGeometry.size( 0 )) arrays need to be summed, not the last one. Does this break your new structure or can it be accomodated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it does break it in the sense that the overload I put in place for arrays was written with the idea of summing the full array across ranks. Basically I wanted to ensure that one does not provide the wrong size (which caused a couple of bugs in the past). We could add an overload to allow for a the reduce operation to occur on a specified size <= array.size().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sframba Is there a reason that this all has to be in one array? It seems that the array is holding different quantity types? If so, can we just create a separate object to hold the last value that is of different quantity type?

Copy link
Contributor

@sframba sframba Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rrsettgast as discussed together, no there is no reason why this should be all one array. We could (should) separate the extra array so that they become of homogeneous dimension. However, this requires some changes in our python code as well. Could we maybe :

  • Implement the overload suggested by @CusiniM for now, so we unlock this PR, then
  • When the "strong unit typing" work starts officially, add an item about wave solvers' receiver uniformity and assign it to us, so we can take the time to tidy up this problem properly?
    What do you guys think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already implemented it here:

void MpiWrapper::allReduce( SRC_CONTAINER_TYPE const & src, DST_CONTAINER_TYPE & dst, int const count, Reduction const op, MPI_Comm const comm )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are now passing so this just needs @sframba 's approval now.

@CusiniM CusiniM added the ci: run code coverage enables running of the code coverage CI jobs label Jan 13, 2025
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 58.97436% with 16 lines in your changes missing coverage. Please review.

Project coverage is 56.74%. Comparing base (5bfdb01) to head (8f6aefe).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
...olvers/solidMechanics/SolidMechanicsStatistics.cpp 0.00% 4 Missing ⚠️
src/coreComponents/common/MpiWrapper.hpp 84.61% 2 Missing ⚠️
...onents/physicsSolvers/PhysicsSolverBaseKernels.hpp 33.33% 2 Missing ⚠️
...hysicsSolvers/solidMechanics/SolidMechanicsMPM.cpp 0.00% 2 Missing ⚠️
src/coreComponents/mesh/ParticleManager.cpp 0.00% 1 Missing ⚠️
...nents/physicsSolvers/contact/ContactSolverBase.cpp 0.00% 1 Missing ⚠️
...olvers/contact/SolidMechanicsEmbeddedFractures.cpp 0.00% 1 Missing ⚠️
...hysicsSolvers/multiphysics/HydrofractureSolver.cpp 0.00% 1 Missing ⚠️
...ers/solidMechanics/SolidMechanicsLagrangianFEM.cpp 0.00% 1 Missing ⚠️
...sicsSolvers/surfaceGeneration/SurfaceGenerator.cpp 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3446      +/-   ##
===========================================
- Coverage    56.74%   56.74%   -0.01%     
===========================================
  Files         1169     1169              
  Lines       101538   101551      +13     
===========================================
+ Hits         57615    57622       +7     
- Misses       43923    43929       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@CusiniM
Copy link
Collaborator Author

CusiniM commented Jan 15, 2025

@sframba please have a look now and approve.

@rrsettgast rrsettgast merged commit ac45ee2 into develop Jan 29, 2025
27 checks passed
@rrsettgast rrsettgast deleted the feature/cusini/addMpiWrapperForArrays branch January 29, 2025 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci: run code coverage enables running of the code coverage CI jobs ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: no rebaseline Does not require rebaseline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants