Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuned bcast data corruption #1763

Open
nysal opened this issue Jun 7, 2016 · 16 comments
Open

Tuned bcast data corruption #1763

nysal opened this issue Jun 7, 2016 · 16 comments

Comments

@nysal
Copy link
Member

nysal commented Jun 7, 2016

There are multiple issues with bcast algorithms when the tasks involved specify different counts and datatypes for a given bcast operation.

One low hanging fix is :

--- a/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c
+++ b/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c
@@ -252,7 +252,7 @@ int ompi_coll_tuned_bcast_intra_dec_fixed(void *buff, int count,

     /* Handle messages of small and intermediate size, and
        single-element broadcasts */
-    if ((message_size < small_message_size) || (count <= 1)) {
+    if (message_size < small_message_size) {
         /* Binomial without segmentation */
         segsize = 0;
         return  ompi_coll_base_bcast_intra_binomial(buff, count, datatype,

I'm seeing issues with both the pipelined and split binary tree algorithms. This is probably a known issue with pipelined algorithms in tuned module. We should probably start discussing the right approach to solve this.

Here's a reproducer:

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define MAXLEN 10000000

int 
main(int argc, char **argv)
{
   int root,*out,i,j,k;
   int myself,tasks;
   MPI_Datatype type;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&myself);
   MPI_Comm_size(MPI_COMM_WORLD,&tasks);
   MPI_Type_contiguous(10, MPI_INT, &type);
   MPI_Type_commit(&type);
   /* Add this section because some os/architectures can't allocate
      this much memory on the stack */

   out = malloc(sizeof(int) * MAXLEN);
   if (out == NULL) {
     printf( "Doh!  Rank %d was not able to allocate enough memory.  MPI test aborted!\n", myself);
     exit(1);
   }

   root = tasks-1;
   for (j=1000000;j<=MAXLEN;j*=10)  {
      if (myself == root)
         for (i=0;i<j;i++)  
       out[i] = i;
      fprintf(stderr,"bcast %d integers\n",j);
      if(myself == root) {
          MPI_Bcast(out,j,MPI_INT,root,MPI_COMM_WORLD);
      } else {
          MPI_Bcast(out,j/10,type,root,MPI_COMM_WORLD);
      }

      for (k=0;k<j;k++) {
         if (out[k] != k) {  
       printf("bad answer (%d) at index %d of %d (should be %d)\n",out[k],k,j,k); 
           exit(1);
     }
      }
   }
   MPI_Barrier(MPI_COMM_WORLD);
   printf("Rank %d exiting successfully\n", myself);
   MPI_Type_free(&type);
   MPI_Finalize();

   /* Clean up, 'cause it's the Right Thing to do */

   free(out);

   return 0;
}
@nysal nysal added the bug label Jun 7, 2016
@nysal nysal added this to the v2.0.1 milestone Jun 7, 2016
@ggouaillardet
Copy link
Contributor

I confirm this is a known issue.
iirc, the workaround is to force the segment size to zero.
there has been some discussions on how to fix that, one of them involves flatten datatypes.
generally speaking, this requires an internal copy, which does impact performances.

@bosilca
Copy link
Member

bosilca commented Jun 8, 2016

Indeed, the problem is that despite the fact that all processes must provide the same type signature, they can provide different counts, and the collective algorithm selection will be unable to cut the datatype at the expected boundary. Thus flattening the datatype, will generate the packed type that match the user provided type on all processes, a type that can be easily split in multiple fragments.

Another possible approach would be to allow the collective to call into the PML at a lower level, where they can provide the convertor and the length to be sent.

@ggouaillardet
Copy link
Contributor

i like the second option better

consider the following academic case

  • one hand sends n MPI_BYTE
  • the other hand sends one ddt, which is m MPI_BYTE followed by padding and then (n-m) MPI_BYTE
  • m is a prime number

the only way I can think of sending data into several pieces is to copy the data into a buffer (flatten datatype) first.
this is very unefficient ...

so let's enhance the PML framework to basically

  • send up to the first n bytes of a message
  • resume and send up to n bytes of a previous send

makes sense ?

@nysal
Copy link
Member Author

nysal commented Jun 8, 2016

I like the second suggestion better as well. So this will require an addition to the pml send interface. Lets discuss how that might look like. So an interface where we pass a convertor to the PML will be sufficient ?

@bosilca
Copy link
Member

bosilca commented Jun 8, 2016

@ggouaillardet I don't think we can make this work the way you describe. My suggestion was that instead of the usual (pointer, datatype, count) we provide a well formed convertor and a length to the isend/irecv calls. This means that we maintain the same semantics as today (including matching per message), but we gain more flexibility on what par of the user data become a message by itself.

@ggouaillardet the academic case is trivial. The 2 datatypes have exactly the same type signature, so one flattened the 2 datatypes should be identical. In the worst case the flattened datatype will lack a high level count, but because the signature is the same, the current collective algorithms will work.

@ggouaillardet
Copy link
Contributor

@bosilca ok for the pml extension you suggested

back to the academic case, this is something i never understood ... but i d like to understand at some point in time ...

let's create the ddt with
MPI_Type_create_struct(2, {3,1}, {0,4}, MPI_BYTE, &ddt);

and let's have one task call

MPI_Bcast(buf, 4, MPI_BYTE, ...);

and an other one call

MPI_Bcast(buf, 1, ddt, ...);

and we want to split the message into two (two bytes) chunks.

that is trivial if 4 * MPI_BYTE are used
now, with ddt, the flatten datatype is also 4 * MPI_BYTE
that is trivial if

MPI_Bcast(buf, 1, ddt, ...);

ends up doing

MPI_Sendrecv(buf, 1, ddt, ..., buf2, 4, MPI_BYTE, ...);
MPI_Bcast(buf2, 4, MPI_BYTE, ...);

put that requires extra memory and a local send/recv, which does impact performance (and sometimes pretty badly)

is there a way (and if yes, which one) to use flatten datatypes without requiring extra memory and a local send/recv ?

@ggouaillardet
Copy link
Contributor

I made #1789 as a proof of concept
Only ob1 implements the extended pml interface for now.
The test case works now except with the openib btl
(I still need to debug some paths in ob1)
Heterogeneous clusters are currently broken.
I pass a size_t* parameter, but i could also pass a size_t, since non blocking subroutine do not update this value (yet)

Any thoughts ?

@jjhursey
Copy link
Member

@bosilca You mentioned (in person) that there is an MCA parameter to route around this issue. What was that parameter?

@bosilca
Copy link
Member

bosilca commented Aug 17, 2016

@jjhursey I said we should provide one ;) Right now you can use the tuned collective dynamic configuration to change the decision function to prevent any pipelined function from being called. We might want to create a simpler way, by taking advantage of the PVARs.

jjhursey added a commit to jjhursey/ompi that referenced this issue Nov 1, 2016
 * If (legal) non-uniform data type signatures are used in ibcast
   then the chosen algorithm may fail on the request, and worst case
   it could produce wrong answers.
 * Add an MCA parameter that, by default, protects the user from this
   scenario. If the user really wants to use it then they have to
   'opt-in' by setting the following parameter to false:
   - `-mca coll_libnbc_ibcast_skip_dt_decision f`
 * Once the following Issues are resolved then this parameter can
   be removed.
   - open-mpi#2256
   - open-mpi#1763
 * Fixes STG Defect #114680

Signed-off-by: Joshua Hursey <[email protected]>
jjhursey added a commit to jjhursey/ompi that referenced this issue Nov 1, 2016
 * If (legal) non-uniform data type signatures are used in ibcast
   then the chosen algorithm may fail on the request, and worst case
   it could produce wrong answers.
 * Add an MCA parameter that, by default, protects the user from this
   scenario. If the user really wants to use it then they have to
   'opt-in' by setting the following parameter to false:
   - `-mca coll_libnbc_ibcast_skip_dt_decision f`
 * Once the following Issues are resolved then this parameter can
   be removed.
   - open-mpi#2256
   - open-mpi#1763

Signed-off-by: Joshua Hursey <[email protected]>
jjhursey added a commit to jjhursey/ompi that referenced this issue Nov 7, 2016
 * If (legal) non-uniform data type signatures are used in ibcast
   then the chosen algorithm may fail on the request, and worst case
   it could produce wrong answers.
 * Add an MCA parameter that, by default, protects the user from this
   scenario. If the user really wants to use it then they have to
   'opt-in' by setting the following parameter to false:
   - `-mca coll_libnbc_ibcast_skip_dt_decision f`
 * Once the following Issues are resolved then this parameter can
   be removed.
   - open-mpi#2256
   - open-mpi#1763

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 350ef67)
Signed-off-by: Joshua Hursey <[email protected]>
jjhursey added a commit to jjhursey/ompi that referenced this issue Nov 7, 2016
 * If (legal) non-uniform data type signatures are used in ibcast
   then the chosen algorithm may fail on the request, and worst case
   it could produce wrong answers.
 * Add an MCA parameter that, by default, protects the user from this
   scenario. If the user really wants to use it then they have to
   'opt-in' by setting the following parameter to false:
   - `-mca coll_libnbc_ibcast_skip_dt_decision f`
 * Once the following Issues are resolved then this parameter can
   be removed.
   - open-mpi#2256
   - open-mpi#1763

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 350ef67)
Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey
Copy link
Member

jjhursey commented Nov 8, 2016

For anyone fixing this issue make sure to also fix it in libnbc - per Issue #2256. The resolution of that issue implements a work around - which should be removed once a full fix is available.

artpol84 pushed a commit to artpol84/ompi that referenced this issue Dec 20, 2016
 * If (legal) non-uniform data type signatures are used in ibcast
   then the chosen algorithm may fail on the request, and worst case
   it could produce wrong answers.
 * Add an MCA parameter that, by default, protects the user from this
   scenario. If the user really wants to use it then they have to
   'opt-in' by setting the following parameter to false:
   - `-mca coll_libnbc_ibcast_skip_dt_decision f`
 * Once the following Issues are resolved then this parameter can
   be removed.
   - open-mpi#2256
   - open-mpi#1763

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 350ef67)
Signed-off-by: Joshua Hursey <[email protected]>
@jsquyres jsquyres modified the milestones: v3.0.0, v2.1.0 Feb 21, 2017
@jsquyres
Copy link
Member

Per discussion on the 2017-02-21 call, removed the blocker label and moved to v3.0.0.

@hppritcha
Copy link
Member

Doesn't look like this is going to make it into 3.0.0. Should we move to future?

@bosilca
Copy link
Member

bosilca commented Jul 25, 2017

👍

@bwbarrett bwbarrett modified the milestones: v3.0.1, v3.0.0 Sep 12, 2017
@bwbarrett bwbarrett modified the milestones: v3.0.1, v3.0.2 Mar 1, 2018
@bwbarrett bwbarrett modified the milestones: v3.0.2, v3.0.3 Jun 1, 2018
@bwbarrett bwbarrett modified the milestones: v3.0.3, v3.0.4 Oct 29, 2018
@bwbarrett bwbarrett modified the milestones: v3.0.4, v3.0.5 Apr 16, 2019
@bwbarrett bwbarrett modified the milestones: v3.0.5, v3.0.6 Nov 26, 2019
@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2020

@nysal @bosilca @ggouaillardet @jjhursey Did this get fixed after v3.0.x?

Regardless, unless the fix is trivial, it's unlikely to get fixed in v3.0.x or v3.1.x. Should this be moved to v4.0.x or closed?

@bosilca
Copy link
Member

bosilca commented Jan 9, 2020

The underlying cause of this issue cannot get fixed in an efficient way in the current OMPI infrastructure. @ggouaillardet created a proof of concept PR but due to time constraints it never got completed, which means we are stuck with a suboptimal solution: disabling all tuned decisions for collective where the user provides different counts on the different nodes (implemented by @jjhursey for libnbc).

@jsquyres
Copy link
Member

jsquyres commented Jan 9, 2020

Ah, that's #1789? Looks like that was closed with no time to work on it.

Ok, let's leave this issue open, but I'll remove the milestone targets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants