Tuned bcast data corruption #1763

nysal · 2016-06-07T11:55:55Z

There are multiple issues with bcast algorithms when the tasks involved specify different counts and datatypes for a given bcast operation.

One low hanging fix is :

--- a/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c
+++ b/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c
@@ -252,7 +252,7 @@ int ompi_coll_tuned_bcast_intra_dec_fixed(void *buff, int count,

     /* Handle messages of small and intermediate size, and
        single-element broadcasts */
-    if ((message_size < small_message_size) || (count <= 1)) {
+    if (message_size < small_message_size) {
         /* Binomial without segmentation */
         segsize = 0;
         return  ompi_coll_base_bcast_intra_binomial(buff, count, datatype,

I'm seeing issues with both the pipelined and split binary tree algorithms. This is probably a known issue with pipelined algorithms in tuned module. We should probably start discussing the right approach to solve this.

Here's a reproducer:

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

#define MAXLEN 10000000

int 
main(int argc, char **argv)
{
   int root,*out,i,j,k;
   int myself,tasks;
   MPI_Datatype type;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD,&myself);
   MPI_Comm_size(MPI_COMM_WORLD,&tasks);
   MPI_Type_contiguous(10, MPI_INT, &type);
   MPI_Type_commit(&type);
   /* Add this section because some os/architectures can't allocate
      this much memory on the stack */

   out = malloc(sizeof(int) * MAXLEN);
   if (out == NULL) {
     printf( "Doh!  Rank %d was not able to allocate enough memory.  MPI test aborted!\n", myself);
     exit(1);
   }

   root = tasks-1;
   for (j=1000000;j<=MAXLEN;j*=10)  {
      if (myself == root)
         for (i=0;i<j;i++)  
       out[i] = i;
      fprintf(stderr,"bcast %d integers\n",j);
      if(myself == root) {
          MPI_Bcast(out,j,MPI_INT,root,MPI_COMM_WORLD);
      } else {
          MPI_Bcast(out,j/10,type,root,MPI_COMM_WORLD);
      }

      for (k=0;k<j;k++) {
         if (out[k] != k) {  
       printf("bad answer (%d) at index %d of %d (should be %d)\n",out[k],k,j,k); 
           exit(1);
     }
      }
   }
   MPI_Barrier(MPI_COMM_WORLD);
   printf("Rank %d exiting successfully\n", myself);
   MPI_Type_free(&type);
   MPI_Finalize();

   /* Clean up, 'cause it's the Right Thing to do */

   free(out);

   return 0;
}

ggouaillardet · 2016-06-07T12:44:41Z

I confirm this is a known issue.
iirc, the workaround is to force the segment size to zero.
there has been some discussions on how to fix that, one of them involves flatten datatypes.
generally speaking, this requires an internal copy, which does impact performances.

bosilca · 2016-06-08T10:58:04Z

Indeed, the problem is that despite the fact that all processes must provide the same type signature, they can provide different counts, and the collective algorithm selection will be unable to cut the datatype at the expected boundary. Thus flattening the datatype, will generate the packed type that match the user provided type on all processes, a type that can be easily split in multiple fragments.

Another possible approach would be to allow the collective to call into the PML at a lower level, where they can provide the convertor and the length to be sent.

ggouaillardet · 2016-06-08T12:44:21Z

i like the second option better

consider the following academic case

one hand sends n MPI_BYTE
the other hand sends one ddt, which is m MPI_BYTE followed by padding and then (n-m) MPI_BYTE
m is a prime number

the only way I can think of sending data into several pieces is to copy the data into a buffer (flatten datatype) first.
this is very unefficient ...

so let's enhance the PML framework to basically

send up to the first n bytes of a message
resume and send up to n bytes of a previous send

makes sense ?

nysal · 2016-06-08T14:19:31Z

I like the second suggestion better as well. So this will require an addition to the pml send interface. Lets discuss how that might look like. So an interface where we pass a convertor to the PML will be sufficient ?

bosilca · 2016-06-08T14:51:24Z

@ggouaillardet I don't think we can make this work the way you describe. My suggestion was that instead of the usual (pointer, datatype, count) we provide a well formed convertor and a length to the isend/irecv calls. This means that we maintain the same semantics as today (including matching per message), but we gain more flexibility on what par of the user data become a message by itself.

@ggouaillardet the academic case is trivial. The 2 datatypes have exactly the same type signature, so one flattened the 2 datatypes should be identical. In the worst case the flattened datatype will lack a high level count, but because the signature is the same, the current collective algorithms will work.

ggouaillardet · 2016-06-09T00:01:13Z

@bosilca ok for the pml extension you suggested

back to the academic case, this is something i never understood ... but i d like to understand at some point in time ...

let's create the ddt with
MPI_Type_create_struct(2, {3,1}, {0,4}, MPI_BYTE, &ddt);

and let's have one task call

MPI_Bcast(buf, 4, MPI_BYTE, ...);

and an other one call

MPI_Bcast(buf, 1, ddt, ...);

and we want to split the message into two (two bytes) chunks.

that is trivial if 4 * MPI_BYTE are used
now, with ddt, the flatten datatype is also 4 * MPI_BYTE
that is trivial if

MPI_Bcast(buf, 1, ddt, ...);

ends up doing

MPI_Sendrecv(buf, 1, ddt, ..., buf2, 4, MPI_BYTE, ...);
MPI_Bcast(buf2, 4, MPI_BYTE, ...);

put that requires extra memory and a local send/recv, which does impact performance (and sometimes pretty badly)

is there a way (and if yes, which one) to use flatten datatypes without requiring extra memory and a local send/recv ?

ggouaillardet · 2016-06-17T09:12:43Z

I made #1789 as a proof of concept
Only ob1 implements the extended pml interface for now.
The test case works now except with the openib btl
(I still need to debug some paths in ob1)
Heterogeneous clusters are currently broken.
I pass a size_t* parameter, but i could also pass a size_t, since non blocking subroutine do not update this value (yet)

Any thoughts ?

jjhursey · 2016-08-17T15:55:00Z

@bosilca You mentioned (in person) that there is an MCA parameter to route around this issue. What was that parameter?

bosilca · 2016-08-17T16:01:46Z

@jjhursey I said we should provide one ;) Right now you can use the tuned collective dynamic configuration to change the decision function to prevent any pipelined function from being called. We might want to create a simpler way, by taking advantage of the PVARs.

* If (legal) non-uniform data type signatures are used in ibcast then the chosen algorithm may fail on the request, and worst case it could produce wrong answers. * Add an MCA parameter that, by default, protects the user from this scenario. If the user really wants to use it then they have to 'opt-in' by setting the following parameter to false: - `-mca coll_libnbc_ibcast_skip_dt_decision f` * Once the following Issues are resolved then this parameter can be removed. - open-mpi#2256 - open-mpi#1763 * Fixes STG Defect #114680 Signed-off-by: Joshua Hursey <[email protected]>

* If (legal) non-uniform data type signatures are used in ibcast then the chosen algorithm may fail on the request, and worst case it could produce wrong answers. * Add an MCA parameter that, by default, protects the user from this scenario. If the user really wants to use it then they have to 'opt-in' by setting the following parameter to false: - `-mca coll_libnbc_ibcast_skip_dt_decision f` * Once the following Issues are resolved then this parameter can be removed. - open-mpi#2256 - open-mpi#1763 Signed-off-by: Joshua Hursey <[email protected]>

* If (legal) non-uniform data type signatures are used in ibcast then the chosen algorithm may fail on the request, and worst case it could produce wrong answers. * Add an MCA parameter that, by default, protects the user from this scenario. If the user really wants to use it then they have to 'opt-in' by setting the following parameter to false: - `-mca coll_libnbc_ibcast_skip_dt_decision f` * Once the following Issues are resolved then this parameter can be removed. - open-mpi#2256 - open-mpi#1763 Signed-off-by: Joshua Hursey <[email protected]> (cherry picked from commit 350ef67) Signed-off-by: Joshua Hursey <[email protected]>

jjhursey · 2016-11-08T16:19:21Z

For anyone fixing this issue make sure to also fix it in libnbc - per Issue #2256. The resolution of that issue implements a work around - which should be removed once a full fix is available.

* If (legal) non-uniform data type signatures are used in ibcast then the chosen algorithm may fail on the request, and worst case it could produce wrong answers. * Add an MCA parameter that, by default, protects the user from this scenario. If the user really wants to use it then they have to 'opt-in' by setting the following parameter to false: - `-mca coll_libnbc_ibcast_skip_dt_decision f` * Once the following Issues are resolved then this parameter can be removed. - open-mpi#2256 - open-mpi#1763 Signed-off-by: Joshua Hursey <[email protected]> (cherry picked from commit 350ef67) Signed-off-by: Joshua Hursey <[email protected]>

jsquyres · 2017-02-21T16:21:56Z

Per discussion on the 2017-02-21 call, removed the blocker label and moved to v3.0.0.

hppritcha · 2017-07-24T19:51:03Z

Doesn't look like this is going to make it into 3.0.0. Should we move to future?

bosilca · 2017-07-25T01:46:15Z

👍

jsquyres · 2020-01-09T16:28:13Z

@nysal @bosilca @ggouaillardet @jjhursey Did this get fixed after v3.0.x?

Regardless, unless the fix is trivial, it's unlikely to get fixed in v3.0.x or v3.1.x. Should this be moved to v4.0.x or closed?

bosilca · 2020-01-09T17:18:47Z

The underlying cause of this issue cannot get fixed in an efficient way in the current OMPI infrastructure. @ggouaillardet created a proof of concept PR but due to time constraints it never got completed, which means we are stuck with a suboptimal solution: disabling all tuned decisions for collective where the user provides different counts on the different nodes (implemented by @jjhursey for libnbc).

jsquyres · 2020-01-09T17:59:22Z

Ah, that's #1789? Looks like that was closed with no time to work on it.

Ok, let's leave this issue open, but I'll remove the milestone targets.

nysal added the bug label Jun 7, 2016

nysal added this to the v2.0.1 milestone Jun 7, 2016

jjhursey mentioned this issue Jul 27, 2016

coll/basic allgatherv wrong answer on different datatypes/counts #1907

Closed

hppritcha modified the milestones: v2.1.0, v2.0.1 Aug 17, 2016

jjhursey mentioned this issue Oct 21, 2016

libnbc nonuniform type failures in ibcast, iallgather(v) #2256

Closed

jjhursey mentioned this issue Nov 1, 2016

coll/libnbc: Work around for non-uniform data types in ibcast #2330

Merged

This was referenced Nov 7, 2016

coll/libnbc: Work around for non-uniform data types in ibcast #2378

Merged

coll/libnbc: Work around for non-uniform data types in ibcast #2379

Merged

hppritcha added the Severity: blocker label Jan 24, 2017

hppritcha removed the Severity: blocker label Feb 21, 2017

jsquyres modified the milestones: v3.0.0, v2.1.0 Feb 21, 2017

hppritcha added the Target: v3.0.x label Mar 21, 2017

bosilca mentioned this issue Jul 20, 2017

Crash or a hang (it depends on OpenMPI version) when receiver and sender use different datatypes #3937

Closed

bwbarrett modified the milestones: v3.0.1, v3.0.0 Sep 12, 2017

bwbarrett modified the milestones: v3.0.1, v3.0.2 Mar 1, 2018

bwbarrett modified the milestones: v3.0.2, v3.0.3 Jun 1, 2018

bwbarrett modified the milestones: v3.0.3, v3.0.4 Oct 29, 2018

bwbarrett modified the milestones: v3.0.4, v3.0.5 Apr 16, 2019

bwbarrett modified the milestones: v3.0.5, v3.0.6 Nov 26, 2019

jsquyres added Target: main and removed Target: v3.0.x labels Jan 9, 2020

jsquyres removed this from the v3.0.6 milestone Jan 9, 2020

gpaulsen added the help wanted label Jan 9, 2020

kawashima-fj mentioned this issue Feb 4, 2020

Fujitsu: MPI_GATHER (linear_sync) can be truncated with derived datatypes #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuned bcast data corruption #1763

Tuned bcast data corruption #1763

nysal commented Jun 7, 2016

ggouaillardet commented Jun 7, 2016

bosilca commented Jun 8, 2016

ggouaillardet commented Jun 8, 2016

nysal commented Jun 8, 2016

bosilca commented Jun 8, 2016

ggouaillardet commented Jun 9, 2016

ggouaillardet commented Jun 17, 2016

jjhursey commented Aug 17, 2016

bosilca commented Aug 17, 2016

jjhursey commented Nov 8, 2016

jsquyres commented Feb 21, 2017

hppritcha commented Jul 24, 2017

bosilca commented Jul 25, 2017

jsquyres commented Jan 9, 2020

bosilca commented Jan 9, 2020

jsquyres commented Jan 9, 2020

Tuned bcast data corruption #1763

Tuned bcast data corruption #1763

Comments

nysal commented Jun 7, 2016

ggouaillardet commented Jun 7, 2016

bosilca commented Jun 8, 2016

ggouaillardet commented Jun 8, 2016

nysal commented Jun 8, 2016

bosilca commented Jun 8, 2016

ggouaillardet commented Jun 9, 2016

ggouaillardet commented Jun 17, 2016

jjhursey commented Aug 17, 2016

bosilca commented Aug 17, 2016

jjhursey commented Nov 8, 2016

jsquyres commented Feb 21, 2017

hppritcha commented Jul 24, 2017

bosilca commented Jul 25, 2017

jsquyres commented Jan 9, 2020

bosilca commented Jan 9, 2020

jsquyres commented Jan 9, 2020