Skip to content

Latest commit

 

History

History
417 lines (325 loc) · 17.3 KB

developer_guide.md

File metadata and controls

417 lines (325 loc) · 17.3 KB

MPICH Developer Guide

This serves as a README for developers. It provides high level descriptions of code architecture and provide pointers to other notes.

  1. mpi.h
  2. mpiimpl.h
  3. MPI and PMPI functions
  4. MPIR impl functions
  5. Handle Objects
    1. Handle bit-layerout
  6. MPID Interface
    1. ch3
    2. ch4
      1. Active Message
      2. Shm
      3. Netmod
      4. RMA
  7. Datatype
    1. Typerep
  8. Collectives
  9. Op
  10. Romio
  11. MPI_T
  12. MPL
  13. CVAR
  14. PM and PMI
  15. Fortran binding
    1. mpif.h
    2. use mpi
    3. use mpi_f08
  16. Test Suite
  17. Build and Debug

mpi.h

The top-most layer is src/include/mpi.h. This is the interface standardized by MPI Forum and exposed to users. It defines user-level constants, types, and API prototypes.

The types include MPI objects, which we internally refer as handle objects, such as MPI_Communicator, MPI_Datatype, MPI_Request, etc. In MPICH, all handle objects are exposed to user as integer handle. Internally we have a custom allocator system for these objects. How we manage handle objects is a cruicial piece in understanding the MPICH architecture. Refer to dedicated section below.

The API prototypes are currently generated by maint/gen_binding_c.py in mpi_proto.h, which get included in mpi.h.

mpi.h is included in mpiimpl.h, which is included by nearly every source code.

Reference:

mpiimpl.h

Nearly all internal source code includes mpiimpl.h and often only mpiimpl.h. The all encompassing nature makes this header quite complex and delicate. The best way to understand it is to refer to the header itself rather than repeat here. It is cruical for understanding how we organize types and inline functions.

The all encompasing nature provides the convenience that most source file only need include this single header. On the other hand, it is also responsible for the slow compilation. There is a balance between spending time maintaining headers and spending time waiting for compilations.

Reference:

MPI and PMPI functions

The top level functions are MPI functions, e.g. MPI_Send. You will not find these functions directly in our current code base. They are generated by maint/gen_binding_c.py based on src/binding/mpi_standard_api.txt and other meta files. MPI functions handles parameter validation, trivial case early returns, standard error behaviors, and calling internal implementation routines with necessary argument transformations. These functionality contains large amount of boilerplates, thus are more suited for script generation.

PMPI prefixed function names are used to support MPI profiling interface. When user call MPI functions, e.g. MPI_Send, the symbol may link to a tool or profiling library function, which itercepts the call, do its profiling or analyses, and then calls PMPI_Send. Thus all top level functions are defined with PMPI_ names. This is why often PMPI names show up in traceback logs. To work when there isn't a tool library (the common case), both PMPI and MPI symbols are defined. If the compiler supports weak symbol, MPI names are weak symbols that links to PMPI names. This is how we do on Linux. Without weak symbol support, the top level functions are compiled twice, once with MPI names, and a second time with PMPI names. This is how on MacOs works.

Since this layer is mostly generated, we also refer to this layer as binding layer, which widely also apply to the Fortran/CXX bindings.

Reference:

MPIR impl functions

Nearly all MPI functions will call MPIR_Xxx_impl functions except some trivial functions that we directly handle in the binding layer.

MPIR impl functions uses the same parameters as the MPI function with some exceptions. For one, all handle objects are passed as pointers. E.g. MPI_Comm comm in MPI function become MPIR_Comm * comm_ptr internally. For two, internally we use MPI_Aint as much as we can to avoid hazadous incompatible integer conversions, while MPI functions may use int or MPI_Count. These conversion are all handled by the binging layer.

MPI functions are grouped in "chapters" -- referring to the chapters in MPI specification. The mpir implementation functions are placed in src/mpi/xxx, where xxx refers to the chapters such as pt2pt, coll, rma, etc.

Reference:

Handle Objects

Handle objects are crucial data structures in MPICH. Currently defined handle objects are here.

Internally, handle objects are defined as struct, e.g. MPIR_Comm. The first two fields for all object structs are:

    int handle;
    Handle_ref_count ref_count;

We use custom handle memory allocation for handle objects. There are 3 tiers:

  1. built-in objects - example
  2. direct objects - example
  3. indirect objects - allocated by slabs This system allows no initialization/overhead for built-in objects, minimum overhead when only minimum number of objects are used, and a managed overhead when large amount of objects are needed.

Handle bit-layout

Use of integer handle provides better stability for bindings (where pointer type is not guaranteed) and it provides better debuggability since the handle contains more semantic information than pointer address. With our handle memory system, integer handler can be quickly converted to internal pointers.

Current handle bits layout:

    0 [2]  Handle type (INVALID, BUILTIN, DIRECT, INDIRECT)
    2 [4]  Handle kind (MPIR_COMM, ...)
    6 [26] Block and Index

The 26 bits blocks are further divided into 14 bits for slab (block) index, and 12 bits for index within the block. For request objects, the 14 bits block index is further divided into 6 bits for request pools and 8 bits for blocks.

Consequently, the maximum number of live objects we can have is limited by handle bits.

Reference:

MPID Interface

A key design goal of MPICH is to allow downstream vendors to easily create vendor specific implementations. This achieved by Abstract Device Interface (ADI).ADI is a set of MPID prefixed functions that implements the functionality of MPI operation. For example, MPID_Send implements MPI_Send. Nearly all MPI functions will call MPID counter parts first, allowing device layer to either supply full functionality or simply fall back by calling MPIR implementations.

For performance critical path, e.g. pt2pt and rma path, we call MPID layer directly from binding. This allows full inline build to achieve maximum compiler optimization. The other ADIs are not performance critical, but provided as hooks -- MPIR layer will call these hooks at key points -- to allow device properly setup and control the implementation behaviors.

Note: all pt2pt communicatioin in ADI are nonblocking -- it will only initiate the communication and completed during MPID_Progress_wait/test.

Reference:

ch3

Ch3 is currently in maintenance mode. It is still fully supported since there are vendors still basing on ch3.

There are two channels in ch3. ch3:sock is a pure socket implementation. ch3:nemesis adds shared memory communication and also supports network modules (netmod). We currently support ch3:nemesis:tcp and ch3:nemesis:ofi.

ch4

Ch4 is where currently active research and development take place. Many new features, e.g. per-VCI threading, GPU IPC, partitioned communication, etc. are only available in Ch4.

Ch4 designs an additional ADI-like interface, commonly referred as ch4 API or shm/netmod API. With most MPID functions, ch4-layer will check whether the communication is local (can be carried out using shared memory) and calls into either shm API or netmod API. It is possible to disable shm entirely.

The framework for ch4 API involves large amount of boilerplates due to the need allow both fully inlined build and nonlined build using function tables. We use script to generate most of these API files.

Reference:

Active Message

Similar to MPIR, ch4 provide a default/fallback implementation based on active messages. Ch3 implementation is largely an active message implementation. It is easier to use an example to explain what is active message and how it works.

MPID_Send will call either MPIDI_SHM_mpi_isend or MPIDI_NM_mpi_isend based on locality, which may call back to MPIDIG_mpi_isend. MPIDIG_mpi_isend will send the message along with am header to target process using MPIDI_{SHM,NM}_am_isend. It creates an MPIR_Request object and return to user.

During progress, the receiving process receives the message (inside shm progress or netmod progress), it will check the am header, and call a registered handler function to process the message. The handler functions are essential pieces in a active message protocol. In the pt2pt eager send case, it will check posted receiver buffer list and either copy the message data to the receiver buffer (and completes the receive request) or enqueuing the message to an unexpected queue.

As a counter part, MPID_Recv will at some point call MPIDIG_mpi_irecv, which will check the unexpected queue, and either copy the data over (and completes the receive buffer) or enqueue the receive request to a posted queue.

There are complications, where we may implement different protocols such as rendezvous protocol or IPC or RDMA, as well as various RMA operations. Different protocols are all implemented by handler functions which in turn may send additional active messages with another handler functions to carry on the protocols.

Reference:

Active message APIs as of MPICH 3.4.2:

/****************** Header and Data Movement APIs ******************/

/* blocking header send */
int MPIDI_[NM|SHM]_am_send_hdr(int rank, MPIR_Comm * comm, int handler_id,
                               const void *am_hdr, MPI_Aint am_hdr_sz)

/* nonblocking header + datatype send */
int MPIDI_[NM|SHM]_am_isend(int rank, MPIR_Comm * comm, int handler_id,
                            const void *am_hdr, MPI_Aint am_hdr_sz,
                            const void *data, MPI_Aint count,
                            MPI_Datatype datatype, MPIR_Request * sreq)

/* nonblocking headers + datatype send */
int MPIDI_[NM|SHM]_am_isendv(int rank, MPIR_Comm * comm, int handler_id,
                             struct iovec *am_hdrs, size_t iov_len,
                             const void *data, MPI_Aint count,
                             MPI_Datatype datatype, MPIR_Request * sreq)

/* blocking header send (callback safe) */
int MPIDI_[NM|SHM]_am_send_hdr_reply(MPIR_Comm * comm, int src_rank,
                                     int handler_id, const void *am_hdr,
                                     MPI_Aint am_hdr_sz)

/* nonblocking header + datatype send (callback safe) */
int MPIDI_[NM|SHM]_am_isend_reply(MPIR_Comm * comm, int src_rank, int handler_id,
                                  const void *am_hdr, MPI_Aint am_hdr_sz,
                                  const void *data, MPI_Aint count,
                                  MPI_Datatype datatype, MPIR_Request * sreq)

/* CTS for pt2pt messages */
int MPIDI_[NM|SHM]_am_recv(MPIR_Request * rreq)

/* largest header (in bytes) supported by the nm/shm */
MPI_Aint MPIDI_[NM|SHM]_am_hdr_max_sz(void)

/* eager size supported by transport (only used internally by nm/shm) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_limit(void)

/* eager buffer size supported by transport, used to assert pipeline protocol will work(?) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_buf_limit(void)

/* return true/false if pt2pt message can be sent eagerly */
bool MPIDI_[NM|SHM]_am_check_eager(MPI_Aint am_hdr_sz, MPI_Aint data_sz,
                                   const void *data, MPI_Aint count,
                                   MPI_Datatype datatype, MPIR_Request * sreq)

/****************** Callback APIs ******************/

/* target-side message callback */
typedef int (*MPIDIG_am_target_msg_cb) (int handler_id, void *am_hdr,
                                        void *data, MPI_Aint data_sz,
                                        int is_local, int is_async, MPIR_Request ** req);

/* target-side completion callback */
typedef int (*MPIDIG_am_target_cmpl_cb) (MPIR_Request * req);

/* origin-side completion callback */
typedef int (*MPIDIG_am_origin_cb) (MPIR_Request * req);

SHM

We currently support shared memory based communication based on posix shared memory.

Netmod

We currently support ch4:ofi and ch4:ucx netmods. A minimum netmod can be implemented by just implement a minimum set of byte-transfer functions, e.g. MPIDI_NM_am_isend and the corresponding mechanism of calling back handler function upon receiving the byte payload. To achieve better performance, both the libfabric library and ucx library provides APIs to directly send messages with tag matching and RDMA operations that avoid extra data copy and shorten latencies. Thus both netmod have direct implementations of netmod API that skips active message.

Both netmod provide direct RMA operations where it can, and fallback to active message where there is limitations. All window synchronizations are based on active messages.

Collectives

Collectives are implemented in src/mpi/coll/, with separate folders for each collective function.

Use MPI_Bcast for example, bcast/bcast.c defines 3 functions, mostly boilerplates. The binding layer will call MPIR_Bcast, which will call MPID_Bcast unless the control variables (CVAR) directs differently. Typically MPID_Bcast will fallback to call MPIR_Bcast_impl, which will check CVARs and select particlar algorithm. By default, it will call MPIR_Bcast_allcomm_auto, which will automatically choose a best algorithm based on selection logic defined by a runtime json file.

Each algorithm is typically implemented in a separate file.

Romio

MPI-IO (MPI_File) APIs are implemented as Romio library. Romio is implemented entirely on top of other MPI functions and doesn't use MPICH internals. This is similar in a way as how we implement the Fortran bindings.

MPI_T

MPI Tools information interface provides orthogonal system for controlling and monitoring implementation behavior by means of control variables (CVAR), performance variables (PVAR), and the event callback interface.

MPL

All utilities, whose functionality is indepeendent of MPICH internals, are implemented inside MPL. All functions from MPL uses MPL_ prefix.

Some of the essential MPICH functionality are provided by MPL. They include:

  • Debug logging
  • Atomic variables
  • Thread functions
  • Memory tracing
  • GPU support

Sometime we use an additional MPIR wrapping layer on top of MPL functions to provide additional control. For example, the threading interfaces are wrapped in src/mpid/common/thread/mpidu_thread_fallback.h to allow device override of locking mechanism. Another example, GPU interface are wrapped to allow MPIR_CVAR_ENABLE_GPU to by pass GPU functions (when it is set to disable).

CVAR

Control variables are a mechanism to control library behavior using environment variables. The MPI Tool interface allows usage of control variables in code during runtime as well (however, it is rather clunky thus not widely used).

Reference:

PM and PMI

References:

Test Suite

We have an extensive test suite in test/mpi/. While currently it is configured with the mpich configure, it requires mpich to be installed before running make testing.

Tests are controlled by testlist files.

Build and Debug

References: