This serves as a README for developers. It provides high level descriptions of code architecture and provide pointers to other notes.
- mpi.h
- mpiimpl.h
- MPI and PMPI functions
- MPIR impl functions
- Handle Objects
- Handle bit-layerout
- MPID Interface
- Datatype
- Typerep
- Collectives
- Op
- Romio
- MPI_T
- MPL
- CVAR
- PM and PMI
- Fortran binding
- mpif.h
- use mpi
- use mpi_f08
- Test Suite
- Build and Debug
The top-most layer is src/include/mpi.h. This is the interface standardized by MPI Forum and exposed to users. It defines user-level constants, types, and API prototypes.
The types include MPI objects, which we internally refer as handle objects, such as MPI_Communicator, MPI_Datatype, MPI_Request, etc. In MPICH, all handle objects are exposed to user as integer handle. Internally we have a custom allocator system for these objects. How we manage handle objects is a cruicial piece in understanding the MPICH architecture. Refer to dedicated section below.
The API prototypes are currently generated by maint/gen_binding_c.py
in
mpi_proto.h
, which get included in mpi.h
.
mpi.h
is included in mpiimpl.h
, which is included by nearly every source
code.
Reference:
Nearly all internal source code includes mpiimpl.h
and often only
mpiimpl.h
. The all encompassing nature makes this header quite complex and
delicate. The best way to understand it is to refer to the header itself
rather than repeat here. It is cruical for understanding how we organize types
and inline functions.
The all encompasing nature provides the convenience that most source file only need include this single header. On the other hand, it is also responsible for the slow compilation. There is a balance between spending time maintaining headers and spending time waiting for compilations.
Reference:
The top level functions are MPI functions, e.g. MPI_Send. You will not find
these functions directly in our current code base. They are generated by
maint/gen_binding_c.py
based on src/binding/mpi_standard_api.txt
and other
meta files. MPI functions handles parameter validation, trivial case early
returns, standard error behaviors, and calling internal implementation
routines with necessary argument transformations. These functionality contains
large amount of boilerplates, thus are more suited for script generation.
PMPI prefixed function names are used to support MPI profiling interface. When user call MPI functions, e.g. MPI_Send, the symbol may link to a tool or profiling library function, which itercepts the call, do its profiling or analyses, and then calls PMPI_Send. Thus all top level functions are defined with PMPI_ names. This is why often PMPI names show up in traceback logs. To work when there isn't a tool library (the common case), both PMPI and MPI symbols are defined. If the compiler supports weak symbol, MPI names are weak symbols that links to PMPI names. This is how we do on Linux. Without weak symbol support, the top level functions are compiled twice, once with MPI names, and a second time with PMPI names. This is how on MacOs works.
Since this layer is mostly generated, we also refer to this layer as binding layer, which widely also apply to the Fortran/CXX bindings.
Reference:
Nearly all MPI functions will call MPIR_Xxx_impl functions except some trivial functions that we directly handle in the binding layer.
MPIR impl functions uses the same parameters as the MPI function with some
exceptions. For one, all handle objects are passed as pointers. E.g. MPI_Comm comm
in MPI function become MPIR_Comm * comm_ptr
internally. For two,
internally we use MPI_Aint
as much as we can to avoid hazadous incompatible
integer conversions, while MPI functions may use int
or MPI_Count
. These
conversion are all handled by the binging layer.
MPI functions are grouped in "chapters" -- referring to the chapters in MPI
specification. The mpir implementation functions are placed in src/mpi/xxx
,
where xxx
refers to the chapters such as pt2pt
, coll
, rma
, etc.
Reference:
Handle objects are crucial data structures in MPICH. Currently defined handle objects are here.
Internally, handle objects are defined as struct, e.g. MPIR_Comm
. The first
two fields for all object structs are:
int handle;
Handle_ref_count ref_count;
We use custom handle memory allocation for handle objects. There are 3 tiers:
- built-in objects - example
- direct objects - example
- indirect objects - allocated by slabs This system allows no initialization/overhead for built-in objects, minimum overhead when only minimum number of objects are used, and a managed overhead when large amount of objects are needed.
Use of integer handle provides better stability for bindings (where pointer type is not guaranteed) and it provides better debuggability since the handle contains more semantic information than pointer address. With our handle memory system, integer handler can be quickly converted to internal pointers.
Current handle bits layout:
0 [2] Handle type (INVALID, BUILTIN, DIRECT, INDIRECT)
2 [4] Handle kind (MPIR_COMM, ...)
6 [26] Block and Index
The 26 bits blocks are further divided into 14 bits for slab (block) index, and 12 bits for index within the block. For request objects, the 14 bits block index is further divided into 6 bits for request pools and 8 bits for blocks.
Consequently, the maximum number of live objects we can have is limited by handle bits.
Reference:
A key design goal of MPICH is to allow downstream vendors to easily create
vendor specific implementations. This achieved by Abstract Device Interface
(ADI).ADI is a set of MPID prefixed functions that implements the
functionality of MPI operation. For example, MPID_Send implements MPI_Send.
Nearly all MPI functions will call MPID counter parts first, allowing device
layer to either supply full functionality or simply fall back by calling
MPIR
implementations.
For performance critical path, e.g. pt2pt
and rma
path, we call MPID
layer directly from binding. This allows full inline build to achieve maximum
compiler optimization. The other ADIs are not performance critical, but
provided as hooks -- MPIR layer will call these hooks at key points -- to
allow device properly setup and control the implementation behaviors.
Note: all pt2pt communicatioin in ADI are nonblocking -- it will only initiate
the communication and completed during MPID_Progress_wait/test
.
Reference:
Ch3 is currently in maintenance mode. It is still fully supported since there are vendors still basing on ch3.
There are two channels in ch3. ch3:sock
is a pure socket implementation.
ch3:nemesis
adds shared memory communication and also supports network
modules (netmod). We currently support ch3:nemesis:tcp
and
ch3:nemesis:ofi
.
Ch4 is where currently active research and development take place. Many new features, e.g. per-VCI threading, GPU IPC, partitioned communication, etc. are only available in Ch4.
Ch4 designs an additional ADI-like interface, commonly referred as ch4 API or shm/netmod API. With most MPID functions, ch4-layer will check whether the communication is local (can be carried out using shared memory) and calls into either shm API or netmod API. It is possible to disable shm entirely.
The framework for ch4 API involves large amount of boilerplates due to the need allow both fully inlined build and nonlined build using function tables. We use script to generate most of these API files.
Reference:
- Notes on ch4 api autogen
- Notes on ch4 inline mechanism
- Notes on ch4 namespace convention
- ch4_api.txt
- request.md
Similar to MPIR, ch4 provide a default/fallback implementation based on active messages. Ch3 implementation is largely an active message implementation. It is easier to use an example to explain what is active message and how it works.
MPID_Send will call either MPIDI_SHM_mpi_isend
or MPIDI_NM_mpi_isend
based
on locality, which may call back to MPIDIG_mpi_isend
. MPIDIG_mpi_isend
will
send the message along with am header to target process using
MPIDI_{SHM,NM}_am_isend
. It creates an MPIR_Request object and return to user.
During progress, the receiving process receives the message (inside shm progress or netmod progress), it will check the am header, and call a registered handler function to process the message. The handler functions are essential pieces in a active message protocol. In the pt2pt eager send case, it will check posted receiver buffer list and either copy the message data to the receiver buffer (and completes the receive request) or enqueuing the message to an unexpected queue.
As a counter part, MPID_Recv
will at some point call MPIDIG_mpi_irecv
, which
will check the unexpected queue, and either copy the data over (and completes
the receive buffer) or enqueue the receive request to a posted queue.
There are complications, where we may implement different protocols such as rendezvous protocol or IPC or RDMA, as well as various RMA operations. Different protocols are all implemented by handler functions which in turn may send additional active messages with another handler functions to carry on the protocols.
Reference:
- src/mpid/ch4/src/mpidig_send.h
- src/mpid/ch4/src/mpidig_recv.h
- src/mpid/ch4/src/mpidig_pt2pt_callbacks.h
- src/mpid/ch4/src/mpidig_recvq.h
- src/mpid/ch4/src/mpidig_rma.h
- src/mpid/ch4/src/mpidig_rma_callbacks.h
- src/mpid/ch4/src/mpidig_rma_callbacks.h
Active message APIs as of MPICH 3.4.2:
/****************** Header and Data Movement APIs ******************/
/* blocking header send */
int MPIDI_[NM|SHM]_am_send_hdr(int rank, MPIR_Comm * comm, int handler_id,
const void *am_hdr, MPI_Aint am_hdr_sz)
/* nonblocking header + datatype send */
int MPIDI_[NM|SHM]_am_isend(int rank, MPIR_Comm * comm, int handler_id,
const void *am_hdr, MPI_Aint am_hdr_sz,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/* nonblocking headers + datatype send */
int MPIDI_[NM|SHM]_am_isendv(int rank, MPIR_Comm * comm, int handler_id,
struct iovec *am_hdrs, size_t iov_len,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/* blocking header send (callback safe) */
int MPIDI_[NM|SHM]_am_send_hdr_reply(MPIR_Comm * comm, int src_rank,
int handler_id, const void *am_hdr,
MPI_Aint am_hdr_sz)
/* nonblocking header + datatype send (callback safe) */
int MPIDI_[NM|SHM]_am_isend_reply(MPIR_Comm * comm, int src_rank, int handler_id,
const void *am_hdr, MPI_Aint am_hdr_sz,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/* CTS for pt2pt messages */
int MPIDI_[NM|SHM]_am_recv(MPIR_Request * rreq)
/* largest header (in bytes) supported by the nm/shm */
MPI_Aint MPIDI_[NM|SHM]_am_hdr_max_sz(void)
/* eager size supported by transport (only used internally by nm/shm) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_limit(void)
/* eager buffer size supported by transport, used to assert pipeline protocol will work(?) */
MPI_Aint MPIDI_[NM|SHM]_am_eager_buf_limit(void)
/* return true/false if pt2pt message can be sent eagerly */
bool MPIDI_[NM|SHM]_am_check_eager(MPI_Aint am_hdr_sz, MPI_Aint data_sz,
const void *data, MPI_Aint count,
MPI_Datatype datatype, MPIR_Request * sreq)
/****************** Callback APIs ******************/
/* target-side message callback */
typedef int (*MPIDIG_am_target_msg_cb) (int handler_id, void *am_hdr,
void *data, MPI_Aint data_sz,
int is_local, int is_async, MPIR_Request ** req);
/* target-side completion callback */
typedef int (*MPIDIG_am_target_cmpl_cb) (MPIR_Request * req);
/* origin-side completion callback */
typedef int (*MPIDIG_am_origin_cb) (MPIR_Request * req);
We currently support shared memory based communication based on posix shared memory.
We currently support ch4:ofi
and ch4:ucx
netmods. A minimum netmod can be
implemented by just implement a minimum set of byte-transfer functions, e.g.
MPIDI_NM_am_isend
and the corresponding mechanism of calling back handler
function upon receiving the byte payload. To achieve better performance, both
the libfabric library and ucx library provides APIs to directly send messages
with tag matching and RDMA operations that avoid extra data copy and shorten
latencies. Thus both netmod have direct implementations of netmod API that
skips active message.
Both netmod provide direct RMA operations where it can, and fallback to active message where there is limitations. All window synchronizations are based on active messages.
Collectives are implemented in src/mpi/coll/
, with separate folders for each
collective function.
Use MPI_Bcast
for example, bcast/bcast.c
defines 3 functions, mostly
boilerplates. The binding layer will call MPIR_Bcast
, which will call
MPID_Bcast
unless the control variables (CVAR) directs differently.
Typically MPID_Bcast
will fallback to call MPIR_Bcast_impl
, which
will check CVARs and select particlar algorithm. By default, it will call
MPIR_Bcast_allcomm_auto
, which will automatically choose a best
algorithm based on selection logic defined by a runtime json file.
Each algorithm is typically implemented in a separate file.
MPI-IO (MPI_File) APIs are implemented as Romio library. Romio is implemented entirely on top of other MPI functions and doesn't use MPICH internals. This is similar in a way as how we implement the Fortran bindings.
MPI Tools information interface provides orthogonal system for controlling and monitoring implementation behavior by means of control variables (CVAR), performance variables (PVAR), and the event callback interface.
All utilities, whose functionality is indepeendent of MPICH internals, are
implemented inside MPL. All functions from MPL uses MPL_
prefix.
Some of the essential MPICH functionality are provided by MPL. They include:
- Debug logging
- Atomic variables
- Thread functions
- Memory tracing
- GPU support
Sometime we use an additional MPIR wrapping layer on top of MPL functions to
provide additional control. For example, the threading interfaces are wrapped
in src/mpid/common/thread/mpidu_thread_fallback.h
to allow device override
of locking mechanism. Another example, GPU interface are wrapped to allow
MPIR_CVAR_ENABLE_GPU
to by pass GPU functions (when it is set to disable).
Control variables are a mechanism to control library behavior using environment variables. The MPI Tool interface allows usage of control variables in code during runtime as well (however, it is rather clunky thus not widely used).
Reference:
References:
- Notes on Process Manager
- Notes on Process Manager Interface
- Notes on Namepub
- Using the Hydra Process Manager
We have an extensive test suite in test/mpi/
. While currently it is
configured with the mpich configure, it requires mpich to be installed before
running make testing
.
Tests are controlled by testlist
files.
References: