Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Business Card Exchange for Process-to-Process Wire-up #191

Closed
jjhursey opened this issue Jun 1, 2019 · 14 comments
Closed

Business Card Exchange for Process-to-Process Wire-up #191

jjhursey opened this issue Jun 1, 2019 · 14 comments
Labels
Use Case Description of a Use Case

Comments

@jjhursey
Copy link
Member

jjhursey commented Jun 1, 2019

Brief Description

Multi-process communication libraries, such as MPI, need to establish communication channels between a set of those processes. Each process needs to share connectivity information (a.k.a. Business Cards) with all other processes before communication channels can be established. The runtime environment must provide a mechanism for the efficient exchange of this connectivity information. Additional information about the current state of the job (e.g., number of processes globally and locally) and of how the process was started (e.g., process binding) are also helpful.

Use Case Details

Note: The Instant-On wire-up mechanism is a separate, related use case.

Multi-process communication libraries, such as MPI, need to establish communication channels between a set of those processes. Each process needs to share connectivity information (a.k.a. Business Cards) with all other processes before communication channels can be established. This connectivity information may take the form of one or more unique strings that allow a different process to establish a communication channel with the originator.

Each process provides their business card to PMIx via one or more PMIx_Put operations to store the tuple of {UID, key, value}. The UID is the unique name for this process in the PMIx universe (i.e., namespace and rank). The key is a unique key that other processes can reference generically (note that since the UID is also associated with the key there is no need to make the key uniquely named per process). The value is the string representation of the connectivity information.

Some business card information is meant for remote processes (e.g., TCP or InfiniBand addresses) while others are meant only for local processes (e.g., shared memory information). As such a scope should be associated with the PMIx_Put operation to differentiate this intention.

The PMIx_Put operations may be cached local to the process. Once all PMIx_Put operations have been called each process should call PMIx_Commit to push those values to the local PMIx server. Note that in a multi-library configuration each library may PMIx_Put then PMIx_Commit values - so there may be multiple PMIx_Commit calls before a Business Card Exchange is activated.

After calling PMIx_Commit a process can activate the Business Card Exchange collective operation by calling PMIx_Fence. The PMIx_Fence operation is collective over the set of processes specified in the argument set. That allows for the collective to span a subset of a namespace or multiple namespaces. After the completion of the PMIx_Fence operation, the data PMIx_Put by other processes is available to the local process through a call to PMIx_Get which returns the key/value pairs necessary to establish the connection(s) with the other processes.

The PMIx_Fence operation must have a "Synchronize Only" mode that works as a barrier operation. This is helpful if the communication library requires a synchronization before leaving initialization or starting finalization, for example.

The PMIx_Fence operation should have a "Sparse" mode in addition to a "Full" mode for the data exchange. The "Full" mode will fully exchange all Business Card information to all other processes. This is helpful for tightly communicating applications. The "Sparse" mode will dynamically pull the connectivity information on-demand from inside of PMIx_Get (if it is not already available locally). This is helpful for sparsely communicating applications. Since which mode is best for an application cannot be inferred by the PMIx library the caller must specify which mode works best for their application.

The PMIx_Fence operation should have an option for the end user to specify which mode they desire for this operation.

Additional information about the current state of the job (e.g., number of processes globally and locally) and of how the process was started (e.g., process binding) are also helpful. This "job level" information must be available immediately after PMIx_Init without the need for any explicit synchronization.

The number of processes globally in the namespace and this process's rank within that namespace is important to know before establishing the Business Card information to best allocate resources.

The number of processes local to the node and this process's local rank is important to know before establishing the Business Card information to help the caller determine the scope of the put operation. For example, to designate a leader to set up a shared memory segment of the proper size before putting that information into the locally scoped Business Card information.

The number of processes local to a remote node is also helpful to know before establishing the Business Card information. This information is useful to pre-establish local resources before that remote node starts to initiate a connection or to determine the number of connections that need to be advertised in the Business Card when it is sent out.

Note that some of the job level information may change over the course of the job in a dynamic application.

Interfaces

PMIx_Put
PMIx_Get
PMIx_Commit
PMIx_Fence
PMIx_Init

Keys

The following job level information is useful to have before establishing Business Card information:

  • PMIX_NODE_LIST List of nodes in the job
  • PMIX_NUM_NODES Number of nodes in the job
  • PMIX_NODEID Node ID where this process is located
  • PMIX_JOB_SIZE Number of processes globally in the job
  • PMIX_PROC_MAP Mapping of processes to nodes in this job
  • PMIX_LOCAL_PEERS List of local processes on this node in this job
  • PMIX_LOCAL_SIZE Number of processes local to this node.

For each process this information is also useful (note that any one process may want to access this list of information about any other process in the system):

  • PMIX_RANK My global rank in the job
  • PMIX_LOCAL_RANK My local rank on this node for this job
  • PMIX_GLOBAL_RANK My global rank across all namespaces
  • PMIX_LOCALITY_STRING Process binding on this node
  • PMIX_HOSTNAME hostname associated with this process (useful for queries about remote processes)

There are other keys that are helpful to have before a synchronization point, this is not meant to be a comprehensive list.

References

  • PMI v2 link
  • PMIx: Process management for exascale environments (Nov. 2018) DOI
@jjhursey jjhursey added WorkInProgress Work In Progress Use Case Description of a Use Case labels Jun 1, 2019
@jjhursey
Copy link
Member Author

jjhursey commented Jun 1, 2019

Note: Still a Work In Progress. There are some more details that I want to add about job level information and the behavior of put/get.

@SteVwonder
Copy link
Contributor

Thanks @jjhursey for putting this together. As you mention, MPI has this use-case. Presumably SHMEM does as well. Are there other popular libraries that we can list as having this use-case?

@jjhursey
Copy link
Member Author

jjhursey commented Jun 5, 2019

I updated the description with a few of the specific job/process level information items. I'll take the WIP tag off this so we can have a discussion and further expand on this.

@jjhursey jjhursey removed the WorkInProgress Work In Progress label Jun 5, 2019
@jjhursey
Copy link
Member Author

jjhursey commented Jun 5, 2019

For a list of the job/process level see section 11.1.3 (PMIx_server_register_nspace) for a list. Document Link

Suggestion from the teleconf:

  • Add a marking for optional vs required. For each, a small explanation, especially for the optional values, explain the use case for it and the implications (e.g., performance or design/workaround) if it is not provided.

@kathrynmohror
Copy link
Collaborator

There are lots of other use cases for this functionality, e.g. tools (I/O middleware, performance tools, etc.), programming model runtimes (MPI of course and others), probably anything that doesn't rely on MPI and wants to bootstrap communication.

This is kind of a meta point: Do we want to write all of these use cases wrt to how it is done in the current version of PMIx (i.e. naming the API calls to use), or do we want to simply map out the general functionality needed by the use case (independent of what PMIx does)? I thought we were doing the latter since in some cases we may be describing use cases for which there is no interface yet.

@jjhursey
Copy link
Member Author

jjhursey commented Jun 6, 2019

The goal of the use case is to present a description of a problem with guidance on how it might be solved in PMIx. The author can describe it in a broad sense or with specifics about interfaces they think would be helpful from PMIx.

The PMIx community then can help the author link it to existing PMIx interfaces that might address the use case. If new interfaces are needed then they can be worked out.

The idea was to have a low bar for suggesting use cases and engaging with the community (no need to be a PMIx expert). Then the PMIx community can engage to help see how it fits into what is currently defined by PMIx, and what might need to be defined still.

@SteVwonder
Copy link
Contributor

Notes from the meeting yesterday relevant to this issue:

  • We may want to add a sub use-case under bootstrap for cloud services which require the exchange of authentication information.
  • It would be useful to add links to projects that have this use-case and what interfaces/attrs/keys they leverage. In the case of this use-case, OpenMPI and Spectrum MPI were mentioned.

This is kind of a meta point: Do we want to write all of these use cases wrt to how it is done in the current version of PMIx (i.e. naming the API calls to use), or do we want to simply map out the general functionality needed by the use case (independent of what PMIx does)?

I'm not sure exactly what we want the final form of the use-cases to look like, but I think the most useful would be both of what you describe: the general functionality/use-case and the specifics for each sub use-case (what does MPI require, what does a debugger require, what does a kublet (or other cloud technology) require, etc). If a particular sub use-case is exactly the same as another (which may be the case for MPI and SHMEM, idk), then that can be called out too.

@SteVwonder
Copy link
Contributor

Updated the original post with a snapshot of the use-case from our Google Drive drafts folder: https://drive.google.com/open?id=1eN7aBxyzPD0a_GJFq1KH2ZHpoONj76op

@artpol84
Copy link
Member

@artpol84
Copy link
Member

Above is as an example of business card exchange

@jjhursey
Copy link
Member Author

jjhursey commented Mar 2, 2020

I updated the Google Drive version of this use case to make a clearer distinction of the role of PMIx_Fence and PMIx_Get in the Direct Business Card Exchange and Full Business Card Exchange.

The new text starts in the paragraph starting with:

After the completion of the PMIx_Commit

Let me know what you think.

@SteVwonder
Copy link
Contributor

Thanks @jjhursey! I left a comment in the google drive version too, but for anyone having trouble accessing that, my question was w.r.t. this sentence:

The PMIx_Fence operation takes an additional parameter indicating that a full exchange of business card information is requested of the runtime environment.

What is the "runtime environment" in this scenario? I presume the PMIx server, but I'm not sure.

@jjhursey
Copy link
Member Author

I updated the google drive version to clarify that it is the RM that it is talking about.

@jjhursey
Copy link
Member Author

The v5.0.x PR #328 included this use case. The issue will remain open for further discussion on this topic.

@jjhursey jjhursey removed their assignment Jan 12, 2023
@pmix pmix locked and limited conversation to collaborators Feb 9, 2023
@raffenet raffenet converted this issue into discussion #438 Feb 9, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
Use Case Description of a Use Case
Projects
None yet
Development

No branches or pull requests

4 participants