Amends and extends:
- RFC0012: Add APIs and internal support for RM-network library interactions
- Add attributes to support query of Fabric Manager (FM) state and capabilities information
- Add API by which the RM can specify fabric configuration requirements for quality of service, network topology, etc.
- Define events by which the FM can notify the rest of the SMS and applications about fabric events
Fabric Manager Interaction
This RFC extends network integration to provide greater access to fabric information and control over fabric resources to workload managers (for scheduling purposes), resource managers (for fault detection), tools (e.g., debuggers), and applications.
[EXTENSION][CLIENT-API][SERVER-API][RM-INTERFACE]
[IN PROGRESS]
Copyright (c) 2018 Intel, Inc. All rights reserved.
This document is subject to all provisions relating to code contributions to the PMIx community as defined in the community's LICENSE file. Code Components extracted from this document must include the License text as described in that file.
The PMIx standard already includes support by which the resource manager can request fabric configuration information for a given allocation and forward that to the allocated nodes for local configuration of their NICs. This can include environmental variables to be set prior to spawning an application process for directing the network library.
This RFC seeks to extend this integration to provide greater access to fabric information and control over fabric resources to workload managers (for scheduling purposes), resource managers (for fault detection), tools (e.g., debuggers), and applications. This includes defining:
(a) attributes for querying network topologies for scheduling and process placement, as well as traffic reports for application debugging and identification of bottlenecks;
(b) APIs by which the SMS can notify the fabric manager of configuration requirements such as quality of service (QoS) changes and network topology for a given allocation; and
(c) PMIx events for asynchronous notification of network issues.
The RFC will also bring the description in RFC0012 up to current PMIx standards practices.
Schedulers generally require knowledge of the fabric topology as part of their scheduling procedure to support optimized communication between processes on different nodes. Current methods utilize files, read by the scheduler when it starts, to communicate this information. However, the format of the files tends to be both scheduler and fabric vendor specific, thereby making it more difficult to maintain as cluster configurations evolve (e.g., as a cluster reconfigures its network topology, or vendors offer new fabrics).
PMIx defines two modes for supporting inventory collection:
-
Fabric manager query. If the FM supports a global report of resources, then the PMIx server hosted by the scheduler can obtain the report for the entire system. This is the preferred method for obtaining inventory information as it supports dynamic requests – e.g., triggered by notification of a network change – and directly provides information on inter-switch connectivity
-
Rollup of information from individual daemons at boot. This procedure involves direct discovery of NICs on individual nodes, with the resulting information passed in a scalable manner back to the scheduler. If individual resource plugins do not support local NIC discovery, then this mode may require that PMIx be configured with third-party libraries (e.g., HWLOC) to enable discovery. Given that switches are not likely to have RM daemons on them, this mode requires an alternative method (e.g., file) for providing information on the switch/connectivity map.
Regardless of which mechanism is used, the final inventory needs to include both a topology description of switch connectivity, and a description of the NICs on each node. The latter must include at least all fabric-related contact information (e.g., GID/LID) as well as the network plane to which the NIC is attached. Additional information on NIC capabilities (e.g., firmware version, memory size) is desirable, but not required.
To aid in implementation, PMIx provides an API to request that the PMIx server execute the discovery and report back the information to the host RM daemon. Attributes are provided to help tailor the request - e.g., to include specific types of interfaces, processor architecture, or physical memory size and configuration.
/* Inventory attributes */
#define PMIX_MASTER_SERVER "pmix.mstr.svr" // (bool) indicates that this PMIx server directly supports the system
// controller (often the workload manager, or scheduler)
/* Collect inventory of local resources. If the plugin for a
* particular resource is capable of obtaining a global map
* of its resources, then only the server given the PMIX_MASTER_SERVER
* attribute shall grab it - all other servers will ignore the
* request.
*
* This is a non-blocking API as it may involve somewhat
* lengthy operations to obtain the requested information
*/
PMIX_EXPORT pmix_status_t PMIx_server_collect_inventory(pmix_info_t directives[], size_t ndirs,
pmix_info_cbfunc_t cbfunc, void *cbdata);
Inventory collection is expected to be a rare event – at system startup and upon command from a system administrator. Inventory updates are expected to involve a smaller operation involving only the changed information. For example, replacement of a node would generate an event to notify the scheduler with an inventory update without invoking the global inventory operation.
Ralph H. Castain
Intel, Inc.
Github: rhc54