Rm scon

12
RM SCON Annapurna Dasari Intel Inc.

Transcript of Rm scon

Page 1: Rm scon

RM SCON

Annapurna DasariIntel Inc.

Page 2: Rm scon

RM Overlay Networks• An overlay network of the resident RM daemons (ie

slurmd/orcmd overlay network). Spans over the daemons in the allocation, lifetime – single/mulitple job executions session start - session end.

• An overlay network of the job step daemons responsible for launch and management of the MPI processes. – One participant per node.– Spans across all step daemons where the job is executing.– Lifetime: job launch – job termination.

• Embedded overlay networks that use the RMs native communication.

Page 3: Rm scon

Additional RM Overlay Networks

• The RM may create an additional special purpose overlay network outside of its own integrated communication system for:– reliable broadcast for event notification, job launch– scalable collectives - for initial wire-up, oob request for job

information by debuggers or other tools– High speed fabrics – RM daemons can speed up

communications by using an overlay network that can send messages on a high speed fabric and it also has the required hooks for taking advantages of the high speed fabrics capabilities (can provide required QoS).

Page 4: Rm scon

Additional RM Overlay Networks

• RMs can create an adhoc overlay network on demand for a specific purpose or they can offload all further communications to the newly created overlay network.

• The SCON library provides interface to create, send, recv point to point messages and broadcast, all gather and barrier capabilities.

Page 5: Rm scon

PMIx Event Notification SCON

• PMIx provides a capability that allows MPI applications to request the RM to notify error events of relevance to them. They can also request the RM to propagate events detected by them to their peers.

• The RM local PMIx server is required to provide reliable notification of the error events to all interested parties (process could belong to other jobs).

• The RM local PMIx server could create a SCON among the requested parties and send the event notification broadcast on the SCON.

Page 6: Rm scon

PMIx Event Notification SCON

local RM PMIx Server

local RM PMIx Server

local RM PMIx Server

local RM PMIx Server

local RM PMIx Server

local RM PMIx Server

local RM PMIx Server

local RM PMIx Server

Page 7: Rm scon

Event Notification SCON Creation

• The SCON participants consists of – All local RM (PMIx server) daemons for the specific job plus– All local RM daemons of the job to which process of the

specified job is connected to. The parent process spawning the job and any explicit connections requested via PMIx connect.

• The SCON can be created during PMIx server initialization or on demand.

• RM PMIx server daemons may join and leave the SCON according to the PMIx connect/disconnect requests?

Page 8: Rm scon

Creating a RM SCON• The participant daemons create SCON by

calling the create API.• Participant list is provided by the daemon.• Specify info keys for fabric selection, topology • Request reliable broadcast capability during

create by specifying the info key.• Topology is a tree topology with rank 0

daemon at root.

Page 9: Rm scon

Creating a RM SCON

• SCON library will in response wire up the SCON according to the topology and complete the create operation when all participants join.

• SCON is operational. Typical operations include– broadcast event notification– Allgather/barrier for fence operations.– Pt to pt send/recv?

Page 10: Rm scon

Join a RM SCON• Additional RM daemons may join an existing RM SCON in

response to connect requests.• The joining daemon must specify the existing SCON

information to the library where it wants to join.• The library would join the new member using internal xcast

and allgather operations among all members.• Should we provide new scon_join API or enable join by

specifying the join directive in create info keys?• Do we need to support a join? • Do existing participants know outside of the SCON about the

new member? How do RMs handle connect requests today?

Page 11: Rm scon

Reliable xcast on RM SCON

• The notifying daemon broadcasts the event notification message to all members by calling scon_xcast API

• Is it required/nice to have/not required to know when– The xcast has completed– The xcast completion status with detailed error

information.• What would the RM daemon do if the xcast cannot

be delivered to some of the participant daemons?

Page 12: Rm scon

Allgather/Barrier on RM SCON

• Perform barrier operation by calling the scon_barrier API.

• Allgather API to be added.• Brainstorm:– Reliable allgather/barrier? – being able to

synchronize among the live participants.– allgather/barrier timeout – each process releases

itself from the barrier upon timeout?