Minutes from the June 1999 meeting of the Data Reorganization Forum

Location: Lockheed Martin GES, Moorestown, NJ

June 10 meeting: 1pm – 5pm

June 10 meeting was an executive session, per Tony’s request, to organize the discussion on June 11, 1999 and future near-term activities

Attendance:

Tony Skjellum

MPI Software Technology, Inc. / MSU

Ken Cain

MITRE

Karen Lauro

Mercury Computer

James Lebak

MIT Lincoln Laboratory

Jon Greene

Mercury Computer

Tom McClean

Lockheed Martin GES

[EDITORIAL NOTE: There is a lot of detail here from our discussions, but basically the goal was to generate an agenda of discussion topics for the June 11 meeting. That agenda is listed in item #6 in this (June 10 meeting minutes) section of the document. Items 1-5 show how we arrived at that agenda

ANOTHER NOTE: gdo == global data object in the discussion

]

 

1) Discussion of standalone API versus layered or co-layered DRI with other middleware implementations

 

2) Discussion of need for high-level data distribution:

3) Process set dimensionality discussion:

4) Need to address multiple data buffers issue

[ EDITORIAL NOTE: This document will refer to the first approach as "multiple buffering", and the second approach as "flow control". There is some consensus for this based on the forum’s discussions to date, so let’s adopt this language for clarity in future communications ]

5) MPI/RT bindings?

6) Set preliminary agenda for June 11 meeting

 

 

June 11 meeting: 845am – 4pm

Attendance:

Ken Cain

MITRE

Karen Lauro

Mercury Computer

James Lebak

MIT Lincoln Laboratory

Jon Greene

Mercury Computer

Tom McClean

Lockheed Martin GES

Rick Pancoast

Lockheed Martin GES

Nathan Doss

Lockheed Martin GES

Shane Hebert

MPI Software Technology, Inc. / MSU

Steve Paavola

Sky Computer

Dennis Cottel

SPAWARSYSCEN, S.D.

Randy Judd

SPAWARSYSCEN, S.D.

Arkady Kanevsky

Mercury Computer

James made some opening remarks to start the meeting at 8:45 a.m.

Jon recapped the executive session

Question: Is there a document available?

Ken: suggested a prioritization of the Agenda: handle items 1 and 2 (group dimensionality and high-level partitioning descriptors) before getting back into multiple buffering and flow-control

James starts off the process set (group) dimensionality discussion:

[ EDITORIAL NOTE: I don’t think we concretely decided the policy on dynamically changing the group dimensionality. Most people did agree that providing multiple "views" of a group is not an attractive option ]

High-level partitioning description discussion:

James reviews for clarity the current approach to creating a "dist" object in the current API:

James: Introduces a potential problem in combining the ideas of group dimensionality and partitioning

Nathan summarizes approaches to handling high-level partitioning by using group dimensionality

[ EDITORIAL NOTE: group elects to continue the discussion using Option 2 as the basis ]

Discussion on the form of the high-level partitioning information (written very loosely like a C struct or union)

ON A PER-DIMENSION BASIS, USER SPECIFIES THE FOLLOWING INFORMATION:

{

DRI_dist_type dt; (encodes either block, indivisible, or block-cyclic)

Nprocs being applied to the partitioning in this dimension (default = 0, implementation decides)

IF dt specifies a block partitioning:

Minimum acceptable size of local partitioning (default = 0)

"modulo" requirement (local size should be a multiple of this # of elements) – (default is 1)

Left overlap specification (see below for details)

Right overlap specification

IF dt specifies an indivisible axis (no partitioning):

IF dt specifies a block-cyclic partitioning:

Cyclic block size (default size = 1)

Left overlap (yikes!)

Right overlap

}

OVERLAP DETAIL (RECALL, THIS IS SPECIFIED ON A PER-DIMENSION BASIS FOR BLOCK DISTRIBUTIONS)

{

Number of positions (default = 0)

Type (either pad, truncate, or toroidal) – (default = pad)

Overlap type specifies the policy to implement at the "edges" of the gdo

Pad: pad either with a pad value (e.g., zero), or with replicated data from the last local

position

Truncate: do not allocate additional space in the local memory for overlap storage

Toroidal: Store a copy of the "adjacent" processor’s data (adjacency wraps around to the

Processor that owns the data on the opposite "edge" of the gdo)

}

Issue: do we still allow the user to specify a partitioning based on a low-level (e.g., four-tuple for the block-partitioning case) description ONLY?

 

The issue of supporting SPE-like behavior (where a source process doesn't need to specify all the destination processes) came up again: The group decided after some discussion that in our current interface we would need a global commit to implement this. The consensus of the group was that no global commit was desired. We may try to implement this functionality later by requiring a "registry" for desired connections among all processes at system startup.

Tom brings up a deficiency in the current API:

Discussion on what parts of the API involve "collective" communication among groups of processes

Discussion of local memory "layout" object

 

Dennis: could use stride specification in memory layout to facilitate flow-control

 

Revisit where the layout object should be specified in the chain of calls to establish a data transfer:

[EDITORIAL NOTE: Now that we have required the memory layout to be described before dist_create time, the user cannot specify explicit strides (because the user doesn't know what the exact size of the local data buffer will be]

 

Multiple buffering discussion

At the end of the meeting, we talked about overlapping buffers in memory when two sides of a transfer are the same process. The most prominent subset of this case is the in-place reuse of buffers in clique data reorganizations. The group decided to disallow this, with the idea that we could relax this restriction later.

Nathan comment on collective nature of transfer_connect call: