December 1999 Data Reorganization Forum meeting minutes
 

Attendance:

 
Name
Organization
Dennis Cottel SPAWAR (host)
Keith Bromley SPAWAR
Ken Cain The MITRE Corporation
Zenquian Cui MPI Software Technology
Jon Greene Mercury Computer
Randy Judd SPAWAR
Arkady Kanevsky Mercury Computer
Rodric Rabbah NYU
Rick Steinberger Lockheed Martin GES
Sonetra Wilburn Litton TASC

Agenda:

1. Review September 1999 meeting's minutes, call for additional agenda items
2. Read through the API document, comments by committee members
3. Buffer management specification for the 'do' call needed - currently ambiguous
4. Late-binding feasibility discussion
5. Form of the final document presentation
6. A way to specify low-level data partitionings (i.e., distspecs) is needed
7. A way to specify detailed memory layout objects is needed
8. Object management / reference counting
9. Rename acquire/insert/extract/release to something more clear?
10. Buffer management scenarios for acquire/insert/extract/release
11. Error status codes and how to get detailed reporting of errors
12. Integration of DR with MPI and MPI/RT
13. Discussion of a plan for finalizing the committee's work
14. Miscellaneous topics covered

* Items 3 - 7 were identified by forum members during the September 1999 meeting. Items 8 and higher are new issues - additions to the list are needed at the beginning of the meeting.
 

Summary (mostly based on planned agenda items):

Agenda item #1. Review September 1999 meeting minutes

Minutes reviewed and accepted informally by the group
 

Agenda item #2. Review September 1999 version of API

[EDITORIAL NOTE: Parts of this API review are more naturally organized into the other planned agenda item sections.  This section, therefore, contains a description of issues raised that do not already fit into one of the planned agenda categories. Some of the most significant changes to the API occured as a result of the  "DRI_part and bounds_t changes" discussion].
 

Topic: "query routines":

Jon expressed an interest in having only 1 query routine instead of a query routine per attribute. The argument for this is to reduce the number of interfaces we specify. Taking this a step further, we could forego specification of query routines in general, opting instead to specify only the "critical path" functions (i.e., those that are crucial to setting up and executing a data reorganization). Ken noted likely portability problems that would result from not specifying anything.
 

Topic: DRI_group accessor functions -  naming convention:

Arkady noted that the call should be named dri_group_get_size(), following the convention of our other query/accessor routines (get_attributename).
The same should be true for dri_group_myrank (corrected to be dri_group_get_myrank).
 

Topic: dri_group_get_myrank() call:

Jon mentioned that this function should return a -1 when the caller is not a member of the DRI_group object being queried. This erroneous condition could occur in pipelined data-parallel applications, in which each process must have a DRI_group object referring to both process groups associated with the data reorganization to be performed.
 

Topic: dri_distspec_create() calls (block and blockcyclic):

Arkady expressed concern about how dri_distspec_create_block() and dri_distspec_create_blockcyclic() are presented together in the API draft. The function arguments are all presented together, but some of these arguments are common to both routines, and others are unique to one routine. Called for a separate presentation of each function to simplify.

Jon would like to remove the word distspec from the function names. This would go against the dri_objectname_action function naming convention that we have informally adopted to this point. Perhaps a shorter object name for DRI_distspec is in order?
 

Topic: dri_layout_create() call:

Group discovered a bug in the example presented in the discussion associated with this call in the API draft. Ken will fix in the next version

Jon expressed an interest in setting up default DRI_layout objects that cover the most common cases and could be referred to by name, instead of making the user call dri_layout_create. Defaults should cover all possible arrangements of 1, 2, or 3-dimensional densely-packed data sets. This call could be considered out of the "critical path" of calls needed to set up most data reorganizations.
 

Topic: dri_dist_create() call:

Jon mentioned that the group_dims argument should be optional - the user should be able to leave the logical process set dimensionality details up to the interface.

The following question was raised: if the user specifies logical process set dimensionality with the group_dims argument, then is the implementation forced to follow that guidance? The group decided that, as long as the group_dims argument is correct, this parameter should be respected by the interface as it performs the data partitioning calculations.

Cui suggested that the order of entries in the group_dims array parameter needs to be clarified in the API draft discussion. A good way to do this is to specify that the group_dims order corresponds directly to the order supplied in the dimsizes array parameter to the dri_global_data_create function.

Arkady would like to have additional explanation in the RESTRICTIONS/POLICY section for this call in the API document. Users should be aware that this call may not be collective, and therefore could calculate an incorrect partitioning unless all processes call this function with the same parameters. He also suggested that the discussion make clear that the API guarantees that every global element is assigned to a process - but that not every process is guaranteed to be assigned data in a partitioning. These suggestions will be incorporated into the API draft.

Another issue was raised in the DRI_dist discussion: we need to have a DRI_DIST_INDIVISIBLE pre-defined object that indicates no partitioning should be applied to a dimension of global data. Ken indicated that this got removed accidentally from the API draft and will be reinstated in the next revision.
 

Topic: DRI_part and bounds_t:

It was noted that the DRI_part object no longer provides value in this API. The only purpose it serves is to be queried for the low-level  partitioning information resulting from a dri_dist_create call (yielding a bounds_t structure). A better design instead would be to directly query the DRI_dist object directly for this information. By eliminating DRI_part, we need to rename many accessor functions so that they are prefixed with dri_dist instead of dri_part.

It was further discovered that we do not have all of the necessary information stored in the bounds_t structure and that we need to rethink how the low-level partitioning information is provided to the user. An example of "missing information" is the (multi-dimensional) index into the local buffer where the "first owned element" of data can be found (esp. relevant when dealing with overlapped data partitionings). The API draft has been updated as follows (see the API draft document for details):

A function named dri_dist_get_blockinfo() returns detailed information about a (single) locally owned block of data in a structure of type DRI_blockinfo. Because this function only returns detailed partitioning information for a single block, multiple calls are needed when querying for the details of a block-cyclic partitioning. This is consistent with the group's prior thinking on block-cyclic partitionings (because this approach is more memory-efficient than returning the partitioning details of every local block). A new function named dri_dist_get_numblocks has been added so that users can query for the number of local blocks resulting from a block-cyclic partitioning.

The DRI_blockinfo structure contains two types of information:

  1. Partitioning information specific to each dimension (e.g., left and right overlap, block length, stride, ...)
  2. Other information about the block not specific to a single dimensions (e.g., element size in bytes)
The per-dimension partitioning information is stored in a structure of type DRI_blockdim.
 

[EDITORIAL NOTE: We need some guidance from members of the group who have implemented block-cyclic based applications. Does the user need the ability to specify the order of access to the local blocks?. Also, do we need a multi-dimensional "block number" index when referring to locally owned blocks? Or, is a single block number value sufficient? ]

Another decision was made to simplify the API in the following way: a process can only query the interface for its own low-level partitioning information. Library implementations will obviously have per-processor partitioning information stored internally, but it is unlikely that this information needs to be exposed to the application.

Topic: Buffer management:

NOTE: dri_part_calc_local_size function has been renamed to dri_dist_calc_local_size

The group agreed that (for now), the dri_dist_calc_local_size function should return the size of a buffer that can hold all local data blocks. This is because, at this time, the interface only supports "monolithic" data reorganizations.

The group decided to create a datatype for a "buffer set", instead of using array data types.
 

Topic: Transfers:

[EDITORIAL NOTE: The group decided to change the name of the DRI_transfer object to DRI_channel.  This document will use the new DRI_channel name to describe this object. Also note that the decision was made to simplify the way that data reorganizations are called in the API - replacing acquire/insert/extract/release with a simpler put/get approach. See the notes on this topic in agenda item #9 summary]

Group revisited the issue of whether we need both the dri_channel_create and dri_channel_connect calls, or if we can just consolidate the API to have only a connect call. The reason for having both calls is to support (in the future) the ability to support more than two process groups participating in a commonly-named data reorg. (e.g., 1 input process group, 2 output process groups). The dri_channel_create call would then act as a "registration" tool. Assuming that a barrier synchronization is put between the channel create and connect stages, then this type of data reorganization could be supported. Group decides to leave the API in its current form so that implementations may provide this feature.

Agenda item #3. Buffer management with the "do" call

This topic was merged into the discussion for agenda item #10

Agenda item #4. Late-binding feasibility discussion

There was some thought that a late binding capability could be folded into the September 1999 version of the API by modifying the "do" call to accept a user-supplied pointer. The idea is that the user should be able to specify exact memory buffer locations at the time that a data reorganization is initiated (instead of the current approach in which memory is either managed entirely by the interface or user-managed memory is "registered" with the interface prior to the start of an associated data reorganization).

In agenda items 9 and 10, the group decided to forego the "do" call (tentatively) until the "put/get" interfaces have been well defined and proven (there have been substantial changes to the way that data reorgs are initiated in the API). It still remains to be seen whether late-binding functionality can be folded into the API.  The majority of attendees agreed, however, that late-binding functionality is not a top priority.
 

Agenda item #5. Final document form

The group agreed that the API should be organized into at least two sections, the first of which introduces the high-level "productivity-oriented" calls that will find the most use. The second section shall introduce alternative software interfaces that allow the caller much more explicit control over how the data reorganization library works (e.g., in controlling data partitioning and local data layouts in memory).

A list of the critical calls was identified by the group:

dri_global_data_create
dri_group_create
    dri_overlap_create (optional - max of 2 calls per dimension)
dri_distspec_block_create (per dimension)
dri_distspec_blockcyclic_create (per dimension)
    dri_layout_create (optional - 1 call)
dri_dist_create
    dri_dist_get_numblocks
    dri_dist_get_blockinfo
dri_bufferset_create
dri_bufferset_create_user
dri_channel_create_send
dri_channel_create_recv
dri_channel_create_sendrecv
dri_channel_connect
dri_channel_put
dri_channel_get
 

The broader "final document" form was also discussed, and the following rough list of topics was generated by the group:

Regarding co-layering with other standards: The group agrees that an MPI co-layering specification is most important to specify in the document, and is in scope for this group. MPI/RT co-layering with Data Reorg can be best performed by that body.
 

Agenda item #6: Low-level data partitioning control

There was a short discussion on this topic just before the end of the meetings.

Here is a proposed call to generate a user-specified DRI_distspec object.

dri_distspec_create (nblocks, extents[], lov, rov)

extents is a structure: { int start, int extent }; (DRI_extent type name?)
lov and rov are DRI_overlap objects, just like in the other calls
 

Agenda item #7: Low-level specification of local memory layouts

No discussions were held on this topic because of time constraints.
 

Agenda item #8: Object and memory management / reference counting

What happens to user-supplied data (e.g., arrays) when it is passed as an input argument into a DR function (copied or referenced)? Copying protects the library implementation from bad user code that accidentally overwrites the original input data.

What is returned to the user when a DR library function returns a structure to the user (a pointer/ reference to a structure stored internally by the library, or a copy of that structure)? An example is the DRI_blockinfo structure described earlier. Pointers would be problematic, with erroneous applications potentially corrupting memory in library-space. Using a reference to an object would provide some safety, but then would require query functions for every piece of information stored in the object.

Many group members prefer no change to the current specification. Users must (of course) write correct programs in order to get reasonable results.
 

Agenda item #9: Rename acquire/insert/extract/release series of data reorg calls

The group addressed the current complexity in the interface for invoking data reorganizations and managing access to data buffers. Four DRI_transfer (now DRI_channel) object functions (acquire/insert/extract/release) currently provide this capability. Many in the group prefer a simpler get/put based syntax, and addressed whether it is possible to adopt that syntax.

A quick review of what motivated the current API design is in order. In a SPMD program, each process must:

We previously called these four operations, respectively, acquire/insert/extract/release. Acquire and release referred to data buffer access operations, while insert and extract referred to parallel communication operations. The get/put syntax is not possible to express these four operations (unless we redesign somehow).

During the meeting, the group successfully redesigned the interface  to support the universal use of get/put to accomplish either clique/SPMD or pipeline data reorganizations:

For pipeline data reorgs:

No change is necessary in the setup. dri_channel_create_send or dri_channel_create_recv is called, followed by a dri_channel_connect.  A single DRI_channel object results, and is used in subsequent dri_channel_get and dri_channel_put calls according to the following table:
 
 
dri_channel_put
dri_channel_get
Sender (producer) Send a full (produced) data buffer "downstream" in the pipeline Get an available buffer for producing data (to be used in a subsequent put operation) 
Receiver (consumer) Return a consumed buffer back to the library (to be used in a subsequent get operation) Receive a data buffer from the upstream  processes in the pipeline for  consumption (processing)

For clique/SPMD reorgs:

In this case, the user needs to create two channel objects, one to manage the send buffers (via dri_channel_create_send), and another to manage the receive buffers (via dri_channel_create_recv). This has implications on the channel "connect" process. There needs to be a new function that accepts both channel objects as input arguments: dri_channel_connect_sendrecv is that function. In order to accomplish the communications for a data reorg, the program must call both dri_channel_put and dri_channel_get (in that order). The get and put calls work in the same way as described above with respect to acquiring buffers from and releasing buffers to the library.
 

Agenda item#10: Think through various buffer management scenarios

Considerable discussion about whether buffer sets could be shared between two different channels. This capability would allow buffers to be treated as a receive buffer, used for in-place processing, and then used as a send buffer in a subsequent data reorg (served by a different channel object). We want this capability in both pipeline and
clique/SPMD programs.

Group agreed that It is possible by keeping some state information inside a bufferset object to share that bufferset between two channel objects.

Agenda item #11: Error status codes in the API

Not covered in this meeting

Agenda item #12: Integration of Data Reorganization with MPI, MPI/RT

No progress in defining specific interfaces during this meeting. Group did agree that MPI co-layering is the first priority and should be included in our final product. MPI/RT co-layering is something that should be taken up by that standards definition body.

The group agreed on the concept that the interfaces we define as part of this activity should be flexible enough to interoperate directly (via co-layering) with basic MPI and MPI/RT infrastructure (e.g., data types, communicators/groups, channels). Ken argued for a minimum "standalone" specification for these areas of the data reorganization API to allow for completely layered implementations of the API on top of other libraries.

Agenda item #13: Game plan for finalizing this group's work

Final document (see earlier notes on content ideas) and  meta-API specification are planned. Ken argued that a layered reference implementation on top of MPI would provide proof-of-concept validation of the software interfaces developed here. No concrete date for completion set at this meeting, although some suggest HPEC 2000 as a reasonable goal.

Agenda item #14: Miscellaneous topics


***** Topic #1: adding start/stop activate/deactivate-like calls to a DRI_channel

Jon: motivation is to get back to a "primed" state following some activity - want to call a stop-like call to turn off the ability to transfer data on that channel object. Then, at some later time,  resume, but with a  "primed" channel.

Jon: suggested a single reset call that is collective - effectively a barrier , plus resetting the DMA engines ,then maybe another barrier, then return to application.

Dennis: how does  a reset propagate through a pipeline? Race condition?

Dennis: there are already ways to do control / synchronization in MPI & MPI/RT with barriers

Decision: let's defer on this issue - MPI/RT forum may want to address whether it has the facilities already in its channel abstraction to deal with this.

***** Topic #2: Non-blocking dri_channel_get
Jon: suggestion is to eventually add a non-blocking get, and a test operation to the interface