Minutes from the September 21 1999 meeting of the Data Reorganization Forum

Location: MIT Lincoln Laboratory, Lexington, MA
 

Attendance:
 

Individual

Organization

James Lebak, host

MIT Lincoln Laboratory

Ed Rutledge

MIT Lincoln Laboratory

Tom McClean

Lockheed Martin GES

Arkady Kanevsky

Mecury Computer Systems

Dennis Cottel

SPAWARSYSCEN, S.D.

Ken Cain

The MITRE Corporation

Randy Judd

SPAWARSYSCEN, S.D.

Karen Lauro

Mercury Computer Systems

Nathan Doss

Lockheed Martin GES

Myra Prelle

Mercury Computer Systems

Jon Greene

Mercury Computer Systems

Summary Section


Recall our convention as defined after our June 1999 meeting in Moorestown.  There are 2 types of "multiple buffering" that we are considering in this forum:
 

[ EDITORIAL NOTE: This document will refer to the first approach as "multiple buffering", and the second approach as "flow control". There is some consensus for this based on the forum’s discussions to date, so let’s adopt this language for clarity in future communications ]

In the September meeting, "policy" was decided on 4 key issues. The idea here is that we want to distinguish what features will be provided in the first version of the API vs. later enhancements.

  1. Who allocates memory? The user, the library, or both?
  2. Early vs. late binding: do we support early-only (our assumption to date), or both early and late binding
  3. Multiple buffering - do we allow it in the preliminary version of the API?
  4. Finer-grained "flow control" - do we allow it in the preliminary version of the API?

Resolutions:
1. We decided to support both user-allocated and buffers allocated in "special" memory by a call to the data reorg. interface
2. Ken will do a feasibility study to see if we can add late-binding to our existing API (probably in the "do" call to execute a transfer)
3. We will support multiple-buffering in the first version of the API with restrictions (restrictions based on the observation that we do not yet have rules to govern the order of access to the multiple local buffers associated with a transfer - we can make some assumptions that make for a simple buffer access protocol)
4. We will not support flow control in the first version of the API, but will support it in subsequent enhancements to the API.  After some considerable discussion on this topic during the meeting, the group was convinced that it is possible to evolve the API to include this new feature at a later time. Tabled for now in order to get the preliminary version of the API out that will cover 90% of the cases users will want.



Another major thread of discussion was how to handle the need to keep the API simple, yet also serve the power users who want to have explicit control (examples are found in creating local memory layouts, low-level data partitionings, process set dimensionality, etc.).

Resolution:
The group agreed that the API itself could be segmented - the first part of the API presented would be limited to show only the "high level" calls that capture 90% of the cases.  Low-level functions that give the user the most explicit control will be presented in a later section of the API document, and can be referred to in the earlier part of the API presentation.
 
 

Detail Section


Karen: noted that there was a recent request for information from the Object Management Group (OMG) about the use of  CORBA technology in aggregated computing environments. Some of the responses received to date have a data reorg component, and it would be nice to point the forum members to those responses:

 http://www.omg.org/techprocess/meetings/schedule/Supporting_Aggreg._Computing_RFI.html
 

A variety of discussion topics followed during the first 30-45 minutes of the meeting
 

Major Topic #1: 4 key areas for "policy" decisions

Arkady identifies 4 key areas where "policy" needs to be decided by the members:

  1. Who allocates memory? The user, the library, or both?
  2. Early vs. late binding: do we support early-only (our assumption to date), or both early and late binding
  3. Multiple buffering - do we allow it in the preliminary version of the API?
  4. Finer-grained "flow control" - do we allow it in the preliminary version of the API?


Issue 1) Who allocates memory?

Jon & Myra: need for a data buffer abstraction is apparent from their recent prototyping activities built on the use of PAS with the LIS DRI API
Group: agrees that we need a call to allocate memory for the user to account for highly-optimized approaches used in the implementation
 

Issue 2) Early vs. late binding: should we allow both?

We have only considered early-binding in the API development to date.
Nathan: would like to preserve MPI's late-binding characteristics for applications that are transitioning from the use of MPI (e.g., specify the pointer to the local data at transfer-time, not in advance when creating the transfer object)

Ken: this is related to an earlier question about whether the current "dri_transfer_do()" call works as written (it seems to not have the ability to tell which of the multiple buffers provided by the user is the "current" buffer to be used in the communication being performed by the do call).  Perhaps we could keep the "do" call, and just add another argument (pointer to the data to be transferred).  Will look into this possibility when re-writing the LIS API.

So, to summarize: we'll look into it.  If it's possible, then we can vote as a group whether to keep late-binding functionality, or require an early-binding-only API.
 

Issue 3) Multiple buffering

Jon: note that there is no buffer ordering policy specified in the API for accessing the array of buffers specified for a given transfer
Jon: seems appropriate to specify restrictions in order to keep multi-buffering capability, but make it simple to implement (e.g., FIFO)
Ken: will look into what those restrictions should be as the API is modified
One suggestion for policy was that for any given transfer, the same buffer "index" had to be used uniformly by all of the processes in the parallel environment.
Dennis: what if different numbers of buffers are supplied by different processes when they set up the transfer object?
 

Issue 4) Fine-grained flow control

Dennis: outlined 2 arguments "for" having this type of functionality (he did not necessarily endorse these ideas, but stated them to organize the discussion)

  1. Global data object is too large to fit in the aggregate memory, but it would be convenient to write your software as if it could fit, and as if you were really executing 1 large transfer - not the N "piecemeal" transfers that are actually required in order to fit data into the available memory resources
  2. The need to get data "downstream" sooner rather than later (example: process a row, ship it out, process the next row, ship it out, ...). This is in contrast to processing all rows, and then performing a "monolithic" data reorganization operation

Concensus after some discussion: this type of support is probably at the next "tier" of support that users would want.  Should be supported in a future enhanced version of the API.  Discussion did take place regarding whether it is technically feasible to add this functionality to our API without "breaking" fundamental components (i.e., we don't want to solidify important parts of the API now, and then have to change those parts later when flow-control is added at a later date).

The group agreed that the API as currently constituted does not need to support flow control, and that adding such features later will not change the fundamental components of the API.
 
 

Major Topic #2: General discussion of June 1999 version of the Language-Independent Specification (LIS) of the API

Topic 2.1) Specifying process set dimensionality - where?

Nathan: does not like specifying "nprocs" parameter in the dri_distspec_create() function.  Recall that this function is called on a per-dimension basis to specify (at a high level) the type of data partitioning that is requested for the application data on one "side" of a transfer.

Ken: Specifying nprocs associated with a single dimension allows one to effectively specify a process set dimensionality (can provide greater control in the way that the user specifies how data is split up among processes)

Myra: note that the dri_distspec_create() function arguments are all very generic (scalable), except for the nprocs parameter.  This prevents re-use of the resulting object that is created by the call.

Ken: recall from last meeting's minutes that there was a lot of resistance to specifying dimensionality ias a parameter to the DRI_group constructor (DRI_group objects are the process set abstraction in our API), so we need to find another place to put this information. The only other candidate is as an argument to dri_dist_create() function.  Will look into whether moving this specification changes things radically or not.  There seems to be good reason to do this - since the DRI_dist objects that get created will never be "re-usable" in the sense that they calculate a specific partitioning of a dataset over a specific process set.  So, in this case, adding process set dimensions as an argument will not hurt the reusability of a DRI_dist.
 

Topic 2.2) dri_distspec_create() call (generality vs. different instantiations)

dri_distspec_create() currently takes one form, even though it relates to different ways to specify a high level partitioning of data (block vs. block-cyclic, vs ."indivisible")
The problem is that the arguments may not mean anything (or their meaning is less clear) in certain cases
Example: "blksz" parameter doesn't mean anything when you're trying to create a block partitioning specification.

The suggestion is to create multiple forms of this call corresponding to the different types of partitioning (block, block-cyclic).  The calls will still create an object of a single type (DRI_distspec) - the partitioning differences will be maintained inside this opaque object type.

dri_distspec_block_create()
dri_distspec_blockcyclic_create()
 

Topic 2.3) In the API spec document, DRI_overlap should be defined before it is referenced (Group agreement)

Topic 2.4) DRI_layout object

Myra suggested that we allow a specification of layout on a per-dimension basis, and that the user can specify both stride and alignment requirements. She noted that some alignment specs across dimensions may actually be faulty, so those would need to be checked at some point in the protocol.

Ken: we'll need to get 2 forms of the dri_layout_create call: one that is specifically for dense (so-called packed) layouts, and one for the more general specification

Some members suggested a mnemonic symbol for some reasonable number of cases (up through 3 dimensions) that allow the user to easily specify dense (packed) local data: (e.g., DRI_LAYOUT_PACKED_XYZ, DRI_LAYOUT_PACKED_ZXY, ....)

Alternative is to create 2 forms of the dri_dist_create() call:

Group concensus was that we should try this approach in the API.
 

Topic 2.5) DRI_gdo object should be renamed to something more straightforward (Group agreement)
 

Topic 2.6) General issues in providing "low-level" or. "high-level" calls

This has come up in two key areas:

The problem is that in 90% of user cases, the user will want a high-level call to specify the data reorganizations.  Some users will require more sophistication in the API and therefore, we need to support that (within reasonable limits).

Suggestion was made to put all of the "mainstream" calls into an earlier section of the API/document that results from our activity, and to leave the low-level specification functions in a later section. The functions that provide the most powerful specification (and the sometimes unnecessary generality) can be grouped into a section for perusal by people who really need to learn about it.  General agreement by participants in the meeting for this type of API/document organization.
 

Topic 2.7) bounds_t structure (in)efficiency issue for storing details of a  block-cyclic partitioning

Nathan: When a user queries a block-cyclic partitioning in the current API, a list of bounds_t structures is returned on a per-dimension basis.  The suggestion is that maintaining this type of information is pretty heavyweight (compared to the very compact "formulaic" representations that exist).