Minutes from the September 21 1999 meeting of the Data Reorganization Forum
Location: MIT Lincoln Laboratory, Lexington, MA
Attendance:
|
Individual |
Organization |
|
James Lebak, host |
MIT Lincoln Laboratory |
|
Ed Rutledge |
MIT Lincoln Laboratory |
|
Tom McClean |
Lockheed Martin GES |
|
Arkady Kanevsky |
Mecury Computer Systems |
|
Dennis Cottel |
SPAWARSYSCEN, S.D. |
|
Ken Cain |
The MITRE Corporation |
|
Randy Judd |
SPAWARSYSCEN, S.D. |
|
Karen Lauro |
Mercury Computer Systems |
|
Nathan Doss |
Lockheed Martin GES |
|
Myra Prelle |
Mercury Computer Systems |
|
Jon Greene |
Mercury Computer Systems |
Recall our convention as defined after our June 1999 meeting in Moorestown. There are 2 types of "multiple buffering" that we are considering in this forum:
[ EDITORIAL NOTE: This document will refer to the first approach as "multiple buffering", and the second approach as "flow control". There is some consensus for this based on the forum’s discussions to date, so let’s adopt this language for clarity in future communications ]
In the September meeting, "policy" was decided on 4 key issues. The idea here is that we want to distinguish what features will be provided in the first version of the API vs. later enhancements.
Resolutions:
1. We decided to support both user-allocated and buffers allocated in "special" memory by a call to the data reorg. interface
2. Ken will do a feasibility study to see if we can add late-binding to our existing API (probably in the "do" call to execute a transfer)
3. We will support multiple-buffering in the first version of the API with restrictions (restrictions based on the observation that we do not yet have rules to govern the order of access to the multiple local buffers associated with a transfer - we can make some assumptions that make for a simple buffer access protocol)
4. We will not support flow control in the first version of the API, but will support it in subsequent enhancements to the API. After some considerable discussion on this topic during the meeting, the group was convinced that it is possible to evolve the API to include this new feature at a later time. Tabled for now in order to get the preliminary version of the API out that will cover 90% of the cases users will want.
Another major thread of discussion was how to handle the need to keep the API simple, yet also serve the power users who want to have explicit control (examples are found in creating local memory layouts, low-level data partitionings, process set dimensionality, etc.).
Resolution:
The group agreed that the API itself could be segmented - the first part of the API presented would be limited to show only the "high level" calls that capture 90% of the cases. Low-level functions that give the user the most explicit control will be presented in a later section of the API document, and can be referred to in the earlier part of the API presentation.
Karen: noted that there was a recent request for information from the Object Management Group (OMG) about the use of CORBA technology in aggregated computing environments. Some of the responses received to date have a data reorg component, and it would be nice to point the forum members to those responses:
http://www.omg.org/techprocess/meetings/schedule/Supporting_Aggreg._Computing_RFI.html
A variety of discussion topics followed during the first 30-45 minutes of the meeting
Major Topic #1: 4 key areas for "policy" decisions
Arkady identifies 4 key areas where "policy" needs to be decided by the members:
Issue 1) Who allocates memory?
Jon & Myra: need for a data buffer abstraction is apparent from their recent prototyping activities built on the use of PAS with the LIS DRI API
Group: agrees that we need a call to allocate memory for the user to account for highly-optimized approaches used in the implementation
Issue 2) Early vs. late binding: should we allow both?
We have only considered early-binding in the API development to date.
Nathan: would like to preserve MPI's late-binding characteristics for applications that are transitioning from the use of MPI (e.g., specify the pointer to the local data at transfer-time, not in advance when creating the transfer object)
Ken: this is related to an earlier question about whether the current "dri_transfer_do()" call works as written (it seems to not have the ability to tell which of the multiple buffers provided by the user is the "current" buffer to be used in the communication being performed by the do call). Perhaps we could keep the "do" call, and just add another argument (pointer to the data to be transferred). Will look into this possibility when re-writing the LIS API.
So, to summarize: we'll look into it. If it's possible, then we can vote as a group whether to keep late-binding functionality, or require an early-binding-only API.
Issue 3) Multiple buffering
Jon: note that there is no buffer ordering policy specified in the API for accessing the array of buffers specified for a given transfer
Jon: seems appropriate to specify restrictions in order to keep multi-buffering capability, but make it simple to implement (e.g., FIFO)
Ken: will look into what those restrictions should be as the API is modified
One suggestion for policy was that for any given transfer, the same buffer "index" had to be used uniformly by all of the processes in the parallel environment.
Dennis: what if different numbers of buffers are supplied by different processes when they set up the transfer object?
Issue 4) Fine-grained flow control
Dennis: outlined 2 arguments "for" having this type of functionality (he did not necessarily endorse these ideas, but stated them to organize the discussion)
Concensus after some discussion: this type of support is probably at the next "tier" of support that users would want. Should be supported in a future enhanced version of the API. Discussion did take place regarding whether it is technically feasible to add this functionality to our API without "breaking" fundamental components (i.e., we don't want to solidify important parts of the API now, and then have to change those parts later when flow-control is added at a later date).
The group agreed that the API as currently constituted does not need to support flow control, and that adding such features later will not change the fundamental components of the API.
Major Topic #2: General discussion of June 1999 version of the Language-Independent Specification (LIS) of the API
Topic 2.1) Specifying process set dimensionality - where?
Nathan: does not like specifying "nprocs" parameter in the dri_distspec_create() function. Recall that this function is called on a per-dimension basis to specify (at a high level) the type of data partitioning that is requested for the application data on one "side" of a transfer.
Ken: Specifying nprocs associated with a single dimension allows one to effectively specify a process set dimensionality (can provide greater control in the way that the user specifies how data is split up among processes)
Myra: note that the dri_distspec_create() function arguments are all very generic (scalable), except for the nprocs parameter. This prevents re-use of the resulting object that is created by the call.
Ken: recall from last meeting's minutes that there was a lot of resistance to specifying dimensionality ias a parameter to the DRI_group constructor (DRI_group objects are the process set abstraction in our API), so we need to find another place to put this information. The only other candidate is as an argument to dri_dist_create() function. Will look into whether moving this specification changes things radically or not. There seems to be good reason to do this - since the DRI_dist objects that get created will never be "re-usable" in the sense that they calculate a specific partitioning of a dataset over a specific process set. So, in this case, adding process set dimensions as an argument will not hurt the reusability of a DRI_dist.
Topic 2.2) dri_distspec_create() call (generality vs. different instantiations)
dri_distspec_create() currently takes one form, even though it relates to different ways to specify a high level partitioning of data (block vs. block-cyclic, vs ."indivisible")
The problem is that the arguments may not mean anything (or their meaning is less clear) in certain cases
Example: "blksz" parameter doesn't mean anything when you're trying to create a block partitioning specification.
The suggestion is to create multiple forms of this call corresponding to the different types of partitioning (block, block-cyclic). The calls will still create an object of a single type (DRI_distspec) - the partitioning differences will be maintained inside this opaque object type.
dri_distspec_block_create()
dri_distspec_blockcyclic_create()
Topic 2.3) In the API spec document, DRI_overlap should be defined before it is referenced (Group agreement)
Topic 2.4) DRI_layout object
Myra suggested that we allow a specification of layout on a per-dimension basis, and that the user can specify both stride and alignment requirements. She noted that some alignment specs across dimensions may actually be faulty, so those would need to be checked at some point in the protocol.
Ken: we'll need to get 2 forms of the dri_layout_create call: one that is specifically for dense (so-called packed) layouts, and one for the more general specification
Some members suggested a mnemonic symbol for some reasonable number of cases (up through 3 dimensions) that allow the user to easily specify dense (packed) local data: (e.g., DRI_LAYOUT_PACKED_XYZ, DRI_LAYOUT_PACKED_ZXY, ....)
Alternative is to create 2 forms of the dri_dist_create() call:
Group concensus was that we should try this approach in the API.
Topic 2.5) DRI_gdo object should be renamed to something more straightforward (Group agreement)
Topic 2.6) General issues in providing "low-level" or. "high-level" calls
This has come up in two key areas:
The problem is that in 90% of user cases, the user will want a high-level call to specify the data reorganizations. Some users will require more sophistication in the API and therefore, we need to support that (within reasonable limits).
Suggestion was made to put all of the "mainstream" calls into an earlier section of the API/document that results from our activity, and to leave the low-level specification functions in a later section. The functions that provide the most powerful specification (and the sometimes unnecessary generality) can be grouped into a section for perusal by people who really need to learn about it. General agreement by participants in the meeting for this type of API/document organization.
Topic 2.7) bounds_t structure (in)efficiency issue for storing details of a block-cyclic partitioning
Nathan: When a user queries a block-cyclic partitioning in the current API, a list of bounds_t structures is returned on a per-dimension basis. The suggestion is that maintaining this type of information is pretty heavyweight (compared to the very compact "formulaic" representations that exist).