Participants
|
Name
|
Organization
|
| Murali Beddhu |
MPI Software Technology, Inc. |
| Ken Cain |
Mercury Computer Systems, Inc. |
| Dennis Cottel |
SPAWAR Systems Center, S.D. |
| Zhenqian Cui |
MPI Software Technology, Inc. |
| Jon Godwin |
Northrop Grumman MEC |
| Jon Greene |
Mercury Computer Systems, Inc. |
| Steve Paavola |
Sky Computers, Inc. |
| Anna Rounbehler |
MPI Software Technology, Inc. |
| Anthony Skjellum |
MPI Software Technology, Inc., Mississippi State University |
| Brian Sroka |
The MITRE Corporation |
|
|
Agenda
-
Ken&Tony: share recent DRI API concerns (corner-turn example too long)
-
revisit whether to explicitly specify "single step" data reorg setup calls
(as opposed to implementation-specific helper functions idea discussed
previously)
-
Open discussion on DRI relationship to new DUSD S&T embedded software
initiative (HPEC-SI)
-
Consider completion timeframe proposal
-
are dates feasible for all participants?
-
discuss multiple document editors configuration -- version control, tools,
document format
-
what major API features should be voted on during June meeting? (for exclusion/inclusion/specification-approach)
-
what are the priority areas for proposals?
-
June meeting dates (goal: coincide with DUSD S&T HPEC-SI meetings
week of 6/11/2001)
-
Review recent changes to the API (as listed in Annex A)
-
Review recent email proposals
-
Buffer processing
-
connect
-
processing block definition
-
get/put functions
-
Start to consider the design of remaining parts of the API (based on high
priority areas discussed earlier?) [not covered
due to time constraints]
Minutes for the May 2001 meeting of the Data Reorganization Forum
Meeting Minutes
Agenda item #1: feedback that corner-turn example is too long
-
Ken noted that example annexes and discussion chapters are based on January
2001 API, and therefore do not include some of the benefits of the more
recent API developments (e.g., pre-defined default object instances for
DRI_Overlap, DRI_Layout, DRI_Partition, etc.)
-
many in group feel that we don't want to eliminate the detail calls to
develop helper functions
-
most think that helper functions are ok to be put in the spec, as opposed
to implementation-specific interfaces
-
Ken: concrete proposals will be needed to make this happen
Agenda item #2 - DRI relationship to HPEC-SI
-
strong interest among group to move forward w/ current effort, because
of short-term improvements that dri can make in performance, portability,
and especially ease of use
-
not strong interest to consider whether dri forum will start to address
(or not) integrated comm&comp libraries following 1.0 specification
(a technology of interest for HPEC-SI). Strong interest to complete 1.0
first, and then determine appropriate way to proceed after that event.
-
many in group feel that HPEC-SI will provide a higher level of abstraction
without outdating other apis like MPI, DRI. Many users will obviously still
expect MPI to be fully supported -- DRI would be in the same category upon
its completion, implementation, and adoption.
Agenda item #3 - DRI 1.0 milestone proposal submitted by ken
-
steve: we should have telecons for when people generate proposals between
meetings
-
tony: we need to be specific about meeting dates in order to vote
-
tony: voting rules issue: one vote per org, orgs to include vendors (Sky,
Mercury), MSTI, SPAWAR, MITRE, LMCO, other active participant organizations.
-
steve: voting approach could be to either 1) agree, 2) agree with comments,
or 3) disagree with comments
-
steve&tony: perhaps it's time to sub-committee-ize portions of the
spec, and have each subgroup present proposals on their assigned topics.
Then, have coarse synchronization points where broad announcements are
made to ask for votes from organizations. (where votes are agree, agree
with comments, disagree with comments)
-
tony: key thing for remainder of this meeting is to come up with list of
remaining high priority issues
-
List of issues (some smaller, some larger) below. Sub-bullets either
1) discuss specific resolutions decided at the meeting, 2) refer
to a later part of the minutes, or 3) indicate that no disucssion occurred
on the topic at this meeting
-
MPI middlware adapter/binding/instantiation
-
Steve (Sky) and Cui (MSTI) will work together to draft a proposal
-
distribution create scenario - legal or not? : parameters parts[i] == WHOLE
and groups[i] > 1
-
discussed later, reported in minutes for agenda item #5
-
multi-dimensional (random access) indices to the DRI_Blockinfo structures
associated with block-cyclic distributions
-
sequential access / iterator approach to multiple DRI_Blockinfo in block-cyclic
-
VSIPL views returned from DRI to facilitate processing before/after a data
reorg
-
Ken will determine if it is feasible to incorporate rudimentary VSIPL interoperation
in DRI 1.0 (based on degree of complexity, may table until after DRI 1.0)
-
NULL objects vs. "DEFAULT" objects (do we need NULL? we mostly use NULL
to represent default behavior)
-
all objects have DEFAULT, remove NULLS
-
error codes definition, and error handling in user application code
-
group agrees that the approach should be to have integer return codes from
DRI functions
-
integer return codes will be processed by a function that converts the
error code into a string description of the error just encountered.
-
left/right overlap terminology ==> before/after (consider this change)
-
group decides to keep left/right terminology
-
"pad" terminology and overlap specifications (toroidal overlap should not
be named DRI_OVERLAP_PAD_TOROIDAL)
-
replace PAD with EDGE, in all ovr_type parameter options for the DRI_Overlap_create
function
-
remove DRI_GROUP_WORLD, given new import/export model
-
agreement to remove this
-
tony suggests renaming import to alias, and to remove export
-
how to deal with conflicting overlap/pad specifications
-
we now call this "edge" and not pad per previous discussion
-
happens in multi-dimensional partitionings -- problem surfaces on the corners
of the data.
-
steve: 2 options 1) these cases are erroneous 2) implementation decides
how to resolve 3) specify the precedence rules for which overlap spec will
control the actual run-time behavior
-
buffer linking proposal (dennis)
-
buffer processing proposal (steve)
-
discussion reported later in agenda item #6
-
connect proposal (steve)
-
discussion reported later in agenda item #6
-
get/put proposal (steve)
-
discussion reported later in agenda item #6
-
processing block proposal (steve)
-
discussion reported later in agenda item #6
-
alternative data movement functions (murali)
-
this proposal does not exist, but Murali has volunteered to draft one after
the meeting
Miscellaneous Discussion
Dennis: there is a problem of replicated edge overlap:
-
take last element and replicate num_pos times (not in current spec, but
probably what most apps would want)
-
take last num_pos elements and replicate them as a group (current specification)
-
steve suggests also a "mirror image" option to replicate the group in reverse
order (not in current spec)
-
suggests a function pointer to a user function that is called only on edge
processes (which is known by DRI) to precondition the edges. Benefit of
this approach is that it takes the testing of edge conditions out of the
application code.
-
table this decision until we talk about the buffer processing proposal
later, which also suggests handlers in that area
-
for now: change to replicate the last single element num_pos times (group
feels that this is needed by more applications than the other options)
Steve: another potentially missing area of the spec
-
wants to be able to specify that on global data edges the process will
not use elements at the edges along specified dimensions.
-
A variation on the overlap specification -- instead of extending the local
storage (as overlap does), constrict the local storage
-
suggests possibly writing a proposal to handle this -- no action to be
taken right now.
Anna: provided feedback on the current document, especially the introduction/discussion
sections
-
encourages DRI to clearly state up front its:
-
target applications
-
technical approach (esp. as compared to MPI/RT, MPI)
-
ability or not to gracefully be used in the context of other standard,
proprietary, or third-party middleware
-
and more...
-
Anna agrees to put together a list of "frequently asked questions" about
DRI that the forum can answer, and publish on its web server
Agenda item #4 - next meeting
6/11 week is leading candidate - ESI probably Tues/Wed
1/2 day Wed 6/13 1:00 - 5:00, all day Thurs 6/14 would be best for
DRI
Most likely scenario:
-
HPEC-SI meetings at MIT/LL full day 6/12 and morning 6/13
-
DRI meetings at MITRE half day 6/13, full day 6/14
Agenda item #5 - Review recent changes to the API (as listed
in Annex A) -- See discussions in Agenda item #6 minutes
-
better explain DRI_LAYOUT_DEFAULT pre-defined objects
-
add conditions of usage for DRI_LAYOUT_ORDER_DEFAULT
-
When using this default value in only a subset of the order[] array entries,
then the remaining non-default order[] values must refer to the most contiguously
stored dimensons, starting with any order[] array entry = 0, and then continuing
to specify other order[] array entries with 1, 2, ... without skipping
any intermediate values.
-
Here is an example of an erroneous case:
-
3 dimensional data object
-
order[] array parameter to DRI_Layout_create_* is order[0] = 0, order[1]
= DRI_LAYOUT_ORDER_DEFAULT, order[2] = 2.
-
here, the user specified the layout order of some dimensions explicitly,
but failed to assign one of the data dimensions to the "second most contiguous"
storage (would correspond to an order[] array entry = 1).
-
Jon: naming approach for layout creation functions is confusing -- suggests
the following approach
-
DRI_Layout_create_packed (leave as-is)
-
DRI_Layout_create_aligned (ndims, order[], align[], &layout)
-
order[]: can be DRI_LAYOUT_ORDER_GDO_DEFAULT
-
order[i]: can be DRI_LAYOUT_ORDER_DEFAULT
-
align[]: can be DRI_LAYOUT_ALIGN_DEFAULT
-
align[i]: if align[] is non-null, each align[i] must contain a user-supplied
byte alignment for each dimension
-
DRI_PARTITION_WHOLE expression in terms of basic params (minsz, etc.) -
add a note that minsz MUST be equal to the size of the associated global
data dim
-
DRI_PARTITION_WHOLE discussion should say that it is equivalent to DRI_Partition_create_BLOCK
with the minsz, mod, etc. parameters equal to: ...
-
in DRI_Distribution_create discussion of restrictions on consistency of
input parameters across a process set
-
add another restriction that layout parameter should be identically specified
across all processes on the same DR "side"
-
in DRI_Distribution_create, communication properties should say that this
is definitely a local call now (impacted by decisions made at this meeting)
-
DRI_Group_export is removed from the API
-
DRI_Distribution_create conflicting? parameters
-
what happens when input parameters occur like this for a given dimension,
i ?? (dists[i] == WHOLE) && (group_dims[i] > 1)
-
current document says that the data in dimension i is "replicated"
-
As a receiver, this is a broadcast.
-
As a sender, this allows the implementation the flexibility to transfer
those data items from whichever process it wants (or, even transfer from
those processes in parallel to better scale the data reorg)
-
On the proposal to put more info(actual overlap due to data overlap, actual
overlap due to edge padding) in DRI_Blockdim:
-
Group agrees to provide two quantities -- number of true data overlap positions,
and number of positions due to edge padding
Agenda item #6 - reviewing recent proposals
-
Buffer processing(author: steve -- refer to prior email reflector
activity for proposal details)
-
connect(author: steve -- refer to prior email reflector activity
for proposal details)
-
wants to have an MPIRT-like commit function
-
cui: should be allowed to provide a list of channels to start
-
tony: current global start/stop is problematic (e.g., what if you want
to stop a specific channel but not all others)
-
tony: suggests scoping DRI_Init to be over a specific aliased/imported
group from other middleware. This would define the synchronization scope
of the DRI_Start (or commit, if we call it that).
-
allows some subsets of an application to use DRI, and others to not use
DRI simultaneously -- limits the # of processes that have to call DRI_Start()
to connect all the channels
-
MPI ex option #1:
-
assume MPI_COMM_WORLD exists, after MPI_Init(argc, argv)
-
DRI_Group_import (MPI_COMM_WORLD, &dri_group_world)
-
DRI_Init (&argc, &argv, &dri_group_world)
-
DRI_Commit() (or DRI_Start())
-
DRI_Stop()
-
problem: we're calling a dri function before dri-init
-
Ex#2
-
application command lines have something like "-dri-mpi-ranks 0..7" inserted
by the DRI implementation
-
injected into command line by implementation-specific "run script" machinery
(a-la mpirun)
-
MPI_Init(&argc, &argv)
-
DRI_Init(&argc, &argv, &handle)
-
processes the command line, looking for implementation-specific arguments,
and retrieves the "DRI scope" from the command line. Stores this scope
in the library "handle" output object
-
This handle is passed to all subsequent DRI setup functions requiring communication.
This is minimally Channel_create, and the new library-scoped connect/disconnect
functions (DRI_Connect, _Disconnect) described below. The handle represents
a single independent "DRI network". Future DRI specs may allow a single
process to participate in different DRI networks simultaneously.
-
Still to be done (by Cui and Steve in the MPI middleware binding work)
- how to use MPI communicators to specify the scope of the DRI network
handle
-
"Connect" Resolutions below:
-
DRI_Channel_create takes the DRI library handle developed above
-
this starts a registration process for the channel names
-
DRI_Connect takes a handle, and connects all channels that were created
using that handle
-
DRI_Disconnect takes the handle, and stops activities on all channels created
using that handle
-
remove DRI_Channel_connect
-
DRI_Finalize takes the handle parameter too
-
get/put functions(author: steve -- refer to prior email reflector
activity for proposal details)
-
goal: query for blockinfo at channel get time (to enable dynamic)
-
Ken: suggest putting blockinfo reference in returned Buffer_ID (and not
the other way around to protect CORE/mwadapter separation already developed)
-
steve: also wants get calls to work on a block granularity only (not a
buffer granularity, in which multiple blocks may be stored -- a-la block-cyclic
distribs)
-
steve&jon: 2 possible interface approaches, depending on what granularity
you want the communications to happen:
-
DRI_Channel_get_block (channel, &blockinfo)
-
Blockinfo would contain a pointer to the data
-
DRI_Channel_put_block(channel, blockinfo)
-
DRI_Channel_get_buffer (channel, &bufferid)
-
buffer would contain a pointer to the data, and a pointer to the blockinfo
-
DRI_Channel_put_buffer(channel, bufferid)
-
Jon suggests adding a third argument where the pointer can be returned,
instead of packaging it inside DRI_Blockinfo or another DRI object
-
tony: would like to add a boolean to the DRI_Blockinfo that can be queried
to tell you if it reflects the first block of a new buffer. (assuming the
new model is just get_block put_block, and the middleware keeps track of
when buffers are fully produced / consumed)
-
tony: suggests specifying the granularity of transfers at channel create
time (block or buffer)
-
alternative to get_buffer would be get_allblocks
-
jon: we may have no choice about transport granularity - it may require
buffer-level in order to maintain reasonable implementation complexity.
Consider that we allow any distribution to any distribution reorgs -- this
can get into very difficult synchronization scenarios (flow control requirements
are different on one side than another, because they each specify different
communication granularities).
-
dennis: suggests for 1.0 keeping current buffer-level transport granularity.
Return buffer id, within which is stored all associated blockinfos. User
can query the bufferid to get various properties of a particular blockinfo
-
"Get/put" resolutions below, based on Dennis' suggestions:
-
DRI 1.0 has buffer-level granularity for channel get/put
-
DRI_Buffer_get_blockcount(buffer_id, &count)
-
DRI_Buffer_get_blockinfo (buffer_id, index_value, &blockidhandle)
-
DRI_Channel_get_buffer is new function name to indicate buffer-level granularity
(indicates desire to expand in 1.1 to DRI_Channel_get_block approach)
-
We get a buffer id back from channel get
-
Buffer ids store the actual pointer, and a list of blockinfo objects
-
Buffer object is queried to get a blockinfo object (based on a random access
index)
-
DRI_Buffer_get_blockinfo(buf, index, &blockinfo)
-
DRI_Distribution_get_numblocks goes away (to enable dynamic distrs)
-
DRI_Distribution_get_blockifo goes away
-
question: is a returned DRI_Blockinfo structure read-only, or read-write?
-
readonly would require returning DRI_Blockinfo **blockinfo
-
high performance can be enabled in some cases by forcing read-only (implementation
can pre-compute the DRI_Blockinfo structures associated with a distribution,
and return references to those structures to the calling application
in its "inner loop")
-
DRI_Blockinfo doesn't have to be a structure (could be object or structure)
-- with the object approach the user would be required to call get-only
accessors that would enforce the read-only attribute. Can get aforementioned
performance benefits too
-
processing block definition (author: steve -- refer to prior email
reflector activity for proposal details)
-
this proposes to replace buffer link proposal
-
DRI_Compute_create(void *(routine_name)(void *user_defined), void *user_defined,
DRI_Compute *compute)
-
routine_name can be null function pointer
-
DRI_Compute_single_channel(DRI_Compute compute, DRI_Channel channel)
-
DRI_Compute_inplace (DRI_Compute compute, DRI_Channel chan1, DRI_Channel
chan2)
-
ken: problem with callback approach is that the called function will not
know the channel context in which it is called.
-
group: what is really needed is the ability to distinguish between in-place
and not-in-place processing stages, not necessarily the specific processing
routine that will be invoked
-
The callback approach outlined here tries to address a different problem
(automated data transfer without any explicit calls by the user to DRI_Channel_get/put)
-
"Processing block definition" resolutions:
-
DRI_Reorg_inplace (reorg_in, reorg_out)
-
no output parameter. Instead, the state of reorg_in and reorg_out gets
changed
-
(i.e., an internally-managed linked list consisting of [at least] these
two channels is created/extended)
-
refers to in-place processing, not in-place data reorganization
-
some discussion on removing bufferset abstraction
-
also included discussion on how to accomplish system vs. user allocation
-- user allocation done by passing a function pointer to channel create,
and then at the latest possible time (most likely library connect) the
buffers assoc. with the channel are actually allocated (either by system
or by user-provided routine). The channel will provide the arguments to
the user routine (index, length, alignment)
-
prolonged group discussion on scope of DRI and whether channels are needed
-
one issue: the meaning of a channel is not clear from its name -- perhaps
consider renaming the object to a "reorg"
-
group agrees to make this name change (channel --> reorg)
-
another issue: what about avoiding the channel object altogether?
-
some applications may want to create DRI objects only up to the DRI_Distribution.
-
Murali agrees to draft a proposal on this subject
Summary of API changes suggested at the meeting
DRI_Reorg_create_user (dri_lib_handle, side, name, datatype, dist, num_buffers,
void * (*alloc_routine)(channel, buffer_index, len, alignment), void *
(*dealloc_routine)(channel, buffer_index, void *bufptr)
DRI_Reorg_create_system(dri_lib_handle, side, name, datatype,
dist, num_buffers, &reorg)
direction flag is DRI_CHANNEL_SIDE_IN, DRI_CHANNEL_SIDE_OUT
DRI_Reorg_inplace (side_in, side_out)
DRI_Reorg_get_bufaddr(DRI_Reorg reorg, int index, void **addr)
DRI_Reorg_get_buflen(DRI_Reorg reorg, int *index)
DRI_Reorg_get_bufcount(DRI_Reorg reorg, int *bufcount)
DRI_Reorg_get_buffer(DRI_Reorg reorg, DRI_Buffer *buf)
DRI_Reorg_put_buffer(DRI_Reorg reorg, DRI_Buffer *buf)
NULL objects go away, DEFAULTs object instances for everything where
appropriate
error codes and error string generator function needed
remove PAD from overlap specification types ovr_type (use EDGE instead)
EDGE_REPLICATED takes last single element, and replicates
buffersets are gone
DRI_get_version
DRI_get_subversion
#define DRI_VERSION
#defineDRI_SUBVERSION
DRI_Buffer_get_blockinfo(DRI_Buffer buf, DRI_Blockinfo *blockinfo)
DRI_Blockinfo_get_*(DRI_Blockinfo *blockinfo, attribute * attribute);
Layout changes (repeated from agenda item #5 section of the minutes)
-
DRI_Layout_create_aligned (ndims, order[], align[], &layout)
-
order[]: can be DRI_LAYOUT_ORDER_GDO_DEFAULT
-
order[i]: can be DRI_LAYOUT_ORDER_DEFAULT
-
align[]: can be DRI_LAYOUT_ALIGN_DEFAULT
-
align[i]: if align[] is non-null, each align[i] must contain a user-supplied
byte alignment for each dimension
Action items
Ken - address vsipl views and rudimentary interoperation proposal
Murali - alternative data movement functions proposal (e.g., those not
involving channels)
Steve & Cui - MPI middleware adapter proposal
Ken - document the version control, latex tool chain for the group
Brian - address shortcut/helper functions that would simplify the expression
of commonly-needed cases (e.g., corner-turn)