Minutes for the August 2001 meeting of the DRI Forum
Attendees:
Ray Artz, Lockheed Martin Eagan
Murali Beddhu, MPI Software Technology Inc.
Ken Cain, Mercury Computer
Dennis Cottel, SPAWAR Systems Center, San Diego
Zhenqian Cui, MPI Software Technology Inc.
Nathan Doss, Lockheed Martin Moorestown
Jon Greene, Mercury Computer
Jamie Kenny, Mercury Computer
Steve Paavola, Sky Computers
Myra Jean Prelle, Mercury Computer
Bob Rowlands, Sky Computers
Sharon Sacco, Sky Computers
Brian Sroka, MITRE
Agenda
-
Solidify interfaces through DRI_Distribution, and DRI_Blockinfo
-
(A) Discuss DRI_Distribution_create function (whether we should remove
the DRI_Group input parameter)
-
(B) Discuss whether we can query for DRI_Blockinfo immediately after creating
a DRI_Distribution object
-
Define the roadmap for DRI (see Ken Cain's suggestions later in the minutes)
-
Time permitting, address specific issues:
-
(A) memory management, buffer sharing among DRI_Reorgs (see Ken's skeleton
working briefing)
-
(B) VSIPL interoperation proposal -- version 2 (Jamie Kenny, Ken Cain)
-
(C) blocking, non-blocking communication support -- add more options via
additional parameter to get_buffer and put_buffer? (see Ken's skeleton
briefing)
-
(D) general object memory management (opaque object design, reference counting,
etc.)
Complementary Materials Discussed at the Meeting (see www.data-re.org
to gain access to these materials)
-
(A) Skeleton working briefing from Ken Cain describing memory management
and buffer sharing in DRI
-
(B) Steve Paavola's briefing characterizing Sky Computers DRI activities,
including a specifically designed "shortcut" function to set up the matrix
transpose
-
(C) requirements of DRI expressed by a defense contractor (distributed
by Mercury participants at the meeting) -- a summary of the contents of
this document will be posted on the DRI web site
-
(D) Revised VSIPL/DRI inter-operation approach (Jamie Kenny and Ken Cain)
Legend
Resolutions are shown in bold and italic font
agenda item #1: solidifying the interfaces through DRI_Distribution and
DRI_Blockinfo
-
arguments for doing these things are in the august 20 version of the document,
Annex A "list of current proposals" (items 3,4)
-
Issue 1-A: should we remove DRI_Group input parameter to DRI_Distribution_create?
-
steve: is not against moving DRI_Group input into the DRI_Reorg_create
-
jon: supports removing it
-
would also like to remove group_dims input parameter (seems logical with
the relocation of the DRI_Group parameter)
-
DRI_Partition_create would take an additional num_parts scalar argument
[recall DRI_Partition_create refers to a single data dimension and would
require only a scalar argument for process group topology]
-
ken: this suggestion is good in a way, because the DRI_Partition object
represents a mathematical function that is applied to the global data set
properties, and the scalar "num_parts" would represent the topology input
parameter to that function
-
steve & ken: on the other hand, this approach reduces the utility of
the DRI_PARTITION_* shortcut objects.
-
Leaving group_dims in DRI_Distribution_create permits DRI_PARTITION_* shortcuts
to be used with an arbitrary process group topology (a "good thing")
-
The proposed approach prevents the flexibility already available with DRI_PARTITION_*
shortcuts
-
With the proposed approach, the user would have to create an array of DRI_Partition
objects using DRI_Partition_create (supplying default parameters for minsz,
mod, etc.)
-
Nathan: for dynamic cases (either changing #procs or data sizes), a user
may want to change process group topology frequently, so moving the topology
specification to DRI_Partition_create may not be a good design
-
group agrees that we will address dynamic functionality later
-
After further discussion, 6 different approaches were identified:
-
keep topology input specification where it is in DRI_Distribution_create
-
move topology input earlier in the typical DRI function calling sequence
into DRI_Partition_create (a scalar "num_parts" input value)
-
move topology input later in the typical calling sequence into DRI_Reorg_create
-
Dennis suggests use approach #2, and ALSO pass it to DRI_Reorg_create
-
create a new DRI_Partition_create function that takes num_parts, but NOT
minsz, mod (because they are seemingly independent)
-
there was some debate about this claim among the participants
-
take away the user's ability to specify the process topology, leaving it
to the implementation
-
ken: there are a handful of good reasons to move it into different _create
functions, but they all tend to conflict with each other. Recommends keeping
group_dims as input to DRI_Distribution_create
-
STRAW VOTE TAKEN (on leaving group_dims as input to DRI_Distribution_create):
see summary in "Resolutions" below
-
STRAW VOTE TAKEN (on removing DRI_Group input parameter from DRI_Distribution_create):
see summary in "Resolutions" below
-
STRAW VOTE TAKEN (on moving DRI_Group input parameter to DRI_Reorg_create):
see summary in "Resolutions" below
-
Issue 1-B: should we allow user to query DRI_Blockinfo from a DRI_Distribution
object after it has been created?
-
steve: it is hard solidify the DRI_Distribution and consequently DRI_Blockinfo
early in some cases. For example:
-
you may want to have the DRI_Blockinfo change dynamically from buffer to
buffer to support load balancing application design
-
iteration 0: p0 gets 3 blocks, p1 and p2 each get 2 blocks
-
iteration 1: p0 gets 2 blocks, p1 gets 3 blocks, p2 gets 2 blocks
-
iteration 2: p0 gets 2 blocks, p1 gets 2 blocks, p2 gets 3 blocks
-
...
-
over multiple iterations, the load assigned to each process is balanced
(this is important for applications that have no synchronization requirements
among the processes)
-
ken: such scenarios can be supported by the "flag" argument of DRI_Reorg_create.
A load balanced DRI_Reorg object could be specified to cover this case.
-
ken: need to acknowledge that there will be different DRI use cases (load
balanced reorgs is one use case)
-
myra: problem with deferring until DRI_Reorg_create is that we assume that
the user is using DRI_Reorg style communication (fixed-size data sizes,
fixed-size transfers, queueing support, etc.). If using a different communication
approach, user needs way to find out the specific assignment of elements
to processes. DRI_Distribution is the last object in the sequence of objects
created to get this information, without going down the DRI_Reorg path.
-
Proposal (formulated by the meeting participants):
-
DRI_Datapart_create (DRI_Distribution dist, unsigned int group_size, unsigned
int rank, DRI_Datapart *dpo);
-
DRI_Datapart_get_buffer_size (DRI_Datapart dpo, DRI_Dataspec datatype,
unsigned int *nbytes)
-
In the back-end of the API, change all instances of DRI_Buffer to DRI_Datapart,
and make similar replacement for uses of "buffer" in DRI function names
-
DRI_Datapart_get_ptr will return NULL, unless the DRI_Datapart object was
returned from DRI_Reorg_get
-
STRAW VOTE TAKEN: see summary in "Resolutions" below
-
Agenda item 1-A and 1-B Resolutions:
-
1-A: remove DRI_Group from DRI_Distribution_create: yes (STRAW
VOTE: 3 abstain, 10 in favor)
-
1-A: moving DRI_Group input parameter into DRI_Reorg_create: yes
(STRAW VOTE: 2 abstain, 11 in favor)
-
1-A: leave group_dims as an input parameter to DRI_Distribution_create:
yes
(3 abstain, 1 not in favor, 9 in favor)
-
1-B: querying for DRI_Blockinfo from DRI_Distribution, and querying
for buffer size from DRI_Distribution object: yes (STRAW VOTE: 1
abstain, 12 in favor)
-
Nathan: final thoughts on agenda items 1-A and 1-B:
-
gaining stability through DRI_Distribution object creation is fine for
now, but will these objects remain stable once we start thinking about
how to make DRI dynamic in various ways (processes, data sizes, etc.)?
agenda item #2: defining the DRI Forum roadmap discussion
Suggested DRI Forum Plan (suggested by Ken Cain)
-
Scrub API spec for correctness & clarity through DRI_Distribution object
creation, and associated DRI_Distribution object attribute queries
-
Solidify these interfaces earlier than the full-blown DRI 1.0 spec (whose
goal is January 2002 completion), and announce their availability
-
perhaps assign a distinct version number for these parallel mapping features
of DRI
-
call for middleware developers to start to integrate the solidified API
constructs
-
finish the "second half" of the API, as planned, to result in the complete
DRI 1.0 specification (goal January 2002)
-
Static transfer sizes, Buffered (library-managed buffer synchronization)
communication support -- this is what has been referred to as "early binding"
-
call for defense contractors to actively participate in subsequent (post
January 2002) DRI Forum efforts
-
to effectively communicate their future application requirements
-
Address budget/funding concerns of contractors by re-emphasizing online
collaboration via email reflector
-
Define the next phases of DRI development (post January 2002) to be the
following, with each item being a separate revision of DRI spec:
-
Dynamic transfer sizes, Buffered (library-managed buffer synchronization)
communication support -- a preliminary form of late binding
-
Static and dynamic transfer sizes, Non-Buffered (user-managed synchronization)
communication support
-
Other forms of "late binding"
-
dynamically changing process sets to support load balancing and fault tolerance
requirements
Group discussion of suggested DRI Forum Roadmap
-
steve: is generally not in favor of an intermediate version number or subset
"profile" of DRI (ken's suggestion was to define a profile of sorts for
the parallel mapping portions of DRI)
-
group: basic requirements for future DRI Forum work (beyond the initial
early-binding specification to be produced) would seem to include:
-
sustained contractor participation to define requirements
-
sponsors to fund a reference implementation
-
dennis: HPEC-SI potential. Original hope would be that this would be an
umbrella organization. Seems that HPEC-SI is leaning toward VSIPL++ and
combined comp&comm, not on the intermediate middlewares, like MPI/DRI
reference
-
dennis: right thing to do would be to propose DRI reference implementation
to HPEC-SI when RFPs occur
-
defining dynamic application requirements (DRI Forum best guess):
-
change DRI_Globaldata dimension sizes on the fly (instead of pre-planning
a bunch of different cases, each with their own gdo, distribution, reorg,
etc.)
-
jamie: the idea is to modify the basic objects (that have already been
created) in the inner loop instead of creating new instances (incurring
memory allocation overhead) in the inner loop
-
change destination process groups from a common source group
-
dennis: willing to entertain increased communication contexts in future
DRI specifications
-
dennis: we should consider finishing static DRI_Reorgs (channels), without
worrying about buffer sharing (yielding a very simple static channel approach,
but something achieveable in January timeframe)
-
dennis: then, consider buffer sharing when we increase our scope
-
jon: propose we agree that DRI 1.0 in January timeframe should still
have a basic early-binding transport
-
group: agrees with jon's assertion
Miscellaneous: discussion of more agressive shortcut functions to explicitly
set up a distributed matrix transpose (cornerturn)
-
action item: try to integrate Steve Paavola's cornerturn shortcut in the
next edit of the document (it was discussed at last meeting, but has not
yet been integrated into the doc)
-
See handout (B)
Agenda item #3-A - discussion of buffer sharing scenarios
-
much of this discussion refers to portions of Ken Cain's skeleton working
briefing
-
see handout (A)
discussion of chart titled "Which Buffer Sharing Scenarios Should
We Support"
-
specifically, regarding in-place vs. out-of-place use of buffers for data
reorg communication:
-
dennis: DRI Forum previously decided to let the implementation (not the
user) control whether data reorg communication is performed in-place vs.
out-of-place
-
steve: thinks implementation should determine at run-time whether in-place
or out-of-place communication is appropriate (based on its knowledge of
the run-time environment)
-
myra: thinks the user should be able to specify which approach is used
-- especially to specify in-place DRs to save memory when it is important
discussion of chart titled "Basic ``Tenets'' of Early-Binding
Reorgs"
-
[Bullets 1-2]: About the order of access to buffers in a DRI bufferset:
-
dennis: doesn't think that we need to constrain implementations to rotate
through buffers in order in the underlying bufferset
-
steve: the application has to give buffers back to the library in the order
that they were provided
-
myra: wants users to be able to express where the associated memory exists
(i.e., a bufferset)
-
myra: wants a single dataset to stay together through a series of stages
(e.g., buffer 0 from all of the associated buffersets in a chain of data
reorganizations will be used for the same data set [associated with an
application iteration] through those stages)
-
steve: disagrees -- we should not confuse the buffer identification with
the buffer contents
-
steve: the requirement should be that the implementation keeps datasets
together
-
myra: agrees that this is the real issue
-
jamie: the way to say this in the document is that the i'th invocation
of put or get on a single side of a DR should refer to the same dataset
-
[Bullet 3]: proposed restriction that #buffers should be the same on both
source and destination sides of a data reorganization
-
dennis: the problem with this is that processes on one side may need fewer
buffer resoruces. Or, it may be appropriate to perform only single buffering
instead of double buffering on some resources (or double buffering instead
of triple, ...)
-
the requirement is that the DR implementation must abide by the user's
buffer specifications, even if the #buffers is different on both sides.
-
DRI implementation is responsible for making the application run
correctly, but can elect to degrade performance (e.g., by performing associated
DMA transfer setups dynamically if there are too many combinations to set
up)
examination of selected charts that describe different application
designs (chart #s 8-13, 17-18)
-
case: pipeline / in-place-processing / out-of-place data reorg (chart #9)
-
consider the single-threaded environment sub-case:
-
middle process set (labeled P2) is of interest
-
R1-Recv->get ==> B2,0
-
Calling the "get" method of DRI_Reorg object R1-Recv will result in
buffer 0 from bufferset B2 returned to the user
-
B2,0 ==> B2,0
-
In-place processing: buffer 0 of bufferset B2 is both the input
and output buffer
-
B2,0 ==> R2-Send->put
-
Calling "put" method of DRI_Reorg object R2-Send: perform data
reorg communication with source buffer 0 from bufferset B2
-
Because "put" is a non-blocking, non-waiting call, no status on this communication
will be available until the next time the application calls the DRI library
(perhaps on next call to R2-Send->put?)
-
Library knows where to "recycle" the buffer so that it will next be used
by R1-Recv because of the prior call to DRI_Reorg_process_inplace that
binds R1-Recv and R2-Send
-
this case seems to work with the current DRI specification
-
case: clique / in-place processing / out-of-place data reorg (chart #11)
-
consider the single-threaded / no progress engine environment sub-case:
-
this case seems to work with the current DRI specification (if we
use DRI_Reorg_inplace to connect R2-Recv to R1-send -- this is a somewhat
unconventional use of DRI_Reorg_process_inplace, but it would seem to work
here)
further discussion on buffer sharing, and the role of bufferset
objects, commit and connect functionality
-
myra: thinks that the remedy to some of the existing buffer sharing problems
is to reintroduce the bufferset object (exposed to user)
-
note that constructing a bufferset DOES NOT have to imply actual creation
of memory. It is a placeholder to indicate buffer sharing (of many possible
types)
-
thinks buffersets can also establish other types of reorg arrangements
seen in some application requirements (e.g., sideways sharing -- common
source reorg, many different dest reorgs).
-
about commit/connect functionality as it pertains to buffer sharing:
-
jamie: we need a way to specify the scope of memory buffer sharing among
reorgs
-
not in favor of having a network wide commit function figuring out what
buffer sharing mechanism to use
-
we could scope the commit function to a list of DRI_Reorg objects, for
example.
-
alternatively, we could "connect" each reorg individually
-
jon: perceived problem with individual reorg connect (in earlier meetings)
was that there is a danger of deadlock if user calls the functions in the
wrong order.
-
one approach that could solve the problem is a 2-stage connect (first stage
is a non-blocking call that registers a DRI_Reorg's information with the
library, second call blocks)
-
a second approach is a "lazy evaluation" in which the application
connects the individual DRI_Reorg objects (those calls having non-blocking
semantics) and on a DRI_Reorg object's first use (get/put), then the connection
could be finalized
-
zhenqian: in favor of a model that has the user pass the same bufferset
object reference as an input parameter to each call to DRI_Reorg_create
-
jon: we can't allocate memory for the bufferset shared by DRI_Reorg objects
until the sharing relationship is finalized.
-
each reorg will have different buffer size and alignment requirements.
-
the idea is to pick the "worst case" among all reorg requirements, and
then allocate according to that case.
-
ken: recall that the old approach (when we had a DRI_Bufferset object in
the API) required the user to call DRI_Bufferset_create with an "nbytes"
input parameter
-
This of course is bad in the context that Jon just described (without having
additional API to acquire the needed buffer size information from a list
of DRI_Reorg objects)
-
This is one of the reasons why we ended up leaning toward a network wide
DRI_Commit function
-
What is new from this meeting is that we can consider bufferset creation
without having to specify the size initially
-
scoping the level of connect/commit in conjunction with the modified bufferset
creation approach may be an appropriate solution
discussion of candidate approaches to address buffer sharing,
commiit/connect functionality
-
there is no resolution as to which of the approaches presented below
will be used (if any) in the official DRI API. The goal of the discussion
was only to get some technical ideas on the table
-
dennis: we could take an inverse specification approach (a call to create
a bufferset that gives a list of created DRI_Reorgs that will share it)
-
this approach uses a single line of code to express a buffer sharing relationship
-
some in group gravitating toward the following type of approach:
-
DRI_Bufferset_create (..., &bufset1) (provide no buffer sizing details)
-
DRI_Reorg_create(..., bufset1, &reorg1) (bufset1 attributes get modified
to reflect "worst case" sizing of all of the reorgs affiliated with it
so far)
-
DRI_Reorg_create(..., bufset1, &reorg2) (bufset1 attributes get modified
to reflect "worst case" sizing of all of the reorgs affiliated with it
so far)
-
...
-
<must perform all DRI_Reorg_create calls before calling a series of
DRI_Reorg_connect calls, each scoped to a single DRI_Reorg object)
-
DRI_Reorg_connect (..., reorg1);
-
DRI_Reorg_connect (..., reorg2);
-
...
-
dennis: believes we need to transform the API so it doesn't use a network
wide commit before we finish DRI1.0.
-
candidate approach #1
-
summary: using per-Reorg connect calls, and DRI_Bufferset objects:
-
for system-alllocated memory:
-
DRI_Bufferset_create (num_buffers, &bs_handle)
-
DRI_Reorg_create(network, side, name, dataspec, group, dist, bs_handle,
options, &reorg1) /* internal state of bs_handle is modified to reflect
data sizes associated with reorg1 */
-
DRI_Reorg_create(network, side, name, dataspec, group, dist, bs_handle,
options, &reorg2) /* internal state of bs_handle is modified to reflect
worst case data size among reorg1 and reorg2 */
-
DRI_Reorg_connect(reorg1); /* reorg1 notices that the memory for its bufferset,
bs_handle, has not yet been allocated. So, memory allocation is performed
at this time */
-
DRI_Reorg_connect(reorg2); /* reorg2 connect notices that the memory for
its bufferset, bs_handle, has been allocated already in a prior connect
call */
-
<no longer need DRI_Reorg_process_inplace to express in-place processing
buffer sharing relationships>
-
In above examples, we need to consider what happens when user wants to
express NO sharing between the reorgs created
-
dennis: suggests a different version of DRI_Reorg_create that would tell
the system to create a distinct bufferset internally
-
application would have to pass num_buffers input argument instead of a
bufferset handle
-
benefit: user does not have to go through cumbersome approach of constructing
a DRI_Bufferset object for each DRI_Reorg object
-
For user-allocated buffers:
-
DRI_Bufferset_create (num_buffers, &bs_handle)
-
DRI_Reorg_create(network, side, name, dataspec, group, dist, bs_handle,
options, &reorg1)
-
DRI_Reorg_create(network, side, name, dataspec, group, dist, bs_handle,
options, &reorg2)
-
DRI_Bufferset_get_size (bs_handle, &nbytes)
-
<user allocates num_buffers regions of memory, each of size nbytes)
-
DRI_Bufferset_bind(bs_handle, void *ptrs[num_buffers]);
-
DRI_Reorg_connect(reorg1); /* this function notices that bs_handle [referenced
by reorg1] has user memory bound to it already, and so there is no memory
allocation to be done */
-
DRI_Reorg_connect(reorg2);/* this function notices that bs_handle [referenced
by reorg2] has user memory bound to it already, and so there is no
memory allocation to be done */
-
<no longer need DRI_Reorg_process_inplace to express in-place processing
buffer sharing relationships>
-
In above examples, we need to consider clique connections
-
candidate approach #2 (suggested by jon)
-
summary: like approach #1 above, but removes DRI_Network input parameter
from DRI_Reorg_create
-
for system-allocated memory:
-
DRI_Bufferset_create (num_buffers, &bufferset_handle);
-
DRI_Reorg_create(side, name, dataspec, group, dist, bufferset_handle, options,
&reorg_s);
-
DRI_Reorg_create(side, name, dataspec, group, dist, bufferset_handle, options,
&reorg_r);
-
DRI_Reorg_connect (reorg_s);
-
DRI_Reorg_connect (reorg_r);
-
for user-allocated memory:
-
DRI_Bufferset_create (num_buffers, &bufferset_handle);
-
DRI_Reorg_create(side, name, dataspec, group, dist, bufferset_handle, options,
&reorg_s);
-
DRI_Reorg_create(side, name, dataspec, group, dist, bufferset_handle, options,
&reorg_r);
-
DRI_Bufferset_get_size(bufferset_handle, &nbytes);
-
<user allocates num_buffers regions of memory, each of size nbytes>
-
DRI_Bufferset_bind(bufferset_handle, void *ptrs[num_buffers]);
-
DRI_Reorg_connect(reorg_s);
-
DRI_Reorg_connect(reorg_r);
-
myra suggests DRI_Reorg_create_send (name, ...) and DRI_Reorg_create_recv
(name, dataspec, ..), so that the "side" input parameter can be removed
-
candidate approach #3 (suggested by jon)
-
summary:
-
no "side" argument to DRI_Reorg_create (replaced by 2 different calls
- DRI_Reorg_create_send, and DRI_Reorg_create_recv)
-
DRI_Bufferset_create call is optional.
-
No bufferset input parameter to DRI_Reorg_create.
-
List of DRI_Reorgs supplied as input parameter to DRI_Bufferset_create
(if called at all).
-
If no bufferset is affiliated with a DRI_Reorg when DRI_Reorg_create
is called, then the function will perform a distinct memory allocation
associated only with the single DRI_Reorg object.
-
for system-allocated memory:
-
DRI_Reorg_create_send (name, dataspec, group, dist, options, &reorg_s)
-
DRI_Reorg_create_recv (name, dataspec, group, dist, options, &reorg_r)
-
DRI_Bufferset_create (nreorgs, reorg_list, &bs_handle)
-
DRI_Bufferset_get_size (bs_handle, &nbytes)
-
DRI_Reorg_connect(reorg_s)
-
DRI_Reorg_connect(reorg_r)
-
for user-allocated memory:
-
DRI_Reorg_create_send (name, dataspec, group, dist, options, &reorg_s)
-
DRI_Reorg_create_recv (name, dataspec, group, dist, options, &reorg_r)
-
DRI_Bufferset_create (nreorgs, reorg_list, &bs_handle)
-
DRI_Bufferset_get_size (bs_handle, &nbytes)
-
DRI_Reorg_connect(reorg_s)
-
DRI_Reorg_connect(reorg_r)
Miscellaneous
-
nathan: has 2 concerns, and one request for future DRI functionality:
-
concern #1: about solidifying the API now through Distribution/Blockinfo/Datapart
objects.
-
when we start thinking about dynamic functionality in DRI, it might cause
the API to look much different
-
it may not be as simple as providing modify/clone functions on the existing
objects to provide dynamic behavior
-
for example: 2 transfers executing at the same time, and they each reference
the same underlying attribute objects. Modifying an underlying attribute
object won't work because it only refers to 1 of the 2 transfers (but the
attribute change might affect both).
-
concern #2: DRI (and VSIPL and MPI/RT) try to control memory management.
MPI does not explicitly control memory
-
users can build memory management abstractions such as DRI_Reorg objects
on top of transports like MPI
-
suggestion: when we get to late binding support, consider an approach in
which the caller specifies both sides of the reorg being requested,
instead of the current one-sided "connect" approach, in which the sides
are connected by a name.
-
jamie: for example, you could layer the buffered(Reorg/channel) DRI functionality
on top of the non-buffered(no Reorg/channel) DRI capability (which is likely
to require the 2-sided specification approach nathan is requesting)
-
this would allow you to layer the 1-sided, name-based DRI_Reorg_connect
on top of such a 2-sided specification from the late binding API
-
steve: concerned that we've turned back the clock on DRI Forum progress
at this meeting (perhaps affecting our goal of finishing DRI 1.0 by January
2002)
-
ken: schedule slip risk would be due to DRI not covering important application
design cases, and not due to this meeting in particular