March 2001 Data Reorganization meeting minutes
Attendance:
|
Name
|
Organization
|
| Murali Beddhu |
MPI Software Technology, Inc. |
| Ken Cain |
Mercury Computer Systems, Inc. |
| Dennis Cottel |
SPAWAR Systems Center, S.D. |
| Zenqian Cui |
MPI Software Technology, Inc. |
| Jon Greene |
Mercury Computer Systems, Inc. |
| Randy Judd |
SPAWAR Systems Center, S.D. |
| James Lebak |
MIT Lincoln Laboratory |
| Steve Paavola |
Sky Computers |
| Anthony Skjellum |
Mississippi State University / MPI Software Technology, Inc. |
| Brian Sroka |
The MITRE Corporation |
Steve raised attention to several areas in the document needing interpretation
/ correction. These issues were addressed by the group, and are outlined
here:
DRI_Init
-
the following problems (observed at prior DR meeting) have not been fixed
in the document:
-
order of arguments needs to be reversed (argvp, argcp ==> argcp, argvp)
-
datatype of argv parameter should be changed (char **argvp ==> char ***argvp)
-
group confirms the changes are needed
DRI_Partition_create_whole function and/or DRI_PARTITION_WHOLE
built-in object:
-
function produces an object equivalent in semantics to DRI_PARTITION_WHOLE
-
should we only allow 1 way to specify "whole" (unpartitioned) datasets?
-
steve: suggests one of the following 2 approaches:
-
remove the function, leave the built-in object
-
keep the function (and add more arguments to make it more meaningful),
and remove the built-in object
-
dennis: future DRI versions may allow setting object attributes (including
DRI_Partition) -- in that (likely) event, we would want to have both approaches
available to the user
-
group decides to keep both specification approaches
Other "shortcut" opportunities by defining additional built-in
objects:
-
jon: Let's create a built-in object for block partitionings, called DRI_PARTITION_BLOCK.
Implies the following:
-
minsz = 1
-
mod = 1
-
left overlap elements = 0
-
right overlap elements = 0
-
can't really do the same thing for block-cyclic, since its blocksize parameter
can never be assigned a (usable) default
-
group agrees to add DRI_PARTITION_BLOCK built-in object
DRI_Partition_create API documentation:
-
steve: document should be clear about what "whole" partitionings mean in
terms of the key parameters:
-
minsz
-
mod
-
left overlap
-
right overlap:
-
steve: make sure that documentation (esp. for "block size" references)
is clear that this operation is performed in a single-process context
-
api document will be modified according to these suggestions
DRI_Layout_create typo
-
steve: there is no output parameter
-
api document will fix this problem
DRI_Layout -- "packed" is a dimensionless concept
-
group: observes that current API requires DRI_LAYOUT_PACKED_<N> specifications
in each dimension, but that this is not necessary
-
group agrees to change DRI_Layout_create accordingly
final form of DRI_Layout_create depends on the results of other Layout-related
discussion, and is presented elsewhere in these minutes
-
group will investigate re-defining DRI_Layout object to apply to all
dimensions
DRI_Layout packing order -- "I don't care" option
-
steve: in some applications, the user needs 1 dimension of the data to
be most contiguous. Beyond that, it is not important which subsequent dimension
is packed more contiguously than the others. Would like a "don't care"
flag
-
group considers DRI_LAYOUT_PERMUTE_NOPREFERENCE flag
DRI_Layout: DRI_LAYOUT_UNIFORM specification -- revisited and
transformed
-
There was a lot of discussion on this issue, and Ken was tasked to complete
the analysis after the meeting -- here are the results
-
DRI_LAYOUT_UNIFORM was not correctly specified...
-
Here are 2 properties of local data storage that we really want to control:
-
to align the first data element in a dimension according to a user-specified
byte-alignment
(this is mainly a processing-motivated optimization)
-
to make the size of the local storage in a dimension a multiple of
a specified number of bytes (by padding the storage of the dimension beyond
the extent of the "assigned" data from a data distribution)
(this is mainly a communication-motivated optimization -- can perform
transfers completely using DMA [no programmed I/O] if the local storage
is a multiple of the size required by the DMA engine)
-
The 2 properties are related, and we can probably use one (per-dimension)
byte alignment parameter to effectively specify these properties:
-
Assume that buffers can be created (separately) with a user-specified byte-alignment
(e.g., in Bufferset_create)
-
The DRI_Layout per-dimension byte alignment parameter specifies a
"relative" alignment (to the beginning of the buffer)
-
if the buffer is suitably aligned, then we can specify very precise per-dimension
byte alignments using this relative parameter
-
there are some (not easy to describe) rules regarding the selection of
the byte alignment values
-
e.g., byte-alignments should be equal or higher (but not lower) with increasing
layout "order"
-
The second local memory property can be controlled (force dimension's storage
length to be a multiple of a specified number of bytes) by specifying byte
alignment parameter at least as large as that required by the system's
DMA facility.
-
The user needs these 3 levels of control:
-
Force local storage to be densely packed, with no holes between data elements
in any dimension
-
Allow the system to select the per-dimension relative alignment that should
be used
-
Explicitly specify the per-dimension relative alignment that should be
used
-
Re-designed overall DRI_Layout approach - 3 functions:
-
DRI_Layout_create_packed (unsigned int ndims, int *order[ndims])
-
DRI_Layout_create_systemaligned (unsigned int ndims, int *order[ndims])
-
DRI_Layout_create_useraligned (unsigned int ndims, int *order[ndims],
unsigned int *rel_align[ndims])
-
Individual elements in the order[] array can be DRI_LAYOUT_ORDER_DEFAULT,
or can be a value in the range 0..(ndims-1)
-
implementors are advised to select a negative integer constant for DRI_LAYOUT_ORDER_DEFAULT
-
It is possible for the user to specify DRI_LAYOUT_ORDER_DEFAULT in every
position in the order[] array
-
For order[] array parameters with a mixture of DRI_LAYOUT_ORDER_DEFAULT
and legitimate dimension numbers, there must be at least one entry such
that order[j] = 0
-
The user can choose to bypass the entire DRI_Layout_create_ process, and
pass a built-in DRI_Layout object named DRI_LAYOUT_DEFAULT to a later DRI_Distribution_create()
call
-
This layout says that we want to use the GDO dimension number as the layout
order parameter (i.e., global data dimension 0 runs fastest in linear memory,
etc.)
-
The gory details of this approach will be presented in a proposal embedded
in the API document, to be reviewed at May 2001 meeting
API change needed to determine buffer sizes
-
steve: how to accomplish buffer sizing when layout alignments have been
specified
-
data types are not passed to DRI_Distribution_create. However, DRI_Layout
is a parameter (and, now, that Layout object can contain byte-alignment
values)
-
DRI_Distribution_get_local_count returns number of elements that need to
be stored in local buffers. Previous approach to sizing buffers was for
user to get this element count, then multiply by the datatype size. Now,
we need to factor in the impact of the byte-alignment specification made
in the DRI_Layout.
-
group agreed to add a new function:
-
DRI_Distribution_get_buffersize (DRI_Distribution *dist, DRI_Dataspec
*dataspec, unsigned int *nbytes)
-
NOTE: This issue was revisited later, and the API changed in the "channel
linking" discussion!!!
A problem that may or may not yet be solved with the new DRI_Layout
approach
-
steve: demonstrated a matrix transpose case in which the amount of padding
needed in local buffers (to satisfy alignment requirements) is a function
of the number of processes in the DRI_Group on the other "side" of the
data reorg
-
making it impossible to appropriately size the buffers prior to channel_connect
Clique Transpose: memory savings using application buffers as
temp storage during the reorg (Can't determine buffer sizes before
channel_connect)
-
Assume a clique matrix transpose implementation works in the following
way to save memory:
-
2 application buffers are created (size = ceiling of size requirements
on source and destination "sides" of the reorg)
-
step 1: local out-of-place transpose (source_buf => dest_buf)
-
step 2: collective communication (dest_buf => source_buf)
-
step 3: local out-of-place data reordering to complete the transpose behavior
(source_buf => dest_buf)
-
The application source buffer gets overwritten, but many applications are
written this way
-
PROBLEM: to appropriately size the source/dest application buffers, we
need to wait until channel_connect time, when the details of the "other
side" of the data reorg are made known in a rendez-vous.
-
group agrees that this problem must be handled by temporary buffers
created and managed internally by the channel
DRI_Distribution_create
-
Problem: no function ever takes a DRI_Group object as an input parameter!
-
We did modify the API to remove DRI_Group from DRI_Distribution parameter
list because we wanted to "cleanly" separate CORE API components from "Middleware
Adapter" components (DRI_Group is a MW adapter component).
-
Logical place to input the DRI_Group parameter is in DRI_Distribution_create
-
participants consider how to make this work, considering CORE/Adapter issues
-
DRI_Group_get_rank, DRI_Group_get_size remain in the API and their meanings
do not change
-
DRI_Group_create should not be supported
-
We assume the existence of process group technology (e.g., MPI, MPI/RT,
PAS, SCL, etc.)
-
DRI_Group_import and DRI_Group_export can be defined to map between DRI_Group
type and supporting IPC middleware infrastructure
-
DRI_Distribution_create API remains "fixed" even though it has a "middleware
adapter" object as an input parameter
-
implementations cannot change this API to accept MPI communicators, MPI/RT
group arguments, etc..
group decides that this design will work well, and that DRI_Group can
become an input parameter to DRI_Distribution_create. DRI_Distribution_create
no longer needs to take "myrank" and "group_size" integer parameters
More DRI_Distribution_create issues
-
steve: documentation is not clear that collective scope of this function
is over the process group on a single "side" of the reorg
-
document will be improved to be clear on this point
DRI_Distribution_create_simple
-
recall this allows one to create a distribution without specifying a DRI_Layout
(i.e., asking for a "default", GDO-ordered layout)
-
group agrees to kill this function altogether
-
NOTE: DRI_Layout proposed changes will have a DRI_LAYOUT_DEFAULT pre-defined
object that can be passed to DRI_Distribution_create
Channel linking (sharing the same bufferset over multiple channels)
-
this is important for applications that need to save memory
-
it is often used in clique data-parallel applications
-
common approach is to "toggle" between 2 buffers over a series of consecutive
data-parallel processing stages
-
2 ways to design the interface
-
user control of buffer state - user supplies the same bufferset to the
Channel_create functions (one for each channel sharing the bufferset),
and uses care when calling Channel_get and Channel_put with all the resulting
channels.
-
Channel control of buffer state - user may only perform a Channel_get on
a receive channel, and may only perform a Channel_put on a send channel.
The library enforces these rules by generating errors.
-
group decides to have channels control buffer state
-
Summary of approach discussed at the meeting for chaining channels together
with a single bufferset:
-
When calling DRI_Channel_create for all affected channels, pass DRI_BUFFERSET_NULL
-
Call DRI_Channel_link with an array of DRI_Channel objects, stored in
the order in which the application will use them, and a DRI_Bufferset object
that will be shared by the channels to allow in-place processing.
-
To determine the size and alignment of the buffers in the bufferset,
we need to look at all associated DRI_Disributions and DRI_Dataspecs
-
DRI_Distribution_get_bufparams (int numdists, DRI_Distribution *dists,
DRI_Dataspec *dataspecs, int *max_bufsize, int *max_align)
-
This will replace DRI_Distribution_get_local_count
-
DRI_Bufferset_system_create (int nbufs, int buffersize, int byte_alignment,
DRI_Bufferset *bufset);
-
DRI_Bufferset_user_create (int nbufs, int buffersize, int byte_alignment,
void **user_allocated_buffers, DRI_Bufferset *bufset);
-
Dennis agrees to draft a proposal -- see March 13 and March 31 emails
to data-reorg@data-re.org mailing list on this topic
-
Group also discussed how to set up a chain of mixed in-place and out-of-place
processing channels
-
Group also discussed how to associate a bufferset with 2 channels that
are not used immediately in succession
-
e.g., in a clique, setting up a "toggling" between 2 buffersets -- here,
a bufferset is associated with processing stage 0, 2, 4, ... (and the other
bufferset associated with 1,3,5, ...)
Load Leveling
-
steve: would be nice to be able to give work to the "next available" processor
or process-set. Dynamically dispatch work. Also allow the data ("work")
to vary in size over time
DRI_Channel_get: request to also return the associated
DRI_Blockinfo structure
-
steve: currently DRI_Channel_get only returns a DRI_Buffer_Id. For convenience,
it should also return the corresponding DRI_Blockinfo structure. This will
reduce the lines of code needed before being able to traverse the local
data block.
Channel_connect and potential for deadlock in applications
-
Consider the following scenario:
-
3 process sets (1, 2, and 3)
-
set 1 ==> set 2 is connected by channel "red"
-
set 2 ==> set 3 is connected by channel "blue"
-
set 3 ==> set 1 is connected by channel "green"
-
User must perform DRI_Channel_connect in the following way to avoid deadlock:
-
On set 1 nodes:
-
DRI_Channel_connect(red)
-
DRI_Channel_connect(green)
-
On set 2 nodes:
-
DRI_Channel_connect(red)
-
DRI_Channel_connect(blue)
-
On set 3 nodes:
-
DRI_Channel_connect(blue)
-
DRI_Channel_connect(green)
-
Some ensuing discussion about a better Channel_connect approach (2-stage)
that can prevent deadlock
-
somebody needs to write a proposal with the details of this approach