Included below is an initial attempt at an MPI binding for the DRI.
I've tried to put in what was decided at the last meeting but no doubt
I've left out things. In addition, I've:
* made names longer (e.g., gdo => global_data, part => partition,
dist => distribution)
* introduced iterators as a way to access indices of a distributed
data structure (and gotten rid of the partition object as the
way to determine which data a process owns)
* used the partition object as the way to specify partition information
for each dimension of a distribution
* used an MPI-like approach for the communication calls
* introduced predefined block and block cyclic distributions, and
predefined packed layout
* attempted to make it easy to do simple redistributions (e.g., corner
turns)
In writing this "binding", I've taken the approach that the DRI
interface would sit side-by-side with the MPI implementation and work
in conjunction with the MPI implementation. For example, instead of
using DRI_Dataspec, I used MPI_Datatypes.
The first included file is the corner turn example that James put out
a while back. The second file is the proposal for an MPI binding for
DRI.
--
Nathan Doss
II.F Distribution
-----------------
The DRI distribution object combines the following information into
one object:
* the global data,
* the process group over which the global data is distributed,
* and partitions for each dimension.
The distribution object pulls together in one place all the
information needed to describe a distributed data distribution. This
object is used not only to describe a data reorganization operation
but also to allow the user to determine on which processor individual
pieces of a distributed data structure data are actually located.
* Constructors
The following routine is used to create a distribution object:
int DRI_Distribution_create(DRI_Global_data global_data,
DRI_Group group,
DRI_Partition partition[],
DRI_Distribution *dist);
global_data - describes the shape of the data structure to
be distributed
group - group of processors over which the data will be
distributed
partition - array of partitions. The size of this array must
equal the number of dimensions in "global_data".
dist - newly created distribution
If one or more of the partitions in the partition array was specified
as having 0 "procs", the distribution constructor will attempt to
provide a reasonable value for that particular dimension. It attempts
to select a balanced distribution of processors per coordinate
direction, depending on the number of processes in the group to be
balanced and upon the number of processors specified in the other
partitions. The dimensions are set to be as close to each other as
possible, using an appropriate divisibility algorithm. An error will
occur if the number of nodes in the group is not a multiple of the
product of non-zero "procs" in the partitions. The call will
allocate processors to each partition in non-increasing order.
For example, if two "procs" values are specified as 0 for a two
dimensional distributed data structure, the value of the
"procs" associated with the first (index 0 element in the partition
array) will be greater than or equal to the "procs" value associated
with the second entry (index 1 element in the partition array).
Although the distribution constructor associates processor counts with
partitions that were originally created with "procs" = 0, the original
partitions are not modified. Upon return from the constructor, each
of the partitions in the partition array will be as they were before
the call to the distribution constructor.
In the following examples, assume we have been given:
Object Contents
------------ ----------------------------------
Global data 3 dims, 100 x 500 x 10
Group 20 processors
Example 1:
Object Contents
------------ ----------------------------------
Partitions { p1(procs=0), p2(procs=5), p3(procs=2) }
Creating the distribution object will assign 2 processors to the
p1 partition.
Example 2:
Object Contents
------------ ----------------------------------
Partitions { p1(procs=0), p2(procs=0), p3(procs=0) }
Creating the distribution object will assign: p1(procs=5),
p2(procs=2), p3(procs=2).
Example 3:
Object Contents
------------ ----------------------------------
Partitions { p1(procs=3), p2(procs=2), p3(procs=0) }
An error will be generated since the product of the number of
processors(20) is not a multiple of the product of non-zero
procs(6).
Example 4:
Object Contents
------------ ----------------------------------
Partitions { p1(procs=5), p2(procs=4), p3(procs=0) }
Creating the distribution object will assign: p3(procs=1).
* Destructors
The following routine frees a distribution. :
int DRI_Distribution_free(DRI_Distribution *dist);
After this call returns, the value of dist will equal
DRI_DISTRIBUTION_NULL;
* Accessors
DRI distributions require a wide range of accessors. These
accessors answer questions such as:
- What are the objects that make up this distribution (i.e.,
what were the objects passed into the constructor)?
- How much data do I own? How much space does a processor
need to allocate for a particular distribution? How much
data does a particular processor have?
- Who has a certain element (e.g., who owns array element (3,2,4)
of a 3 dimensional array)? What elements does a certain
processor own?
The following routines return information about the arguments
passed into the distribution constructor:
int DRI_Distribution_get_global_data(DRI_Distribution dist,
DRI_Global_data *gd);
int DRI_Distribution_get_group(DRI_Distribution dist,
DRI_group *group);
int DRI_Distribution_get_partition(DRI_Distribution dist,
DRI_partition partition[]);
The following calls are used to determine how much data individual
processors hold (data with overlap and pad) and own (data without
overlap or pad).
int DRI_Distribution_get_count(DRI_Distribution dist,
int rank, int *count);
dist - the distribution
rank - rank of the processor we want to find out about
count - how many elements are stored by processor "rank".
This includes overlap or pad. The count is not a byte
count -- it is a count of how many elements relative to
the datatype specified during construction of the global
data.
int DRI_Distribution_get_owned_count(DRI_Distribution dist,
int rank, int *count);
dist - the distribution
rank - rank of the processor we want to find out about
count - how many elements are owned by processor "rank".
This does not include overlap or pad.
This function can be used in conjunction with the MPI extent function
to determine how much memory space is required by a distribution on a
particular processor.
For any distribution, an element has a local index (for each
dimension) and a global index (for each dimension). The local index
is the index relative to the processors local buffer. The global
index is the index relative to the global data structure. The
following functions are used to aid in mapping between local and
global indices:
int DRI_Distribution_location(DRI_Distribution dist,
int global[],
int *rank);
dist - the distribution
global - the global indices of the data element
rank - rank of the processor that owns the element
int DRI_Distribution_global_to_local(DRI_Distribution dist,
int rank, int global[],
int local[]);
dist - the distribution
rank - rank of the processor relative to which the mapping should
be made
global - the global indices of the data element
local - the resulting local mapping. If the data element does
not exist on the "rank" processor, DRI_UNDEFINED is
returned for each of the indices.
int DRI_Distribution_local_to_global(DRI_Distribution dist,
int rank, int local[],
int global[]);
dist - the distribution
rank - rank of the processor relative to which the mapping should
be made
local - the local indices of the data element
global - where the element exists in the global data structure
It's sometimes useful to obtain a listing of the indices in
each dimension that a particular processor either owns or
has access to (i.e., overlap/pad). Distribution iterators
are used to allow the user to iterate through the indices owned
by a particular processor.
The following functions is used to create a distribution iterator.
The constructed iterator will iterate through each index
resident in the specified processors memory (i.e., including
pad & overlap).
int DRI_Distribution_iterator_create(DRI_Distribution dist,
int rank, int dim,
DRI_Distribution_iterator *it)
dist - the distribution
rank - rank of the processor to which this iterator
will refer
dim - dimension of the global data this iterator will
cover
it - newly created distribution iterator
The following functions is used to create a distribution iterator.
The constructed iterator will iterate only through each index owned by
the specified processor (i.e., it will NOT include pad & overlap).
int DRI_Distribution_iterator_create_owned(DRI_Distribution dist,
int rank, int dim,
DRI_Distribution_iterator
*it)
dist - the distribution
rank - rank of the processor to which this iterator
will refer
dim - dimension of the global data this iterator will
cover
it - newly created distribution iterator
The following operation frees a distribution iterator.
int DRI_Distribution_iterator_free(DRI_Distribution_iterator *it)
it - distribution iterator to free. It will point to
DRI_DISTRIBUTION_ITERATOR_NULL after the routine returns.
The following operations are used to traverse an iterator.
DRI_Distribution_iterator_next increments the iterator to the next
available index and returns its value. If
DRI_Distribution_iterator_next is called on a previously unused
iterator, it returns the first index.
int DRI_Distribution_iterator_next(DRI_Distribution_iterator it,
int *index, int *flag)
it - distribution iterator
index - the "next" index. The index is a global index.
flag - is used to indicate whether or not the returned index
value is valid. If flag is 1, the end of the iterator
has not been reached -- the value of index is valid.
If flag is 0, the end of the iterator has been reached,
the value of index is undefined.
DRI_Distribution_iterator_prev decrements the iterator to the previous
index and returns its value. If DRI_Distribution_iterator_prev is
called on a previously unused iterator, it returns the last index.
int DRI_Distribution_iterator_prev(DRI_Distribution_iterator it,
int *index, int *flag)
it - distribution iterator
index - the "previous" index. The index is a global index.
flag - is used to indicate whether or not the returned index
value is valid. If flag is 1, the end of the iterator
has not been reached -- the value of index is valid.
If flag is 0, the end of the iterator has been reached,
the value of index is undefined.
The following function resets an iterator to it's original "unused"
state.
int DRI_Distribution_iterator_reset(DRI_Distribution_iterator it);
The following function is used to determine whether the current
index pointed to by the iterator is an overlap/pad index.
int DRI_Distribution_iterator_overlap(DRI_Distribution_iterator it,
int *flag)
it - distribution iterator
flag - returns 1 if the current index is overlap or pad, returns
0 otherwise
As an example of how distribution iterators might be used, consider
how one would iterate through each element of a 3 dimensional
matrix available in processor 0:
DRI_Distribution_iterator it_i, it_j, it_k;
int i, j, k;
int f_i, f_j, f_k;
DRI_Distribution_iterator_create(dist, 0, 0, &it_i);
DRI_Distribution_iterator_create(dist, 0, 1, &it_j);
DRI_Distribution_iterator_create(dist, 0, 2, &it_k);
DRI_Distribution_iterator_next(it_i,&i,&f_i);
while (f_i) {
DRI_Distribution_iterator_next(it_j,&j,&f_j);
while (f_j) {
DRI_Distribution_iterator_next(it_k,&k,&f_k);
while (f_k) {
/* do whatever to global element i,j,k */
DRI_Distribution_iterator_next(it_k,&k,&f_k);
}
DRI_Distribution_iterator_reset(it_k);
DRI_Distribution_iterator_next(it_j,&j,&f_j);
}
DRI_Distribution_iterator_reset(it_j);
DRI_Distribution_iterator_next(it_i,&i,&f_i);
}
DRI_Distribution_iterator_free(&it_i);
DRI_Distribution_iterator_free(&it_j);
DRI_Distribution_iterator_free(&it_k);
(Discussion: As a variation of this, you could have the iterator_next
and iterator_prev return blocks of information. In other words,
instead of returning a single index, it would return two indices --
begin and end.)
II.G Communication
------------------
This section details the routines used for performing the actual
data reorganizations from one distribution to another.
The basic calls used to perform a redistribution operation
are:
int DRI_Send(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist)
buffer - buffer containing data local to the processor
tag - tag value must match on sender and receiver
local_dist - describes the distribution of data on the
sending side of the reorganization operation
remote_dist - describes the distribution of data on the
receiving side of the reorganization operation
int DRI_Recv(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist)
buffer - buffer containing data local to the processor
tag - tag value must match on sender and receiver
local_dist - describes the distribution of data on the
receiving side of the reorganization operation
remote_dist - describes the distribution of data on the
sending side of the reorganization operation
These functions require that the local distribution of the
sender match the remote distribution of the receiver and
vice-versa. The DRI_Send call has semantics simlar to the
standard MPI_Send call with respect to buffering requirements.
(To do: Describe "safe" and "unsafe" situations with simply
using DRI_Send and DRI_Recv. For example, if the groups
of the local and remote distributions are the same, calling
DRI_Send followed by DRI_Recv may or may not work -- i.e.,
it's an unsafe program. Whether it worked or not would
depend on how it's implemented. One possible implementation
is to issue a bunch of MPI_Isend calls followed by a
MPI_Waitall. If the MPI implementation has enough buffer
space, the call will work. If not, the call will fail.)
(Rationale: Both the local and remote distribution information
is specified by the sender and receiver. It's possible to
provide a layered library that does not require distribution
information about both sides. The layered library could use
a name server approach to finding out the remote distribution
and then use these calls to actually implement the redistribution)
In addition to the standard mode DRI_Send, the DRI interface
has 3 additional modes that correspond to the buffered, ready,
and synchronous MPI send modes:
int DRI_Bsend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist)
This function has similar semantics to that of MPI_Bsend. It either
uses an MPI_Bsend variant to implement itself or (if the
implementation is by the MPI vendor) uses the buffer space provided to
the implementation by MPI_Buffer_attach.
int DRI_Ssend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist)
This call will not return until all processors in the remote_dist
processor
group have called a matching receive operation.
int DRI_Rsend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist)
This call assumes all processors in the remote_dist processor group
have called a matching receive operation prior to the sender issuing
the DRI_Rsend.
Each of these send and receive functions are blocking where the terms
blocking and non-blocking are defined as in the MPI-1 specification.
After a blocking send call returns, the application may write into the
buffer with full confidence that the changes will not be transmitted
to the receiving group. After a non-blocking send call returns, the
application may not write into the buffer since changes made at that
point may transmitted to the receiving group. After a blocking
receive, valid data is guaranteed to be in the buffer (i.e., data
received from the send group). Data is not guaranteed to be valid
after a non-blocking receive.
For DRI_Recv and each of the various DRI send functions, a non-blocking
version of the operation is provided:
int DRI_Irecv(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Isend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Ibsend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Issend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Irsend(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
Various functions can be used to determine if a non-blocking operation
has completed. The following routine waits until a non-blocking
request has completed.
DRI_Wait(DRI_Request *request);
request - request will be set to DRI_REQUEST_NULL upon completion
of this function (unless the request is a persistant
request).
The following routine tests to see if a non-blocking operation has
finished.
DRI_Test(DRI_Request *request, int *flag);
request - request will be set to DRI_REQUEST_NULL if the request
has completed (and it is not a persistant request),
otherwise it will not be modified.
flag - is set to 1 if the request has completed, 0 otherwise.
(To do: Add DRI variations of the other MPI completion functions:
DRI_Waitany, DRI_Waitsome, DRI_Waitall,
DRI_Testany, DRI_Testsome, DRI_Testall)
In order to reduce the overhead associated with setting up a
distribution operation, persistant communication operations are also
available (based on MPI persistant communication):
int DRI_Recv_init(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Send_init(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Bsend_init(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Ssend_init(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
int DRI_Rsend_init(void *buffer, int tag,
DRI_Distribution local_dist,
DRI_Distribution remote_dist,
DRI_Request *request)
The requests created with these operations can be started by
calling:
int DRI_Start(DRI_Request *request);
A request must complete before it can be restarted.
The following routine is provided for cases where a processor
participates as both sender and receiver:
int DRI_Sendrecv(void *send_buffer, int stag, DRI_Distribution send_dist,
void *recv_buffer, int rtag, DRI_Distribution recv_dist);
A persistant version of this call is available:
int DRI_Sendrecv_init(void *sendbuffer, int stag, DRI_Distribution senddist,
void *recvbuffer, int rtag, DRI_Distribution recvdist,
DRI_Request *request);
(Discussion: Should we have DRI_Bsendrecv, DRI_Ssendrecv,
DRI_Rsendrecv?
Should we then have persistant versions of each of these?)
Appendix A. Constants
=====================
DRI_GLOBAL_DATA_NULL
DRI_GROUP_NULL
DRI_PARTITION_NULL
DRI_DISTRIBUTION_NULL
DRI_LAYOUT_NULL
DRI_DISTRIBUTION_ITERATOR_NULL
DRI_UNDEFINED
DRI_PARTITION_TYPE_BLOCK
DRI_PARTITION_TYPE_BLOCK_CYCLIC
DRI_LAYOUT_PACKED
<add missing ones>