$Id: DRI_LIS.txt,v 1.7 2000/11/21 05:02:53 kcain Exp $ This is a language-independent specification of a meta-API for data reorganization. This version of the specification is based on the concensus of the Data Reorganization Forum following its September 2000 meeting. DATE OF THIS DRAFT: 09/29/2000 There are 4 sections to this document: 1) Administrative notes 2) Change summary between this version and the prior version 3) Current version of "critical path" interfaces in the API 4) Current version of other interfaces in the API ------------------- SECTION 1: Administrative Notes -------------------------- These notes are mostly re-iterated from previous versions of this draft. NOTE #1: A change has been made in the organization of this file so that the functions that are in the "critical path" to creating and initiating a data reorganization function are presented first. Other functions such as object query functions that are not in this "critical path" are presented later. As of this date (09/29/2000), this draft contains _only_ the critical path calls (i.e., Section 4 is empty). Since there have been many changes to these functions, we should agree on their specification before handling the low-level ("power user") and non-critical fucntions. NOTE #2: A new feature is being added for the committee members - a list of changes that have been introduced in this newer version of the API and the corresponding reasons. The goal is to prevent revisiting issues that have been resolved in past working meetings. ----------------------- SECTION 2: Change Summary -------------------------- ***** Changes from July 2000 draft to September 2000 (POST SEPTEMBER MEETING) ***** draft: 0. Reorganization of API and associated terminology: CORE Standalone / pure Standalone / middleware adapter CORE Data Reorg interfaces must appear in all implementations. They are inherently part of Data Reorg and are not likely to appear in other existing standard APIs. List of CORE objects/interfaces: For non-CORE interfaces, there is an implicit overlap with existing middleware services. Implementers can fully implement these services according to the Data Reorg "standalone" specification (akin to re-inventing the wheel), or they can make reasonable choices about how to leverage the existing services in so-called "middleware adapters". In this respect, the Data Reorg specification can be thought of as "META" (i.e., not precisely specified, because it could be instantiated in many ways corresponding to the many middlewares that might "co-layer" with Data Reorg). The Data Reorg committee will define, along with the first CORE and pure standalone interfaces, an MPI middleware adapter. List of Standalone objects/interfaces: 1. Removed dataspec parameter from DRI_Global_Data_create We are deferring data type specification until Channel creation time. Having a data type input parameter to DRI_Global_Data_create would confuse its distinction as a Data Reorg "CORE" interface. Data types can take many forms, since they are implemented in many other middlewares (e.g., MPI, MPI/RT, VSIPL, ...). 2. Added "name" parameter to DRI_Global_Data_create This string parameter was added as a way to connect off-line configuration file specifications to run-time calls to the Data Reorg library. Some architectures use a configuration file approach to specifying communication resources that will be used at run-time. Another benefit of adding this parameter is to facilitate any future debugging or profiling features either included in DRI itself or provided by a third party as a useful complementary capability. 3. Changed DRI_Distribution_calc_local_size to DRI_Distribution_get_local_count In addition to the name change, the return value has changed from number of bytes to number of data elements. The change from returning bytes to returning elements is necessary because the data type of the data being partitioned has not yet been bound in any prior calls (see related change to DRI_Global_Data_create). 4. In DRI_Distribution_create documentation, clarified a special case that corresponds to "broadcasting" data. The case is when the DRI_Partition object refers to a "whole" partitioning of a data dimension (i.e., indivisible), but the process group logical topology (group_dims parameter) specifies more than 1 process. Here, such distributions will effectively replicate the data across all associated processes. 5. In DRI_Bufferset_system_create, renamed dist parameter to disth to be consistent with other documentation in this document. 6. DRI_Distribution_create - wholesale modifications!! 6a) Changed how layouts input array works. There are 3 cases supported: - layouts is NULL (no layout specified at all - default operation request) - layouts is a mixture of user-instantiated DRI_Layout and pre-defined DRI_LAYOUT_NULL objects - layouts consists entirely of user-instantiated DRI_Layout objects See documentation for DRI_Distribution_create for details 6b) Removed group parameter, in favor of user specifying group_size and myrank This was motivated by the need to remove so-called "META" arguments out of "CORE" Data Reorg functions like this 7. Moved DRI_Dataspec input parameter into DRI_Channel_create_send and DRI_Channel_create_recv functions (from DRI_Global_Data_create). See reasoning above in item 1 8. Changed DRI_Bufferset_system_create and DRI_Bufferset_user_create to take bufsize (in bytes) instead of DRI_Distribution object parameter. This, of course, is because of the change described in item 1 (no dataspec is bound early on in the API - it is deferred until DRI_Channel creation time) 9. Added built-in datatypes and DRI_Dataspec_get_size ***** Changes from March 2000 draft to July 2000 (POST JUNE DR MEETING) draft: 1. Changed capitalization conventions as decided in prior meetings, but has not yet been implemented. The new convention is: DRI_ prefix for everything (objects/types, function names, etc.) DRI_Data_Type (capitalized first letter of object/type names) DRI_Data_Type_method (lowercase method/function names) DRI_Method (for standalone functions without associated objects) 2. Changed DRI_Distspec object to DRI_Partition, per group's decision at the June 2000 meeting An associated change is that the pre-defined object DRI_DISTSPEC_INDIVISIBLE has been changed to DRI_PARTITION_WHOLE (2 changes - DISTSPEC to PARTITION, and INDIVISIBLE to WHOLE) 3. Changed DRI_Dist object to DRI_Distribution, per group's decision at the June 2000 meeting 4. Changed DRI_Bufferset_create to DRI_Bufferset_system_create 5. Modified DRI_Bufferset_system_create function to take a DRI_Distribution object parameter (as decided in June 2000 meeting) 6. Created DRI_Bufferset_user_create function to import user-allocated memory into a DRI_Bufferset object (instead of calling DRI_Bufferset_system_create to have the library/run-time perform the memory allocation). 7. Added DRI_Init and DRI_Finalize functions to the API specification 8. Specified a number of pre-defined "NULL" objects (one for each major data type in the API specification). DRI_Init creates these NULL objects at runtime. 9. Added a new function DRI_Partition_whole_create that allows the user to get a DRI_Partition object that specifies no partitioning at all (per June meeting). 10. Changed DRI_Group_create parameters to require an "original_group" from which a subset process group is to be created 11. REMOVED the following language from parts of this document, based on decisions made in June 2000 meeting. What if we are not using standard middleware, but a process set construct exists? Do we want to leverage those (non-portable) approaches to alleviate the need to do process set management in DRI? If we choose to do this, then this object and its methods become unnecessary? ***** Changes from December 1999 draft to March 2000 (PRE DR MEETING) draft: 1. Marked appropriate functions as "" to indicate areas where Data Reorg will likely use infrastructure from other middleware Each meta function now has a "META NOTES" section to try to organize the discussion on the meta-nature of the DRI API. 2. Inserted notation in this document where some remaining decisions are needed. Before final API is settled, we need to go back and remove these notes and replace with the final decisions 3. Inserted _destroy functions for the following objects: DRI_Global_Data DRI_Group DRI_Overlap DRI_Distspec DRI_Bufferset DRI_Channel. ***** Changes from September 1999 to December 1999 drafts: 1. DRI_Group_myrank name changed to DRI_Group_get_rank() 2. DRI_Dist_create - added a note in RESTRICTIONS/POLICY section reiterating that this call may not involve collective communication, at the implementation's discretion. This of course means that erroneous programs could cause DRI_Dist_create to calculate an incorrect data partitioning. 3. DRI_Dist_create - Added useful default layout parameters so the user doesn't have to use DRI_Layout_create to create commonly-needed objects (e.g., DRI_LAYOUT_PACKED_012). All possible packed layouts for 1, 2, and 3 dimensions are provided. This effectively makes DRI_Layout_create a non "critical-path" function in the API. 4. DRI_Dist_create - A NULL pointer argument for the group_dims parameter specifies that the user wants to have the implementation determine an appropriate logical process set topology to use in dividing up the data during the execution of this call. 5. DRI_Dist_create - Added a note in the DESCRIPTION section that says that a valid entry in the distspecs array parameter is DRI_DISTSPEC_INDIVISIBLE (this capability was accidentally removed from the API in prior edits). 6. DRI_Dist_create - Noted that the group_dims array parameter corresponds directly to the dimsizes array parameter of the DRI_Global_Data_create function. 7. Added a DRI_Dist_get_numblocks function 8. Added a DRI_Dist_get_blockinfo function to return a single structure that gives all needed information about a locally owned block of data following the partitioning process. Returns a new DRI_Blockinfo structure type. internally, the DRI_Blockinfo structure contains an array of DRI_Blockdim structures. This pair of structures replaces the old DRI_Part object and bounds_t structure. DRI_Part was not adding anything beyond DRI_Dist, so we have opted directly query DRI_Dist for the low level partitioning information. The bounds_t structure was poorly named, and needed to be more descriptive (we now call it DRI_Blockdim). Additional information was needed beyond what the old bounds_t provided, so we just 9. Changed DRI_Part_calc_local_size to DRI_Dist_calc_local_size, since the DRI_Part object has been removed and we now just query DRI_Dist objects. 10. Added the DRI_Bufferset object and its DRI_Bufferset_create function 11. DRI_Transfer object has been renamed to DRI_Channel ------- SECTION 3: Current API for "critical path" functions --------------- /******************** DRI_Init ********************/ DRI_Init - Initialize the data reorganization run-time environment SYNOPSIS DRI_Init(argvp, argcp) - C language binding DRI_Init(???) - other language bindings PARAMETERS INOUT: argvp (pointer to array of strings) - application command line arguments INOUT: argcp (pointer to integer) - address of integer variable that stores the number of command line arguments contained in argvp STANDALONE/ADAPTER NOTES Co-layer (middleware adapter) implementations based on MPI or MPI/RT may be able to accomplish the necessary Data Reorg init actions within their respective Init functions, depending on how they are implemented. DESCRIPTION Parses the application command line for implementation-specific data reorganization library options. Synchronizes with all other data reorganization processes in the environment, and produces the DRI_GROUP_WORLD object that is used to represent all processes in a data reorganization based application. If they are not already provided at compile-time, this function creates a pre-defined "null" objects: - DRI_GROUP_NULL - DRI_GLOBAL_DATA_NULL - DRI_DATASPEC_NULL - DRI_OVERLAP_NULL - DRI_PARTITION_NULL - DRI_DISTRIBUTION_NULL - DRI_LAYOUT_NULL - DRI_CHANNEL_NULL - DRI_BUFFERSET_NULL - DRI_BUFFER_ID_NULL Also, if necessary, creates the following pre-defined objects: - DRI_PARTITION_WHOLE COMMUNICATION BEHAVIOR Collective. Synchronizes with all other data reorganization processes in the environment, and produces the DRI_GROUP_WORLD object that is used to represent all processes in a data reorganization based application. RESTRICTIONS / POLICY DRI_Init must be the first data reorganization library function called. /******************** DRI_Global_Data_create ********************/ DRI_Global_Data_create - Create a global data object SYNOPSIS DRI_Global_Data_create(ndims, dimsizes[ndims], name, gdo) PARAMETERS IN: ndims (integer) - number of dimensions in the global data IN: dimsizes (integer array) - size of each dimension of the global data IN: name (string) - symbolic name of data object represented by gdo OUT: gdo (DRI_Global_Data) - object that describes the global data DESCRIPTION Creates a global data object to describe application data. The size information supplied by the user refers to the size of the application data _without_ considering how the data will eventually be partitioned across a group of processes in the parallel environment. COMMUNICATION BEHAVIOR Local. All processes that will participate in a future data reorganization involving this data must create this object independently. RESTRICTIONS / POLICY All processes that will participate in a data reorganization on the described data must call this function with identical ndims and dimsizes parameters. Implementations may place an upper limit on the ndims parameter. However, all implementations must minimally support 1 <= ndims <= 3 /***************** DRI_Group_create *****************/ DRI_Group_create - Create an object to represent a group of processes SYNOPSIS DRI_Group_create(original_group, num_ranks, rank_list, new_group) PARAMETERS IN: original_group (DRI_Group) - group from which a subset will be taken to produce the new_group of processes IN: num_ranks (integer) - total number of processes in the group to be created IN: rank_list (array of integer) - list of logical process ranks from the original_group that will form the new_group of processes OUT: new_group (DRI_Group) - process group object STANDALONE/ADAPTER NOTES This is meta at the "object level". That is, some implementations may choose to completely leverage constructs from other middleware APIs (e.g., MPI Communicators) as part of a co-layered implementation with Data Reorg. Implementations that take this approach may elect not to implement DRI_Group objects in the standalone format shown here. DESCRIPTION Creates an object to represent a group of unique processes in the parallel processing environment. Groups are one-dimensional logical orderings of processes. Each process is assigned an integer rank, numbered between zero and the total number of processes - 1. The original_group parameter must be a valid data reorganization group. The pre-defined DRI_Group object DRI_GROUP_WORLD must be used to create the first subset group of processes. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY /**************** DRI_Group_get_rank ****************/ DRI_Group_get_rank - Return the rank of the calling process in specified group SYNOPSIS DRI_Group_get_rank(group, rank) PARAMETERS IN: group (DRI_Group) - group object OUT: rank (integer) - rank of the calling process in the group STANDALONE/ADAPTER NOTES See DRI_Group_create DESCRIPTION Returns the rank (logical process id) in the given group to the caller. COMMUNICATION BEHAVIOR Local RESTRICTIONS / POLICY Only members of the specified group may call this function successfully /*************** DRI_Group_get_size ***************/ DRI_Group_get_size - Return the size of the specified group SYNOPSIS DRI_Group_get_size(group, size) PARAMETERS IN: group (DRI_Group) - group object OUT: size (integer) - size of the specified group STANDALONE/ADAPTER NOTES See notes for DRI_Group_create. DESCRIPTION Returns the number of participating processes in the given group COMMUNICATION BEHAVIOR Local RESTRICTIONS / POLICY Only members of the specified group may call this function successfully /******************** DRI_Overlap_create ********************/ DRI_Overlap_create - Create an overlap data partitioning object SYNOPSIS DRI_Overlap_create(ovr_type, num_pos, overlaph) PARAMETERS IN: ovr_type (DRI_Overlap_type) - overlap policy to implement at the edges of a global data object. Can be one of: DRI_OVERLAP_TRUNCATE DRI_OVERLAP_TOROIDAL DRI_OVERLAP_PAD_ZEROS DRI_OVERLAP_PAD_REPLICATED IN: num_pos (integer) - number of positions to overlap OUT: overlaph (DRI_Overlap) - overlap object DESCRIPTION Creates the overlap attribute used in the data distribution high-level specification. The resulting DRI_Overlap object is to be passed into either DRI_Partition_block_create or DRI_Partition_blockcyclic_create as a left or right overlap argument. NOTE: Just like the DRI_Partition object, the user is expected to create a DRI_Overlap object specification for each dimension of global data (where a nonzero overlap is desired). In the event that no overlap is requested by the user, DRI_NO_OVERLAP can be passed as the left and right overlap arguments to one of the the DRI_Partition_create functions. In general, overlap is the storage of extra data in a processor's local data buffer to hold data that is adjacent in the global data context, but that is assigned to another processor, based on the data partitioning. Overlap therefore refers to data that is stored on processor boundaries in the partitioning of the global data. There are different overlap policies supported: 1) ovr_type == DRI_OVERLAP_TRUNCATE The local buffer should contain enough space to store copies of num_pos adjacent, non-local elements. At the ends of the global data object, extra storage is not required in the local data buffer, and is truncated accordingly. 2) ovr_type == DRI_OVERLAP_TOROIDAL The local buffer should contain enough space to store copies of num_pos adjacent, non-local elements. At the ends of the global data object, extra storage is required in the local data buffer, and will be filled with data from the num_pos elements that start at the opposite end of the global data dimension. 3) ovr_type == DRI_OVERLAP_PAD_ZEROS The local buffer should contain enough space to store copies of num_pos adjacent, non-local elements. At the ends of the global data object, extra storage is required in the local data buffer, and will be filled with zeros. 4) ovr_type == DRI_OVERLAP_PAD_REPLICATED The local buffer should contain enough space to store copies of num_pos adjacent, non-local elements. At the ends of the global data object, extra storage is required in the local data buffer, and will be filled with a copy of the last num_pos _locally_ held elements. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY /******************** DRI_Partition_block_create ********************/ /******************** DRI_Partition_blockcyclic_create ********************/ /******************** DRI_Partition_whole_create ********************/ DRI_Partition_block_create - Create a block distribution specification DRI_Partition_blockcyclic_create - Create a block cyclic distribution DRI_Partition_whole_create - Create an indivisible (whole) distribution SYNOPSIS DRI_Partition_block_create(minsz, mod, lov, rov, part) DRI_Partition_blockcyclic_create(lov, rov, blksz, part) DRI_Partition_whole_create(part) PARAMETERS IN: minsz (integer) - minimum number of local elements required (user specifies 0 to indicate no preference) IN: mod (integer) - modulo requirement (user specifies 1 to indicate no preference) IN: lov (DRI_Overlap) - left overlap (DRI_NO_OVERLAP specifies no overlap) IN: rov (DRI_Overlap) - right overlap (DRI_NO_OVERLAP specifies no overlap) IN: blksz (integer) - block-cyclic partitioning block size (user specifies 1 for pure cyclic partition) OUT: part (DRI_Partition) - high-level data distribution object DESCRIPTION These functions create a DRI_Partition object that stores information about either a block, blockcyclic, or indivisible (whole) partitioning of global application data. Users must associate a separate DRI_Partition object with each dimension of partitioned global data. The output object, part, is only a high-level specification of the requested data partitioning. It does not store exact partitioning details such as specific global data indices assigned to a particular process. Because a DRI_Partition object is not associated with any single global data array, it can be reused for many different data partitionings. The more exact partitioning information for a particular global data array is stored in the DRI_Distribution object that can be queried for detailed partitioning information following the DRI_Distribution_create operation. Calling DRI_Partition_whole_create will produce an object equivalent to the pre-defined object DRI_PARTITION_WHOLE. Implementations may in fact return a reference to this pre-defined object as the output of DRI_Partition_whole_create. Parameter mod specifies that the number of local elements ultimately assigned to the calling process must be some multiple of mod. Parameters lov and rov specify element overlaps (left and right, respectively). These parameters do not change the mapping of global data indices to processors in the data partitioning. They allow copies of adjacent global data elements at the (left or right) boundaries of the data partitining to be stored locally. A right overlap refers to overlap in the direction of _higher_ global indices. Consult the section on the DRI_Overlap object for additional details about the overlap specification. Parameter blksz is used in block-cyclic partitionings to define the size (in number of elements) of the blocks that get assigned to processors in the global data partitioning. COMMUNICATION BEHAVIOR Local RESTRICTIONS / POLICY This object may NOT be queried until the completion of a subsequent DRI_Distribution_create call. COMMUNICATION BEHAVIOR Local /*************** DRI_Distribution_create ***************/ DRI_Distribution_create - Create a distribution object for a global data object over a group of processes SYNOPSIS DRI_Distribution_create(gdo, group_size, myrank, group_dims, parts, layout, disth) PARAMETERS IN: gdo (DRI_Global_Data) - global data object IN: group_size (integer) - total number of processes in group dividing data IN: myrank (integer) - my logical process rank within the group IN: group_dims (array of integer) - logical dimensions of process group IN: parts (array of DRI_Partition) - high-level data distribution specs (one array entry per gdo dimension) IN: layouts (DRI_Layout) - memory layout of local data buffers OUT: disth (DRI_Distribution) - data distribution object STANDALONE/ADAPTER NOTES This function can be implemented in a completely standalone fashion DESCRIPTION This function aggregates all of the input objects into a single container, a DRI_Distribution object. It also calculates explicitly the data block(s) of the global data that will be assigned to processes (and stores that detailed information in the resulting DRI_Distribution object). The user will be able to query this low-level information following the execution of this call. Note that the data partitioning performed here guarantees that each global data element is assigned to a process. It is unlikely, but possible that some processes could be assigned NO global data elements as a result of this call. The group_size parameter defines the size of the group (number of processes) that will be dividing the data set. The myrank parameter uniquely defines the calling process so that a unique portion of the global data set can be assigned to it by this call. The group_dims array specifies a logical process set dimensionality for the process group identified by the "group" parameter. The number of elements in group_dims must be equal to the number of dimensions specified for the gdo parameter in a prior call to DRI_Global_Data_create. The product of all values in group_dims must equal the total number of processes, defined by the group_size parameter. group_dims gives the caller more explicit control over the global data partitioning process performed by DRI_Distribution_create. The group_dims array can take one of three forms: 1. NULL (no process set dimensionality specified) User has no preference and the implementation can choose any values that make the product of group_dims equal to group_size 2. Mixture of zero and nonzero/positive values Nonzero/positive values represent a specific process set dimensionality that should be respected for the associated global data dimensions. Zero valued elements effectively represent a "don't care", and the implementation is free to select values as long as they satisfy the overall requirement that the group_dims product be equal to the group_size 3. All nonzero/positive values User is specifying a specific process set dimensionality that should be used in the partitioning process. The layouts parameter specifies, for each dimension of the global data represented by gdo, how the locally stored data is to be arranged in linear memory space. The form of the layouts array parameter will control the following 2 characteristics of locally stored data: - "ordering" of multi-dimensional data (e.g., which dimension is ordered "fastest" in linear memory) - "striding" of local data (with or without strides between consecutive values in memory) The layouts array can take one of three forms: 1. NULL array (no layout specification for any of the data dimensions) - Assigns natural ordering of multidimensional data in linear memory. The most contiguous dimension corresponds to the first specified dimension in the dimsizes input array to the earlier DRI_Global_Data_create call - the data in each dimension is stored with no strides between consecutive data values 2. Individual elements with value DRI_LAYOUT_NULL - for layouts array elements not equal to DRI_LAYOUT_NULL, the order and striding of the corresponding data dimensions are defined by the properties of the supplied DRI_Layout objects - for a layouts array element equal to DRI_LAYOUT_NULL, the order of the corresponding data dimensions is equal to its position within the layouts array. This is similar to the first case. - the data in the dimensions whose layouts are specified DRI_LAYOUT_NULL are stored with no strides between consecutive data values 3. Fully populated array of DRI_Layout objects created earlier using the DRI_Layout_create call - both multidimensional ordering and striding are defined precisely by the supplied DRI_Layout objects stored in the array The parts parameter is an array of DRI_Partition objects, one entry per dimension of data being partitioned. The entries in the array are created prior to this call by using one of the DRI_Partition__create functions. Special case: If one of the data dimensions has a partitioning description (parts parameter) of "whole", but the group_dims parameter specifies more than one process "dividing" the data, then the data distribution approach is to provide copies of all global data in that dimension to each of the processes in that dimension. This can effectively be used to implement a broadcast or replication of the affected data. (see earlier DRI_Partition_create_whole, and DRI_PARTITION_WHOLE descriptions) COMMUNICATION BEHAVIOR At the implementation's discretion, this can be performed either as a collective operation, or as a local operation. RESTRICTIONS / POLICY This function may or may not be implemented in a collective fashion. It is therefore possible that the constituent processes that make up the group could (erroneously) supply different specifications for the following important parameters: gdo, group_dims, parts. In that case the resulting data distribution that is computed and stored in the DRI_Distribution output parameter may be incorrect. The user _must_ be able to query the low-level partitioning details that result from this call immediately following completion of this call. This is true even if the implementation does not perform this call collectively among the affected processes. /******************** DRI_Distribution_get_numblocks ********************/ DRI_Distribution_get_numblocks - Return the number of locally stored blocks from a data partitioning SYNOPSIS DRI_Distribution_get_numblocks (disth, nblocks) PARAMETERS IN: disth (DRI_Distribution) - data distribution object OUT: nblocks (integer) - number of blocks associated with the low-level partitioning referred to by the disth parameter DESCRIPTION This function returns the number of blocks assigned as part of a low-level data partitioning (described by the disth parameter that was created in an earlier DRI_Distribution_create call). For block data partitionings, this function will return a value of 1 in the nblocks output parameter. For block-cyclic partitionings, a value greater than 1 may be returned in the nblocks parameter. COMMUNICATION BEHAVIOR Local. RESTRICTIONS/POLICY /******************** DRI_Distribution_get_blockinfo ********************/ DRI_Distribution_get_blockinfo - Get detailed information about a local block of data SYNOPSIS DRI_Distribution_get_blockinfo (disth, block_num, blockinfo) PARAMETERS IN: disth (DRI_Distribution) - data distribution object IN: block_num (integer) - local block number for which information is needed OUT: blockinfo (DRI_Blockinfo) - returned structure containing detailed information about the block DESCRIPTION For a specified low-level data partitioning object (disth) and local block index (block_num), allocates and returns a structure to the user containing the following: { ndims (integer) - number of dimensions in the local data block described first_offset (integer) - offset (in elements) from the beginning of the local application's memory buffer to the first "owned" element of this data block. It therefore in some cases does not identify the first data element in the block, since the first element in storage could be the result of an overlapped data partitioning. elem_size (integer) - number of bytes per data element in the local block. This can be obtained by querying other objects (DRI_Global_Data and DRI_Partition), but is provided in this structure for user convenience. dims[ndims] (array of DRI_Blockdim structures) - detailed information (on a per-dimension basis) about the range of global indices covered by the local block of data referred to by this DRI_Blockinfo structure } The DRI_Blockdim structure referred to above is defined as follows: { lov (DRI_Overlap) - left overlap in this dimension rov (DRI_Overlap) - right overlap in this dimension global_begin_ix (integer) - global index of the first "owned" data element in the block in this dimension length (integer) - number of "owned" data elements in this dimension stride (integer) - number of elements between consecutive data elements in the local data buffer in this dimension. If this value is 1, then the data is densely packed, with no spacing between consecutive elements. } COMMUNICATION BEHAVIOR Local. RESTRICTIONS/POLICY /******************** DRI_Distribution_get_local_count ********************/ DRI_Distribution_get_local_count - Calculate size of local buffers associated with one side of a data reorganization SYNOPSIS DRI_Distribution_get_local_count(disth, local_size) PARAMETERS IN: disth (DRI_Distribution) - low-level data partitioning object OUT: local_size (integer) - number of elements in the data buffers associated with this data distribution DESCRIPTION This function tells the caller the size of data buffers (in elements) associated with the disth data distribution. The returned local_size parameter is calculated based on a combination of - user-specified partitioning parameters - user-specified memory layout parameters - and implementation-imposed local memory layout policies The number of elements returned in the local_size parameter specifies the size of a memory buffer large enough to hold all local blocks from a data partitioning. This is relevant for block-cyclic partitionings, in which it is possible and likely that multiple blocks of data are assigned to a single process. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY /******* DRI_Bufferset_system_create *******/ DRI_Bufferset_system_create - create shared application/library buffers for processing and data reorganization SYNOPSIS DRI_Bufferset_system_create (nbufs, bufsize, bufset) PARAMETERS IN: nbufs (integer) - number of buffers of size bufsize that will make up the buffer set to be created by this function IN: bufsize (integer) - number of bytes needed for each buffer in bufferset buffer sizes that will be created in the set OUT: bufset (DRI_Bufferset) - buffer set object created STANDALONE/ADAPTER NOTES: Because of similar constructs defined in other APIs (most notably MPI/RT), buffersets can be implemented in a middleware adapter, or in a completely standalone fashion, at the implementation's discretion. DESCRIPTION Creates a buffer set object that will be associated with a later data reorganization (represented by a DRI_Channel object). After this call, the user will never directly query or manipulate the DRI_Bufferset object created. Once the association of the buffer set is made with a channel object (in a later call do DRI_Channel_create), all access to the buffer set's constituent buffers will be made through that associated channel object. In that interaction, the user will work with individual DRI_Buffer_Id objects that are obtained with a call to DRI_Channel_get and returned to the channel with a call to DRI_Channel_put. See the documentation for the get/put functions for additional details on buffer set management. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY /******** DRI_Bufferset_user_create ********/ DRI_Bufferset_user_create - create shared application/library buffers from user-allocated memory for processing and data reorganization SYNOPSIS DRI_Bufferset_user_create (nbufs, buffer_ptrs[], bufsize, bufset) PARAMETERS IN: nbufs (integer) - number of buffers that will make up the buffer set to be created by this function IN: buffer_ptrs (array of pointer) - addresses of user-allocated buffers IN: bufsize (integer) - number of bytes needed for each buffer in bufferset OUT: bufset (DRI_Bufferset) - buffer set object created STANDALONE/ADAPTER NOTES See notes for DRI_Bufferset_system_create DESCRIPTION Creates a buffer set object (to be used in conjunction with an associated DRI_Channel object) from user-allocated memory. Although the application programmer will have access to the addresses of each buffer using this approach, "safe" use of these memory areas must be negotiated by calling DRI_Channel_get and DRI_Channel_put for the associated channel object. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY The buffers that are supplied as input parameters here must provide a sufficient amount of memory to insure correct function of the associated channel operation (data reorg) to take place. See DRI_Channel_create_send and DRI_Channel_create_recv for guidance on how to appropriately size the buffers. /********** DRI Built-In Datatypes **********/ Equivalence table between DRI built in datatypes, and other middleware: DRI MPI(C/FORTRAN) VSIPL Scalar ANSI C DRI_Dataspec MPI_Datatype Datatype ------------ -------------- ------------ ------ DRI_FLOAT MPI_FLOAT vsipl_scalar_f float DRI_DOUBLE MPI_DOUBLE vsipl_scalar_d double DRI_COMPLEX NA / MPI_COMPLEX vsipl_cscalar_f NA DRI_DOUBLE_COMPLEX NA / MPI_DOUBLE_COMPLEX vsipl_cscalar_d NA DRI_COMPLEX_SPLIT NA / NA vsipl_cscalar_f NA DRI_DOUBLE_COMPLEX_SPLIT NA / NA vsipl_cscalar_d NA DRI_INTEGER MPI_INTEGER vsipl_scalar_i int DRI_SHORT MPI_SHORT vsipl_scalar_si signed short int DRI_UNSIGNED_SHORT MPI_UNSIGNED_SHORT vsipl_scalar_us unsigned short int DRI_LONG MPI_LONG vsipl_scalar_li signed long int DRI_UNSINGED_LONG MPI_UNSIGNED_LONG vsipl_scalar_ul unsigned long int /********** DRI_Dataspec_get_size **********/ DRI_Dataspec_get_size - Get the number of bytes needed to store a data type SYNOPSIS DRI_Dataspec_get_size (dataspec, nbytes) PARAMETERS IN dataspec (DRI_Dataspec) data type descriptor OUT nbytes: (integer) number of bytes needed to store data of type described by the dataspec parameter STANDALONE/ADAPTER NOTES Datatypes are commonly implemented in other APIs (see table above). This is an area where DRI implementations can take advantage of this existing functionality, which will almost always offer a broader feature set compared to the standalone DRI interfaces for data types. DESCRIPTION Returns amount of memory (in bytes) needed to store a single data element of the type specified in the dataspec parameter COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY /********** DRI_Channel_create_send ****************/ /********** DRI_Channel_create_recv ****************/ DRI_Channel_create - create data reorganization communication channel SYNOPSIS There are two forms of this call: DRI_Channel_create_send(name, srcDist, srcBufs, channel); DRI_Channel_create_recv(name, destDist, destBufs, channel); PARAMETERS IN name: (string/integer?) Identifier for the channel IN dataspec (DRI_Dataspec) type of data stored in associated buffers IN srcDist: (DRI_Distribution) distribution object on the send side IN destDist: (DRI_Distribution) distribution object on the receive side IN srcBufs: (DRI_Bufferset) send side data buffers IN destBufs: (DRI_Bufferset) receive side data buffers OUT channel: (DRI_Channel) Data reorganization (channel) object created STANDALONE/ADAPTER NOTES Channel constructs appear in other APIs (e.g., MPI/RT) and so DRI_Channel can be implemented in a middleware adapter or in a completely standalone fashion, at the discretion of the implementation. DESCRIPTION The send channel object allows the calling process to participate in a data reorganization as a sender. The receive channel object has a similar (obvious) function. To properly set up data reorganizations in which the caller is both a sender and receiver of data, both forms must be called, resulting in two DRI_Channel objects. COMMUNICATION BEHAVIOR Local. Processes create channel objects independently and in any order. RESTRICTIONS / POLICY Buffers supplied here are assumed to be large enough to contain all the data transferred. To find the appropriate size for these buffers, use the functions DRI_Distribution_get_local_count (number of elements) and DRI_Dataspec_get_size (number of bytes/element). Alternatively, the user can have the DRI implementation allocate the associated bufferset using the DRI_Bufferset_system_create call. Currently, we assume that data reorganizations are either bi-partite (pipeline) or clique-based (Single Program Multiple Data). Intermediate cases, that is, partially overlapping process groups, are disallowed. If any process is both a sender and a receiver, all processes must be both senders and receivers, or an error will result at the time of the subsequent DRI_Channel_connect call. On a given "side" of a channel (send or receive), all of the participating processes must provide buffersets that contain the same number of local buffers as every other process. The number of buffers on the send side of a channel _can_ be different than the number of buffers in the bufferset associated with the receive side of the same channel. The reason for the restriction is to enable high performance implementations. The middleware will be able to compute in advance: - the explicit pairings of send/recv buffers in the data reorganizations to be performed with this channel - the precise order in which the pairings will occur (if there are multiple buffers on the send and receive sides of the channel) /************ DRI_Channel_connect ********************/ DRI_Channel_connect(chan) - Pipeline channel connect SYNOPSIS DRI_Channel_connect(chan) PARAMETERS INOUT chan: (DRI_Channel) channel object to be connected STANDALONE/ADAPTER NOTES See notes for DRI_Channel_create. DESCRIPTION Enables a given pipeline data reorganization: calculates which processors are Sending to and receiving from which other processors. COMMUNICATION BEHAVIOR The connect call is a synchronization point between all processors in the send and receive sides of the data reorganization identified by the chan parameter: it is collective and blocking. RESTRICTIONS / POLICY Multiple channel objects must be connected in the correct order by the involved parties or deadlock may (will probably) result. /********** DRI_Channel_connect_sendrecv ***********/ SYNOPSIS DRI_Channel_connect_sendrecv(c_send, c_recv) - Clique channel connect PARAMETERS INOUT c_send: (DRI_Channel) object managing the "send side" of a data reorg INOUT c_recv: (DRI_Channel) object managing the "receive side" of a data reorg STANDALONE/ADAPTER NOTES See notes for DRI_Channel_create. DESCRIPTION Enables a given clique data reorganization: calculates which processors are Sending to and receiving from which other processors. COMMUNICATION BEHAVIOR The connect call is a synchronization point between all processors in the send and receive sides of the given data reorganization: it is collective and blocking. RESTRICTIONS / POLICY Multiple channel objects must be connected in the correct order by the involved parties or deadlock may (will probably) result. /**************** DRI_Channel_get ****************/ /**************** DRI_Channel_put ****************/ SYNOPSIS DRI_Channel_get (chan, buf) - Receive data reorg buffer / Get free buffer DRI_Channel_put (chan, buf) - Send data reorg buffer / Return used buffer PARAMETERS INOUT chan: (DRI_Channel) channel object managing a data reorganization OUT buf: (DRI_Buffer_Id) handle to memory buffer STANDALONE/ADAPTER NOTES See notes for DRI_Channel_create. DESCRIPTION Discussion of DRI_Channel_get: If the channel object argument refers to the "receive side" of a data reorganization, this function returns a buffer that represents the received data from a set of sending processes. If the channel object argument refers to the "send side" of a data reorganization, this function returns an available buffer to the application so that it can produce the data that will be sent in a subsequent data reorganization operation. Discussion of DRI_Channel_put: If the channel object argument refers to the "send side" of a data reorganization, this function initiates the communication using the data provided in the input buffer argument. If the channel object refers to the "receive side" of a data reorganization, then this call simply returns the buffer to the DRI library so that it can be filled up with received data in a subsequent data reorganization operation. General discussion of put and get in context: In pipeline data reorganizations, incoming buffers are obtained by calling DRI_Channel_get with a "receive side" channel object input argument. If, after processing the received buffer, the program needs to send the data "downstream" in the pipeline, the same buffer can be used as input to a DRI_Channel_put call, but with a separate channel object (representing the "send side" of a different data reorganization). In cases where the calling program is at the beginning or end of an application pipeline, the buffer may be returned to the buffer set by calling DRI_Channel_put with the same channel object parameter that was used in the earlier DRI_Channel_get. For clique data parallel applications, there are two channel objects associated with the same data reorganization (one for the send side, one for the receive side). To execute clique data reorganizations, the program calls DRI_Channel_get with the send-side channel object as input. The returned buffer is filled and a data reorganization is initiated with a call to DRI_Channel_put (passing again as input the send-side channel object and the buffer id). The program then calls DRI_Channel_get, using the second channel object (associated with the receive side of the data reorganization). COMMUNICATION BEHAVIOR DRI_Channel_get is a blocking call and does not return until a full buffer of received data is available DRI_Channel_put is a non-blocking call and returns immediately to the calling application, regardless of whether the associated communication has completed. The channel object will manage the availability of the buffers associated with the data reorganization, protecting the buffer from future application use (via DRI_Channel_get) until the communication has completed and it is safe to reuse the buffer. RESTRICTIONS / POLICY It is possible to use the same DRI_Channel object for two different data reorganizations when using a clique data-parallel design. The receive-side channel object from the first data reorganization executed can also act as the send-side channel object for a second, distinct data reorganization. This is permissible when the data buffer sizes do not change as a result of application processing between the two data reorganizations. ----------- SECTION 4: Current API for remaining functions --------------- /******************** DRI_Global_Data_destroy ********************/ DRI_Global_Data_destroy - destroy a global data object SYNOPSIS DRI_Global_Data_destroy(global_data) PARAMETERS INOUT: global_data (DRI_Global_Data) - object that describes the global data DESCRIPTION Destroys the global data object referred to by the global_data input parameter. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY This function should only free resources associated with the global_data object when necessary. That is, all references to the global data object must be "destroyed" via this call before the actual internal resources used by the global data object are freed and returned to the system. /*************** DRI_Group_destroy ***************/ DRI_Group_destroy - Destroy an object representing a group of processes SYNOPSIS DRI_Group_destroy(grp) PARAMETERS INOUT: grp (DRI_Group) - process group object STANDALONE/ADAPTER NOTES See notes for DRI_Group_create. DESCRIPTION Destroys the process set group object referred to by the grp input parameter. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY This function should only free resources associated with the group object when necessary. That is, all references to the group object must be "destroyed" via this call before the actual internal resources used by the group object are freed and returned to the system. /******************** DRI_Overlap_destroy ********************/ DRI_Overlap_destroy - Destroy an overlap data partitioning object SYNOPSIS DRI_Overlap_destroy(ov) PARAMETERS INOUT: ov (DRI_Overlap) - overlap object DESCRIPTION Destroys the object referred to by the ov parameter COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY This function should only free resources associated with the overlap object when necessary. That is, all references to the overlap object must be "destroyed" via this call before the actual internal resources used by the overlap object are freed and returned to the system. /******************** DRI_Partition_destroy ********************/ DRI_Partition_destroy - Destroy a data distribution specification object SYNOPSIS DRI_Partition_destroy(part) PARAMETERS INOUT: part (DRI_Partition) - high-level data distribution object DESCRIPTION Destroys the object referred to by the part parameter. This parameter can refer to either a block or block-cyclic distribution object (created by DRI_Partition_block_create or DRI_Partition_blockcyclic_create, respectively). COMMUNICATION BEHAVIOR Local RESTRICTIONS / POLICY This function should only free resources associated with the part object when necessary. That is, all references to the part object must be "destroyed" via this call before the actual internal resources used by the part object are freed and returned to the system. /************** DRI_Bufferset_destroy ***************/ DRI_Bufferset_destroy - destroy shared application/library buffers for processing and data reorganization SYNOPSIS DRI_Bufferset_destroy (nbufs, bufsize bufset) PARAMETERS INOUT: bufset (DRI_Bufferset) - buffer set object destroyed STANDALONE/ADAPTER NOTES See DRI_Bufferset_create notes. DESCRIPTION Destroys the object referred to by the bufset parameter. COMMUNICATION BEHAVIOR Local. RESTRICTIONS / POLICY This function should only free resources associated with the bufferset object when necessary. That is, all references to the bufferset object must be "destroyed" via this call before the actual internal resources used by the bufferset object are freed and returned to the system. /*************** DRI_Channel_destroy ***************/ DRI_Channel_destroy - destroy data reorganization communication channel SYNOPSIS DRI_Channel_destroy(chan); PARAMETERS INOUT chan: (DRI_Channel) Data reorganization (channel) object destroyed STANDALONE/ADAPTER NOTES See DRI_Channel_create notes. DESCRIPTION Destroys the channel referred to by the chan parameter. Frees all internal resources used by the channel, including temporary buffers that may have been created during the earlier DRI_Channel_connect() call. COMMUNICATION BEHAVIOR It would be nice to be able to "shut down" a channel gracefully (i.e., in a "collective" fashion). This could be difficult with respect to process synchronization in pipeline application architectures, where many processes participate in two data reorganization channels. This scenario forces a specific order in which channels must be destroyed (or else deadlock could occur). Since this will apparently be pushed to the application level, a proposal would be to make DRI_Channel_destroy have local communication behavior, and to have applications use other middlewares for the necessary "graceful synchronization". DRI_Finalize ********************/ DRI_Finalize - Free resources used by the data reorganization run-time environment SYNOPSIS DRI_Finalize() PARAMETERS STANDALONE/ADAPTER NOTES Co-layer (middleware adapter) implementations based on MPI or MPI/RT may be able to accomplish the necessary Data Reorg finalize actions within their respective Finalize functions, depending on how they are implemented. DESCRIPTION Frees any internal resources used by the data reorganization implementation. COMMUNICATION BEHAVIOR Local RESTRICTIONS / POLICY