NOTE: This is not an official meeting. Purpose of meeting is to coordinate end-game plan for DR, and to discuss technical issues

*** Arkady: temporary buffers vs. those created in a bufferset.

What happens when unexposed temp buffers contain the final result of a data reorg (e.g., there is no local reordering on the recv side). Here, the data is in a buffer not available to the user, so an extra memory copy would be needed to satisfy a user's get() call on the recv channel.

Options to solve this problem:

  1. user has to supply bufferset that has enough buffers to handle both temp and result buffers
  2. implementation modifies bufferset at channel connect time to include additional temp buffers (which can be exposed to application code later in channel get)
  3. At some future time, we put in additional "directives" into API to help user communicate such details to lib
Question: has the bufferset definition to date been exclusively targeted toward user buffers only?

Argument for future "directives" to control temp vs. user buffer distinction is to allow users who know certain things about resource requirements to  have appropriate control at the API level.

Seems like it is doable (today) to have implementation add buffers "under the covers" - user access
 

Dennis: API doc should make clear that if user allocates X buffers, then the policy is _either_ to:


*** Typographical issue: DRI_Bufferset_system_create (use disth instead of dist as name of parameter)
 

*** Suggested areas to cover today:


*** Bcast issue:

Intention is that when DRI_WHOLE partitioning is performed in each dimension of the gdo, then the intent is that all processes that are on the recv side of the corresponding channel will get a copy of the entire global data (i.e., it will be a "broadcast")

Action item: clarify this in the API document

Scenario: what happens when source nodes all use DRI_WHOLE in each dimension of gdo
Dennis: by definition, this is the user saying that each dataset on each node is identical values
Dennis: say, in a 5 to 100 node pipeline stage (where the 5 nodes have data copies), the 5 nodes could "split up" the duties of sending the data copied to the 100 nodes downstream (1 node in src grp can send to the "first 20" nodes of the downstream grp, etc.)

NOTE: there is a big distinction between this type of distribution and when DRI_BLOCK is used and the "minimum # of elements" requested from the partitioning is equal to the entire size of the global data dimension. In this latter case, it means that only 1 processor will get the data in that dimension, and the other nodes in the process grp are going to be "idle" during that stage of the application

Arkady/Myra: please specify interpretation of these cases in the API doc.
Ken: should add advice to users on how to check partitioning results to see whether calling process actually got anything!

*** Dennis/Arkady: we need to look at distribution results structure to make sure that subsequent "for loops" over local data will _still work_ after the partitioning is done (even in cases when the range of data is really only ZERO positions!)
 
 

*** Visibility into partitioning across the environment issue:

Myra in her own prototyping has added such functions to get partitioning info for all processes (remote side has to be determined after channel connect). Myra thinks impls should be allowed to expose such a call, but it should definitely _NOT_ be part of the official API

Ken: having such calls sounds like MPE lib extensions to MPICH in the sense that it gives detailed visibility into application behavior
 

*** Data distribution object issue:

Mercury has proposed some problems with existing approach in API, has submitted alternative approach. Key issues are:

*** Adding/modifying 2 routines:
1) dri_partition_blockcyclic_create (adding new parameters, min + modulo)

It's as simple as adding minsz and mod parameters into this function.

min=1
mod=1
Above parameter settings allow a classical block-cyclic partitioning approach, in which blocking factors and process group dimensionality are provided as inputs.

Ken: concern that third parties later in the future who want the "classical" way to specify block-cyclic partitionings (blocking factor and #procs in each dimension) will see these additional parameters (min & modulo) as too much specification.
 
 

2) Iterator function that traverses through local buffer and returns block meta-info (useful for block cyclic partitionings)
 

We want to specify the order in which the dimensions are accessed (there is some # of blocks in each dimension assigned to the local process)
This is because order of access, and the operations performed on these blocks will impact what is hot and cold in cache. In this case, the user specifies the specific block number by indexing it in each dimension. To address how the user code should set its loop limits, there is a setup call that would return the number of blocks in each dimension.

Also can provide a traversal that saves state for the user, and the user just calls get_next_block. This function also requires a setup function.
 

***
Arkady: discussed problems associated with conflicting pad/overlap specifications in different dimensions. What happens with corner elements of the data? This problem is exacerbated in higher dimensional data, because the amount of data affected grows from more than just a single element of data. Which dimension of data governs in these conflicting cases? Should this be a user-specifiable thing? Should it be an error?

Jon: suggestion - perform the overlap/pad according to the "lowest" dimension (in terms of memory layout) first, then apply overlap specification of higher ordered dimensions in order
 
 

*** dri_partition_whole_create

Myra: why doesn't dri_partition_whole_create not take overlap arguments? She recalls an email where somebody argued for this capability

Actually, this function was created in March 2000 meeting. There were also pre-defined objects DRI_PARTITION_WHOLE, BLOCK, BLOCKCYCLIC that did "reasonable" things.

Action item: group needs to vote on whether to:

  1. keep or kill dri_partition_whole_create
  2. If dri_partition_whole_create is kept, whether "overlap" specification is needed as additional parameters
  3. keep or kill DRI_PARTITION_WHOLE pre-defined object


*** Layout proposal from Mercury

Defines the sub-region of a buffer in which a data reorg chanel will read or write

Dennis looking for ways to have multiple channels operate on the same buffer, but each channel on different sub-regions of the buffer
(scatter/gather type operation)
 

"order" parameter - indicates degree of contiguousness. 0 indicates that the affected dimension of the global data object is most contiguous. 1 indicates that there is 1 other dimension that is ordered faster. 2 indicates that there are 2 dimensions ordered faster. So, in 3 dimensions, order=2 is the least contiguous dimension, and order=0 is the most contiguous.
 
 

**** Action items to prepare for Sep meeting
 

  1. Should have API document updated 1 week prior to Sep  meeting (reflects proposals discussed today - Sep  requires voting on these & current state of doc)
  2. Include new format for API doc???
  3. In a formal document, make sure that examples are shown. Examples should show pictures, and the actual numeric contents of the structures/objects that get specified and/or created
  4. Ken send minutes to all participants
  5. Mercury updates their proposal documents to reflect decisions made in this mtg.