*** Arkady: temporary buffers vs. those created in a bufferset.
What happens when unexposed temp buffers contain the final result of a data reorg (e.g., there is no local reordering on the recv side). Here, the data is in a buffer not available to the user, so an extra memory copy would be needed to satisfy a user's get() call on the recv channel.
Options to solve this problem:
Argument for future "directives" to control temp vs. user buffer distinction is to allow users who know certain things about resource requirements to have appropriate control at the API level.
Seems like it is doable (today) to have implementation add buffers "under
the covers" - user access
Dennis: API doc should make clear that if user allocates X buffers, then the policy is _either_ to:
*** Typographical issue: DRI_Bufferset_system_create (use disth
instead of dist as name of parameter)
*** Suggested areas to cover today:
*** Bcast issue:
Intention is that when DRI_WHOLE partitioning is performed in each dimension of the gdo, then the intent is that all processes that are on the recv side of the corresponding channel will get a copy of the entire global data (i.e., it will be a "broadcast")
Action item: clarify this in the API document
Scenario: what happens when source nodes all use DRI_WHOLE in each dimension
of gdo
Dennis: by definition, this is the user saying that each dataset on
each node is identical values
Dennis: say, in a 5 to 100 node pipeline stage (where the 5 nodes have
data copies), the 5 nodes could "split up" the duties of sending the data
copied to the 100 nodes downstream (1 node in src grp can send to the "first
20" nodes of the downstream grp, etc.)
NOTE: there is a big distinction between this type of distribution and when DRI_BLOCK is used and the "minimum # of elements" requested from the partitioning is equal to the entire size of the global data dimension. In this latter case, it means that only 1 processor will get the data in that dimension, and the other nodes in the process grp are going to be "idle" during that stage of the application
Arkady/Myra: please specify interpretation of these cases in the API
doc.
Ken: should add advice to users on how to check partitioning results
to see whether calling process actually got anything!
*** Dennis/Arkady: we need to look at distribution results structure
to make sure that subsequent "for loops" over local data will _still work_
after the partitioning is done (even in cases when the range of data is
really only ZERO positions!)
*** Visibility into partitioning across the environment issue:
Myra in her own prototyping has added such functions to get partitioning info for all processes (remote side has to be determined after channel connect). Myra thinks impls should be allowed to expose such a call, but it should definitely _NOT_ be part of the official API
Ken: having such calls sounds like MPE lib extensions to MPICH in the
sense that it gives detailed visibility into application behavior
*** Data distribution object issue:
Mercury has proposed some problems with existing approach in API, has submitted alternative approach. Key issues are:
It's as simple as adding minsz and mod parameters into this function.
min=1
mod=1
Above parameter settings allow a classical block-cyclic partitioning
approach, in which blocking factors and process group dimensionality are
provided as inputs.
Ken: concern that third parties later in the future who want the "classical"
way to specify block-cyclic partitionings (blocking factor and #procs in
each dimension) will see these additional parameters (min & modulo)
as too much specification.
2) Iterator function that traverses through local buffer and returns
block meta-info (useful for block cyclic partitionings)
We want to specify the order in which the dimensions are accessed (there
is some # of blocks in each dimension assigned to the local process)
This is because order of access, and the operations performed on these
blocks will impact what is hot and cold in cache. In this case, the user
specifies the specific block number by indexing it in each dimension. To
address how the user code should set its loop limits, there is a setup
call that would return the number of blocks in each dimension.
Also can provide a traversal that saves state for the user, and the
user just calls get_next_block. This function also requires a setup function.
***
Arkady: discussed problems associated with conflicting pad/overlap
specifications in different dimensions. What happens with corner elements
of the data? This problem is exacerbated in higher dimensional data, because
the amount of data affected grows from more than just a single element
of data. Which dimension of data governs in these conflicting cases? Should
this be a user-specifiable thing? Should it be an error?
Jon: suggestion - perform the overlap/pad according to the "lowest"
dimension (in terms of memory layout) first, then apply overlap specification
of higher ordered dimensions in order
*** dri_partition_whole_create
Myra: why doesn't dri_partition_whole_create not take overlap arguments? She recalls an email where somebody argued for this capability
Actually, this function was created in March 2000 meeting. There were also pre-defined objects DRI_PARTITION_WHOLE, BLOCK, BLOCKCYCLIC that did "reasonable" things.
Action item: group needs to vote on whether to:
*** Layout proposal from Mercury
Defines the sub-region of a buffer in which a data reorg chanel will read or write
Dennis looking for ways to have multiple channels operate on the same
buffer, but each channel on different sub-regions of the buffer
(scatter/gather type operation)
"order" parameter - indicates degree of contiguousness. 0 indicates
that the affected dimension of the global data object is most contiguous.
1 indicates that there is 1 other dimension that is ordered faster. 2 indicates
that there are 2 dimensions ordered faster. So, in 3 dimensions, order=2
is the least contiguous dimension, and order=0 is the most contiguous.
**** Action items to prepare for Sep meeting