1. Define specific communication semantics, and use-case rules for
DRI_Reorg connection/disconnection.
Per-Reorg granularity: DRI_Reorg_connect, DRI_Reorg_disconnect
Library-wide granularity: DRI_connect, DRI_disconnect
2. Discuss any remaining API changes that may be needed before
finalization of the spec.
- this will not include any re-hashing of previous decisions
- only new, minor enhancments should be discussed
- this agenda item must end in a timely manner to give enough time to
item #3 which is more important.
3. Work out the logistics for completing DRI 1.0 spec.
3a. Define overall schedule/tasks and be prepared to take action items.
3b. Determine date for next meeting
3c. Determine whether formal voting on API can take place at next meeting.
Discussing the "glossary of terms" in Ken's proposal:
myra: uses the term non-blocking differently from the definition in the doc. non-blocking put for example, to her, means that the put "fails" because it couldn't perform the action, and gives a return status to the caller to indicate this. The user then can "try again" by calling put again.
looking at ken's definition of synchronous, we decided that synchronous means barrier, and that operations such as the side A connect / side B create completion is really just "collective". (see Steve's email on the reflector where he proposed this)
one thing that needs to be clarified is how inconsistent creates (e.g., non-matching distribution parameters) on the other side are detected (and whether we should _not_ detect such errors because it would unnecessarily increase the complexity of the collective communication (the Connect on side A talking to Creates on side B). For example, to forego error detection, you could talk to only one representative process on side B to make sure that it has completed DRI_Reorg_create, and use its information to determine the properties of the "other side"
steve: suggests that maybe we could have a development mode vs. production mode type of approach (but not necessarily exactly that) to help users know exactly when certain errors will be caught (or not)
Decisions on glossary of terms component of proposal:
Discussing ken's proposal for use case #1:
reorg_create: advertises parameters
reorg_connect: retrieves create information provided by other side's create
jon: when should buffers be allocated? thinks it should be at reorg create-time
steve: thinks buffer allocation should be at reorg connect-time
jamie: concurs with jon because connect-time allocation requires a barrier
synchronous connect call (to allow processes to exchange buffer addresses)
ken: there is a tradeoff; creating buffers at create-time prevents slower
barrier-style connect. But, creating them at connect-time has some benefits
for mode-change support (create reorgs, only connect the appropriate ones;
and by extension only create the buffer resources for the initially active
reorgs).
steve: potential problem with cliques is that the first connect call will
block ; 2nd call doesn't even get a chance to run (corresponding to the other
side).
jamie: at first reorg_connect call, it will be known that the reorg is a
clique, and could just return. The second connect call would do all the work.
ken: yes, this is consistent with some previous discussions we have had on
this topic
steve: if we can get the reorg_connect to work, then library-wide DRI_connect
would seem to be just a convenience function (one that loops over all the
DRI_connects)
steve: what happens if you call DRI_Reorg_disconnect?
ken: esp. need to consider what should happen if buffers are checked out
by caller
jamie: require user to have put all buffers back into Reorg before calling
disconnect
steve: on send side, how do we flush?
could be the same issue, make sure all puts have been done, and all buffers
are checked in before disconnecting
steve: if alloc occurs at connect-time, how would the information actually
get exchanged?
create advertises distribution/procgroup information
connect mallocs and advertises the allocation information, collects distribution/procgroup
info from earlier create call
first invocation of get/put collects the buffer address information
steve: wants to avoid extra work in first get/put, wants that to be very
much an inner loop function
ken: one way to get reuse of memory disjoint in time by different reorgs
assoc. with different processing modes is to have disconnect NOT free the
memory. Next Reorg_connect (assoc with different mode) could examine the last
used memory area and see if it is big enough to satisfy this new reorg.
jamie: potential complications with this approach if you are executing multiple
application threads
jon: we can accomplish this alternatively by having user-allocated memory
input to DRI_Reorg_create (via some sort of handle). User could size and
allocate memory before any modes start, then DRI_Reorg_connect would bind
the Reorg with its memory (as supplied to Reorg_create). Reorg_disconnect
could disassociate the Reorg with that memory, marking it as available for
re-use disjoint in time with another Reorg
steve: what if we did supply a bufferset arg to Reorg_create? Thinks that
we'll need a bufferset "finalize" to solidifiy the size of the bufferset before
doing any DRI_Reorg_connect calls. This finalize call would publish the addresses
to all processes in the DRI network (myra: alternatively, it is just known
that the data is available from the appropriate sources). Then, the individual
DRI_Reorg_connect calls could retrieve the information (DRI_Reorg_connect
is collective with completion of DRI_Bufferset_"post" on other side of reorg).
ken: bufferset based specification for disjoint in time sharing among reorgs
is really awkward (because you can't even use buffersets to express sharing
of memory within the same mode -- user will then have to create a separate
buffersets within a single mode -- makes disjoint in time sharing tedious
at best).
ken: ok, so it seems that disjoint in time sharing is difficult with the
individual Reorg connect/disconnet model (without having simultaneous in
time sharing). This means that disconnect must free memory, and the next
Reorg connect must malloc memory. Is it even worth it to pursue this approach
if mode switches are going to take so much time?
group: it is not worth it until we can resolve the larger sharing issues in
a later version of DRI spec
discussing whether to have library-wide DRI_connect or individual DRI_Reorg_connect approach.
group: decides to remove DRI_Network from the spec (because given either
of above approaches, it isn't clear what role Network is providing. It _may_
be a placeholder in which to collect information about all of the DRI_Reorgs
created by the calling process, but it doesn't seem necessary).
ken: if you use individual reorg_connects, do you have to do all connects
before any get/put calls?
jamie: within the same thread of control, yes
myra/jamie: in PAS, connect call is optional and allows deferring to first
put/get
steve: wants to force user for a reorg R to call connect before calling
get/put (i.e., don't allow deferring of connect process to first execution
of put/get)
group: agrees
ok, now what about disconnecting a reorg?
ken: halt or abort like situation when you want to quickly mode-change
jon: what about ensuring all pending communications have gone through all
the reorgs in a chain
steve: to address completing all pending communications, first reorg in the chain could send a 32-bit control word that gets sent down the chain.
jamie: we need to send the control potentially by itself (not necessarily with a "full" buffer of data)
jamie: we could create a special control Datapart object that could be sent in place of a regular Datapart object
steve: have a putcontrol function that puts the 32-bit value, and the get function on the other side gives an error return code indicating that the received data contains this control value
DRI_Reorg_"notify" semantics:
- require user to send the same value from all source processes
- don't require implementation to actually check that received values
from all sources are same
- pass by value DRI_Reorg_notify(Reorg, value)
- jamie: value should take up a "buffer" (or dpo) slot internally in
Reorg. That way ordering can be preserved between sending values and sending
real buffers
- Reorg_get_datapart returns DRI_GOT_VALUE (instead of VALUE, use whatever
you select [ken] for the name of the put function)
- steve proposes we just use the existing void **ptr to point to the
value (ken agrees because we're treating the value as just another buffer
within the data reorg, so the pointer points to the value itself)
RESOLUTIONS:
- we are removing the DRI_Network handle from the specification
- all memory is created at reorg_create_time
- reorg process inplace called with 2 reorg objects to indicate in-place
processing relationship
< here, the "send side" reorg must have been created
with num_bufs=0, else Reorg_process_inplace returns an error>
- we have individual reorg_connects, instead of library-wide DRI_connect
- an individual reorg must have its connect called before it can be
get/put
- we will add a new in-band control mechanism for Reorgs that sends
a 32-bit value
currently, spec is ambiguous about when put_buffer returns to you. There
are at least 3 possibilities:
- when library has queued your request internally (but not yet started
or completed xfer)
- when library has started xfer
- when library has finished xfer
steve: wants contract to be that "the data will be delivered" and that
there is no need for the user to call put again on the same datapart. wants
it left to the put implementation to perform any or all of the above three
internal steps on behalf of the user.
put will have to at least queue the request internally (in software) in order
to fulfill the contract
jamie: is worried about deadlock that could occur with sequence of gets and puts. Depending on how put behaves, user might need to be careful to order puts/gets on the send and receive side of the reorg. e.g., a clique.
resolution: adopt steve's approach. To deal with "making progress" issues, we will issue guidance to implementors that if a progress engine implementation is not possible, then progress must be made by having the user call another DRI function. This applies to both clique scenarios (in which the recv->get could notice that the send xfer wasn't actually executed, and do it), and pipeline (where, if the put can't deliver data, then the receive side could orchestrate a pull). Advice to users will include that they shouldn't count on a progress engine implementation, so a DRI_library_call; while(1); DRI_library_call could result in deadlock
jon <arrived shortly after discussion>: should we now add more user control as to the behavior of put? For example, allow the user to call put requesting it to return only after xfer has started? finished?
this reintroduces possibility of deadlock based on user's call order
steve: maybe better is a test function to test whether a previously put buffer has been transferred to the destination processes.
this would require a "transfer handle" returned by put. User would input the xfer handle to the test function.
jamie: thinks we should defer on this test functionality until later versions
of DRI spec. individual implementations can do their own thing for now so
we can get experience.
jon: suggests another option to use put call as a "trial" that would return an error status if it can't deliver data to the other side. user could respond to this condition by trying again, or by doing something else.
jamie: thinks that this is more valuable as an option to get
group: generally concurs
jon: suggests that we can design a test architecture that can test any of the 4 conditions (recv/put recv/get send/put and send/get)
steve:any test functions should only test local knowledge, and not require communication to complete the test
ken: send-side put tests are not going to be effective ; you've already gotten the buffer with a prior get call; you have no choice but to put that buffer back into the same reorg, which will force the xfer to occur (so a test is not useful in this case).
ken: recv-side get test is also problematic. You could get a positive test result, but then when you actually call get, the buffer may no longer be available
resolution: we'll do the following for get calls (both send and recv
sides):
recv-side: get (as written today),
recv-side: TRYget (tries once to see if a buffer is full, and returns
it to caller if so. If not, you'll have to try again).
Also, a Reorg test function that tells you whether there are NO full
buffers (send/get would have failed, recv/get would have blocked), SOME
full buffers (get would have succeeded), ALL full buffers (get would have
succeeded). In addition to a return code describing SOME/NONE/ALL, an information
block is filled in (a transparent structure) with information about the state
of the reorg. The two fields to be mandated for inclusion in this structure
are the number of buffers available for "get", and the total # of buffers
(so we don't have to go look it up elsewhere).
steve: suggests that information block returned can contain vendor-dependent
information
group: agrees with steve's suggestion
jon: the test is not specific to a "get", it is specific to a Reorg
ken: information block in Reorg test function should be optional
(user could supply null handle)
group:agrees
There is no use for an "iget" (immediate get). We don't feel the need to "queue" a receive request.
jamie suggests we replace NONE/SOME/ALL with just returning the # of available
buffers for subsequent get.
jon: suggests we combine a return status code with an information block output
parameter
group thinking about whether to now have a TRYput call
goal would be to do the put if all conditions are ripe for actually starting
the transfer
steve: problem is that if MPI is your underlying transport, you may not know
whether the DRI destination buffers are ready to be written. In MPI, you're
just going to send the data, and MPI middleware deals with buffer availability
and buffer copying issues
brian: since its unclear, we should punt on this for now, defer to later specification. users will chime in as they start to use the library as to whether this functionality is needed in later versions of DRI
group agrees to not specify a tryput function
Ken: here are some action items:
1. Boston area people will get together about every 3 weeks, 1/2 day per
meeting, and review the document together. Will address remaining clarity
& correctness issues. We will audio/video conference to remote sites
2. somebody has to deal with error code specification
myra: proposes that we focus on error codes and handling during the first
mtg. or 2 of doc review. then task somebody to rigorously handle codes
3. Formal voting planning
Rules: 1 vote per institution
institution has to qualify in order to be able to vote (must have attended
last 2 out of 3 general meetings)
list of institutions (if we were to vote today) is:
group: decides that we need to vote on the doc in its entirety instead of on individual functions
jon proposes tentative plan:
betw now & next gen meeting:
start doc scrubbing w/ meetings in Boston area
work on final details
next gen meeting: finalize connect/disconnect issues, any remaining small
API stuff
continue document scrubbing
2nd gen meeting following this: voting meeting on whole document
<looking @ summer timeframe to make this happen>
jon: publish result in hard copy form or just electronic?
group: we will publish the final specification in electronic form
Reason for removing DRI_connect from the library is because of
library developer considerations (because implementors would have to maintain
an internal list/array of all DRI_Reorgs. This list would have to be referenced
[and locked for thread safety] at DRI_connect time). And we also believe
that the functionality is equivalent to the user calling individual DRI_Reorg_connects