DRI Forum meeting minutes, 01/31/2002


Attenance

Ken Cain, Mercury Computer Systems
Jon Greene, Mercury Computer Systems
Jamie Kenny, Mercury Computer Systems
Steve Paavola, Sky Computers
Myra Jean Prelle, Mercury Computer Systems
Brian Sroka, The MITRE Corporation
Chris Young, MPI Software Technology
 

Agenda

1. Define specific communication semantics, and use-case rules for 
DRI_Reorg connection/disconnection.
Per-Reorg granularity: DRI_Reorg_connect, DRI_Reorg_disconnect
Library-wide granularity: DRI_connect, DRI_disconnect


2. Discuss any remaining API changes that may be needed before 
finalization of the spec.
- this will not include any re-hashing of previous decisions
- only new, minor enhancments should be discussed
- this agenda item must end in a timely manner to give enough time to 
item #3 which is more important.
3. Work out the logistics for completing DRI 1.0 spec.
3a. Define overall schedule/tasks and be prepared to take action items.
3b. Determine date for next meeting
3c. Determine whether formal voting on API can take place at next meeting.

Legend

Decisions are denoted in bold and italic text
 
 

Agenda item #1: DRI_Reorg connect/disconnect use cases

Ken Cain prepared a proposal, located at http://www.data-re.org/Reorg_create_connect_destroy_proposal.html
 

Discussing the "glossary of terms" in Ken's proposal:

myra: uses the term non-blocking differently from the definition in the doc. non-blocking put for example, to her, means that the put "fails" because it couldn't perform the action, and gives a return status to the caller to indicate this. The user then can "try again" by calling put again.

looking at ken's definition of synchronous, we decided that synchronous means barrier, and that operations such as the side A connect / side B create completion is really just "collective". (see Steve's email on the reflector where he proposed this)

one thing that needs to be clarified is how inconsistent creates (e.g., non-matching distribution parameters) on the other side are detected (and whether we should _not_ detect such errors because it would unnecessarily increase the complexity of the collective communication (the Connect on side A talking to Creates on side B). For example, to forego error detection, you could talk to only one representative process on side B to make sure that it has completed DRI_Reorg_create, and use its information to determine the properties of the "other side"

steve: suggests that maybe we could have a development mode vs. production mode type of approach (but not necessarily exactly that) to help users know exactly when certain errors will be caught (or not)

Decisions on glossary of terms component of proposal:

Discussing ken's proposal for use case #1:

reorg_create: advertises parameters
reorg_connect: retrieves create information provided by other side's create

jon: when should buffers be allocated? thinks it should be at reorg create-time
steve: thinks buffer allocation should be at reorg connect-time
jamie: concurs with jon because connect-time allocation requires a barrier synchronous connect call (to allow processes to exchange buffer addresses)
ken: there is a tradeoff; creating buffers at create-time prevents slower barrier-style connect. But, creating them at connect-time has some benefits for mode-change support (create reorgs, only connect the appropriate ones; and by extension only create the buffer resources for the initially active reorgs).

steve: potential problem with cliques is that the first connect call will block ; 2nd call doesn't even get a chance to run (corresponding to the other side).
jamie: at first reorg_connect call, it will be known that the reorg is a clique, and could just return. The second connect call would do all the work.
ken: yes, this is consistent with some previous discussions we have had on this topic

steve: if we can get the reorg_connect to work, then library-wide DRI_connect would seem to be just a convenience function (one that loops over all the DRI_connects)
 

steve: what happens if you call DRI_Reorg_disconnect?
ken: esp. need to consider what should happen if buffers are checked out by caller
jamie: require user to have put all buffers back into Reorg before calling disconnect
steve: on send side, how do we flush?
could be the same issue, make sure all puts have been done, and all buffers are checked in before disconnecting

steve: if alloc occurs at connect-time, how would the information actually get exchanged?
create advertises distribution/procgroup information
connect mallocs and advertises the allocation information, collects distribution/procgroup info from earlier create call
first invocation of get/put collects the buffer address information
steve: wants to avoid extra work in first get/put, wants that to be very much an inner loop function

ken: one way to get reuse of memory disjoint in time by different reorgs assoc. with different processing modes is to have disconnect NOT free the memory. Next Reorg_connect (assoc with different mode) could examine the last used memory area and see if it is big enough to satisfy this new reorg.
jamie: potential complications with this approach if you are executing multiple application threads
jon: we can accomplish this alternatively by having user-allocated memory input to DRI_Reorg_create (via some sort of handle). User could size and allocate memory before any modes start, then DRI_Reorg_connect would bind the Reorg with its memory (as supplied to Reorg_create). Reorg_disconnect could disassociate the Reorg with that memory, marking it as available for re-use disjoint in time with another Reorg
steve: what if we did supply a bufferset arg to Reorg_create? Thinks that we'll need a bufferset "finalize" to solidifiy the size of the bufferset before doing any DRI_Reorg_connect calls. This finalize call would publish the addresses to all processes in the DRI network (myra: alternatively, it is just known that the data is available from the appropriate sources). Then, the individual DRI_Reorg_connect calls could retrieve the information (DRI_Reorg_connect is collective with completion of DRI_Bufferset_"post" on other side of reorg).
ken: bufferset based specification for disjoint in time sharing among reorgs is really awkward (because you can't even use buffersets to express sharing of memory within the same mode -- user will then have to create a separate buffersets within a single mode -- makes disjoint in time sharing tedious at best).

ken: ok, so it seems that disjoint in time sharing is difficult with the individual Reorg connect/disconnet model (without having simultaneous in time sharing). This means that disconnect must free memory, and the next Reorg connect must malloc memory. Is it even worth it to pursue this approach if mode switches are going to take so much time?
group: it is not worth it until we can resolve the larger sharing issues in a later version of DRI spec
 

discussing whether to have library-wide DRI_connect or individual DRI_Reorg_connect approach.

group: decides to remove DRI_Network from the spec (because given either of above approaches, it isn't clear what role Network is providing. It _may_ be a placeholder in which to collect information about all of the DRI_Reorgs created by the calling process, but it doesn't seem necessary).
 

ken: if you use individual reorg_connects, do you have to do all connects before any get/put calls?
jamie: within the same thread of control, yes
myra/jamie: in PAS, connect call is optional and allows deferring to first put/get

steve: wants to force user for a reorg R to call connect before calling get/put (i.e., don't allow deferring of connect process to first execution of put/get)
group: agrees
 

ok, now what about disconnecting a reorg?

ken: halt or abort like situation when you want to quickly mode-change
jon: what about ensuring all pending communications have gone through all the reorgs in a chain

steve: to address completing all pending communications, first reorg in the chain could send a 32-bit control word that gets sent down the chain.

jamie: we need to send the control potentially by itself (not necessarily with a "full" buffer  of data)

jamie: we could create a special control Datapart object that could be sent in place of a regular Datapart object

steve: have a putcontrol function that puts the 32-bit value, and the get function on the other side gives an error return code indicating that the received data contains this control value

DRI_Reorg_"notify" semantics:
 - require user to send the same value from all source processes
 - don't require implementation to actually check that received values from all sources are same
 - pass by value DRI_Reorg_notify(Reorg, value)
 - jamie: value should take up a "buffer" (or dpo) slot internally in Reorg. That way ordering can be preserved between sending values and sending real buffers
 - Reorg_get_datapart returns DRI_GOT_VALUE (instead of VALUE, use whatever you select [ken] for the name of the put function)
 - steve proposes we just use the existing void **ptr to point to the value (ken agrees because we're treating the value as just another buffer within the data reorg, so the pointer points to the value itself)
 

RESOLUTIONS:

- we are removing the DRI_Network handle from the specification
- all memory is created at reorg_create_time
- reorg process inplace called with 2 reorg objects to indicate in-place processing relationship
   < here, the "send side" reorg must have been created with num_bufs=0, else Reorg_process_inplace returns an error>
- we have individual reorg_connects, instead of library-wide DRI_connect
- an individual reorg must have its connect called before it can be get/put
- we will add a new in-band control mechanism for Reorgs that sends a 32-bit value
 
 

Agenda item #2: any remaining API changes needed?

Being more clear about DRI_Reorg_put_buffer's communication semantics

currently, spec is ambiguous about when put_buffer returns to you. There are at least 3 possibilities:
 - when library has queued your request internally (but not yet started or completed xfer)
 - when library has started xfer
 - when library has finished xfer

steve: wants contract to be that "the data will be delivered" and that there is no need for the user to call put again on the same datapart. wants it left to the put implementation to perform any or all of the above three internal steps on behalf of the user.
put will have to at least queue the request internally (in software) in order to fulfill the contract

jamie: is worried about deadlock that could occur with sequence of gets and puts. Depending on how put behaves, user might need to be careful to order puts/gets on the send and receive side of the reorg. e.g., a clique.

resolution: adopt steve's approach. To deal with "making progress" issues, we will issue guidance to implementors that if a progress engine implementation is not possible, then progress must be made by having the user call another DRI function. This applies to both clique scenarios (in which the recv->get could notice that the send xfer wasn't actually executed, and do it), and pipeline (where, if the put can't deliver data, then the receive side could orchestrate a pull). Advice to users will include that they shouldn't count on a progress engine implementation, so a DRI_library_call; while(1); DRI_library_call could result in deadlock

jon <arrived shortly after discussion>: should we now add more user control as to the behavior of put? For example, allow the user to call put requesting it to return only after xfer has started? finished?

this reintroduces possibility of deadlock based on user's call order

steve: maybe better is a test function to test whether a previously put buffer has been transferred to the destination processes.

this would require a "transfer handle" returned by put. User would input the xfer handle to the test function.

jamie: thinks we should defer on this test functionality until later versions of DRI spec. individual implementations can do their own thing for now so we can get experience.
 

jon: suggests another option to use put call as a "trial" that would return an error status if it can't deliver data to the other side. user could respond to this condition by trying again, or by doing something else.

jamie: thinks that this is more valuable as an option to get
group: generally concurs

jon: suggests that we can design a test architecture that can test any of the 4 conditions (recv/put recv/get send/put and send/get)

steve:any test functions should only test local knowledge, and not require communication to complete the test

ken: send-side put tests are not going to be effective ; you've already gotten the buffer with a prior get call; you have no choice but to put that buffer back into the same reorg, which will force the xfer to occur (so a test is not useful in this case).

ken: recv-side get test is also problematic.  You could get a positive test result, but then when you actually call get, the buffer may no longer be available

resolution: we'll do the following for get calls (both send and recv sides):
recv-side:  get (as written today),
recv-side: TRYget (tries once to see if a buffer is full, and returns it to caller if so. If not, you'll have to try again).

Also, a Reorg test function that tells you whether there are NO full buffers (send/get would have failed, recv/get would have blocked), SOME  full buffers (get would have succeeded), ALL full buffers (get would have succeeded). In addition to a return code describing SOME/NONE/ALL, an information block is filled in (a transparent structure) with information about the state of the reorg. The two fields to be mandated for inclusion in this structure are the number of buffers available for "get", and the total # of buffers (so we don't have to go look it up elsewhere).
steve: suggests that information block returned can contain vendor-dependent information
group: agrees with steve's suggestion

jon: the test is not specific to a "get", it is specific to a Reorg

ken: information block in Reorg test function should be optional (user could supply null handle) 
group:agrees

There is no use for an "iget" (immediate get). We don't feel the need to "queue" a receive request.

jamie suggests we replace NONE/SOME/ALL with just returning the # of available buffers for subsequent get.
jon: suggests we combine a return status code with an information block output parameter

group thinking about whether to now have  a TRYput call
goal would be to do the put if all conditions are ripe for actually starting the transfer
steve: problem is that if MPI is your underlying transport, you may not know whether the DRI destination buffers are ready to be written. In MPI, you're just going to send the data, and MPI middleware deals with buffer availability and buffer copying issues

brian: since its unclear, we should punt on this for now, defer to later specification. users will chime in as they start to use the library as to whether this functionality is needed in later versions of DRI

group agrees to not specify a tryput function
 
 
 

Agenda item #3: logistics to complete DRI 1.0


Ken: here are some action items:
1. Boston area people will get together about every 3 weeks, 1/2 day per meeting, and review the document together. Will address remaining clarity & correctness issues. We will audio/video conference to remote sites

2. somebody has to deal with error code specification

myra: proposes that we focus on error codes and handling during the first mtg. or 2 of doc review. then task somebody to rigorously handle codes
 

3. Formal voting planning

Rules: 1 vote per institution
institution has to qualify in order to be able to vote (must have attended last 2 out of 3 general meetings)

list of institutions (if we were to vote today) is:

steve: suggests we vote yes, no, or no with list of reasons

group: decides that we need to vote on the  doc in its entirety instead of on individual functions

jon proposes tentative plan:

betw now & next gen meeting:
start doc scrubbing w/ meetings in Boston area
work  on final details

next gen meeting: finalize connect/disconnect issues, any remaining small API stuff
continue document scrubbing

2nd gen meeting following this: voting meeting on whole document
<looking @ summer timeframe to make this happen>

jon: publish result in hard copy form or just electronic?
group: we will publish the final specification in electronic form
 
 

Miscellaneous


Reason for removing DRI_connect from the library is because of  library developer considerations (because implementors would have to maintain an internal list/array of all DRI_Reorgs. This list would have to be referenced [and locked for thread safety] at DRI_connect time). And we also believe that the functionality is equivalent to the user calling individual DRI_Reorg_connects