DARPA Data Re-organization Interface Effort

San Diego, February 2, 1999

Attendance:

Darwin Ammala, MSTI dammala@mpi-softtech.com

Clayborne Taylor Jr., MSTI cdtaylor@mpi-softtech.com

Steve Paavola, Sky paavola@sky.com

Shane Hebert, MSU shane@erc.msstate.edu

Ken Cain, MITRE kcain@mitre.org

Arkady Kanevsky, MITRE arkady@mitre.org

Dennis Cottel, SPAWAR (host) dennis@spawar.navy.mil

Nathan Doss, LM/GES nathan.e.doss@lmco.com

James Lebak, MIT/LL jlebak@ll.mit.edu

Jon Greene, Mercury greene@mc.com

Richard Massary, NAWCAD massaryrj@navair.navy.mil

Robert Grim, NAWCAD rob@sai19.nawcad.navy.mil

Randall Judd, SSC-50 judd@spawar.navy.mil

The goal for the meeting is to finish the discussion on API pre-requisite ideas, and begin to develop function calls.

Data Re-organization Steps

Data à Logical Arrangement à Physical Arrangement à Processing Resource

(These last 3 steps must be scalable)

à Decompose data into a logical arrangement

à Derive a physical arrangement from the logical

à Map the physical arrangement onto the processing resource.

Figure 1: Basic steps in data decomposition

 

 

 

Datatypes: Data representation

For datatypes, we could adopt MPI/RT's versions: short, int, long, float, etc., the implementation will be free to extend them.

We don't want to specify a size for each type, as this would not be uniform for all architectures.

For global data re-organizations, do we want a group or communicator? - describe a collective entity for all participants?

Partitioning:

Partitioning objects should describe what is needed for the computation; e.g., I need a contiguous row.

We must account for 2 consumers who want different parts of the data.

In a process set, we should agree on the global data object, and how it is to be partitioned.

A strategy for dividing data needs to be Euclidean, slices must be parallel to a dimension.

No striding (every nth element).

Jon's PART_XY discusses slicing along dimensions, you ‘can’ do so, vs. ‘must’

Use of 4-tuples, and reasonable default partitioning.

We could allow both, specific or a reasonable default. A user can specify, or we can pick the default for them.

Default, and specific API.

There could be multiple APIs for partitioning, e.g., part_x_no_overlap.

Support a sizeof(record). A record is arbitrary in type and size.

It may have memory holes, e.g. 8 byte aligned and a struct of size 6 or 7.

.

Local Processing

We need n (from D) 4-tuples for the partition object.

On the sending side of D, we have pieces coming from somewhere, .."this is what I am providing (…)"

There is a notion of ownership. On sending side, overlap may not make sense. On input, the data set

is partitioned disjointedly (otherwise conflicts on writes would arise).

On the receiving side, not every partition might be removed, this could be ok.

(index of start, index of end, …) ?

Figure 2: Local Processing

Take D, everyone owns a piece.

How to Get from the Partition to Layout Object

 

 

Figure 3: Global View – Use of DRI

 

Partitioning Parameters

PART_XY

  1. Accept the default partitioning
  2. Set some constraints
  3. Fully specify the partitioning

Figure 4: Partition in 2 dimensions – options

In partition specification:

Part_xy is how to slice the data in D. It can also mean, what part is independent.

Partition class, describes a min, max for a dimension, and which dimension is to be contiguous.

Additionally, the left and right offsets are specified. We have a constraint portion, and a requirement portion as stages. This would allow a failure to be detected earlier.

Min and max are constraints, and overlap is the requirement.

For each dimension, a min and max number of elements can be specified. To say exactly how many, set max and min equal. E.g., if we want 50 elements set both min and max to 50.

 L: Overlap

Min

Max

R: Overlap

optimizations

 

Figure 5: Overlap Parameters – See also in Figure 2

 

Partitioning should support heterogeneity since we might have specialized Resource Constrained processors which might have a filter in firmware, and a 2nd in software.

 

Figure 6: Two stage FFT : One FFT could be in firmware, the Data, partition D, and process set object is transformed to a transfer object. The distribution object must have D, partition in, process set in, partition out, process set out. The transformation object is made of 2 distribution objects. Assumption, D is constant through a transform

Building of the Transfer and Layout Objects

The transfer object is made by collective call to all processes, this object has a handle, which is used in transfer operations.

It is also possible to give named objects. Knowing all distribution aspects, the sender and receiver could rendezvous at some point and transfer the data.

Figure 7: Collective nature of building Transfer and Layout Objects

 

A single process only needs to know its role (sender), a distribution is defined, we say we are the source, we can create a name to the transfer.

The receiver does the same thing. They rendezvous. Thus, a producer doesn't need to specify the source and destination distributions. A process is never both producer and consumer in the same step. It can be done in one operation a producer/consumer needn't do both steps in place. There can a buffer for each role.

In a clique corner turn all are both producer and consumer, D is the same. Figure 7 shows the creation of the objects from the input side, output works analogously.

Commit Operation

Figure 8: Local and Global aspects of Commit operation

The user must create buffers in the commit stage of an operation. D and partition are local. The process set has not been addressed. The distribute is scoped to a process set. After the collective call, all checks have been made to stay within the definitions of D Part, Process Set. The Distribution object creation is synchronizing, we must avoid deadlock.

Named transform stage, we must be collective across both ends of the transfer. With a name given to the transfer object, more than 1 destination can synchronize. However, we are postponing error detection for the commit operation.

Outside of the transform call, we may not know if the operation will succeed, or be legal since the transform object is local.

 

 

 

 

 

 

 

 

 

Named Transfers

 

The name represents a communication; at commit time the name and resources can be known,

P1 is the source of transform zebra. Processes p2, p3 consume it, and p2/3 may not know of each other.

What if p3 requests call before p1 is ready? We have blocking and non-blocking calls.

At the end of the commit, we get objects for all handles that were set up in earlier stages.

What about 1 sided push, pull, or 2 sided?

Push - p1 does a do data goes to both p2,3

Pull from p3 implies p2 gets the data too, if only 1 'zebra' name is used for transfer.

During transfer, p1 has blocking calls (collective d) thus no push/pull.

In tight resource situations, we can either write a smaller D, to deal with less buffer space, or back off as an error.

In a large operation, such as a 17k x 96k reduction and corner turn in parallel. The corner turn and computation are overlapped. You could set smaller than 17k pieces. The API won't implicitly support this type of operation. An Option would be to double buffer the input or turned object.

Question: how static is the buffer assignment? We would not want to have to rebuild everything. We also don't want to re-invent all of RT's buffer constructs - pools etc.

The Nature and Intention of the DRI API

Assumption: API will change, as it is implemented with different libraries. E.g., MPI, and RT, and PAS will all look different. DRI could also exist on its own. The goal is to be conceptually portable vs. literally. DIR should be viewed as a concept unifier versus and end unto itself.

 

 

Figure 10: Where does DRI fit in?

 

In early binding: we allocate buffers, and link the buffer into distribution object. The buffer could be an array. No formal queuing implied, simply cycles through data.

Buffers are allocated locally, in the global case it is allocated at commit.

The application owns input/output buffers, also inter-process buffers at the interfaces would have other buffers. - used in Go call. Call buffer is in the layout, which the user gives to the interface. Local buffers are bound early, and at runtime the go call buffers are allocated It marries go, pack/unpack.

Within the transfer, each side must issue the go call.

Good, fast memory is given early to the interface before runtime. Size is provided by the user. The memory is owned by the application. The transfer for local to global forces a copy. DRI must figure optimal way to move data – lest we have a commit cart and horse problem. You won’t know until commit time on the receive side how big the buffers are. The distribution side does know their size. If no transpose, you may not need local buffers but you won't know this until commit time. We would like to allocate memory early, but we don't know how much to allocate that early

.

The forced copy will hurt performance. Is the local implementation 'smart' enough to avoid local copy. You could look at the address and test. You are likely to have to do the copy anyway.

Can we use double buffers on both sides to avoid the copy? The buffers would be associated with the distribution object. Both allocated by user. One buffer is temporary, while the other is active. It’s implied that the active buffer is sent by the go call. This would imply that the same address would be used for the transfer - hence early binding. Sometimes we need more than two buffers. You may not know how many buffers will be needed until commit time. In clique operations you'd have input, output and temp (3) buffers. Might be able to use the temp buffer for other actions so this may not be as wasteful.

Addressing the Commit cart and horse: We know who is interconnected - we can make the 'netlist'. This whole netlist issue can be resolved with respect to the buffer space allocation since all of the important information could be known early. Commit may still have to be at the end. Commit would set up DMA engines.

Dennis' proposal: Before you make calls to create the transform object, make a register for named connection calls e.g., Make (use zebra). All those must complete before transform_create is done. Real buffers must be known by both ends in order to avoid copies.

Distribute call: you get the size of buffer, you bind to object. At commit time all is known. You then decide you need a big temp buffer. Internally tries to create. User might provide specialized knowledge on which transfers are not overlapped.

Ken: data endpoint buffers, must be created internally. It must be acceptable for user to do a malloc etc. we need this to be portable.

Layout object simplest, packed xyz orientation.

Summary:

6 Objects:

1. D: data shape object

2. Partition object, of 4-tuples per dimension. We haven't resolved this fully

We don't yet know details they are euclidian, not sparse. We will have guidance on partitioning.

3. Process set object an array of handles for processes in the communication.

4. These three are fed into a Distribution object creation call (collective over the process set). Error checking will be done here. Named vs. ordered. Named implies asynchronous. It must be a synchronized call. Even if named it must be blocking - hence decision is to have it be ordered over the process set. We know our part of D after this, our 4-tuples are then also known.

5. A layout object is a packed permutation of dimensions.

6.. A Transfer object. This can be asynchronous across all process sets. But it should be synchronous in local group..

Within a process set, all data belongs to a process set, no data belongs to more than one process in the set.

Do we want to push burden of partitioning up to the user to ensure scalability. They could write a program to do the mapping. Partition should be precisely defined on the source side specify such as to not imply overlap. Communication steps are the proper place to handle pack/unpack You can chose not to send overlapped data, but then you need to jump over it.

Convenience functions create D' as function of D. How do we enter to API. Also do convenience functions from :P to P'

Ned to agree on low level partition attributes to support them directly.

The go call

We now can write down a set of functions, elaborate on objects, the go and Do calls, synchronization, moving of buffers, etc.

GO: is it blocking and synchronous. User declares buffers range cycle through buffers double, triple just cycle through them - then they can be used by a receiver who sends etc.

Steve: we may not know which piece of global data we are generating at a given time. We are one of the processes in the set,

The interface is static, a change implies a new transfer object. We might want to consider an offset change.

A future Go could have extra arguments to handle the more difficult cases as we approach them.

If Go returned a structure with descriptors and status, you could learn of partitioning info

Question of buffer usage. We may not want Go to determine this.

We should write up a simple case of this.

 

API in brief:

Several families of functions will be defined in the DRI_API.

Create family – does the first portion of the committal of resources

Query family – enquiry of status

Destroy family – return resources to the system

Init family – sets up initial state

The Create family is outlined here, specifications for the others will follow.

The format for the specification will be language independent.

 

DRI_gdo_create (n-dimensions, DRI_datatype, sizes[], &gdo_h)

// Create a generalized data object.

IN n-dimensions is number of dimensions,

IN datatype is fixed for sizes array of sizes,

IN sizes is the array of sizes

INOUT gdo_h gdo object handle.

 

DRI_part_create (n-dimensions, DRI_part_tuples[], &part_h)

// Create a partition object

IN n-dimensions is the number of dimensions

IN DRI_part_tuples is the array of partition 4-tuples, one tuple for each dimension

INOUT part_h is the partition object handle.

Tuples[] === left_v, left_x, right_x, right_v === _v is overlap, _x is index

 

DRI_group_create (n-processes, procs[], &group_h)

// Create a process set (group)

IN n-processes is number of processes in the group,

IN procs is the array of processes

INOUT group_h is a group object handle

 

DRI_dist_create (gdo_h, part_h, group_h, &dist_h)

// Create a distribution object

IN gdo_h is the gdo object handle

IN part_h is the partition object handle

INOUT dist_h is the distribution object handle

This call is collective

Creates a distribution object from gdo, partition, and group

 

DRI_layout_create (n-dimensions, strides[], &layout_h)

// Create a layout object

IN n-dimensions is dimensions

IN strides is the array of strides for each dimension,

INOUT layout_h is the layout object handle

creates enumerated type of permutations of dimension combinations e.g., 3 d has 6 combinationss . orientations and strides (strides, layout type, order (array of dimensions, order,

 

DRI_xfer_create (xfer_name, dist_h, layout_h, n-buffs, buffs[], &xfer_h)

// Create a transfer object

IN xfer_name is a text symbolic name for the transfer

IN Dist_h is the distribution object handle,

IN Layout_h is the layout object handle

IN n-buffs is the number of buffers to allocate for the transfer

INOUT buffs[] is the array of buffers for the transfer

INOUT xfer_h is the transfer object handle.

 

DRI_do (xfer_name, xfer_h)

// the do call

IN xfer_name is the name of a transfer to execute

IN xfer_h is the transfer object handle.