DARPA Data Reorganization Effort

Meeting Minutes

MIT/Lincoln Laboratory, Lexington, MA

September 25, 1998

ATTENDEES:

James Lebak, MIT/LL (host) jlebak@ll.mit.edu

Dennis Cottel, SPAWAR dennis@spawar.navy.mil

Randall Judd, SPAWAR judd@spawar.navy.mil

Arkady Kanevsky, MITRE arkady@mitre.org

Nathan Doss, Lockheed-Martin, GES nathan.e.doss@lmco.com

Jonathan Greene, Mercury greene@mc.com

Sharon Sacco, CSPI ssacco@cspi.com

Rick Pancoast, Lockheed-Martin, GES rick.pancoast@lmco.com

Ken Cain, MITRE kcain@mitre.org

Steve Paavola, SKY paavola@sky.com

Richard Games, MITRE rg@mitre.org

Joe Fogler, AHPCC/KRI fogler@orca.unm.edu

Anna Rounbehler, Raytheon anna_c_rounbehler@res.ray.com

Clayborne Taylor Jr., MPI Software Technology, Inc. cdtaylor@mpi-softtech.com

Darwin E. Ammala, MPI Software Technology, Inc. dammala@mpi-softtech.com

Brief Summary

The prominent activity was a discussion of the intention of this group to produce advice for the High Performance computing community (in the form of a recommended Data Reorganization API framework, (a meta-API), or whether the best contribution would be a true API. This being the close of the first year’s work, the advice to the ‘ HPC community’ is due. Upon promulgating the advice, it would be reasonable to proceed with developing the API. An initial phase would be the creation of the capability to handle two dimensional corner turns.

TOPICS Discussed

Meta vs. True API

The question of whether a real API is required was discussed.. Within HPEC, the CFD community using Cray’s have generally non, or few, real time constraints on their computations. However, the defense and medical communities, for example, must deploy and live under time and space constraints. They need higher utilization, and thus standard and optimal algorithms for their data reorganization requirements.

Richard Games mentioned that an essential capability is that of a two-dimensional corner turn. Almost everyone needs it, and in examining it, the salient features and points of the API would be revealed and opened for understanding.

The MPI/RT Forum intends to incorporate the Data Re-org API into its specification in version 1.1 (version 1.0 is now in the public comment phase) "RT" needed early binding, and a real API which is compatible with early bindings.

Overlap

Overlap is important and must be included, but we must accurately and
exhaustively describe what overlap means in all cases.

We need to keep overlap. PAS centers non-overlapped data in the same location within the buffer.

What is an overlap of 1? A Processor could in effect share 2 planes from its 2 adjacent neighbors.

We need to be specific

Overlap persistence problem. Apply a sequence of transforms, more pixels get contaminated as you progress. You might need to do more than the necessary amount of overlap up front – in the initial phases of the transforms. Otherwise over time, "garbage" will accumulate. Gather only relevant parts, "data growth" over the sequence of transforms must be watched carefully.

Data Growth: occurs, as the output object is different from input object.

To support overlap, you might want to redistribute over a subset – defined a priori, - can be static, and support redistribution of a part of data.

The thrust of this effort has trained focus on dense matrix problems, however problems such as target tracking, needed by the HPEC community, do not involve dense data.

Dynamic splitting of data requires late binding behavior, which is counter productive to the goal of an early binding behavior for this API.

Does overlapping of partitions impact the process space topology? E.g., A 3-D set vs. a linear one might have some meaning or performance gain.

In some cases, only overlap might be exchanged with neighbors – CFD flow.

Tiles vs. striping: more parallelism with tiles. More degrees of freedom in rows.

Processor topology implies a geometry in the processor organization.

 

Data Distribution

Users should have a way to specify a non-default algorithm for distributing
data. This could be allowed by providing a way for the user to specify a
table or a function with the appropriate arguments. For instance, in
two-dimensional row distribution, the function could take in row and column
arguments and return the rank of the process, which should get that row. We
may want to restrict what the user can do to contiguous data.

The user needs a way to specify constraints on the distribution, e.g., a
minimum number of rows for a row distribution.

Need to have whole system, or whole rows to break up processing.

We want to specify base dependencies – e.g., smallest chunk to give to a process

We might want a row –to- tile distribution or vice versa.

Element size: Tile – unit of distribution vs. fixed size was not defined since we were byte shuffling. Although we do need to account for endian-ness somewhere. We didn’t specify exactly what a tile is.

 

The Draft Document

We also need to Finish object description, list attributes and what has to be included - data description

List of attributes:

Object – Need to query descriptions (since your code might be getting the data from someone else’s), Suggested to present current data organization, and the intended data organization, and allow implementation to change it.

Partitioning – block oriented vs. cyclic (card dealing).

Describe how much each process gets, make it flexible – could even be table driven to allow for flexibility.

 

Dennis’ Independence ideas. (part x,y,z) we can maintain a linear description of processes. You hand the number of processes and partition and it can be done any old way – planes, tiles rows etc. part (x,y) divide

Any way practical. This could be set up as an option, this should be separate calls.

 

Layout is orientation in memory

Tile is a subset of band.

Provide Hints on configuration of data, minimum dimensions.

An object yzy can be partx(yz) implies freedom, part xy – can’t touch z. e.g., rectangle or tube.

Need to handle odd size chunks at end.

Mitre STAP Example

Ken Cain submitted the process partitioning performed by STAP. It is included at the end of these minutes.

Goals for the Group to Consider:

Give knowledgeable MPI users some help and tools on top 10 data re-org problems.

Proposal: Do a corner turn implementation, define pieces needed for it, don’t get wrapped around transform logistics (memory, parameters et al).

Develop advice for the HPC community regarding data re-organization techniques, best practices, and announce the intentions of the API, and reveal what it will address, and what it will not.

 

 

ASSUMPTIONS

The following assumptions were summarized at the meeting as being understood in considering the design of a Data reorganization API.

Summary assumption; data blocks are contiguous.

Assumptions: Heterogeneous systems? Yes. Data will be converted ala MPI.
There may be restrictions (initially?) to simple data types.

Assumptions: Support process geometry? No.

Assumptions: Thread-safe calls (i.e., calls can be made to the API at the
same time by multiple threads)? Yes.

Assumptions: Assume MPI-like functionality is available underneath -- this
document/API does not deal with how to actually move the data at the lowest
level.

Assumptions: Memory will be allocated explicitly by users (e.g., with
malloc()) and provided to the API when needed.

Assumptions: Blocking or non-blocking calls? Unresolved, I believe.
[Initially, use blocking calls? The API should not prevent the addition of
non-blocking in the future?]

Assumptions: Avoid endian-ness by saying that X bytes are contiguous.

Design Considerations

Ideally, we should define the API to avoid multiple instances.

A way to do this is to set out some basic capabilities akin to the VSIP Forum’s "Core Lite". Extensions would be allowable.

Define a small API, test it to ensure it is useful. It could layer on top of other libraries. The 3-D transformation would be useful, but starting with a 2-D would help us ‘prototype’ the API.

We should disallow partitions which are non Euclidean.

Offline partition configuration should be offered.

The interface for data movement is rigid, but a configuration type API will allow good decision support to optimally break down the problems.

We are able to view the set of processes as a linear set.

Also account for speed, load, and separation of available processors; giving them all the same amount of data may not be wise. We also want some fault tolerance built in.

API composition – offline, and online. Postulate existence of process groups, we move data between groups. If everyone knew of groups, you could reference the groups

Name of connection is a transformation object, which is built.

Offline process could also figure out allocation needs and provide to allocation function as input.

Figure out largest single use/application/partition, and malloc 2 of them for usage.

Systems which change modes and sizes, such as small memory to large memory usage in next mode.

We would want to inform the local program what its memory needs are.

Issue of thread safety – for SMP it’s required. Memory usage overlap is a consideration in safety as well, so as not to mess around with memory belonging to another computation. We can’t preclude thread safety in the specification.

INERTIA RT prototype example Do we want a prototype for DR? E.g., protoyping or evolution development or an idea merge.

Synchronization issues, layering DR on message passing, we only have loose synchronization. RT does by design imply this intrinsic synchronization.

Strict API has name spaces to define the ops and steps.

Sender and receiver don’t have to attach to a common object.

Do we want to name the transformations? (the namespace problem).

We should prototype the clique vs. pipeline. Pipeline is favored. Experiment with both models.

Nathan: Make reusable components which work for either model. E.g., pipeline to self in clique case.

Richard: clique 2-D corner turn. Clique done wrong never begets a pipeline.

Postulate naming, allocation, groups, network, universal send/receive, e.g., the start., synchronization

We build partitioning, sizing,

Do we worry about blocking, synch – or classifying as user mistake?

Strive for useful, and good enough vs. perfection

Vendors will compete on speed of corner turns, FFTs customized paradigms,

Row-column first

Row-tile next.

Advice – work in area math models exist …

Get more ideas from Dennis, Ken, James,

Support linear process set.

API support of 1 or 0 copy. E.g., shared memory MPI.

DR could bind distributed and shared memory communities. Data intensive computing Re: (Munoz) and Berkley IRAM project page.

Memory management – even in-place operations need some temporary space.

You could give the receiver a pointer to a location and avoid a copy.

 

As a goal we don’t want the API to impose constraints which impede MPI or other middleware. E.g., adding extra copies to the data movements is unacceptable.

Meeting Schedule:

December 1, MITRE/Bedford MA (prior to MPI/RT)

February 1999

March 1999, Atlanta (in conjunction with MPIDC)

June 1999

August 1999, Arlington VA, Final Meeting

September 1999, HPEC Debut of the API

 

MITRE RT_STAP

This is a diagram of the RT_STAP initial data "cube". The data consists of a number of time samples, stored contiguously in memory. The most contiguous dimension is "samples" (marked "Time" in this figure). Each pulse (marked PRI in this figure) contains N samples (N is 1920 for our problem). There are P=64 pulses in the initial data cube. There are L=16 channels, each contaning P=64 pulses, which in-turn contain N=1920 samples.

The partitioning discussion for RT_STAP centered around the so-called "Preprocessing" phase of the software, where the initial data cube is processed prior to performing the space-time adaptive processing (STAP) algorithm. The preprocessing can be applied independently for each pulse of the data cube. That is, each unit of 1920 contiguous samples (a pulse) may be preprocessed independently of any other pulse. So, the natural "degrees of parallelism" in this problem is L times P (16 times 64). However, it is important to know that we will divide up the processing first in the "channel" dimension. It is helpful to think in terms of channels, because the data reorganization that follows preprocessing effectively transposes those channel "faces" of the datacube. So, if we have NP=16 (16 processors), then we would store one channel of data on each available processor (i.e., a "face" of the data cube would be stored in each processor's memory). The example that we discussed at the Data Reorg involved a case where we had NP=48 processors. In that case, we would assign 3 processors to each channel "face" of the datacube.