Proposed API approach for handling data partitioning, layout, and resulting local memory buffer access/management
 

History:

Summary of Suggested Goals and Approaches / Discussion

Goal: Define the interface for the programmer to traverse through local memory to retrieve and process the data block(s) assigned to the calling process in a data reorg "distribution"

Goal: Make the interface as compact and simple to use as possible.

Suggested Approach: Hide all details associated with local memory buffer "striding" (resulting either from a memory layout specification or resulting from the local storage of "overlapped" data elements). After all, the user (usually) just wants to get to the "real" data, and doesn't want to worry about how to calculate memory offsets to that data.

As shown in the discussion to follow, this may require a slight inconvenience in compactness of expression in user code (movement away from pointer math and/or array indexing), but will save a lot of coding effort. Perhaps use of object oriented language features such as operator overloading can provide a more compact representation that looks a lot like traditional array indexing. Such approaches could also improve performance of such memory lookup operations by doing pointer math (one memory load instruction) instead of multiple levels of indirect memory access for each dimension of a multidimensional array (many memory load instructions).

In the discussion below, note the similarity in concepts between the term "data view" and the VSIPL view object approach.
 
 

The approach


Hypothesize / design an API that hides all of the local memory buffer striding details from application source code. Then, show that this API can cover many/all cases currently supported in the data partitioning part of the API.
 

Notional interfaces


Important notation conventions:


To achieve the goals stated earlier, the following interfaces are proposed:
 


 

Definition of Datatypes Needed to Support Interfaces Described Above

We have not yet adequately defined Data Reorg datatype support (DRI_Dataspec). However, this is needed to successfully define the interfaces that I describe above. Here is a first draft for datatype support in DRI, shown as an equivalence table between DRI, MPI, VSIPL, and ANSI C data types (DRI and MPI have built-in type descriptors, whereas VSIPL and ANSI C are more conventional data types)
 
 
DRI  Datatype Descriptor  (DRI_Dataspec) MPI (C language) Datatype Descriptor (MPI_Datatype) VSIPL Scalar Datatype ANSI C Datatype
DRI_FLOAT MPI_FLOAT vsipl_scalar_f float
DRI_DOUBLE MPI_DOUBLE vsipl_scalar_d double
DRI_COMPLEX N/A (MPI_COMPLEX only for FORTRAN bindings) vsipl_cscalar_f N/A (no language support)
DRI_DOUBLE_COMPLEX N/A (MPI_DOUBLE_COMPLEX only for FORTRAN bindings) vsipl_cscalar_d N/A (no language support)
DRI_COMPLEX_SPLIT N/A vsipl_cscalar_f

(becomes an issue at DRI bufferset creation time)

N/A (no language support)
DRI_DOUBLE_COMPLEX_SPLIT N/A vsipl_cscalar_d

(becomes an issue at DRI bufferset creation time)

N/A (no language support)
DRI_INTEGER MPI_INTEGER vsipl_scalar_i int
DRI_SHORT MPI_SHORT vsipl_scalar_si signed short int
DRI_UNSIGNED_SHORT MPI_UNSIGNED_SHORT vsipl_scalar_us unsigned short int
DRI_LONG MPI_LONG vsipl_scalar_li signed long int
DRI_UNSINGED_LONG MPI_UNSIGNED_LONG vsipl_scalar_ul unsigned long int

 
 
 

Specific proposed interfaces (using "CORE" DR interfaces, plus integrating VSIPL)


This section still needs to be written.
 
 

Examples (using "CORE" DR interfaces, plus integrating VSIPL)


/** EXAMPLE #1 - lots of crazy options used ("stressing the interface") **
 ** Assumptions for this example:
 * matrix size: 64 columns by 20 rows
 * data type is single-precision floating point complex
 * 5 processes will partition the matrix
 *   (logically viewed as a 1 column by 5 row process group)
 * processes participate in a reorg as a RECEIVER group
 * columns will not be partitioned
 * rows will be partitioned in block-cyclic fashion, block size=2
 * left overlap in row dimension will be 1 element
 * right overlap in row dimension will be 1 element
 * overlap "policy" on global data edges: DRI_OVERLAP_TOROIDAL
 *
 * Row major ordering of data (columns dimension is ordered "fastest"
 *    in linear memory space)
 *
 * stride between consecutive elements in a single row = 1 position
 * stride between consecutive elements in a single column = 2 positions
 *
 * there is an offset of 8 rows of local memory _before_ the "real"
 *   data is actually stored
 *   (this works out to 8 rows * 64 elements / row = 512 elements)
 *
 * the local memory buffers will be allocated by DRI middleware
 *   (OPEN QUESTION: Can "VSIPL Data" be allocated internally?)
 *
 *
 */

#include <dri.h>
#include <vsip.h>

int global_dims[2] = {64, 20}; /* global data size (no memory order specified) */
int mem_dim_order[2] = {0, 1}; /* data dimension containing 64 elements will be ordered "fastest" */
int mem_dim_strides[2]= {1, 2}; /* 2-position stride between data elements in a single row */
int proc_dims[2] = {1, 5}; /* logical process set dimensionality to help partitioning */
int num_channel_bufs = 2; /* number of local memory blocks used for multi-buffering */
int chan_id = 2000; /* numeric data reorg channel identifier */
DRI_Global_Data *GDO; /* handle to represent the global data and its attributes */
DRI_Group *P; /* handle to represent process set dividing the global data */
DRI_Overlap *l_ovr, *r_ovr; /* left and right overlaps used in data partitioning */
DRI_Partition parts[2]; /* per-dimension specification of how to partition data */
DRI_Layout layout; /* describes how to order multi-dimensional data stored in local memory */
DRI_Distribution *distr; /* stores all details of how data is partitioned */
DRI_Bufferset *bufset; /* handle to local memory blocks (buffers) shared by app and library */
DRI_Channel *rcv_chan; /* persistent communication handle used for invoking data reorgs */
DRI_Buffer_Id *buf; /* handle to a local memory block (buffer) */
int num_data_slices; /* # of subset regions of the global data assigned to this process */
int sli_ct; /* loop control variable to loop over multiple locally stored data slices */
vsip_fftm_f *FFT_handle;
vsip_cmview_f *matrix; /* A matrix view of a data slice (VSIPL object) */
vsip_cmattr_f *matrix_attributes;
vsip_cscalar_f cval; /* storage for a single-precision complex value */
int row_ct, col_ct; /* loop indices for traversing data elements in a data slice */

DRI_Init (&argc, &argv);

/* 2D, 64 by 20, complex data */
DRI_Global_Data_create (2, global_dims, DRI_COMPLEX, &GDO);

/* Toroidal left and right overlap of 1 position */
DRI_Overlap_create (DRI_OVERLAP_TOROIDAL, 1, &l_ovr);
DRI_Overlap_create (DRI_OVERLAP_TOROIDAL, 1, &r_ovr);
 

/* Dimension 0: not partitioned
 * Dimension 1: block-cyclic partitioned, block size = 2, subject to overlap specs
 */
DRI_Partition_whole_create (&parts[0]);
DRI_Partition_blockcyclic_create (l_ovr, r_ovr, 2, &parts[1]);

/* 512 position "up-front" offset */
DRI_Layout_create (512, mem_dim_order, mem_dim_strides, &layout);

/* Determine explicit data partitioning details:
 * Split data (GDO) over process set (P),
 * using partitining approach (parts),
 * using local memory offset/ordering/striding (layouts)
 * store results in (distr)
 */
DRI_Distribution_create (GDO, P, proc_dims, parts, layouts, &distr);

/* Create local memory (accessed by bufset) that will store local data:
 * Create 1 or more buffers (num_channel_bufs) to support multi-buffering
 */
DRI_Bufferset_system_create (num_channel_bufs, distr, &bufset);

/* Create persistent communication handles to invoke data reorg operations
 * This application receives data in a pipeline data-parallel system,
 *    so create a "recv" channel.
 */
DRI_Channel_create_recv (chan_id, distr, bufset, &rcv_chan);
 

/* Initialize the receive channel, all processes coordinate here */
DRI_Channel_connect (rcv_chan);
 

DRI_Distribution_get_numslices (distr, &num_data_slices);

/* NOTE: Could use another DRI_Distribution operation that
 * returns a list of all  "view attributes" that
 * will result from the partitioning of the data
 * (e.g., unevenly divisible block-cyclic cases where
 *  more than one data slice size results)
 *
 * This can provide run-time performance improvement
 * in operational loop below because you wouldn't have
 * to query the VSIPL view for stride, offset, and length
 * attributes for each data slice
 */
 

/* Set up VSIPL multiple, in-place row FFT's */
FFT_handle = vsip_ccfftmip_f ((vsip_length) global_dims[1],   /* number of row FFTs = 20 */
                              (vsip_length) global_dims[0],   /* length of row FFTs = 64 */
                              (vsip_scalar_f) 1.0,            /* FFT scale factor */
                              (vsip_fft_dir) VSIP_FFT_FWD,    /* Forward (not inverse) FFTs */
                              (vsip_major) VSIP_ROW,          /* FFT the rows - row-major spec */
                              (vsip_length) 0,                /* # of times FFT_handle to be invoked
                                                               * (0 means "semi-infinite")
                                                               */
                              (vsip_alg_hint) VSIP_ALG_TIME); /* optimize for latency */

/* OK, finally entering operational loop! */
for (; ;) {

    /* Get the data from the upstream process group */
    DRI_Channel_get (rcv_chan, &buf);

    /* Traverse all blocks assigned to this process
     * by the earlier data partitioning/distribution
     *
     * This uses a simple linear block number approach.
     * Another interface is needed to use multi-dimensional
     * block numbers.
     */
    for (sli_ct = 0; sli_ct < num_data_slices; sli_ct++) {
       /* Get VSIPL matrix view _reference_
        * (library is managing view objects internally)
        *
        * IMPORTANT NOTE: This is using one _version_ of
        * the "get view" interface, based on a _linear_
        * data slice number being provided as input.
        * We are leaving the order of access to the data slices
        * up to the middleware implementation.
        *
        * The alternative interface would look something
        * like DRI_Buffer_multid_get_cmview_f (buf, sli_x_ct, sli_y_ct);
        *
        * Of course, object oriented languages could handle this
        * with fewer function names by having multiple "get view"
        * signatures.
        */
        matrix = DRI_Buffer_get_cmview_f (buf, sli_ct);

        /* PROCESSING stage:
         * It's somewhat unrealistic for a block-cyclic partitioning
         * example, but perform a "multiple, in-place FFT" operation
         * on the rows of the data slice (all row elements are local)
         */

        /* PROCESSING APPROACH:
         * Process by using VSIPL matrix view & VSIPL routines
         * Use VSIPL matrix view attributes to get stride, extent info (OPTIONAL)
         * Use vsipl_cmget_f() and vsipl_cmput_f to read/write values (OPTIONAL)
         */
 

        /* Processing using VSIPL view */
        vsip_ccfftmip_f (FFT_handle, matrix);
 

        /* OPTIONAL STEP: If I need to know number of rows, columns, other attributes: */
        vsip_cmview_getattrib (matrix, matrix_attributes);
 

        /* OPTIONAL STEP: If I need to get/set values at a specific local data position */
        for (row_ct = 0; row_ct < matrix_attributes.col_length; row_ct++) {
            for (col_ct = 0; col_ct < matrix_attributes.row_length; col_ct++) {

                cval = vsip_cmget_f (matrix, row_ct, col_ct);

                /* Print complex values as (real, imag) pairs */
                printf ("matrix[%d][%d] = (%f\, %f)\n", row_ct, col_ct,
                         (float) vsip_real_f (cval),
                         (float) vsip_imag_f (cval));

            } /* end of for loop over columns of data matrix */
        } /* end of for loop over rows of data matrix */

    }
}

DRI_Finalize();
 
 

Specific proposed interfaces (using "standalone" DR interfaces)


The proposal here is to use a subset of the VSIPL 1.0 API to provide memory block and data view constructs. "Standalone" implementations can implement this subset of VSIPL that deals only with the way that data is stored in memory (and not how it is processed). Other layered implementations can require a VSIPL implementation as a prerequisite to writing DR programs. Early co-layered implementations will bundle VSIPL and Data Reorg as a single library infrastructure to more cleanly specify the memory sharing that is needed between communication and processing operations. Over the long-term perhaps a parallel VSIPL would incorporate DR API constructs as a more completely unified interface.
 

Assume that VSIPL blocks are created and maintained internally by the
DRI implementation. Therefore, the user sees no block create/destroy
calls.

Most of the vector view support calls would need to be part of the
DRI implementation in order to be most useful. A minimal
implementation could remove the individual put/get attribute calls.

DRI_Buffer_get_view
vsip_vcloneview
vsip_vdestroy
vsip_vget
vsip_vput
vsip_vgetattrib
vsip_vputattrib
vsip_vimagview
vsip_vrealview

Example description of a particular function:
DRI_Buffer_get_dvview - Get VSIPL view of a buffer as a vector

SYNOPSIS
DRI_Buffer_get_vview(buffer, slice, view)
DRI_Buffer_get_cvview(buffer, slice, view)

PARAMETERS
IN: buffer (DRI_Buffer_Id) DRI storage area for data
IN: slice (integer) Block-cyclic slice index
OUT: view (vsip_vview or vsip_cvview) VSIPL view to be manipulated by the user

META-NOTES
The output view is meta; other middleware that allows the concept of
a local vector object (though probably not a simple pointer
construct) could be used instead.

DESCRIPTION
This function allows the user to view the local portion of a global
data object as a vector. The offset, length, and stride of the view
will be set appropriately by the DRI implementation to reflect the
layout of the local portion.

If the global data object is distributed in a block-cyclic fashion,
the local portion of the object consists of a series of contiguous
blocks. The 'slice' parameter is an index into this series of blocks
that gives the current local block of interest. If the global data
object is not distributed in a block-cyclic way, this parameter is
ignored.

COMMUNICATION BEHAVIOR
Local.

RESTRICTIONS
It is assumed that the local portion has no more than one dimension.
 
 

A similar approach can be taken for VSIPL matrix and tensor descriptors. This still needs to be specified.
 
 

Examples (using "standalone" DR interfaces)

In this proposal, there is no distinction between the CORE interfaces presented earlier, and the standalone DR interfaces because it proposes to adopt a subset of VSIPL to cover both cases.