The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronization, and inter-processor communication.
-
Library API reference
7.0 Syntax
7.1 Program Flow
7.2 Data Movement
7.3 Synchronization
7.3 Basic Math
7.5 Basic DSP
7.4 Image Processing
7.6 FFT (FFTW)
7.7 Linar Algebra (BLAS)
7.8 System Calls
##Why? Any sane and informed person knows that the future of computing is massively parallel. Unfortunately the energy needed to escape the current "von Neumann potential well" seems to be approaching infinity. The legacy programming stack is so effective and so easy to use that developers and companies simply cannot afford to choose the better (parallel) solution. To make parallel computing ubiquitous our only choice is to rewrite the whole software stack from scratch, including: algorithms, run-times, libraries, and applications. The goal of the Parallel Architectures Library project is to establish the lowest layer of this brave new programming stack.
##Design Goals
- Fast (Super fast but no "belt AND suspenders")
- Compact (Small enough to work for memory limited processors with <32KB RAM)
- Scalable (Thread and data scalable)
- Portable (Portable across different ISAs and systems)
- Permissive (Apache 2.0 license to maximize industry adoption)
##License The PAL source code is licensed under the Apache License, Version 2.0. See LICENSE for full license text unless otherwise specified.
##Contribution Our goal is to make PAL a broad community project from day one. If just 100 people contribute one function each, we'll be done in a couple of days! If you know C, you are ready to contribute!!
Instructions for contributing can be found HERE.
##Build Instructions
###Install Prerequisites
$ sudo apt-get install libtool build-essential pkg-config autoconf automake doxygen
###Build Sequence
$ ./bootstrap
$ ./configure
$ make
###Testing
To run the automated unit tests you need to run
$ make check
##A Simple Example The following sample shows how to use PAL launch a simple task on a remote processor within the system. The program flow should be familiar to anyone who has used accelerator programming frameworks.
Manager Code
#include <pal.h>
#include <stdio.h>
#define N 16
int main(int argc, char *argv[])
{
// Stack variables
char *file = "./hello_task.elf";
char *func = "main";
int status, i, all, nargs = 1;
char *args[nargs];
char argbuf[20];
// References as opaque structures
p_dev_t dev0;
p_prog_t prog0;
p_team_t team0;
p_mem_t mem[4];
// Execution setup
dev0 = p_init(P_DEV_DEMO, 0); // initialize device and team
prog0 = p_load(dev0, file, func, 0); // load a program from file system
all = p_query(dev0, P_PROP_NODES); // find number of nodes in system
team0 = p_open(dev0, 0, all); // create a team
// Running program
for (i = 0; i < all; i++) {
sprintf(argbuf, "%d", i); // string args needed to run main asis
args[0] = argbuf;
status = p_run(prog0, team0, i, 1, nargs, args, 0);
}
p_wait(team0); // not needed
p_close(team0); // close team
p_finalize(dev0); // finalize memory
return 0;
}
Worker Code (hello_task.elf)
#include <stdio.h>
int main(int argc, char* argv[]){
int pid=0;
int i;
pid=atoi(argv[2]);
printf("--Processor %d says hello!--\n", pid);
return i;
}
##SYNTAX
##PROGRAM FLOW
These program flow functions are used to manage the system and to execute programs. All PAL objects are referenced via handles (opaque objects).
FUNCTION | NOTES |
---|---|
p_init() | initialize the run time |
p_query() | query a device object |
p_load() | load binary elf file into memory |
p_run() | run a program on a team of processor |
p_open() | open a team of processors |
p_append() | add members to team |
p_remove() | remove members from team |
p_close() | close a team of processors |
p_barrier() | team barrier |
p_wait() | wait for team to finish |
p_fence() | memory fence |
p_finalize() | cleans up run time |
p_get_err() | get error code (if any). |
##MEMORY ALLOCATION
These functions are used for creating memory objects.
The functions return a unique PAL handle for each new memory object. This handle can then be used by functions like p_read() and p_write() to access data within the memory object.
FUNCTION | NOTES | STATUS |
---|---|---|
p_malloc() | allocate memory on local processor | |
p_rmalloc() | allocate memory on remote processor | |
p_free() | free memory |
##DATA MOVEMENT
The data movement functions move blocks of data between opaque memory objects and locations specified by pointers. The memory object is specified by a PAL handle returned by a previous API call. The exception is the p_memcpy function which copies blocks of bytes within a shared memory architecture only.
FUNCTION | NOTES |
---|---|
p_gather() | gather operation |
p_memcpy() | fast memcpy() |
p_read() | read from a memory object |
p_scatter() | scatter operation |
p_write() | write to a memory object |
##SYNCHRONIZATION
The synchronization functions are useful for program sequencing and resource locking in shared memory systems.
FUNCTION | NOTES |
---|---|
p_mutex_lock() | lock a mutex |
p_mutex_trylock() | try locking a mutex once |
p_mutex_unlock() | unlock (clear) a mutex |
p_mutex_init() | initialize a mutex |
p_atomic_add() | atomic fetch and add |
p_atomic_sub() | atomic fetch and sub |
p_atomic_and() | atomic fetch and 'and' |
p_atomic_xor() | atomic fetch and 'xor' |
p_atomic_or() | atomic fetch and 'or' |
p_atomic_swap() | atomic exchange |
p_atomic_compswap() | atomic compare and exchange |
##MATH
The math functions replace the traditional math lib functions and extend them to include support for data as well as task parallelism.
FUNCTION | NOTES |
---|---|
p_abs() | absolute value |
p_absdiff() | absolute difference |
p_add() | add |
p_acos() | arc cosine |
p_acosh() | arc hyperbolic cosine |
p_asin() | arc sine |
p_asinh() | arc hyperbolic sine |
p_cbrt() | cubic root |
p_cos() | cosine |
p_cosh() | hyperbolic cosine |
p_div() | division |
p_dot() | dot product |
p_exp() | exponential |
p_ftoi() | float to |
p_itof() | integer to float conversion |
p_inv() | inverse |
p_invcbrt() | inverse cube root |
p_invsqrt() | inverse square root |
p_ln() | natural log |
p_log10() | denary log |
p_max() | finds max val |
p_min() | finds min val |
p_mean() | mean operation |
p_median() | finds middle value |
p_mode() | finds most common value |
p_mul() | multiplication |
p_popcount() | count the number of bits set |
p_pow() | element raised to a power |
p_rand() | random number generator |
p_randinit() | init random number generator |
p_sort() | heap sort |
p_sin() | sine |
p_sinh() | hyperbolic sine |
p_sqrt() | square root |
p_stddev() | calculates standard deviation |
p_sub() | subtract |
p_sum() | sum of all vector elements |
p_sumsq() | sum of all squared elements |
p_tan() | tangent |
p_tanh() | hyperbolic tangent |
##DSP
The digital signal processing (DSP) functions follow the same convention as the math function set.
FUNCTION | NOTES |
---|---|
p_acorr() | autocorrelation (r[j] = sum ( x[j+k] * x[k] ), k=0..(n-j-1)) |
p_conv() | convolution: r[j] = sum ( h[k] * x[j-k), k=0..(nh-1) |
p_xcorr() | correlation: r[j] = sum ( x[j+k] * y[k]), k=0..(nx+ny-1) |
p_fir() | FIR filter direct form: r[j] = sum ( h[k] * x [j-k]), k=0..(nh-1) |
p_firdec() | FIR filter with decimation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1) |
p_firint() | FIR filter with inerpolation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1) |
p_firsym() | FIR symmetric form |
p_iir() | IIR filter |
##IMAGE PROCESSING
The image processing functions follow the same convention as the math function set.
FUNCTION | NOTES |
---|---|
p_box3x3() | box filter (3x3) |
p_conv2d() | 2d convolution |
p_gauss3x3() | gaussian blur filter (3x3) |
p_median3x3() | median filter (3x3) |
p_laplace3x3() | laplace filter (3x3) |
p_prewitt3x3() | prewitt filter (3x3) |
p_sad8x8() | sum of absolute differences (8x8) |
p_sad16x16() | sum of absolute differences (16x16) |
p_sobel3x3() | sobel filter (3x3) |
p_scharr3x3() | scharr filter (3x3) |
##FFT
- An FFTW like interface
##BLAS
- A port of the BLIS library?
##SYSTEM CALLS
- Bionic libc implementation as starting point..