diff --git a/README.rst b/README.rst
index ae7ffdd..75fb6f5 100644
--- a/README.rst
+++ b/README.rst
@@ -3,7 +3,7 @@ mpi4py-ve
 #########
 
 *mpi4py-ve* is an extension to *mpi4py*, which provides Python bindings for the Message Passing Interface (MPI).
-This package also supports to communicate array objects of `NLCPy <https://www.hpc.nec/documents/nlcpy/en/>`_ (nlcpy.ndarray) between MPI processes on x86 servers of SX-Aurora TSUBASA systems.
+This package also supports to communicate array objects of `NLCPy <https://sxauroratsubasa.sakura.ne.jp/documents/nlcpy/en/>`_ (nlcpy.ndarray) between MPI processes on x86 servers of SX-Aurora TSUBASA systems.
 Combining NLCPy with *mpi4py-ve* enables Python scripts to utilize multi-VE computing power.
 The current version of *mpi4py-ve* is based on *mpi4py* version 3.0.3.
 For details of API references, please refer to `mpi4py manual <https://mpi4py.readthedocs.io/en/stable/>`_.
@@ -14,25 +14,27 @@ Requirements
 
 Before the installation, the following components are required to be installed on your x86 Node of SX-Aurora TSUBASA.
 
-- `NEC SDK <https://www.hpc.nec/documents/guide/pdfs/InstallationGuide_E.pdf>`_
-	- required NEC C/C++ compiler version: >= 3.2.1
-	- required NLC version: >= 2.3.0
+- `Alternative VE Offloading (AVEO) <https://sxauroratsubasa.sakura.ne.jp/documents/veos/en/aveo/index.html>`_
+	- required version: >= 2.13.0
 
-- `VEOS <https://www.hpc.nec/documents/veos/en/aveo/index.html>`_
-	- required version: >= 2.11.1
-
-- `NEC MPI <https://www.hpc.nec/documents/mpi/g2am01e-NEC_MPI_User_Guide_en/frame.html>`_
-	- required NEC MPI version: >=  2.20.0
+- `NEC MPI <https://sxauroratsubasa.sakura.ne.jp/documents/mpi/g2am01e-NEC_MPI_User_Guide_en/frame.html>`_
+	- required NEC MPI version: > 2.22.0 (for Mellanox OFED 4.x) or >= 3.1.0 (for Mellanox OFED 5.x)
 
 - `Python <https://www.python.org/>`_
         - required version: 3.6, 3.7, or 3.8
 
-- `NLCPy <https://www.hpc.nec/documents/nlcpy/en/>`_
-        - required version: v2.1.1
-
 - `NumPy <https://www.numpy.org/>`_
         - required version: v1.17, v1.18, v1.19, or v1.20
 
+- `NLC(optional) <https://sxauroratsubasa.sakura.ne.jp/documents/sdk/SDK_NLC/UsersGuide/main/en/index.html>`_
+	- required version: >= 2.3.0
+
+- `NLCPy(optional) <https://sxauroratsubasa.sakura.ne.jp/documents/nlcpy/en/>`_
+        - required version: >= 2.2.0
+
+Since December 2022, mpi4py-ve has been provided as a software of NEC SDK (NEC Software Development Kit for Vector Engine).
+If NEC SDK on your machine has been properly installed or updated after that, mpi4py-ve is available by using /usr/bin/python3 command.
+
 ******************
 Install from wheel
 ******************
@@ -57,16 +59,14 @@ You can install *mpi4py-ve* by executing either of the following commands.
  
         $ pip install <path_to_wheel>
 
-The shared objects for Vector Engine, which are included in the wheel package, are compiled and tested by using following software:
-    +-------------------+---------------+ 
-    | NEC C/C++ Compiler| Version 3.2.1 |
-    +-------------------+---------------+
-    | NEC MPI           | v2.20.0       |
-    +-------------------+---------------+
-    | NumPy             | v1.19.2       |
-    +-------------------+---------------+
-    | NLCPy             | v2.1.1        |
-    +-------------------+---------------+
+The shared objects for Vector Host, which are included in the wheel package, are compiled by gcc 4.8.5 and tested by using following softwares:
+    +---------+--------------------+
+    | NEC MPI | v2.22.0 and V3.1.0 |
+    +---------+--------------------+
+    | NumPy   | v1.19.2            |
+    +---------+--------------------+
+    | NLCPy   | v2.2.0             |
+    +---------+--------------------+
 
 ***********************************
 Install from source (with building)
@@ -178,7 +178,7 @@ When using the *mpirun* command:
 
     ::
 
-    $ mpirun -vh -np N $(which python) sample.py
+    $ mpirun -veo -np N $(which python) sample.py
 
 | Here, N is the number of MPI processes that are created on an x86 server.
 | NEC MPI 2.21.0 or later supports the environment variable `NMPI_USE_COMMAND_SEARCH_PATH`.
@@ -187,9 +187,141 @@ When using the *mpirun* command:
     ::
 
     $ export NMPI_USE_COMMAND_SEARCH_PATH=ON
-    $ mpirun -vh -np N python sample.py
+    $ mpirun -veo -np N python sample.py
+
+| For details of mpirun command, refer to `NEC MPI User's Guide <https://sxauroratsubasa.sakura.ne.jp/documents/mpi/g2am01e-NEC_MPI_User_Guide_en/frame.html>`_.
+
+******************
+Execution Examples
+******************
+
+The following examples show how to launch MPI programs that use mpi4py-ve and NLCPy on the SX-Aurora TSUBASA.
+
+| *ncore* : Number of cores per VE.
+| a.py: Python script using mpi4py-ve and NLCPy.
+| 
+
+* Interactive Execution
+
+  * Execution on one VE
+
+    Example of using 4 processes on local VH and 4 VE processes (*ncore* / 4 OpenMP parallel per process) on VE#0 of local VH
+
+    ::
+
+      $ mpirun -veo -np 4 python a.py
+
+  * Execution on multiple VEs on a VH
+
+    Example of using 4 processes on local VH and 4 VE processes (1 process per VE, *ncore* OpenMP parallel per process) on VE#0 to VE#3 of local VH
+
+    ::
+
+      $ VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 4 python a.py
+
+
+    Example of using 32 processes on local VH and 32 VE processes (8 processes per VE, *ncore* / 8 OpenMP parallel per process) on VE#0 to VE# 3 of local VH
+
+    ::
+
+      $ VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 32 python a.py
+
+  * Execution on multiple VEs on multiple VHs
+
+    Example of using a total of 32 processes on two VHs host1 and host2, and a total of 32 VE processes on VE#0 and VE#1 of each VH (8 processes per VE, *ncore* / 8 OpenMP parallel per process)
+
+    ::
+
+      $ VE_NLCPY_NODELIST=0,1 mpirun -hosts host1,host2 -veo -np 32 python a.py
+
+* NQSV Request Execution
+
+  * Execution on a specific VH, on a VE
+
+    Example of using 32 processes on logical VH#0 and 32 VE processes on logical VE#0 to logical VE#3 on logical VH#0 (8 processes per VE, *ncore* / 8 OpenMP parallel per process)
+
+    ::
+
+      #PBS -T necmpi
+      #PBS -b 2 # The number of logical hosts
+      #PBS --venum-lhost=4 # The number of VEs per logical host
+      #PBS --cpunum-lhost=32 # The number of CPUs per logical host
+      
+      source /opt/nec/ve/mpi/2.22.0/bin/necmpivars.sh
+      export NMPI_USE_COMMAND_SEARCH_PATH=ON
+      mpirun -host 0 -veo -np 32 python a.py
+
+  * Execution on a specific VH, on a specific VE
 
-| For details of mpirun command, refer to `NEC MPI User's Guide <https://www.hpc.nec/documents/mpi/g2am01e-NEC_MPI_User_Guide_en/frame.html>`_.
+    Example of using 16 processes on logical VH#0, 16 VE processes in total on logical VE#0 and logical VE#3 on logical VH#0 (8 processes per VE, *ncore* / 8 OpenMP parallel per process)
+
+    ::
+
+      #PBS -T necmpi
+      #PBS -b 2 # The number of logical hosts
+      #PBS --venum-lhost=4 # The number of VEs per logical host
+      #PBS --cpunum-lhost=16 # The number of CPUs per logical host
+      
+      source /opt/nec/ve/mpi/2.22.0/bin/necmpivars.sh
+      export NMPI_USE_COMMAND_SEARCH_PATH=ON
+      VE_NLCPY_NODELIST=0,3 mpirun -host 0 -veo -np 16 python a.py
+
+  * Execution on all assigned VEs
+
+    Example of using 32 processes in total on 4 VHs and using 32 VE processes in total from logical VE#0 to logical VE#7 on each of VHs (1 process per VE, *ncore* OpenMP parallel per process).
+
+    ::
+
+      #PBS -T necmpi
+      #PBS -b 4 # The number of logical hosts
+      #PBS --venum-lhost=8 # The number of VEs per logical host
+      #PBS --cpunum-lhost=8 # The number of CPUs per logical host
+      #PBS --use-hca=2 # The number of HCAs
+      
+      source /opt/nec/ve/mpi/2.22.0/bin/necmpivars.sh
+      export NMPI_USE_COMMAND_SEARCH_PATH=ON
+      mpirun -veo -np 32 python a.py
+
+*********
+Profiling
+*********
+NEC MPI provides the facility of displaying MPI communication information. 
+There are two formats of MPI communication information available as follows:
+
++-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
+| Reduced Format  | The maximum, minimum, and average values of MPI communication information of all MPI processes are displayed.                                                                        |
++-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+| Extended Format | MPI communication information of each MPI process is displayed in the ascending order of their ranks in the communicator MPI_COMM_WORLD after the information in the reduced format. |
++-----------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+
+You can control the display and format of MPI communication information by setting the environment variable NMPI_COMMINF at runtime as shown in the following table.
+
+The Settings of NMPI_COMMINF:
+
++--------------+-----------------------+ 
+| NMPI_COMMINF | Displayed Information |
++--------------+-----------------------+
+| NO           | (Default) No Output   |
++--------------+-----------------------+
+| YES          | Reduced Format        |
++--------------+-----------------------+
+| ALL          | Extended Format       |
++--------------+-----------------------+
+
+When using the *mpirun* command:
+
+    ::
+
+    $ export NMPI_COMMINF=ALL
+    $ mpirun -veo -np N python sample.py
+
+***************************************************
+Use mpi4py-ve with homebrew classes (without NLCPy)
+***************************************************
+
+Below links would be useful to use *mpi4py-ve* with homebrew classes (without NLCPy):
+
+* `use mpi4py-ve with homebrew classes (without NLCPy) <https://github.com/SX-Aurora/mpi4py-ve/blob/v1.0.0/docs/vai_spec_example.rst>`_
 
 ***************
 Other Documents
@@ -197,13 +329,62 @@ Other Documents
 
 Below links would be useful to understand *mpi4py-ve* in more detail:
 
-* `mpi4py-ve tutorial <https://github.com/SX-Aurora/mpi4py-ve/blob/v0.1.0b1/docs/index.rst>`_
+* `mpi4py-ve tutorial <https://github.com/SX-Aurora/mpi4py-ve/blob/v1.0.0/docs/index.rst>`_
 
 ***********
 Restriction
 ***********
-* The value specified by np must not exceed the number of VE cards.
-* The current version of *mpi4py-ve* does not support some functions that are listed in the section "List of Unsupported Functions" of `mpi4py-ve tutorial <https://github.com/SX-Aurora/mpi4py-ve/blob/v0.1.0b1/docs/index.rst>`_.
+* The current version of *mpi4py-ve* does not support some functions that are listed in the section "List of Unsupported Functions" of `mpi4py-ve tutorial <https://github.com/SX-Aurora/mpi4py-ve/blob/v1.0.0/docs/index.rst>`_.
+* Communication of type bool between NumPy and NLCPy will fail because of the different number of bytes.
+
+*******
+Notices
+*******
+* If you import NLCPy before calling MPI_Init()/MPI_Init_thread(), a runtime error will be raised.
+
+    Not recommended usage: ::
+
+        $ mpirun -veo -np 1 $(which python) -c "import nlcpy; from mpi4pyve import MPI"
+        RuntimeError: NLCPy must be import after MPI initialization
+
+    Recommended usage: ::
+
+        $ mpirun -veo -np 1 $(which python) -c "from mpi4pyve import MPI; import nlcpy" 
+
+    MPI_Init() or MPI_Init_thread() is called when you import the MPI module from the mpi4pyve package.
+
+* If you use the Lock/Lock_all function for one-sided communication using NLCPy array data, you need to put in NLCPy synchronization control.
+
+    Synchronization usage:
+
+    .. code-block:: python
+
+        import mpi4pyve
+        from mpi4pyve import MPI
+        import nlcpy as vp
+
+        comm = MPI.COMM_WORLD
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        array = vp.array(0, dtype=int)
+
+        if rank == 0:
+            win_n = MPI.Win.Create(array,  comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if rank == 0:
+            array.fill(1)
+            array.venode.synchronize()
+            comm.Barrier()
+        if rank != 0:
+           comm.Barrier()
+            win_n.Lock(MPI.LOCK_EXCLUSIVE, 0)
+            win_n.Get([array, MPI.INT], 0)
+            win_n.Unlock(0)
+            assert array == 1
+        comm.Barrier()
+        win_n.Free()
 
 *******
 License
diff --git a/conf/mpiconfig.py b/conf/mpiconfig.py
index 4194146..a33b548 100644
--- a/conf/mpiconfig.py
+++ b/conf/mpiconfig.py
@@ -111,7 +111,7 @@ def setup_library_info(self, options, environ):
             else:
                 section = mpiopt
         if not filename: filename = "mpi.cfg"
-        if not section:  section  = "mpi"
+        if not section:  section  = "necmpi"
 
         mach = platform.machine()
         arch = platform.architecture()[0]
diff --git a/coverage_device_test/coverage_device_util.py b/coverage_device_test/coverage_device_util.py
new file mode 100644
index 0000000..010de1b
--- /dev/null
+++ b/coverage_device_test/coverage_device_util.py
@@ -0,0 +1,118 @@
+from mpi4pyve import MPI
+from numpy.testing import assert_array_equal
+from functools import reduce
+import os
+
+
+if os.environ.get("MPI4PYVE_TEST_PATTERN") == "small":
+    _shapes = [5, (2, 20), (2, 10, 20), (2, 10, 20, 4)]
+else:
+    _shapes = [5, 5**10,
+               (2,), (2, 2), (2, 2, 2), (2, 20), (20, 2),
+               (2, 10, 20), (10, 2, 20), (10, 20, 2),
+               (2, 10, 20, 4), (10, 2, 4, 20), (4, 10, 20, 2)]
+
+np = None
+vp = None
+_devices = None
+
+if os.environ.get("MPI4PYVE_TEST_DEVICE") == "vh":
+    import numpy as _np
+    _devices = [_np]
+    np = _np
+elif os.environ.get("MPI4PYVE_TEST_DEVICE") == "ve":
+    import nlcpy as _vp
+    import numpy as _np
+    _devices = [_vp]
+    vp = _vp
+    np = _np
+else:
+    import nlcpy as _vp
+    import numpy as _np
+    _devices = [_np, _vp]
+    vp = _vp
+    np = _np
+
+_dtypes = ['int32', 'int64',
+           'uint32', 'uint64',
+           'float32', 'float64',
+           'complex64', 'complex128',
+           'bool']
+_flush_dtypes = ['int32']
+_order = ['C', 'F']
+_patterns = [(dev1, dev2, shape, dtype, order)
+             for dev1 in _devices
+             for dev2 in _devices
+             for shape in _shapes
+             for dtype in _dtypes
+             for order in _order]
+_flush_test_patterns = [(dev1, dtype, order)
+                        for dev1 in _devices
+                        for dtype in _flush_dtypes
+                        for order in _order]
+_default_fill_value = -1
+_rbuf_np_bool_size_adjust = 4
+
+
+def _get_array(a):
+    if vp is not None and isinstance(a, vp.ndarray):
+        return a.get()
+    return a
+
+
+def _get_type(dtype):
+    if dtype == 'int32':
+        return MPI.INT
+    elif dtype == 'int64':
+        return MPI.LONG
+    elif dtype == 'uint32':
+        return MPI.UNSIGNED
+    elif dtype == 'uint64':
+        return MPI.UNSIGNED_LONG
+    elif dtype == 'float32':
+        return MPI.FLOAT
+    elif dtype == 'float64':
+        return MPI.DOUBLE
+    elif dtype == 'complex64':
+        return MPI.COMPLEX
+    elif dtype == 'complex128':
+        return MPI.DOUBLE_COMPLEX
+    elif dtype == 'bool':
+        return MPI.BOOL
+
+
+def _get_sbuf(dev, shape, dtype, order):
+    if dtype != 'bool':
+        if isinstance(shape, tuple) or isinstance(shape, list):
+            n = reduce((lambda x, y: x * y), shape)
+            return dev.arange(n, dtype=dtype).reshape(shape, order=order)
+        else:
+            return dev.arange(shape, dtype=dtype).reshape(shape, order=order)
+    else:
+        return dev.random.randint(0, 2, shape).astype('?', order=order)
+
+
+def _get_rbuf(dev, shape, dtype, order, fromdev=None):
+    if dtype == 'bool':
+        val = False
+    else:
+        val = _default_fill_value
+    return dev.full(shape, val, dtype=dtype, order=order)
+
+
+def _assert_array(a, desired):
+    if desired is None:
+        return np.all(a == 0)
+    else:
+        return assert_array_equal(a, desired)
+
+
+IS_MULTI_HOST = None
+
+try:
+    if IS_MULTI_HOST is None:
+        comm = MPI.COMM_WORLD
+        nodes = comm.allgather(os.environ['MPINODEID'])
+        IS_MULTI_HOST = (len(list(set(nodes))) != 1)
+except KeyError:
+    pass
diff --git a/coverage_device_test/run.sh b/coverage_device_test/run.sh
new file mode 100644
index 0000000..6086034
--- /dev/null
+++ b/coverage_device_test/run.sh
@@ -0,0 +1,126 @@
+#!/bin/bash
+
+TEST='full'
+NP=4
+PYTHON_CMD=python
+MPIRUN_CMD=mpirun
+TEST_DEVICE='ve_vh'
+
+function usage() {
+  echo 'Usage: run.sh [ARGUMENT]...'
+  echo ''
+  echo '  ARGUMENT:'
+  echo '  --test MODE or --test=MODE: specify the test MODE'
+  echo '         available MODE are [full|small]'
+  echo '         (default: full)'
+  echo '  --np NO or --np=NO: specify the total number of processes'
+  echo '         (default: 4)'
+  echo '  --python-cmd CMD or --python-cmd=CMD: specify python command'
+  echo '         (default: python)'
+  echo '  --mpirun-cmd CMD or --mpirun-cmd=CMD: specify mpirun command'
+  echo '         (default: mpirun)'
+  echo '  --device DEVICE or --device=DEVICE: specify test device'
+  echo '         available DEVICE are [ve_vh|ve|vh]'
+  echo '         (default: ve_vh)'
+}
+
+while (( $# > 0 ))
+do
+    case $1 in
+        --test | --test=*)
+            if [[ "$1" =~ ^--test= ]]; then
+                TEST=$(echo $1 | sed -e 's/^--test=//')
+            elif [[ -z "$2" ]] || [[ "$2" =~ ^-+ ]]; then
+                usage
+                exit 1
+            else
+                TEST="$2"
+                shift
+            fi
+            if [[ $TEST != "full" ]] && [[ $TEST != "small" ]]; then
+                usage
+                exit 1
+            fi
+        ;;
+        --np | --np=*)
+            if [[ "$1" =~ ^--np= ]]; then
+                NP=$(echo $1 | sed -e 's/^--np=//')
+            elif [[ -z "$2" ]] || [[ "$2" =~ ^-+ ]]; then
+                usage
+                exit 1
+            else
+                NP="$2"
+                shift
+            fi
+            if [[ ${NP} =~ ^[0-9]+$ ]]; then
+                NP=${NP}
+            else
+                usage
+                exit 1
+            fi
+        ;;
+        --python-cmd | --python-cmd=*)
+            if [[ "$1" =~ ^--python-cmd= ]]; then
+                PYTHON_CMD=$(echo $1 | sed -e 's/^--python-cmd=//')
+            elif [[ -z "$2" ]] || [[ "$2" =~ ^-+ ]]; then
+                usage
+                exit 1
+            else
+                PYTHON_CMD="$2"
+                shift
+            fi
+            if [[ -z ${PYTHON_CMD} ]]; then
+                usage
+                exit 1
+            fi
+        ;;
+        --mpirun-cmd | --mpirun-cmd=*)
+            if [[ "$1" =~ ^--mpirun-cmd= ]]; then
+                MPIRUN_CMD=$(echo $1 | sed -e 's/^--mpirun-cmd=//')
+            elif [[ -z "$2" ]] || [[ "$2" =~ ^-+ ]]; then
+                usage
+                exit 1
+            else
+                MPIRUN_CMD="$2"
+                shift
+            fi
+            if [[ -z ${MPIRUN_CMD} ]]; then
+                usage
+                exit 1
+            fi
+        ;;
+        --device | --device=*)
+            if [[ "$1" =~ ^--device= ]]; then
+                TEST_DEVICE=$(echo $1 | sed -e 's/^--device=//')
+            elif [[ -z "$2" ]] || [[ "$2" =~ ^-+ ]]; then
+                usage
+                exit 1
+            else
+                TEST_DEVICE="$2"
+                shift
+            fi
+            if [[ $TEST_DEVICE != "ve_vh" ]] && [[ $TEST_DEVICE != "ve" ]] && [[ $TEST_DEVICE != "vh" ]]; then
+                usage
+                exit 1
+            fi
+        ;;
+        -h | --help)
+            usage
+            exit 1
+        ;;
+    esac
+    shift
+done
+
+
+echo 'VE_NLCPY_NODELIST='${VE_NLCPY_NODELIST}
+
+export NMPI_USE_COMMAND_SEARCH_PATH=ON
+set -x
+MPI4PYVE_TEST_PATTERN=${TEST} MPI4PYVE_TEST_DEVICE=${TEST_DEVICE} ${MPIRUN_CMD} -veo -np ${NP} ${PYTHON_CMD} test_coverage_device_comm.py
+MPI4PYVE_TEST_PATTERN=${TEST} MPI4PYVE_TEST_DEVICE=${TEST_DEVICE} ${MPIRUN_CMD} -veo -np ${NP} ${PYTHON_CMD} test_coverage_device_file.py
+MPI4PYVE_TEST_PATTERN=${TEST} MPI4PYVE_TEST_DEVICE=${TEST_DEVICE} ${MPIRUN_CMD} -veo -np ${NP} ${PYTHON_CMD} test_coverage_device_win.py
+MPI4PYVE_TEST_PATTERN=${TEST} MPI4PYVE_TEST_DEVICE=${TEST_DEVICE} ${MPIRUN_CMD} -veo -np ${NP} ${PYTHON_CMD} test_coverage_device_datatype.py
+MPI4PYVE_TEST_PATTERN=${TEST} MPI4PYVE_TEST_DEVICE=${TEST_DEVICE} ${MPIRUN_CMD} -veo -np ${NP} ${PYTHON_CMD} test_coverage_device_message.py
+set +x
+
diff --git a/coverage_device_test/test_coverage_device_comm.py b/coverage_device_test/test_coverage_device_comm.py
new file mode 100644
index 0000000..1cb1b99
--- /dev/null
+++ b/coverage_device_test/test_coverage_device_comm.py
@@ -0,0 +1,1618 @@
+import unittest  # NOQA
+from unittest import TestCase  # NOQA
+from parameterized import parameterized  # NOQA
+from mpi4pyve import MPI  # NOQA
+from numpy.testing import (
+    assert_equal,
+)
+import coverage_device_util  # NOQA
+from coverage_device_util import (
+    _patterns, _get_array, _get_type, _get_sbuf,
+    _get_rbuf, _assert_array, vp, np
+)
+
+
+def create_topo_comms(comm):
+    size = comm.Get_size()
+    rank = comm.Get_rank()
+    # Cartesian
+    n = int(size ** 1 / 2.0)
+    m = int(size ** 1 / 3.0)
+    if m * m * m == size:
+        dims = [m, m, m]
+    elif n * n == size:
+        dims = [n, n]
+    else:
+        dims = [size]
+    periods = [True] * len(dims)
+    yield comm.Create_cart(dims, periods=periods)
+    # Graph
+    index, edges = [0], []
+    for i in range(size):
+        pos = index[-1]
+        index.append(pos + 2)
+        edges.append((i - 1) % size)
+        edges.append((i + 1) % size)
+    yield comm.Create_graph(index, edges)
+    # Dist Graph
+    sources = [(rank - 2) % size, (rank - 1) % size]
+    destinations = [(rank + 1) % size, (rank + 2) % size]
+    yield comm.Create_dist_graph_adjacent(sources, destinations)
+
+
+def get_neighbors_count(comm):
+    topo = comm.Get_topology()
+    if topo == MPI.CART:
+        ndim = comm.Get_dim()
+        return 2 * ndim, 2 * ndim
+    if topo == MPI.GRAPH:
+        rank = comm.Get_rank()
+        nneighbors = comm.Get_neighbors_count(rank)
+        return nneighbors, nneighbors
+    if topo == MPI.DIST_GRAPH:
+        indeg, outdeg, w = comm.Get_dist_neighbors_count()
+        return indeg, outdeg
+    return 0, 0
+
+
+class TestComm(unittest.TestCase):
+
+    COMM = MPI.COMM_WORLD
+
+    @parameterized.expand(_patterns)
+    def test_Send_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Send([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_send_recv_offset(self, dev1, dev2, shape, dtype, order):
+        if dev1 is np or dev2 is np:
+            self.skipTest('buffer is not contiguous case is not testable')
+        if isinstance(shape, tuple):
+            self.skipTest('unsupported tuple offset case is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        offset = 2
+
+        desired = _get_sbuf(np, shape, dtype, order)[offset:]
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Send([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape - offset, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Sendrecv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            a = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Sendrecv([x, mtype], dest=peer, sendtag=0, recvbuf=[a, mtype],
+                          source=peer, recvtag=0)
+            if peer != MPI.PROC_NULL:
+                self.assertFalse(_assert_array(a, x))
+        else:
+            y = dev1.array(desired, dtype=dtype, order=order)
+            z = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Sendrecv([y, mtype], dest=peer, sendtag=0, recvbuf=[z, mtype],
+                          source=peer, recvtag=0)
+            if peer != MPI.PROC_NULL:
+                self.assertFalse(_assert_array(z, y))
+
+    @parameterized.expand(_patterns)
+    def test_Sendrecv_replace(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired_0 = _get_sbuf(np, shape, dtype, order)
+        desired_1 = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired_0, mtype], root=0)
+        comm.Bcast([desired_1, mtype], root=1)
+
+        if rank % 2 == 0:
+            x = dev1.array(desired_1, dtype=dtype, order=order)
+            comm.Sendrecv_replace([x, mtype], dest=peer, sendtag=0, source=peer,
+                                  recvtag=0)
+            if peer != MPI.PROC_NULL:
+                self.assertFalse(_assert_array(x, desired_0))
+        else:
+            y = dev2.array(desired_0, dtype=dtype, order=order)
+            comm.Sendrecv_replace([y, mtype], dest=peer, sendtag=0, source=peer,
+                                  recvtag=0)
+            if peer != MPI.PROC_NULL:
+                self.assertFalse(_assert_array(y, desired_1))
+
+    @parameterized.expand(_patterns)
+    def test_Isend_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Isend([x, mtype], dest=peer).Wait()
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Send_Irecv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Send([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Irecv([y, mtype], source=peer).Wait()
+
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Send_init(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            req = comm.Send_init([x, mtype], dest=peer)
+            req.Start()
+            req.Wait()
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Recv_init(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            req = comm.Send([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            req = comm.Recv_init([y, mtype], source=peer)
+            req.Start()
+            req.Wait()
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Rsend_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Rsend([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Ssend_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Ssend([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Issend_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Issend([x, mtype], dest=peer).Wait()
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Irsend_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.Irsend([x, mtype], dest=peer).Wait()
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Ssend_init(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            req = comm.Ssend_init([x, mtype], dest=peer)
+            req.Start()
+            req.Wait()
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Rsend_init(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            req = comm.Rsend_init([x, mtype], dest=peer)
+            req.Start()
+            req.Wait()
+        else:
+            y = _get_rbuf(dev2, shape, dtype, order, dev1)
+            comm.Recv([y, mtype], source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Bcast(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            sbuf = _get_sbuf(np, size, dtype, order)
+            comm.Bcast([sbuf, mtype], root=root)
+
+            if rank == root:
+                x = dev1.array(sbuf, dtype=dtype, order=order)
+            else:
+                x = _get_rbuf(dev2, size, dtype, order, dev1)
+
+            comm.Bcast([x, mtype], root=root)
+            self.assertFalse(_assert_array(x, sbuf))
+
+    @parameterized.expand(_patterns)
+    def test_Gather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            else:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.array([], dtype=dtype, order=order)
+            comm.Gather([sbuf, mtype], [rbuf, mtype], root=root)
+
+            if rank == root:
+                desired = np.full((size, root + 1), root, dtype=dtype,
+                                  order=order)
+                assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Gatherv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            else:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.array([], dtype=dtype)
+            comm.Gatherv([sbuf, mtype], [rbuf, mtype], root=root)
+
+            if rank == root:
+                desired = np.full((size, root + 1), root, dtype=dtype,
+                                  order=order)
+                assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Scatter(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full((size, size), root, dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            else:
+                sbuf = dev1.array([], dtype=dtype)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            comm.Scatter([sbuf, mtype], [rbuf, mtype], root=root)
+
+            desired = np.full(size, root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Scatterv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full((size, size), root, dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            else:
+                sbuf = dev1.array([], dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            comm.Scatterv([sbuf, mtype], [rbuf, mtype], root=root)
+
+            desired = np.full(size, root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Allgather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+            rbuf = dev2.full((size, root + 1), -1, dtype=dtype, order=order)
+            comm.Allgather([sbuf, mtype], [rbuf, mtype])
+
+            desired = np.full((size, root + 1), root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Allgatherv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        if isinstance(shape, tuple) or isinstance(shape, list):
+            self.skipTest('shape case in tuple or list is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        # for terminated by signal(9).
+        if size > 4 and np.isscalar(shape) and shape >= 5**10:
+            shape = 5**5
+
+        for root in range(size):
+            sbuf = _get_sbuf(dev1, shape, dtype, order)
+            rbuf = _get_rbuf(dev2, size * shape, dtype, order, dev1)
+            self.COMM.Allgatherv(sbuf, rbuf)
+
+            actual_sbuf = _get_sbuf(np, shape, dtype, order)
+            actual_rbuf = _get_rbuf(np, size * shape, dtype, order, np)
+            self.COMM.Allgatherv(actual_sbuf, actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Alltoall(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            sbuf = _get_sbuf(dev1, (size, root + 1), dtype, order)
+            rbuf = _get_rbuf(dev2, (size, root + 1), dtype, order, dev1)
+            self.COMM.Alltoall(sbuf, rbuf)
+
+            actual_sbuf = np.array(sbuf, dtype=dtype, order=order)
+            actual_rbuf = _get_rbuf(np, (size, root + 1), dtype, order, np)
+            self.COMM.Alltoall(actual_sbuf, actual_rbuf)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Alltoallv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            sbuf = _get_sbuf(dev1, (size, size), dtype, order)
+            rbuf = _get_rbuf(dev2, (size, size), dtype, order, dev1)
+            self.COMM.Alltoallv(sbuf, rbuf)
+
+            actual_sbuf = np.array(sbuf, dtype=dtype, order=order)
+            actual_rbuf = _get_rbuf(np, (size, size), dtype, order, np)
+            self.COMM.Alltoallv(actual_sbuf, actual_rbuf)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Alltoallw(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        sdt, rdt = mtype, mtype
+        sbuf = _get_sbuf(dev1, (size, 1), dtype, order)
+        rbuf = _get_rbuf(dev2, (size, 1), dtype, order, dev1)
+        sdsp = list(range(0, size * sbuf.itemsize, sbuf.itemsize))
+        rdsp = list(range(0, size * rbuf.itemsize, rbuf.itemsize))
+        smsg = (sbuf, ([1] * size, sdsp), [sdt] * size)
+        rmsg = (rbuf, ([1] * size, rdsp), [rdt] * size)
+        self.COMM.Alltoallw(smsg, rmsg)
+
+        actual_sbuf = np.array(sbuf, dtype=dtype, order=order)
+        actual_rbuf = _get_rbuf(np, (size, 1), dtype, order, np)
+        actual_sdsp = list(range(0, size * actual_sbuf.itemsize,
+                                 actual_sbuf.itemsize))
+        actual_rdsp = list(range(0, size * actual_rbuf.itemsize,
+                                 actual_rbuf.itemsize))
+        actual_smsg = (actual_sbuf, ([1] * size, actual_sdsp), [sdt] * size)
+        actual_rmsg = (actual_rbuf, ([1] * size, actual_rdsp), [rdt] * size)
+        self.COMM.Alltoallw(actual_smsg, actual_rmsg)
+        assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Reduce(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                sbuf = dev1.array(range(size), dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+
+                comm.Reduce([sbuf, mtype], [rbuf, mtype], op, root)
+
+                actual_sbuf = np.array(range(size), dtype=dtype, order=order)
+                actual_rbuf = np.full(size, -1, dtype=dtype, order=order)
+                comm.Reduce([actual_sbuf, mtype], [actual_rbuf, mtype], op,
+                            root)
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Allreduce(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                sbuf = dev1.array(range(size), dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+
+                comm.Allreduce([sbuf, mtype], [rbuf, mtype], op)
+
+                actual_sbuf = np.array(range(size), dtype=dtype, order=order)
+                actual_rbuf = np.full(size, -1, dtype=dtype, order=order)
+                comm.Allreduce([actual_sbuf, mtype], [actual_rbuf, mtype], op)
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Reduce_scatter(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                rcnt = list(range(1, size + 1))
+                sbuf = dev1.array([rank + 1] * sum(rcnt), dtype=dtype,
+                                  order=order)
+                rbuf = dev2.full(rank + 1, -1, dtype=dtype, order=order)
+
+                comm.Reduce_scatter([sbuf, mtype], [rbuf, mtype], None, op)
+
+                actual_sbuf = np.array([rank + 1] * sum(rcnt), dtype=dtype,
+                                       order=order)
+                actual_rbuf = np.full(rank + 1, -1, dtype=dtype, order=order)
+                comm.Reduce_scatter([actual_sbuf, mtype],
+                                    [actual_rbuf, mtype], None, op)
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Reduce_scatter_block(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                for rcnt in range(1, size):
+                    sbuf = dev1.array([rank] * rcnt * size, dtype=dtype,
+                                      order=order)
+                    rbuf = dev2.full(rcnt, -1, dtype=dtype, order=order)
+                    if op == MPI.PROD:
+                        sbuf = dev1.array([rank + 1] * rcnt * size,
+                                          dtype=dtype, order=order)
+                    comm.Reduce_scatter_block([sbuf, mtype], [rbuf, mtype],
+                                              op=op)
+
+                    actual_sbuf = np.array([rank] * rcnt * size, dtype=dtype,
+                                           order=order)
+                    actual_rbuf = np.full(rcnt, -1, dtype=dtype, order=order)
+                    if op == MPI.PROD:
+                        actual_sbuf = dev1.array([rank + 1] * rcnt * size,
+                                                 dtype=dtype, order=order)
+                    comm.Reduce_scatter_block([actual_sbuf, mtype],
+                                              [actual_rbuf, mtype], op=op)
+                    assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_send_recv(self, dev1, dev2, shape, dtype, order):
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.send(x, dest=peer)
+        else:
+            y = comm.recv(source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_Ibcast(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            sbuf = _get_sbuf(np, size, dtype, order)
+            comm.Bcast([sbuf, mtype], root=root)
+
+            if rank == root:
+                x = dev1.array(sbuf, dtype=dtype, order=order)
+            else:
+                x = _get_rbuf(dev2, size, dtype, order, dev1)
+
+            comm.Ibcast([x, mtype], root=root).Wait()
+            self.assertFalse(_assert_array(x, sbuf))
+
+    @parameterized.expand(_patterns)
+    def test_Igather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            else:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.array([], dtype=dtype, order=order)
+            comm.Igather([sbuf, mtype], [rbuf, mtype], root=root).Wait()
+
+            if rank == root:
+                desired = np.full((size, root + 1), root, dtype=dtype,
+                                  order=order)
+                assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Igatherv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            else:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.array([], dtype=dtype, order=order)
+            comm.Igatherv([sbuf, mtype], [rbuf, mtype], root=root).Wait()
+
+            if rank == root:
+                desired = np.full((size, root + 1), root, dtype=dtype,
+                                  order=order)
+                assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Iscatter(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full((size, size), root, dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            else:
+                sbuf = dev1.array([], dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            comm.Iscatter([sbuf, mtype], [rbuf, mtype], root=root).Wait()
+
+            desired = np.full(size, root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Iscatterv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full((size, size), root, dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            else:
+                sbuf = dev1.array([], dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            comm.Iscatterv([sbuf, mtype], [rbuf, mtype], root=root).Wait()
+
+            desired = np.full(size, root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Iallgather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+            rbuf = dev2.full((size, root + 1), -1, dtype=dtype, order=order)
+            comm.Iallgather([sbuf, mtype], [rbuf, mtype]).Wait()
+
+            desired = np.full((size, root + 1), root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_Iallgatherv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = _get_rbuf(np, size * size, dtype, order, np)
+            self.COMM.Iallgatherv(actual_sbuf, actual_rbuf).Wait()
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, size * size, dtype, order, dev1)
+            self.COMM.Iallgatherv(sbuf, rbuf).Wait()
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Ialltoall(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            actual_sbuf = _get_sbuf(np, (size, size), dtype, order)
+            actual_rbuf = _get_rbuf(np, (size, size), dtype, order, np)
+            self.COMM.Ialltoall(actual_sbuf, actual_rbuf).Wait()
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (size, size), dtype, order, dev1)
+            self.COMM.Ialltoall(sbuf, rbuf).Wait()
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Ialltoallv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            actual_sbuf = _get_sbuf(np, (size, size), dtype, order)
+            actual_rbuf = _get_rbuf(np, (size, size), dtype, order, np)
+            self.COMM.Ialltoallv(actual_sbuf, actual_rbuf).Wait()
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (size, size), dtype, order, dev1)
+            self.COMM.Ialltoallv(sbuf, rbuf).Wait()
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Ialltoallw(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        sdt, rdt = mtype, mtype
+
+        actual_sbuf = _get_sbuf(np, (size, 1), dtype, order)
+        actual_rbuf = _get_rbuf(np, (size, 1), dtype, order, np)
+        actual_sdsp = list(range(0, size * actual_sbuf.itemsize,
+                                 actual_sbuf.itemsize))
+        actual_rdsp = list(range(0, size * actual_rbuf.itemsize,
+                                 actual_rbuf.itemsize))
+        actual_smsg = (actual_sbuf, ([1] * size, actual_sdsp), [sdt] * size)
+        actual_rmsg = (actual_rbuf, ([1] * size, actual_rdsp), [rdt] * size)
+        self.COMM.Ialltoallw(actual_smsg, actual_rmsg).Wait()
+
+        sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+        rbuf = _get_rbuf(dev2, (size, 1), dtype, order, dev1)
+        sdsp = list(range(0, size * sbuf.itemsize, sbuf.itemsize))
+        rdsp = list(range(0, size * rbuf.itemsize, rbuf.itemsize))
+        smsg = (sbuf, ([1] * size, sdsp), [sdt] * size)
+        rmsg = (rbuf, ([1] * size, rdsp), [rdt] * size)
+        self.COMM.Ialltoallw(smsg, rmsg).Wait()
+
+        assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Ireduce(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                sbuf = dev1.array(range(size), dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+
+                comm.Ireduce([sbuf, mtype], [rbuf, mtype], op, root).Wait()
+
+                actual_sbuf = np.array(range(size), dtype=dtype, order=order)
+                actual_rbuf = np.full(size, -1, dtype=dtype, order=order)
+                comm.Ireduce([actual_sbuf, mtype],
+                             [actual_rbuf, mtype], op, root).Wait()
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Iallreduce(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                sbuf = dev1.array(range(size), dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+
+                comm.Iallreduce([sbuf, mtype], [rbuf, mtype], op).Wait()
+
+                actual_sbuf = np.array(range(size), dtype=dtype, order=order)
+                actual_rbuf = np.full(size, -1, dtype=dtype, order=order)
+                comm.Iallreduce([actual_sbuf, mtype],
+                                [actual_rbuf, mtype], op).Wait()
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Ireduce_scatter_block(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        mtype = _get_type(dtype)
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+                for rcnt in range(1, size):
+                    sbuf = dev1.array([rank] * rcnt * size, dtype=dtype,
+                                      order=order)
+                    rbuf = dev2.full(rcnt, -1, dtype=dtype, order=order)
+                    if op == MPI.PROD:
+                        sbuf = dev1.array([rank + 1] * rcnt * size,
+                                          dtype=dtype, order=order)
+                    comm.Ireduce_scatter_block([sbuf, mtype],
+                                               [rbuf, mtype],
+                                               op=op).Wait()
+
+                    actual_sbuf = np.array([rank] * rcnt * size, dtype=dtype,
+                                           order=order)
+                    actual_rbuf = np.full(rcnt, -1, dtype=dtype, order=order)
+                    if op == MPI.PROD:
+                        actual_sbuf = np.array([rank + 1] * rcnt * size,
+                                               dtype=dtype, order=order)
+                    comm.Ireduce_scatter_block([actual_sbuf, mtype],
+                                               [actual_rbuf, mtype],
+                                               op=op).Wait()
+                    assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_ssend_recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.ssend(x, dest=peer)
+        else:
+            y = comm.recv(None, source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_sendrecv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            a = comm.sendrecv(x, peer, source=peer)
+            if peer != MPI.PROC_NULL:
+                self.assertFalse(_assert_array(a, x))
+        else:
+            y = dev2.array(desired, dtype=dtype, order=order)
+            z = comm.sendrecv(y, peer, source=peer)
+            if peer != MPI.PROC_NULL:
+                self.assertFalse(_assert_array(z, y))
+
+    @parameterized.expand(_patterns)
+    def test_isend_irecv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+        mtype = _get_type(dtype)
+
+        # recv buffer size
+        if not (isinstance(shape, tuple) or isinstance(shape, list)):
+            if shape >= 5 ** 5:
+                shape = 5 ** 4
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.isend(x, dest=peer).wait()
+        else:
+            y = comm.irecv(source=peer).wait()
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_issend_recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+        if rank % 2 == 0:
+            x = dev1.array(desired, dtype=dtype, order=order)
+            comm.issend(x, dest=peer).wait()
+        else:
+            y = comm.recv(source=peer)
+            self.assertFalse(_assert_array(y, desired))
+
+    @parameterized.expand(_patterns)
+    def test_bcast(self, dev1, dev2, shape, dtype, order):
+        comm = self.COMM
+        size = comm.Get_size()
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+
+        x = dev1.array(desired, dtype=dtype, order=order)
+        comm.bcast(x, root=0)
+        self.assertFalse(_assert_array(x, desired))
+
+    @parameterized.expand(_patterns)
+    def test_gather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            else:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.array([], dtype=dtype, order=order)
+            rbuf = comm.gather(sbuf, root=root)
+
+            if rank == root:
+                desired = np.full((size, root + 1), root, dtype=dtype,
+                                  order=order)
+                assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_scatter(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full((size, size), root, dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            else:
+                sbuf = dev1.array([], dtype=dtype, order=order)
+                rbuf = dev2.full(size, -1, dtype=dtype, order=order)
+            rbuf = comm.scatter(sbuf, root=root)
+
+            desired = np.full(size, root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_allgather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        for root in range(size):
+            if rank == root:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            else:
+                sbuf = dev1.full(root + 1, root, dtype=dtype, order=order)
+                rbuf = dev2.full((size, root + 1), -1, dtype=dtype,
+                                 order=order)
+            rbuf = comm.allgather(sbuf)
+
+            desired = np.full((size, root + 1), root, dtype=dtype, order=order)
+            assert_equal(_get_array(rbuf), desired)
+
+    @parameterized.expand(_patterns)
+    def test_alltoall(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            sbuf = dev1.full((size, root + 1), root, dtype=dtype, order=order)
+            rbuf = self.COMM.alltoall(sbuf)
+
+            actual_sbuf = np.full((size, root + 1), root, dtype=dtype,
+                                  order=order)
+            actual_rbuf = self.COMM.alltoall(actual_sbuf)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_reduce(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD):
+                sbuf = dev1.array(range(size), dtype=dtype, order=order)
+                rbuf = comm.reduce(sbuf, op, root)
+
+                actual_sbuf = np.array(range(size), dtype=dtype, order=order)
+                actual_rbuf = comm.reduce(actual_sbuf, op, root)
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_allreduce(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+
+        for root in range(size):
+            for op in (MPI.SUM, MPI.PROD):
+                sbuf = dev1.array(range(size), dtype=dtype, order=order)
+                rbuf = comm.allreduce(sbuf, op)
+
+                actual_sbuf = np.array(range(size), dtype=dtype, order=order)
+                actual_rbuf = comm.allreduce(actual_sbuf, op)
+                assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Scan(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        mtype = _get_type(dtype)
+        size = self.COMM.Get_size()
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+            sbuf = _get_sbuf(dev1, size, dtype, order)
+            rbuf = _get_rbuf(dev2, size, dtype, order, dev1)
+            comm.Scan([sbuf, mtype], [rbuf, mtype], op)
+
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = _get_rbuf(np, size, dtype, order, np)
+            comm.Scan([actual_sbuf, mtype], [actual_rbuf, mtype], op)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Exscan(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        mtype = _get_type(dtype)
+        size = self.COMM.Get_size()
+
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+            sbuf = _get_sbuf(dev1, size, dtype, order)
+            rbuf = _get_rbuf(dev2, size, dtype, order, dev1)
+            comm.Exscan([sbuf, mtype], [rbuf, mtype], op)
+
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = _get_rbuf(np, size, dtype, order, np)
+            comm.Exscan([actual_sbuf, mtype], [actual_rbuf, mtype], op)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Iscan(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+
+        comm = self.COMM
+        mtype = _get_type(dtype)
+        size = self.COMM.Get_size()
+
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+            sbuf = _get_sbuf(dev1, size, dtype, order)
+            rbuf = _get_rbuf(dev2, size, dtype, order, dev1)
+            comm.Iscan([sbuf, mtype], [rbuf, mtype], op).Wait()
+
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = _get_rbuf(np, size, dtype, order, np)
+            comm.Iscan([actual_sbuf, mtype], [actual_rbuf, mtype], op).Wait()
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Iexscan(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        mtype = _get_type(dtype)
+        size = self.COMM.Get_size()
+
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+            sbuf = _get_sbuf(dev1, size, dtype, order)
+            rbuf = _get_rbuf(dev2, size, dtype, order, dev1)
+            comm.Iexscan([sbuf, mtype], [rbuf, mtype], op).Wait()
+
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = _get_rbuf(np, size, dtype, order, np)
+            comm.Iexscan([actual_sbuf, mtype], [actual_rbuf, mtype], op).Wait()
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_scan(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = self.COMM.Get_size()
+
+        for op in (MPI.SUM, MPI.PROD):
+            sbuf = _get_sbuf(dev1, size, dtype, order)
+            rbuf = comm.scan(sbuf, op)
+
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = comm.scan(actual_sbuf, op)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_exscan(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = self.COMM.Get_size()
+
+        for op in (MPI.SUM, MPI.PROD):
+            sbuf = _get_sbuf(dev1, size, dtype, order)
+            rbuf = comm.exscan(sbuf, op)
+
+            actual_sbuf = _get_sbuf(np, size, dtype, order)
+            actual_rbuf = comm.exscan(actual_sbuf, op)
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+    @parameterized.expand(_patterns)
+    def test_Neighbor_allgather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+
+            actual_sbuf = _get_sbuf(np, 3, dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Neighbor_allgather([actual_sbuf, mtype], [actual_rbuf, mtype])
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Neighbor_allgather([sbuf, mtype], [rbuf, mtype])
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Neighbor_allgatherv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+            actual_sbuf = _get_sbuf(np, 3, dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Neighbor_allgatherv([actual_sbuf, mtype],
+                                     [actual_rbuf, mtype])
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Neighbor_allgatherv([sbuf, mtype], [rbuf, mtype])
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Ineighbor_allgather(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+
+            actual_sbuf = _get_sbuf(np, 3, dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Ineighbor_allgather([actual_sbuf, mtype],
+                                     [actual_rbuf, mtype]).Wait()
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Ineighbor_allgather([sbuf, mtype], [rbuf, mtype]).Wait()
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Ineighbor_allgatherv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+
+            actual_sbuf = _get_sbuf(np, 3, dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Ineighbor_allgatherv([actual_sbuf, mtype],
+                                      [actual_rbuf, mtype]).Wait()
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Ineighbor_allgatherv([sbuf, mtype], [rbuf, mtype]).Wait()
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Neighbor_alltoall(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+
+            actual_sbuf = _get_sbuf(np, (ssize, 3), dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Neighbor_alltoall([actual_sbuf, mtype], [actual_rbuf, mtype])
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Neighbor_alltoall([sbuf, mtype], [rbuf, mtype])
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Neighbor_alltoallv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+            actual_sbuf = _get_sbuf(np, (ssize, 3), dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Neighbor_alltoallv([actual_sbuf, mtype], [actual_rbuf, mtype])
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Neighbor_alltoallv([sbuf, mtype], [rbuf, mtype])
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Neighbor_alltoallw(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+            sdt, rdt = mtype, mtype
+
+            actual_sbuf = _get_sbuf(np, (ssize, 1), dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 1), dtype, order, np)
+            actual_sdsp = list(range(0, ssize * actual_sbuf.itemsize,
+                                     actual_sbuf.itemsize))
+            actual_rdsp = list(range(0, rsize * actual_rbuf.itemsize,
+                                     actual_rbuf.itemsize))
+            actual_smsg = [actual_sbuf, ([1] * ssize, actual_sdsp),
+                           [sdt] * ssize]
+            actual_rmsg = (actual_rbuf, ([1] * rsize, actual_rdsp),
+                           [rdt] * rsize)
+            comm.Neighbor_alltoallw(actual_smsg, actual_rmsg)
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 1), dtype, order, dev1)
+            sdsp = list(range(0, ssize * sbuf.itemsize, sbuf.itemsize))
+            rdsp = list(range(0, rsize * rbuf.itemsize, rbuf.itemsize))
+            smsg = [sbuf, ([1] * ssize, sdsp), [sdt] * ssize]
+            rmsg = (rbuf, ([1] * rsize, rdsp), [rdt] * rsize)
+            comm.Neighbor_alltoallw(smsg, rmsg)
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Ineighbor_alltoall(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+
+            actual_sbuf = _get_sbuf(np, (ssize, 3), dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Ineighbor_alltoall([actual_sbuf, mtype],
+                                    [actual_rbuf, mtype]).Wait()
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Ineighbor_alltoall([sbuf, mtype], [rbuf, mtype]).Wait()
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Ineighbor_alltoallv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+
+            actual_sbuf = _get_sbuf(np, (ssize, 3), dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 3), dtype, order, np)
+            comm.Ineighbor_alltoallv([actual_sbuf, mtype],
+                                     [actual_rbuf, mtype]).Wait()
+
+            sbuf = dev1.array(actual_sbuf, dtype=dtype, order=order)
+            rbuf = _get_rbuf(dev2, (rsize, 3), dtype, order, dev1)
+            comm.Ineighbor_alltoallv([sbuf, mtype], [rbuf, mtype]).Wait()
+
+            assert_equal(_get_array(rbuf), actual_rbuf)
+
+            comm.Free()
+
+    @parameterized.expand(_patterns)
+    def test_Ineighbor_alltoallw(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        for comm in create_topo_comms(self.COMM):
+            rsize, ssize = get_neighbors_count(comm)
+            mtype = _get_type(dtype)
+            sbuf = _get_sbuf(dev1, (ssize, 1), dtype, order)
+            rbuf = _get_rbuf(dev2, (rsize, 1), dtype, order, dev1)
+            sdt, rdt = mtype, mtype
+            sdsp = list(range(0, ssize * sbuf.itemsize, sbuf.itemsize))
+            rdsp = list(range(0, rsize * rbuf.itemsize, rbuf.itemsize))
+            smsg = [sbuf, ([1] * ssize, sdsp), [sdt] * ssize]
+            rmsg = (rbuf, ([1] * rsize, rdsp), [rdt] * rsize)
+            comm.Ineighbor_alltoallw(smsg, rmsg).Wait()
+
+            actual_sbuf = _get_sbuf(np, (ssize, 1), dtype, order)
+            actual_rbuf = _get_rbuf(np, (rsize, 1), dtype, order, np)
+            actual_sdsp = list(range(0, ssize * actual_sbuf.itemsize,
+                                     actual_sbuf.itemsize))
+            actual_rdsp = list(range(0, rsize * actual_rbuf.itemsize,
+                                     actual_rbuf.itemsize))
+            actual_smsg = [actual_sbuf, ([1] * ssize, actual_sdsp),
+                           [sdt] * ssize]
+            actual_rmsg = (actual_rbuf, ([1] * rsize, actual_rdsp),
+                           [rdt] * rsize)
+            comm.Ineighbor_alltoallw(actual_smsg, actual_rmsg).Wait()
+            (_get_array(rbuf), _get_array(actual_rbuf))
+
+            comm.Free()
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/coverage_device_test/test_coverage_device_datatype.py b/coverage_device_test/test_coverage_device_datatype.py
new file mode 100644
index 0000000..8249bb8
--- /dev/null
+++ b/coverage_device_test/test_coverage_device_datatype.py
@@ -0,0 +1,47 @@
+import unittest  # NOQA
+from unittest import TestCase  # NOQA
+from parameterized import parameterized  # NOQA
+from mpi4pyve import MPI  # NOQA
+from mpi4pyve import util  # NOQA
+from numpy.testing import (
+    assert_array_equal,
+)
+import coverage_device_util  # NOQA
+from coverage_device_util import (
+    _patterns, _get_type, _get_sbuf, _get_rbuf, vp, np
+)
+
+
+class TestCoverageDeviceDataType(unittest.TestCase):
+
+    COMM = MPI.COMM_WORLD
+
+    @parameterized.expand(_patterns)
+    def test_PackUnpack(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+        mtype = _get_type(dtype)
+
+        desired = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast([desired, mtype], root=0)
+
+        x = dev1.array(desired, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, dev1)
+        size_x = mtype.Pack_size(x.size, self.COMM)
+        size_y = mtype.Pack_size(y.size, self.COMM)
+        z = np.empty(max(size_x, size_y) * 16, dtype='b')
+        mtype.Pack(x, z, 0, self.COMM)
+        mtype.Unpack(z, 0, y, self.COMM)
+
+        assert_array_equal(x, y)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/coverage_device_test/test_coverage_device_file.py b/coverage_device_test/test_coverage_device_file.py
new file mode 100644
index 0000000..e1b8cc1
--- /dev/null
+++ b/coverage_device_test/test_coverage_device_file.py
@@ -0,0 +1,452 @@
+import unittest  # NOQA
+from unittest import TestCase  # NOQA
+from parameterized import parameterized  # NOQA
+from mpi4pyve import MPI  # NOQA
+from mpi4pyve import util  # NOQA
+from numpy.testing import (
+    assert_equal,
+)
+import sys, os, tempfile  # NOQA
+import  coverage_device_util  # NOQA
+from coverage_device_util import (
+    _patterns, _get_array, _get_type, _get_sbuf, _get_rbuf, vp, np, IS_MULTI_HOST
+)
+
+
+class TestCoverageDeviceFile(unittest.TestCase):
+
+    COMM = MPI.COMM_WORLD
+    FILE = MPI.FILE_NULL
+
+    prefix = 'mpi4pyve'
+    tmpname = './tmp'
+
+    def setUp(self):
+        comm = self.COMM
+        fname = None
+        if comm.Get_rank() == 0:
+            if not os.path.exists(self.tmpname):
+                try:
+                    os.mkdir(self.tmpname)
+                except OSError as e:
+                    if e.errno != 17:  # not File exists
+                        raise
+                    pass
+            fd, fname = tempfile.mkstemp(prefix=self.prefix, dir=self.tmpname)
+            os.close(fd)
+        fname = comm.bcast(fname, 0)
+        amode = MPI.MODE_RDWR | MPI.MODE_CREATE
+        amode |= MPI.MODE_DELETE_ON_CLOSE
+        amode |= MPI.MODE_UNIQUE_OPEN
+        info = MPI.INFO_NULL
+        try:
+            self.FILE = MPI.File.Open(comm, fname, amode, info)
+        except Exception:
+            if comm.Get_rank() == 0:
+                os.remove(fname)
+            raise
+
+    def tearDown(self):
+        if self.FILE:
+            self.FILE.Close()
+        self.COMM.Barrier()
+
+    @parameterized.expand(_patterns)
+    def test_ReadWriteAt(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+
+        fh.Write_at(rank * nbytes * 10, x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Read_at(rank * nbytes * 10, y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_ReadWriteAtAll(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        fh = self.FILE
+        nbytes = max(x.nbytes, y.nbytes)
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+
+        fh.Write_at_all(rank * nbytes * 10, x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Read_at_all(rank * nbytes * 10, y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_IReadIWriteAt(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+
+        fh.Iwrite_at(rank * nbytes * 10, x).Wait()
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Iread_at(rank * nbytes * 10, y).Wait()
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_IReadIWriteAtAll(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+ 
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Iwrite_at_all(rank * nbytes * 10, x).Wait()
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Iread_at_all(rank * nbytes * 10, y).Wait()
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_ReadWrite(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Write(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Read(y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_ReadWriteAll(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Write_all(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Read_all(y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_IreadIwrite(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Iwrite(x).Wait()
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Iread(y).Wait()
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_IreadIwriteAll(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Iwrite_all(x).Wait()
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Iread_all(y).Wait()
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    @unittest.skipIf(IS_MULTI_HOST, 'necmpi-multi-host')
+    def test_ReadWriteShared(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Write_shared(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Read_shared(y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    @unittest.skipIf(IS_MULTI_HOST, 'necmpi-multi-host')
+    def test_IreadIwriteShared(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Iwrite_shared(x).Wait()
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Iread_shared(y).Wait()
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    @unittest.skipIf(IS_MULTI_HOST, 'necmpi-multi-host')
+    def test_ReadWriteOrderd(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Write_ordered(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Read_ordered(y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_ReadWriteAtAllBegin(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Write_at_all_begin(rank * nbytes * 10, x)
+        fh.Write_at_all_end(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Read_at_all_begin(rank * nbytes * 10, y)
+        fh.Read_at_all_end(y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_ReadWriteAllBegin(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Write_all_begin(x)
+        fh.Write_all_end(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Read_all_begin(y)
+        fh.Read_all_end(y)
+
+        assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    @unittest.skipIf(IS_MULTI_HOST, 'necmpi-multi-host')
+    def test_ReadWriteOrderdBegin(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np)
+                                or (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+
+        actual = _get_sbuf(np, shape, dtype, order)
+        comm.Bcast(actual, root=0)
+
+        x = dev1.array(actual, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype, order, fromdev=dev1)
+        nbytes = max(x.nbytes, y.nbytes)
+        fh = self.FILE
+        fh.Set_size(0)
+        fh.Set_view(0, _get_type(dtype))
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Write_ordered_begin(x)
+        fh.Write_ordered_end(x)
+        fh.Sync()
+        comm.Barrier()
+        fh.Sync()
+        fh.Seek_shared(rank * nbytes * 10, MPI.SEEK_SET)
+        fh.Read_ordered_begin(y)
+        fh.Read_ordered_end(y)
+
+        assert_equal(_get_array(y), actual)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/coverage_device_test/test_coverage_device_message.py b/coverage_device_test/test_coverage_device_message.py
new file mode 100644
index 0000000..dfb0c98
--- /dev/null
+++ b/coverage_device_test/test_coverage_device_message.py
@@ -0,0 +1,72 @@
+import unittest  # NOQA
+from unittest import TestCase  # NOQA
+from parameterized import parameterized  # NOQA
+from mpi4pyve import MPI  # NOQA
+from mpi4pyve import util  # NOQA
+from numpy.testing import (
+    assert_equal,
+)
+import coverage_device_util  # NOQA
+from coverage_device_util import (
+    _patterns, _get_array, _get_type, _get_sbuf, _get_rbuf, vp, np
+)
+
+
+class TestCoverageDeviceMessage(unittest.TestCase):
+
+    COMM = MPI.COMM_WORLD
+
+    @parameterized.expand(_patterns)
+    def test_Recv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+
+        actual = _get_sbuf(np, shape, dtype=dtype, order=order)
+        comm.Bcast(actual, root=0)
+        x = dev1.array(actual, dtype=dtype, order=order)
+        if rank % 2 == 0:
+            comm.Send([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+            m = MPI.Message.Probe(comm)
+            m.Recv(y)
+
+            assert_equal(_get_array(y), actual)
+
+    @parameterized.expand(_patterns)
+    def test_Irecv(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        mtype = _get_type(dtype)
+        actual = _get_sbuf(np, shape, dtype=dtype, order=order)
+        comm.Bcast(actual, root=0)
+        x = dev1.array(actual, dtype=dtype, order=order)
+        if rank % 2 == 0:
+            comm.Send([x, mtype], dest=peer)
+        else:
+            y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+            m = MPI.Message.Probe(comm)
+            m.Irecv(y).Wait()
+
+            assert_equal(_get_array(y), actual)
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/coverage_device_test/test_coverage_device_win.py b/coverage_device_test/test_coverage_device_win.py
new file mode 100644
index 0000000..6d5efec
--- /dev/null
+++ b/coverage_device_test/test_coverage_device_win.py
@@ -0,0 +1,486 @@
+import unittest  # NOQA
+from unittest import TestCase  # NOQA
+from parameterized import parameterized  # NOQA
+from mpi4pyve import MPI  # NOQA
+from mpi4pyve import util  # NOQA
+from numpy.testing import (
+    assert_equal,
+)
+import coverage_device_util  # NOQA
+from coverage_device_util import (
+    _patterns, _get_array, _get_sbuf, _get_rbuf, vp, np,
+    _flush_test_patterns
+)
+
+
+class TestCoverageDeviceWin(unittest.TestCase):
+
+    COMM = MPI.COMM_WORLD
+    INFO = MPI.INFO_NULL
+
+    def memzero(self, m):
+        try:
+            m[:] = 0
+        except IndexError:  # cffi buffer
+            m[0:len(m)] = b'\0' * len(m)
+
+    def setUp(self):
+        nbytes = 5 ** 11 * MPI.DOUBLE.size
+        try:
+            self.mpi_memory = MPI.Alloc_mem(nbytes)
+            self.memory = self.mpi_memory
+            self.memzero(self.memory)
+        except MPI.Exception:
+            import array
+            self.mpi_memory = None
+            self.memory = array.array('B', [0] * nbytes)
+        self.WIN = MPI.Win.Create(self.memory, 1, self.INFO, self.COMM)
+        try:
+            self.actual_mpi_memory = MPI.Alloc_mem(nbytes)
+            self.actual_memory = self.actual_mpi_memory
+            self.memzero(self.actual_memory)
+        except MPI.Exception:
+            import array
+            self.actual_mpi_memory = None
+            self.actual_memory = array.array('B', [0] * nbytes)
+        self.actual_WIN = MPI.Win.Create(self.actual_memory, 1, self.INFO,
+                                         self.COMM)
+
+    def tearDown(self):
+        self.WIN.Free()
+        if self.mpi_memory:
+            MPI.Free_mem(self.mpi_memory)
+        self.actual_WIN.Free()
+        if self.actual_mpi_memory:
+            MPI.Free_mem(self.actual_mpi_memory)
+
+    @parameterized.expand(_patterns)
+    def test_PutGet(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        actual_y = _get_sbuf(np, shape, dtype=dtype, order=order)
+        comm.Bcast(actual_y, root=0)
+        x = dev1.array(actual_y, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+        target = x.itemsize
+        self.WIN.Fence()
+        self.WIN.Put(x, rank, target)
+        self.WIN.Fence()
+        self.WIN.Get(y, rank, target)
+        self.WIN.Fence()
+
+        assert_equal(_get_array(y), actual_y)
+
+    @parameterized.expand(_patterns)
+    def test_Accumulate(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        x = _get_sbuf(dev1, shape, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+        actual_x = _get_sbuf(np, shape, dtype=dtype, order=order)
+        actual_y = _get_rbuf(np, shape, dtype=dtype, order=order)
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
+            self.WIN.Fence()
+            self.WIN.Accumulate(x, rank, op=op)
+            self.WIN.Fence()
+            self.WIN.Get(y, rank)
+            self.WIN.Fence()
+
+            self.actual_WIN.Fence()
+            self.actual_WIN.Accumulate(actual_x, rank, op=op)
+            self.actual_WIN.Fence()
+            self.actual_WIN.Get(actual_y, rank)
+            self.actual_WIN.Fence()
+
+            assert_equal(_get_array(x), actual_x)
+            assert_equal(_get_array(y), actual_y)
+
+    @parameterized.expand(_patterns)
+    def test_GetAccumulate(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        x = _get_sbuf(dev1, shape, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+        actual_x = _get_sbuf(np, shape, dtype=dtype, order=order)
+        actual_y = _get_rbuf(np, shape, dtype=dtype, order=order)
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN, MPI.NO_OP):
+            self.WIN.Fence()
+            self.WIN.Put(x, rank)
+            self.WIN.Get_accumulate(x, y, rank, op=op)
+            self.WIN.Fence()
+            self.WIN.Get(y, rank)
+            self.WIN.Fence()
+
+            self.actual_WIN.Fence()
+            self.actual_WIN.Put(actual_x, rank)
+            self.actual_WIN.Get_accumulate(actual_x, actual_y, rank, op=op)
+            self.actual_WIN.Fence()
+            self.actual_WIN.Get(actual_y, rank)
+            self.actual_WIN.Fence()
+
+            assert_equal(_get_array(x), actual_x)
+            assert_equal(_get_array(y), actual_y)
+
+    @parameterized.expand(_patterns)
+    def test_Fetch_and_op(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        x = _get_sbuf(dev1, 1, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, 1, dtype=dtype, order=order)
+        x.fill(1)
+        y.fill(-1)
+        actual_x = _get_sbuf(np, 1, dtype=dtype, order=order)
+        actual_y = _get_rbuf(np, 1, dtype=dtype, order=order)
+        actual_x.fill(1)
+        actual_y.fill(-1)
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN, MPI.REPLACE,
+                   MPI.NO_OP):
+            self.WIN.Fence()
+            self.WIN.Fetch_and_op(x, y, rank, 0, op=op)
+            self.WIN.Fence()
+
+            self.actual_WIN.Fence()
+            self.actual_WIN.Fetch_and_op(actual_x, actual_y, rank, 0, op=op)
+            self.actual_WIN.Fence()
+
+            assert_equal(_get_array(x), actual_x)
+            assert_equal(_get_array(y), actual_y)
+
+    @parameterized.expand(_patterns)
+    def test_Compare_and_swap(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        x = _get_sbuf(dev1, 1, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, 1, dtype=dtype, order=order)
+        z = _get_rbuf(dev2, 1, dtype=dtype, order=order)
+        x.fill(1)
+        y.fill(0)
+        z.fill(-1)
+        self.WIN.Fence()
+        self.WIN.Compare_and_swap(x, y, z, rank, 0)
+        self.WIN.Fence()
+
+        actual_x = _get_sbuf(np, 1, dtype=dtype, order=order)
+        actual_y = _get_rbuf(np, 1, dtype=dtype, order=order)
+        actual_z = _get_rbuf(np, 1, dtype=dtype, order=order)
+        actual_x.fill(1)
+        actual_y.fill(0)
+        actual_z.fill(-1)
+        self.actual_WIN.Fence()
+        self.actual_WIN.Compare_and_swap(actual_x, actual_y, actual_z, rank, 0)
+        self.actual_WIN.Fence()
+
+        assert_equal(_get_array(x), actual_x)
+        assert_equal(_get_array(y), actual_y)
+        assert_equal(_get_array(z), actual_z)
+
+    @parameterized.expand(_patterns)
+    def test_Rput_Rget(self, dev1, dev2, shape, dtype, order):
+        if dtype == 'bool' and ((dev1 is vp and dev2 is np) or
+                                (dev1 is np and dev2 is vp)):
+            self.skipTest('Booleans case in vp-to-np is not testable')
+
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        actual_y = _get_sbuf(np, shape, dtype=dtype, order=order)
+        comm.Bcast(actual_y, root=0)
+        x = dev1.array(actual_y, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+        self.WIN.Fence()
+        self.WIN.Rput(x, rank).Wait()
+        self.WIN.Fence()
+        self.WIN.Rget(y, rank).Wait()
+        self.WIN.Fence()
+
+        assert_equal(_get_array(y), actual_y)
+
+    @parameterized.expand(_patterns)
+    def test_Raccumulate(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        x = _get_sbuf(dev1, shape, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+        actual_x = _get_sbuf(np, shape, dtype=dtype, order=order)
+        actual_y = _get_rbuf(np, shape, dtype=dtype, order=order)
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN, MPI.REPLACE):
+            self.WIN.Fence()
+            x.fill(1)
+            self.WIN.Rput(x, rank).Wait()
+            self.WIN.Fence()
+            self.WIN.Raccumulate(x, rank, op=op).Wait()
+            self.WIN.Fence()
+            self.WIN.Rget(y, rank).Wait()
+            self.WIN.Fence()
+
+            self.actual_WIN.Fence()
+            actual_x.fill(1)
+            self.actual_WIN.Rput(actual_x, rank).Wait()
+            self.actual_WIN.Fence()
+            self.actual_WIN.Raccumulate(actual_x, rank, op=op).Wait()
+            self.actual_WIN.Fence()
+            self.actual_WIN.Rget(actual_y, rank).Wait()
+            self.actual_WIN.Fence()
+
+            assert_equal(_get_array(x), actual_x)
+            assert_equal(_get_array(y), actual_y)
+
+    @parameterized.expand(_patterns)
+    def test_Rget_accumulate(self, dev1, dev2, shape, dtype, order):
+        if dtype in ('complex64', 'complex128', 'bool'):
+            self.skipTest('Datatype is not testable')
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        x = _get_sbuf(dev1, shape, dtype=dtype, order=order)
+        y = _get_rbuf(dev2, shape, dtype=dtype, order=order)
+        actual_x = _get_sbuf(np, shape, dtype=dtype, order=order)
+        actual_y = _get_rbuf(np, shape, dtype=dtype, order=order)
+        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN, MPI.REPLACE,
+                   MPI.NO_OP):
+            self.WIN.Fence()
+            x.fill(1)
+            self.WIN.Rput(x, rank).Wait()
+            self.WIN.Fence()
+            self.WIN.Rget_accumulate(x, y, rank, op=op).Wait()
+            self.WIN.Fence()
+            self.WIN.Rget(y, rank).Wait()
+            self.WIN.Fence()
+
+            self.actual_WIN.Fence()
+            actual_x.fill(1)
+            self.actual_WIN.Rput(actual_x, rank).Wait()
+            self.actual_WIN.Fence()
+            self.actual_WIN.Rget_accumulate(actual_x, actual_y,
+                                            rank, op=op).Wait()
+            self.actual_WIN.Fence()
+            self.actual_WIN.Rget(actual_y, rank).Wait()
+            self.actual_WIN.Fence()
+
+            assert_equal(_get_array(x), actual_x)
+            assert_equal(_get_array(y), actual_y)
+
+    # Fence : No NLCPy synchronization required.
+    @parameterized.expand(_flush_test_patterns)
+    def test_Fence_synchronize(self, dev1, dtype, order):
+        comm = self.COMM
+        rank = comm.Get_rank()
+
+        n = dev1.array(0, dtype=dtype, order=order)
+        expect = dev1.array(1, dtype=dtype, order=order)
+
+        if rank == 0:
+            win_n = MPI.Win.Create(n, comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if rank == 0:
+            n.fill(1)
+        win_n.Fence()
+        if rank != 0:
+            win_n.Get([n, MPI.INT], 0)
+        win_n.Fence()
+        win_n.Free()
+        assert_equal(_get_array(n), _get_array(expect))
+
+    # Sync : No NLCPy synchronization required.
+    @parameterized.expand(_flush_test_patterns)
+    def test_Sync_synchronize(self, dev1, dtype, order):
+        comm = self.COMM
+        rank = comm.Get_rank()
+
+        n = dev1.array(0, dtype=dtype, order=order)
+        expect = dev1.array(1, dtype=dtype, order=order)
+
+        if rank == 0:
+            win_n = MPI.Win.Create(n, comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if rank == 0:
+            n.fill(1)
+            if dev1 == vp:
+                n.venode.synchronize()
+        if rank != 0:
+            win_n.Lock(MPI.LOCK_EXCLUSIVE, 0)
+            n.fill(0)
+            win_n.Sync()
+            win_n.Get([n, MPI.INT], 0)
+            win_n.Unlock(0)
+        comm.Barrier()
+        win_n.Free()
+        assert_equal(_get_array(n), _get_array(expect))
+
+    # Post : No NLCPy synchronization required.
+    @parameterized.expand(_flush_test_patterns)
+    def test_Post_synchronize(self, dev1, dtype, order):
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        n = dev1.array(0, dtype=dtype, order=order)
+        expect = dev1.array(1, dtype=dtype, order=order)
+        comm_group = comm.Get_group()
+
+        if rank % 2 == 0:
+            win_n = MPI.Win.Create(n, comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if peer != MPI.PROC_NULL:
+            group = comm_group.Incl(list([peer]))
+            if rank % 2 == 0:
+                n.fill(1)
+                win_n.Post(group)
+                win_n.Wait()
+            else:
+                win_n.Start(group)
+                win_n.Get([n, MPI.INT], peer)
+                win_n.Complete()
+            group.Free()
+        win_n.Free()
+        comm_group.Free()
+        if peer != MPI.PROC_NULL:
+            assert_equal(_get_array(n), _get_array(expect))
+
+    # Start : No NLCPy synchronization required.
+    @parameterized.expand(_flush_test_patterns)
+    def test_Start_synchronize(self, dev1, dtype, order):
+        comm = self.COMM
+        size = comm.Get_size()
+        rank = comm.Get_rank()
+        peer = rank ^ 1
+        if peer >= size:
+            peer = MPI.PROC_NULL
+
+        n = dev1.array(0, dtype=dtype, order=order)
+        expect = dev1.array(1, dtype=dtype, order=order)
+        comm_group = comm.Get_group()
+
+        if rank % 2 == 0:
+            win_n = MPI.Win.Create(n, comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if peer != MPI.PROC_NULL:
+            group = comm_group.Incl(list([peer]))
+            if rank % 2 == 0:
+                win_n.Post(group)
+                win_n.Wait()
+            else:
+                n.fill(1)
+                win_n.Start(group)
+                win_n.Put([n, MPI.INT], peer)
+                win_n.Complete()
+            group.Free()
+        win_n.Free()
+        comm_group.Free()
+        if peer != MPI.PROC_NULL:
+            assert_equal(_get_array(n), _get_array(expect))
+
+    # Lock : NLCPy synchronization required.
+    @parameterized.expand(_flush_test_patterns)
+    def test_Lock_synchronize(self, dev1, dtype, order):
+        comm = self.COMM
+        rank = comm.Get_rank()
+
+        n = dev1.array(0, dtype=dtype, order=order)
+        expect = dev1.array(1, dtype=dtype, order=order)
+
+        if rank == 0:
+            win_n = MPI.Win.Create(n, comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if rank == 0:
+            n.fill(1)
+            if dev1 == vp:
+                n.venode.synchronize()
+        comm.Barrier()
+        if rank != 0:
+            win_n.Lock(MPI.LOCK_EXCLUSIVE, 0)
+            win_n.Get([n, MPI.INT], 0)
+            win_n.Unlock(0)
+        comm.Barrier()
+        win_n.Free()
+        assert_equal(_get_array(n), _get_array(expect))
+
+    # Lock_all : NLCPy synchronization required.
+    @parameterized.expand(_flush_test_patterns)
+    def test_Lock_all_synchronize(self, dev1, dtype, order):
+        comm = self.COMM
+        rank = comm.Get_rank()
+
+        n = dev1.array(0, dtype=dtype, order=order)
+        expect = dev1.array(1, dtype=dtype, order=order)
+
+        if rank == 0:
+            win_n = MPI.Win.Create(n, comm=MPI.COMM_WORLD)
+        else:
+            win_n = MPI.Win.Create(None, comm=MPI.COMM_WORLD)
+        if rank == 0:
+            n.fill(1)
+            if dev1 == vp:
+                n.venode.synchronize()
+        comm.Barrier()
+        if rank != 0:
+            win_n.Lock_all()
+            win_n.Get([n, MPI.INT], 0)
+            win_n.Unlock_all()
+        comm.Barrier()
+        win_n.Free()
+        assert_equal(_get_array(n), _get_array(expect))
+
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/demo/osu_allgather.py b/demo/OSU/osu_allgather.py
similarity index 100%
rename from demo/osu_allgather.py
rename to demo/OSU/osu_allgather.py
diff --git a/demo/osu_alltoall.py b/demo/OSU/osu_alltoall.py
similarity index 100%
rename from demo/osu_alltoall.py
rename to demo/OSU/osu_alltoall.py
diff --git a/demo/osu_alltoallv.py b/demo/OSU/osu_alltoallv.py
similarity index 100%
rename from demo/osu_alltoallv.py
rename to demo/OSU/osu_alltoallv.py
diff --git a/demo/osu_barrier.py b/demo/OSU/osu_barrier.py
similarity index 100%
rename from demo/osu_barrier.py
rename to demo/OSU/osu_barrier.py
diff --git a/demo/osu_bcast.py b/demo/OSU/osu_bcast.py
similarity index 100%
rename from demo/osu_bcast.py
rename to demo/OSU/osu_bcast.py
diff --git a/demo/osu_bibw.py b/demo/OSU/osu_bibw.py
similarity index 100%
rename from demo/osu_bibw.py
rename to demo/OSU/osu_bibw.py
diff --git a/demo/osu_bw.py b/demo/OSU/osu_bw.py
similarity index 100%
rename from demo/osu_bw.py
rename to demo/OSU/osu_bw.py
diff --git a/demo/osu_gather.py b/demo/OSU/osu_gather.py
similarity index 100%
rename from demo/osu_gather.py
rename to demo/OSU/osu_gather.py
diff --git a/demo/osu_latency.py b/demo/OSU/osu_latency.py
similarity index 100%
rename from demo/osu_latency.py
rename to demo/OSU/osu_latency.py
diff --git a/demo/osu_multi_lat.py b/demo/OSU/osu_multi_lat.py
similarity index 100%
rename from demo/osu_multi_lat.py
rename to demo/OSU/osu_multi_lat.py
diff --git a/demo/osu_scatter.py b/demo/OSU/osu_scatter.py
similarity index 100%
rename from demo/osu_scatter.py
rename to demo/OSU/osu_scatter.py
diff --git a/demo/README.txt b/demo/README.rst
similarity index 81%
rename from demo/README.txt
rename to demo/README.rst
index 71c40d9..0fe57d7 100644
--- a/demo/README.txt
+++ b/demo/README.rst
@@ -1,6 +1,6 @@
 Issuing at the command line::
 
-   $ mpiexec -n 5 python helloworld.py
+   $ mpiexec -veo -n 5 python helloworld.py
 
 will launch a five-process run of the Python interpreter and execute
 the test script ``helloworld.py``, a parallelized version of the
diff --git a/demo/compute-pi/README.txt b/demo/compute-pi/README.rst
similarity index 100%
rename from demo/compute-pi/README.txt
rename to demo/compute-pi/README.rst
diff --git a/demo/compute-pi/cpi-cco.py b/demo/compute-pi/cpi-cco.py
index 325f2f8..11c47d8 100644
--- a/demo/compute-pi/cpi-cco.py
+++ b/demo/compute-pi/cpi-cco.py
@@ -5,7 +5,7 @@
 
 usage::
 
-  $ mpiexec -n <nprocs> python cpi-buf.py
+  $ mpiexec -veo -n <nprocs> python cpi-buf.py
 """
 
 from mpi4pyve import MPI
diff --git a/demo/compute-pi/cpi-dpm.py b/demo/compute-pi/cpi-dpm.py
deleted file mode 100644
index 619ee7d..0000000
--- a/demo/compute-pi/cpi-dpm.py
+++ /dev/null
@@ -1,183 +0,0 @@
-#!/usr/bin/env python
-"""
-Parallel PI computation using Dynamic Process Management (DPM)
-within Python objects exposing memory buffers (requires NumPy).
-
-usage:
-
-  + parent/child model::
-
-      $ mpiexec -n 1 python cpi-dpm.py [nchilds]
-
-  + client/server model::
-
-      $ [xterm -e] mpiexec -n <nprocs> python cpi-dpm.py server [-v] &
-      $ [xterm -e] mpiexec -n 1 python cpi-dpm.py client [-v]
-"""
-
-import sys
-from mpi4pyve import MPI
-import numpy as N
-
-def get_n():
-    prompt = "Enter the number of intervals: (0 quits) "
-    try:
-        n = int(input(prompt))
-        if n < 0: n = 0
-    except:
-        n = 0
-    return n
-
-def view(pi, np=None, wt=None):
-    from math import pi as PI
-    prn = sys.stdout.write
-    if pi is not None:
-        prn("computed pi is:  %.16f\n"  % pi)
-        prn("absolute error:  %.16f\n" % abs(pi - PI))
-    if np is not None:
-        prn("computing units: %d processes\n" % np)
-    if wt is not None:
-        prn("wall clock time: %g seconds\n" % wt)
-    sys.stdout.flush()
-
-def comp_pi(n, comm, root=0):
-    nprocs = comm.Get_size()
-    myrank = comm.Get_rank()
-    n = N.array(n, 'i')
-    comm.Bcast([n, MPI.INT], root=root)
-    if n == 0: return 0.0
-    h = 1.0 / n;
-    s = 0.0;
-    for i in range(myrank, n, nprocs):
-        x = h * (i + 0.5);
-        s += 4.0 / (1.0 + x**2);
-    mypi = s * h
-    mypi = N.array(mypi, 'd')
-    pi   = N.array(0, 'd')
-    comm.Reduce([mypi, MPI.DOUBLE],
-                [pi,   MPI.DOUBLE],
-                root=root, op=MPI.SUM)
-    return pi
-
-def master(icomm):
-    n = get_n()
-    wt = MPI.Wtime()
-    n =  N.array(n, 'i')
-    icomm.Send([n, MPI.INT], dest=0)
-    pi = N.array(0, 'd')
-    icomm.Recv([pi, MPI.DOUBLE], source=0)
-    wt = MPI.Wtime() - wt
-    if n == 0: return
-    np = icomm.Get_remote_size()
-    view(pi, np, wt)
-
-def worker(icomm):
-    myrank = icomm.Get_rank()
-    if myrank == 0:
-        source = dest = 0
-    else:
-        source = dest = MPI.PROC_NULL
-    n =  N.array(0, 'i')
-    icomm.Recv([n, MPI.INT], source=source)
-    pi = comp_pi(n, comm=MPI.COMM_WORLD, root=0)
-    pi = N.array(pi, 'd')
-    icomm.Send([pi, MPI.DOUBLE], dest=dest)
-
-
-# Parent/Child
-
-def main_parent(nprocs=1):
-    assert nprocs > 0
-    assert MPI.COMM_WORLD.Get_size() == 1
-    icomm = MPI.COMM_WORLD.Spawn(command=sys.executable,
-                                 args=[__file__, 'child'],
-                                 maxprocs=nprocs)
-    master(icomm)
-    icomm.Disconnect()
-
-def main_child():
-    icomm = MPI.Comm.Get_parent()
-    assert icomm != MPI.COMM_NULL
-    worker(icomm)
-    icomm.Disconnect()
-
-# Client/Server
-
-def main_server(COMM):
-    nprocs = COMM.Get_size()
-    myrank = COMM.Get_rank()
-
-    service, port, info = None, None, MPI.INFO_NULL
-    if myrank == 0:
-        port = MPI.Open_port(info)
-        log(COMM, "open port '%s'", port)
-        service = 'cpi'
-        MPI.Publish_name(service, port, info)
-        log(COMM, "service '%s' published.", service)
-    else:
-        port = ''
-
-    log(COMM, "waiting for client connection ...")
-    icomm = COMM.Accept(port, info, root=0)
-    log(COMM, "client connection accepted.")
-
-    worker(icomm)
-
-    log(COMM, "disconnecting from client ...")
-    icomm.Disconnect()
-    log(COMM, "client disconnected.")
-
-    if myrank == 0:
-        MPI.Unpublish_name(service, port, info)
-        log(COMM, "service '%s' unpublished", port)
-        MPI.Close_port(port)
-        log(COMM, "closed  port '%s' ", port)
-
-
-def main_client(COMM):
-    assert COMM.Get_size() == 1
-
-    service, info = 'cpi', MPI.INFO_NULL
-    port = MPI.Lookup_name(service, info)
-    log(COMM, "service '%s' found in port '%s'.", service, port)
-
-    log(COMM, "connecting to server ...")
-    icomm = COMM.Connect(port, info, root=0)
-    log(COMM, "server connected.")
-
-    master(icomm)
-
-    log(COMM, "disconnecting from server ...")
-    icomm.Disconnect()
-    log(COMM, "server disconnected.")
-
-
-def main():
-    assert len(sys.argv) <= 2
-
-    if 'server' in sys.argv:
-        main_server(MPI.COMM_WORLD)
-    elif 'client' in sys.argv:
-        main_client(MPI.COMM_WORLD)
-    elif 'child'  in sys.argv:
-        main_child()
-    else:
-        try:    nchilds = int(sys.argv[1])
-        except: nchilds = 2
-        main_parent(nchilds)
-
-
-VERBOSE = False
-
-def log(COMM, fmt, *args):
-    if not VERBOSE: return
-    if COMM.rank != 0: return
-    sys.stdout.write(fmt % args)
-    sys.stdout.write('\n')
-    sys.stdout.flush()
-
-if __name__ == '__main__':
-    if '-v' in sys.argv:
-        VERBOSE = True
-        sys.argv.remove('-v')
-    main()
diff --git a/demo/compute-pi/cpi-rma.py b/demo/compute-pi/cpi-rma.py
index 66e4035..dc6b9f2 100644
--- a/demo/compute-pi/cpi-rma.py
+++ b/demo/compute-pi/cpi-rma.py
@@ -5,7 +5,7 @@
 
 usage::
 
-  $ mpiexec -n <nprocs> python cpi-rma.py
+  $ mpiexec -veo -n <nprocs> python cpi-rma.py
 """
 
 from mpi4pyve import MPI
diff --git a/demo/compute-pi/makefile b/demo/compute-pi/makefile
index 3ec0dff..4e9cc5d 100644
--- a/demo/compute-pi/makefile
+++ b/demo/compute-pi/makefile
@@ -1,9 +1,8 @@
 .PHONY: test
 
-MPIEXEC=mpiexec -n 1
+MPIEXEC=mpiexec -veo -n 1
 PYTHON=python
 
 test:
 	echo 100 | ${MPIEXEC} ${PYTHON} cpi-cco.py
 	echo 100 | ${MPIEXEC} ${PYTHON} cpi-rma.py
-	echo 100 | ${MPIEXEC} ${PYTHON} cpi-dpm.py
diff --git a/demo/cython/helloworld.pyx b/demo/cython/helloworld.pyx
deleted file mode 100644
index f3f88ea..0000000
--- a/demo/cython/helloworld.pyx
+++ /dev/null
@@ -1,67 +0,0 @@
-cdef extern from "mpi-compat.h": pass
-
-# ---------
-
-
-# Python-level module import
-# (file: mpi4pyve/MPI.so)
-
-from mpi4pyve import MPI
-
-# Python-level objects and code
-
-size  = MPI.COMM_WORLD.Get_size()
-rank  = MPI.COMM_WORLD.Get_rank()
-pname = MPI.Get_processor_name()
-
-hwmess = "Hello, World! I am process %d of %d on %s."
-print (hwmess % (rank, size, pname))
-
-
-
-# ---------
-
-
-# Cython-level cimport
-# this make available mpi4pyve's Python extension types
-# (file:  mpi4pyve/include/mpi4pyve/MPI.pxd)
-
-from mpi4pyve cimport MPI
-from mpi4pyve.MPI cimport Intracomm as IntracommType
-
-# C-level cdef, typed, Python objects
-
-cdef MPI.Comm WORLD = MPI.COMM_WORLD
-cdef IntracommType SELF = MPI.COMM_SELF
-
-
-# ---------
-
-
-# Cython-level cimport with PXD file
-# this make available the native MPI C API
-# with namespace-protection (stuff accessed as mpi.XXX)
-# (file: mpi4pyve/include/mpi4pyve/libmpi.pxd)
-
-from mpi4pyve cimport libmpi as mpi
-
-cdef mpi.MPI_Comm world1 = WORLD.ob_mpi
-
-cdef int ierr1=0
-
-cdef int size1 = 0
-ierr1 = mpi.MPI_Comm_size(mpi.MPI_COMM_WORLD, &size1)
-
-cdef int rank1 = 0
-ierr1 = mpi.MPI_Comm_rank(mpi.MPI_COMM_WORLD, &rank1)
-
-cdef int rlen1=0
-cdef char pname1[mpi.MPI_MAX_PROCESSOR_NAME]
-ierr1 = mpi.MPI_Get_processor_name(pname1, &rlen1)
-pname1[rlen1] = 0 # just in case ;-)
-
-hwmess = "Hello, World! I am process %d of %d on %s."
-print (hwmess % (rank1, size1, pname1))
-
-
-# ---------
diff --git a/demo/cython/makefile b/demo/cython/makefile
deleted file mode 100644
index 179f542..0000000
--- a/demo/cython/makefile
+++ /dev/null
@@ -1,31 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-
-CYTHON = cython
-.PHONY: src
-src: helloworld.c
-helloworld.c: helloworld.pyx
-	${CYTHON} $<
-
-
-MPICC = mpicc
-CFLAGS = -fPIC ${shell ${PYTHON_CONFIG} --includes}
-LDFLAGS = -shared ${shell ${PYTHON_CONFIG} --libs}
-SO = ${shell ${PYTHON_CONFIG} --extension-suffix}
-.PHONY: build
-build: helloworld${SO}
-helloworld${SO}: helloworld.c
-	${MPICC} ${CFLAGS} -o $@ $< ${LDFLAGS}
-
-
-.PHONY: test
-test: build
-	${PYTHON} -c 'import helloworld'
-
-
-.PHONY: clean
-clean:
-	${RM} helloworld.c helloworld${SO}
diff --git a/demo/cython/mpi-compat.h b/demo/cython/mpi-compat.h
deleted file mode 100644
index 367c58a..0000000
--- a/demo/cython/mpi-compat.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* Author:  Lisandro Dalcin   */
-/* Contact: dalcinl@gmail.com */
-
-#ifndef MPI_COMPAT_H
-#define MPI_COMPAT_H
-
-#include <mpi.h>
-
-#if (MPI_VERSION < 3) && !defined(PyMPI_HAVE_MPI_Message)
-typedef void *PyMPI_MPI_Message;
-#define MPI_Message PyMPI_MPI_Message
-#endif
-
-#endif/*MPI_COMPAT_H*/
diff --git a/demo/embedding/helloworld.c b/demo/embedding/helloworld.c
deleted file mode 100644
index cf5a0b7..0000000
--- a/demo/embedding/helloworld.c
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- *  You can use safely use mpi4pyve between multiple
- *  Py_Initialize()/Py_Finalize() calls ...
- *  but do not blame me for the memory leaks ;-)
- *
- */
-
-#include <Python.h>
-#include <mpi.h>
-
-const char helloworld[] = \
-  "from mpi4pyve import MPI                                \n"
-  "hwmess = 'Hello, World! I am process %d of %d on %s.' \n"
-  "myrank = MPI.COMM_WORLD.Get_rank()                    \n"
-  "nprocs = MPI.COMM_WORLD.Get_size()                    \n"
-  "procnm = MPI.Get_processor_name()                     \n"
-  "print (hwmess % (myrank, nprocs, procnm))             \n"
-  "";
-
-int main(int argc, char *argv[])
-{
-  int i,n=5;
-
-  MPI_Init(&argc, &argv);
-
-  for (i=0; i<n; i++) {
-    Py_Initialize();
-    PyRun_SimpleString(helloworld);
-    Py_Finalize();
-  }
-
-  Py_Initialize();
-  PyRun_SimpleString(helloworld);
-  MPI_Finalize(); /* MPI should be finalized */
-  Py_Finalize();  /* after finalizing Python */
-
-  Py_Initialize();
-  PyRun_SimpleString("from mpi4pyve import MPI\n");
-  Py_Finalize();
-
-  return 0;
-}
diff --git a/demo/embedding/makefile b/demo/embedding/makefile
deleted file mode 100644
index 9884f1d..0000000
--- a/demo/embedding/makefile
+++ /dev/null
@@ -1,24 +0,0 @@
-.PHONY: default build test clean
-
-default: build test clean
-
-PYTHON = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-
-MPICC   = mpicc
-CFLAGS  = ${shell ${PYTHON_CONFIG} --cflags}
-LDFLAGS = ${shell ${PYTHON_CONFIG} --ldflags}
-build: helloworld.exe
-helloworld.exe: helloworld.c
-	${MPICC} ${CFLAGS} -o $@ $< ${LDFLAGS}
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ./helloworld.exe
-
-
-clean:
-	${RM} -r helloworld.exe helloworld.exe.dSYM
diff --git a/demo/futures/makefile b/demo/futures/makefile
deleted file mode 100644
index b123115..0000000
--- a/demo/futures/makefile
+++ /dev/null
@@ -1,41 +0,0 @@
-.PHONY: default
-default: test
-
-PYTHON  = python
-MPIEXEC = mpiexec
-NPFLAG  = -n
-
-.PHONY: run-pool run-crawl run-primes
-run-pool: run-crawl run-primes
-run-crawl:
-	${MPIEXEC} ${NPFLAG} 1 ${PYTHON} run_crawl.py
-run-primes:
-	${MPIEXEC} ${NPFLAG} 1 ${PYTHON} run_primes.py
-
-
-.PHONY: run-comm run-julia run-mandelbrot
-run-comm: run-julia run-mandelbrot
-run-julia:
-	${MPIEXEC} ${NPFLAG} 5 ${PYTHON} run_julia.py
-run-mandelbrot:
-	${MPIEXEC} ${NPFLAG} 5 ${PYTHON} run_mandelbrot.py
-
-
-.PHONY: run-cmdline
-run-cmdline:
-	${MPIEXEC} ${NPFLAG} 5 ${PYTHON} -m mpi4pyve.futures run_crawl.py
-	${MPIEXEC} ${NPFLAG} 5 ${PYTHON} -m mpi4pyve.futures run_primes.py
-	${MPIEXEC} ${NPFLAG} 5 ${PYTHON} -m mpi4pyve.futures run_julia.py
-	${MPIEXEC} ${NPFLAG} 5 ${PYTHON} -m mpi4pyve.futures run_mandelbrot.py
-
-
-.PHONY: run-unittest
-run-unittest:
-	${MPIEXEC} ${NPFLAG} 1 ${PYTHON} test_futures.py
-	${MPIEXEC} ${NPFLAG} 2 ${PYTHON} test_futures.py
-	${MPIEXEC} ${NPFLAG} 1 ${PYTHON} -m mpi4pyve.futures test_futures.py
-	${MPIEXEC} ${NPFLAG} 2 ${PYTHON} -m mpi4pyve.futures test_futures.py
-
-
-.PHONY: test
-test: run-pool run-comm run-cmdline
diff --git a/demo/futures/perf_crawl.py b/demo/futures/perf_crawl.py
deleted file mode 100644
index 34cfe99..0000000
--- a/demo/futures/perf_crawl.py
+++ /dev/null
@@ -1,90 +0,0 @@
-"""
-Compare the speed of downloading URLs sequentially vs. using futures.
-"""
-
-import sys
-import time
-import functools
-
-try:
-    from urllib.request import urlopen
-except ImportError:
-    from urllib2 import urlopen
-
-try:
-    from concurrent.futures import ThreadPoolExecutor
-except ImportError:
-    ThreadPoolExecutor = lambda n: None
-try:
-    from concurrent.futures import ProcessPoolExecutor
-except ImportError:
-    ProcessPoolExecutor = lambda n: None
-
-from mpi4pyve.futures import MPIPoolExecutor, as_completed
-
-URLS = [
-    'http://www.google.com/',
-    'http://www.apple.com/',
-    'http://www.ibm.com',
-    'http://www.thisurlprobablydoesnotexist.com',
-    'http://www.slashdot.org/',
-    'http://www.python.org/',
-    'http://www.bing.com/',
-    'http://www.facebook.com/',
-    'http://www.yahoo.com/',
-    'http://www.youtube.com/',
-    'http://www.blogger.com/',
-]
-
-def load_url(url, timeout):
-    kwargs = {'timeout': timeout} if sys.version_info >= (2, 6) else {}
-    return urlopen(url, **kwargs).read()
-
-def download_urls_sequential(urls, timeout=60):
-    url_to_content = {}
-    for url in urls:
-        try:
-            url_to_content[url] = load_url(url, timeout=timeout)
-        except:
-            pass
-    return url_to_content
-
-def download_urls_with_executor(executor, urls, timeout=60):
-    if executor is None: return {}
-    try:
-        url_to_content = {}
-        future_to_url = dict((executor.submit(load_url, url, timeout), url)
-                             for url in urls)
-        for future in as_completed(future_to_url):
-            try:
-                url_to_content[future_to_url[future]] = future.result()
-            except:
-                pass
-        return url_to_content
-    finally:
-        executor.shutdown()
-
-def main():
-    for meth, fn in [('sequential',
-                      functools.partial(download_urls_sequential,
-                                        URLS)),
-                     ('threads',
-                      functools.partial(download_urls_with_executor,
-                                        ThreadPoolExecutor(10), URLS)),
-                     ('processes',
-                      functools.partial(download_urls_with_executor,
-                                        ProcessPoolExecutor(10), URLS)),
-                     ('mpi4pyve',
-                      functools.partial(download_urls_with_executor,
-                                        MPIPoolExecutor(10), URLS))]:
-        sys.stdout.write('%s: ' % meth.ljust(11))
-        sys.stdout.flush()
-        start = time.time()
-        url_map = fn()
-        elapsed = time.time() - start
-        sys.stdout.write('%5.2f seconds (%2d of %d downloaded)\n' %
-                         (elapsed, len(url_map), len(URLS)))
-        sys.stdout.flush()
-
-if __name__ == '__main__':
-    main()
diff --git a/demo/futures/perf_primes.py b/demo/futures/perf_primes.py
deleted file mode 100644
index 77ad5d9..0000000
--- a/demo/futures/perf_primes.py
+++ /dev/null
@@ -1,78 +0,0 @@
-"""
-Compare the speed of primes sequentially vs. using futures.
-"""
-
-import sys
-import time
-import math
-try:
-    range = xrange
-except NameError:
-    range = range
-
-try:
-    from concurrent.futures import ThreadPoolExecutor
-except ImportError:
-    ThreadPoolExecutor = None
-try:
-    from concurrent.futures import ProcessPoolExecutor
-except ImportError:
-    ProcessPoolExecutor = None
-
-from mpi4pyve.futures import MPIPoolExecutor
-
-PRIMES = [
-    112272535095293,
-    112582705942171,
-    112272535095293,
-    115280095190773,
-    115797848077099,
-    117450548693743,
-    993960000099397,
-]
-
-def is_prime(n):
-    if n % 2 == 0:
-        return False
-    sqrt_n = int(math.floor(math.sqrt(n)))
-    for i in range(3, sqrt_n + 1, 2):
-        if n % i == 0:
-            return False
-    return True
-
-def sequential():
-    return list(map(is_prime, PRIMES))
-
-def with_thread_pool_executor():
-    if not ThreadPoolExecutor: return None
-    with ThreadPoolExecutor(4) as executor:
-        return list(executor.map(is_prime, PRIMES))
-
-def with_process_pool_executor():
-    if not ProcessPoolExecutor: return None
-    with ProcessPoolExecutor(4) as executor:
-        return list(executor.map(is_prime, PRIMES))
-
-def with_mpi_pool_executor():
-    with MPIPoolExecutor(4) as executor:
-        return list(executor.map(is_prime, PRIMES))
-
-def main():
-    for name, fn in [('sequential', sequential),
-                     ('threads', with_thread_pool_executor),
-                     ('processes', with_process_pool_executor),
-                     ('mpi4pyve', with_mpi_pool_executor)]:
-        sys.stdout.write('%s: ' % name.ljust(11))
-        sys.stdout.flush()
-        start = time.time()
-        result = fn()
-        if result is None:
-            sys.stdout.write(' not available\n')
-        elif result != [True] * len(PRIMES):
-            sys.stdout.write(' failed\n')
-        else:
-            sys.stdout.write('%5.2f seconds\n' % (time.time() - start))
-        sys.stdout.flush()
-
-if __name__ == '__main__':
-    main()
diff --git a/demo/futures/run_crawl.py b/demo/futures/run_crawl.py
deleted file mode 100644
index 72f47b0..0000000
--- a/demo/futures/run_crawl.py
+++ /dev/null
@@ -1,34 +0,0 @@
-from __future__ import print_function
-from __future__ import division
-
-try:
-    from urllib.request import urlopen
-except ImportError:
-    from urllib2 import urlopen
-
-from mpi4pyve.futures import MPIPoolExecutor
-
-URLS = [
-    'http://www.google.com/',
-    'http://www.apple.com/',
-    'http://www.ibm.com/',
-    'http://www.slashdot.org/',
-    'http://www.python.org/',
-    'http://www.bing.com/',
-    'http://www.facebook.com/',
-    'http://www.yahoo.com/',
-    'http://www.youtube.com/',
-    'http://www.blogger.com/',
-]
-
-def load_url(url):
-    return url, urlopen(url).read()
-
-def test_crawl():
-    with MPIPoolExecutor(10) as executor:
-        for url, content in executor.map(load_url, URLS,
-                                         timeout=10, unordered=True):
-            print('%-25s: %6.2f KiB' % (url, len(content)/(1 << 10)))
-
-if __name__ == '__main__':
-    test_crawl()
diff --git a/demo/futures/run_julia.py b/demo/futures/run_julia.py
deleted file mode 100644
index b4221e3..0000000
--- a/demo/futures/run_julia.py
+++ /dev/null
@@ -1,69 +0,0 @@
-from __future__ import print_function
-from __future__ import division
-import sys
-import time
-
-from mpi4pyve.futures import MPICommExecutor
-
-try:
-    range = xrange
-except NameError:
-    pass
-
-x0 = -2.0
-x1 = +2.0
-y0 = -1.5
-y1 = +1.5
-
-w = 1600
-h = 1200
-
-dx = (x1 - x0) / w
-dy = (y1 - y0) / h
-
-def julia(x, y):
-    c = complex(0, 0.65)
-    z = complex(x, y)
-    n = 255
-    while abs(z) < 3 and n > 1:
-        z = z**2 + c
-        n -= 1
-    return n
-
-def julia_line(k):
-    line = bytearray(w)
-    y = y1 - k * dy
-    for j in range(w):
-        x = x0 + j * dx
-        line[j] = julia(x, y)
-    return line
-
-def plot(image):
-    import warnings
-    warnings.simplefilter('ignore', UserWarning)
-    try:
-        from matplotlib import pyplot as plt
-    except ImportError:
-        return
-    plt.figure()
-    plt.imshow(image, aspect='equal', cmap='cubehelix')
-    plt.axis('off')
-    try:
-        plt.draw()
-        plt.pause(2)
-    except:
-        pass
-
-def test_julia():
-    with MPICommExecutor() as executor:
-        if executor is None: return # worker process
-        tic = time.time()
-        image = list(executor.map(julia_line, range(h), chunksize=10))
-        toc = time.time()
-
-    print("%s Set %dx%d in %.2f seconds." % ('Julia', w, h, toc-tic))
-    if len(sys.argv) > 1 and sys.argv[1] == '-plot':
-        plot(image)
-
-if __name__ == '__main__':
-    test_julia()
diff --git a/demo/futures/run_mandelbrot.py b/demo/futures/run_mandelbrot.py
deleted file mode 100644
index 9739d07..0000000
--- a/demo/futures/run_mandelbrot.py
+++ /dev/null
@@ -1,69 +0,0 @@
-from __future__ import print_function
-from __future__ import division
-import sys
-import time
-
-from mpi4pyve.futures import MPICommExecutor
-
-try:
-    range = xrange
-except NameError:
-    pass
-
-x0 = -2.0
-x1 = +1.0
-y0 = -1.0
-y1 = +1.0
-
-w = 750
-h = 500
-
-dx = (x1 - x0) / w
-dy = (y1 - y0) / h
-
-def mandelbrot(x, y, maxit=255):
-    c = complex(x, y)
-    z = complex(0, 0)
-    n = 255
-    while abs(z) < 2 and n > 1:
-        z = z**2 + c
-        n -= 1
-    return n
-
-def mandelbrot_line(k):
-    line = bytearray(w)
-    y = y1 - k * dy
-    for j in range(w):
-        x = x0 + j * dx
-        line[j] = mandelbrot(x, y)
-    return line
-
-def plot(image):
-    import warnings
-    warnings.simplefilter('ignore', UserWarning)
-    try:
-        from matplotlib import pyplot as plt
-    except ImportError:
-        return
-    plt.figure()
-    plt.imshow(image, aspect='equal', cmap='spectral')
-    plt.axis('off')
-    try:
-        plt.draw()
-        plt.pause(2)
-    except:
-        pass
-
-def test_mandelbrot():
-    with MPICommExecutor() as executor:
-        if executor is None: return # worker process
-        tic = time.time()
-        image = list(executor.map(mandelbrot_line, range(h), chunksize=10))
-        toc = time.time()
-
-    print("%s Set %dx%d in %.2f seconds." % ('Mandelbrot', w, h, toc-tic))
-    if len(sys.argv) > 1 and sys.argv[1] == '-plot':
-        plot(image)
-
-if __name__ == '__main__':
-    test_mandelbrot()
diff --git a/demo/futures/run_primes.py b/demo/futures/run_primes.py
deleted file mode 100644
index 845178b..0000000
--- a/demo/futures/run_primes.py
+++ /dev/null
@@ -1,35 +0,0 @@
-from __future__ import print_function
-import math
-try:
-    range = xrange
-except NameError:
-    range = range
-
-from mpi4pyve.futures import MPIPoolExecutor
-
-PRIMES = [
-    112272535095293,
-    112582705942171,
-    112272535095293,
-    115280095190773,
-    115797848077099,
-    117450548693743,
-    993960000099397,
-]
-
-def is_prime(n):
-    if n % 2 == 0:
-        return False
-    sqrt_n = int(math.floor(math.sqrt(n)))
-    for i in range(3, sqrt_n + 1, 2):
-        if n % i == 0:
-            return False
-    return True
-
-def test_primes():
-    with MPIPoolExecutor(4) as executor:
-        for number, prime in zip(PRIMES, executor.map(is_prime, PRIMES)):
-            print('%d is prime: %s' % (number, prime))
-
-if __name__ == '__main__':
-    test_primes()
diff --git a/demo/futures/test_futures.py b/demo/futures/test_futures.py
deleted file mode 100644
index 936a94c..0000000
--- a/demo/futures/test_futures.py
+++ /dev/null
@@ -1,1229 +0,0 @@
-import os
-import sys
-import time
-import functools
-import unittest
-
-from mpi4pyve import MPI
-from mpi4pyve import futures
-try:
-    from concurrent.futures._base import (
-        PENDING, RUNNING, CANCELLED, CANCELLED_AND_NOTIFIED, FINISHED)
-except ImportError:
-    from mpi4pyve.futures._base import (
-        PENDING, RUNNING, CANCELLED, CANCELLED_AND_NOTIFIED, FINISHED)
-
-
-SHARED_POOL = futures._lib.SharedPool is not None
-WORLD_SIZE  = MPI.COMM_WORLD.Get_size()
-
-
-def create_future(state=PENDING, exception=None, result=None):
-    f = futures.Future()
-    f._state = state
-    f._exception = exception
-    f._result = result
-    return f
-
-
-PENDING_FUTURE = create_future(state=PENDING)
-RUNNING_FUTURE = create_future(state=RUNNING)
-CANCELLED_FUTURE = create_future(state=CANCELLED)
-CANCELLED_AND_NOTIFIED_FUTURE = create_future(state=CANCELLED_AND_NOTIFIED)
-EXCEPTION_FUTURE = create_future(state=FINISHED, exception=OSError())
-SUCCESSFUL_FUTURE = create_future(state=FINISHED, result=42)
-
-
-def mul(x, y):
-    return x * y
-
-
-def sleep_and_raise(t):
-    time.sleep(t)
-    raise Exception('this is an exception')
-
-
-def check_global_var(x):
-    return global_var == x
-
-
-def check_run_name(name):
-    return __name__ == name
-
-
-class ExecutorMixin:
-    worker_count = 2
-
-    def setUp(self):
-        self.t1 = time.time()
-        try:
-            self.executor = self.executor_type(max_workers=self.worker_count)
-        except NotImplementedError:
-            e = sys.exc_info()[1]
-            self.skipTest(str(e))
-        self._prime_executor()
-
-    def tearDown(self):
-        self.executor.shutdown(wait=True)
-        dt = time.time() - self.t1
-        self.assertLess(dt, 60, 'synchronization issue: test lasted too long')
-
-    def _prime_executor(self):
-        # Make sure that the executor is ready to do work before running the
-        # tests. This should reduce the probability of timeouts in the tests.
-        futures = [self.executor.submit(time.sleep, 0)
-                   for _ in range(self.worker_count)]
-        for f in futures:
-            f.result()
-
-
-class ProcessPoolMixin(ExecutorMixin):
-    executor_type = futures.MPIPoolExecutor
-
-    if 'coverage' in sys.modules:
-        executor_type = functools.partial(
-            executor_type,
-            python_args='-m coverage run'.split(),
-            )
-
-
-class ProcessPoolInitTest(ProcessPoolMixin,
-                          unittest.TestCase):
-
-    def _prime_executor(self):
-        pass
-
-    def test_init(self):
-        self.executor_type()
-
-    def test_init_args(self):
-        self.executor_type(1)
-
-    def test_init_kwargs(self):
-        executor = self.executor_type(
-            python_exe=sys.executable,
-            max_workers=None,
-            mpi_info=dict(soft="0:1"),
-            globals=None,
-            main=False,
-            path=[],
-            wdir=os.getcwd(),
-            env={},
-            )
-        futures = [executor.submit(time.sleep, 0)
-                   for _ in range(self.worker_count)]
-        for f in futures:
-            f.result()
-        executor.shutdown()
-
-    def test_init_pyargs(self):
-        executor_type = futures.MPIPoolExecutor
-        executor = executor_type(python_args=['-B', '-Wi'])
-        executor.submit(time.sleep, 0).result()
-        executor.shutdown()
-
-    @unittest.skipIf(SHARED_POOL, 'shared-pool')
-    def test_init_globals(self):
-        executor = self.executor_type(globals=dict(global_var=42))
-        future1 = executor.submit(check_global_var, 42)
-        future2 = executor.submit(check_global_var, 24)
-        self.assertTrue(future1.result())
-        self.assertFalse(future2.result())
-        executor.shutdown()
-
-    @unittest.skipIf(SHARED_POOL and WORLD_SIZE == 1, 'shared-pool')
-    def test_run_name(self):
-        executor = self.executor_type()
-        run_name = futures._lib.MAIN_RUN_NAME
-        future = executor.submit(check_run_name, run_name)
-        self.assertTrue(future.result(), run_name)
-
-    def test_max_workers_environ(self):
-        save = os.environ.get('MPI4PY_MAX_WORKERS')
-        os.environ['MPI4PY_MAX_WORKERS'] = '1'
-        try:
-            executor = self.executor_type()
-            executor.submit(time.sleep, 0).result()
-            executor.shutdown()
-        finally:
-            del os.environ['MPI4PY_MAX_WORKERS']
-            if save is not None:
-                os.environ['MPI4PY_MAX_WORKERS'] = save
-
-    def test_max_workers_negative(self):
-        for number in (0, -1):
-            self.assertRaises(ValueError,
-                              self.executor_type,
-                              max_workers=number)
-
-
-class ProcessPoolBootupTest(ProcessPoolMixin,
-                            unittest.TestCase):
-
-    def _prime_executor(self):
-        pass
-
-    def test_bootup(self):
-        executor = self.executor_type(1)
-        executor.bootup()
-        executor.bootup()
-        executor.shutdown()
-        self.assertRaises(RuntimeError, executor.bootup)
-
-    def test_bootup_wait(self):
-        executor = self.executor_type(1)
-        executor.bootup(wait=True)
-        executor.bootup(wait=True)
-        executor.shutdown(wait=True)
-        self.assertRaises(RuntimeError, executor.bootup, True)
-
-    def test_bootup_nowait(self):
-        executor = self.executor_type(1)
-        executor.bootup(wait=False)
-        executor.bootup(wait=False)
-        executor.shutdown(wait=False)
-        self.assertRaises(RuntimeError, executor.bootup, False)
-        executor.shutdown(wait=True)
-
-    def test_bootup_nowait_wait(self):
-        executor = self.executor_type(1)
-        executor.bootup(wait=False)
-        executor.bootup(wait=True)
-        executor.shutdown()
-        self.assertRaises(RuntimeError, executor.bootup)
-
-    def test_bootup_shutdown_nowait(self):
-        executor = self.executor_type(1)
-        executor.bootup(wait=False)
-        executor.shutdown(wait=False)
-        worker = executor._pool
-        del executor
-        worker.join()
-
-
-class ExecutorShutdownTestMixin:
-
-    def test_run_after_shutdown(self):
-        self.executor.shutdown()
-        self.assertRaises(RuntimeError,
-                          self.executor.submit,
-                          pow, 2, 5)
-
-    def test_hang_issue12364(self):
-        fs = [self.executor.submit(time.sleep, 0.01) for _ in range(50)]
-        self.executor.shutdown()
-        for f in fs:
-            f.result()
-
-
-class ProcessPoolShutdownTest(ProcessPoolMixin,
-                              ExecutorShutdownTestMixin,
-                              unittest.TestCase):
-
-    def _prime_executor(self):
-        pass
-
-    def test_shutdown(self):
-        executor = self.executor_type(max_workers=1)
-        self.assertEqual(executor._pool, None)
-        self.assertEqual(executor._shutdown, False)
-        executor.submit(mul, 21, 2)
-        executor.submit(mul, 6, 7)
-        executor.submit(mul, 3, 14)
-        self.assertNotEqual(executor._pool.thread, None)
-        self.assertEqual(executor._shutdown, False)
-        executor.shutdown(wait=False)
-        self.assertNotEqual(executor._pool.thread, None)
-        self.assertEqual(executor._shutdown, True)
-        executor.shutdown(wait=True)
-        self.assertEqual(executor._pool, None)
-        self.assertEqual(executor._shutdown, True)
-
-    def test_init_bootup_shutdown(self):
-        executor = self.executor_type(max_workers=1)
-        self.assertEqual(executor._pool, None)
-        self.assertEqual(executor._shutdown, False)
-        executor.bootup()
-        self.assertTrue(executor._pool.event.is_set())
-        self.assertEqual(executor._shutdown, False)
-        executor.shutdown()
-        self.assertEqual(executor._pool, None)
-        self.assertEqual(executor._shutdown, True)
-
-    def test_context_manager_shutdown(self):
-        with self.executor_type(max_workers=1) as e:
-            self.assertEqual(list(e.map(abs, range(-5, 5))),
-                             [5, 4, 3, 2, 1, 0, 1, 2, 3, 4])
-            threads = [e._pool.thread]
-            queues = [e._pool.queue]
-            events = [e._pool.event]
-
-        for t in threads:
-            t.join()
-        for q in queues:
-            self.assertRaises(LookupError, q.pop)
-        for e in events:
-            self.assertTrue(e.is_set())
-
-    def test_del_shutdown(self):
-        executor = self.executor_type(max_workers=1)
-        list(executor.map(abs, range(-5, 5)))
-        threads = [executor._pool.thread]
-        queues = [executor._pool.queue]
-        events = [executor._pool.event]
-        if hasattr(sys, 'pypy_version_info'):
-            executor.shutdown(False)
-        else:
-            del executor
-
-        for t in threads:
-            t.join()
-        for q in queues:
-            self.assertRaises(LookupError, q.pop)
-        for e in events:
-            self.assertTrue(e.is_set())
-
-
-class WaitTestMixin:
-
-    def test_first_completed(self):
-        future1 = self.executor.submit(mul, 21, 2)
-        future2 = self.executor.submit(time.sleep, 0.2)
-
-        done, not_done = futures.wait(
-                [CANCELLED_FUTURE, future1, future2],
-                return_when=futures.FIRST_COMPLETED)
-
-        self.assertEqual(set([future1]), done)
-        self.assertEqual(set([CANCELLED_FUTURE, future2]), not_done)
-
-    def test_first_completed_some_already_completed(self):
-        future1 = self.executor.submit(time.sleep, 0.2)
-
-        finished, pending = futures.wait(
-                 [CANCELLED_AND_NOTIFIED_FUTURE, SUCCESSFUL_FUTURE, future1],
-                 return_when=futures.FIRST_COMPLETED)
-
-        self.assertEqual(
-                set([CANCELLED_AND_NOTIFIED_FUTURE, SUCCESSFUL_FUTURE]),
-                finished)
-        self.assertEqual(set([future1]), pending)
-
-    def test_first_exception(self):
-        future1 = self.executor.submit(mul, 2, 21)
-        future2 = self.executor.submit(sleep_and_raise, 0.2)
-        future3 = self.executor.submit(time.sleep, 0.4)
-
-        finished, pending = futures.wait(
-                [future1, future2, future3],
-                return_when=futures.FIRST_EXCEPTION)
-
-        self.assertEqual(set([future1, future2]), finished)
-        self.assertEqual(set([future3]), pending)
-
-    def test_first_exception_some_already_complete(self):
-        future1 = self.executor.submit(divmod, 21, 0)
-        future2 = self.executor.submit(time.sleep, 0.2)
-
-        finished, pending = futures.wait(
-                [SUCCESSFUL_FUTURE,
-                 CANCELLED_FUTURE,
-                 CANCELLED_AND_NOTIFIED_FUTURE,
-                 future1, future2],
-                return_when=futures.FIRST_EXCEPTION)
-
-        self.assertEqual(set([SUCCESSFUL_FUTURE,
-                              CANCELLED_AND_NOTIFIED_FUTURE,
-                              future1]), finished)
-        self.assertEqual(set([CANCELLED_FUTURE, future2]), pending)
-
-    def test_first_exception_one_already_failed(self):
-        future1 = self.executor.submit(time.sleep, 0.2)
-
-        finished, pending = futures.wait(
-                 [EXCEPTION_FUTURE, future1],
-                 return_when=futures.FIRST_EXCEPTION)
-
-        self.assertEqual(set([EXCEPTION_FUTURE]), finished)
-        self.assertEqual(set([future1]), pending)
-
-    def test_all_completed(self):
-        future1 = self.executor.submit(divmod, 2, 0)
-        future2 = self.executor.submit(mul, 2, 21)
-
-        finished, pending = futures.wait(
-                [SUCCESSFUL_FUTURE,
-                 CANCELLED_AND_NOTIFIED_FUTURE,
-                 EXCEPTION_FUTURE,
-                 future1,
-                 future2],
-                return_when=futures.ALL_COMPLETED)
-
-        self.assertEqual(set([SUCCESSFUL_FUTURE,
-                              CANCELLED_AND_NOTIFIED_FUTURE,
-                              EXCEPTION_FUTURE,
-                              future1,
-                              future2]), finished)
-        self.assertEqual(set(), pending)
-
-    def test_timeout(self):
-        future1 = self.executor.submit(mul, 6, 7)
-        future2 = self.executor.submit(time.sleep, 0.5)
-
-        finished, pending = futures.wait(
-                [CANCELLED_AND_NOTIFIED_FUTURE,
-                 EXCEPTION_FUTURE,
-                 SUCCESSFUL_FUTURE,
-                 future1, future2],
-                timeout=0.2,
-                return_when=futures.ALL_COMPLETED)
-
-        self.assertEqual(set([CANCELLED_AND_NOTIFIED_FUTURE,
-                              EXCEPTION_FUTURE,
-                              SUCCESSFUL_FUTURE,
-                              future1]), finished)
-        self.assertEqual(set([future2]), pending)
-
-
-class ProcessPoolWaitTest(ProcessPoolMixin,
-                          WaitTestMixin,
-                          unittest.TestCase):
-    pass
-
-
-class AsCompletedTestMixin:
-
-    def test_no_timeout(self):
-        future1 = self.executor.submit(mul, 2, 21)
-        future2 = self.executor.submit(mul, 7, 6)
-
-        completed = set(futures.as_completed(
-                [CANCELLED_AND_NOTIFIED_FUTURE,
-                 EXCEPTION_FUTURE,
-                 SUCCESSFUL_FUTURE,
-                 future1, future2]))
-        self.assertEqual(set(
-                [CANCELLED_AND_NOTIFIED_FUTURE,
-                 EXCEPTION_FUTURE,
-                 SUCCESSFUL_FUTURE,
-                 future1, future2]),
-                completed)
-
-    def test_zero_timeout(self):
-        future1 = self.executor.submit(time.sleep, 0.2)
-        completed_futures = set()
-        try:
-            for future in futures.as_completed(
-                    [CANCELLED_AND_NOTIFIED_FUTURE,
-                     EXCEPTION_FUTURE,
-                     SUCCESSFUL_FUTURE,
-                     future1],
-                    timeout=0):
-                completed_futures.add(future)
-        except futures.TimeoutError:
-            pass
-
-        self.assertEqual(set([CANCELLED_AND_NOTIFIED_FUTURE,
-                              EXCEPTION_FUTURE,
-                              SUCCESSFUL_FUTURE]),
-                         completed_futures)
-
-    def test_nonzero_timeout(self):
-        future1 = self.executor.submit(time.sleep, 0.0)
-        future2 = self.executor.submit(time.sleep, 0.2)
-        completed_futures = set()
-        try:
-            for future in futures.as_completed(
-                    [CANCELLED_AND_NOTIFIED_FUTURE,
-                     EXCEPTION_FUTURE,
-                     SUCCESSFUL_FUTURE,
-                     future1],
-                    timeout=0.1):
-                completed_futures.add(future)
-        except futures.TimeoutError:
-            pass
-
-        self.assertEqual(set([CANCELLED_AND_NOTIFIED_FUTURE,
-                              EXCEPTION_FUTURE,
-                              SUCCESSFUL_FUTURE,
-                              future1]),
-                         completed_futures)
-
-    def test_duplicate_futures(self):
-        py_version = sys.version_info[:3]
-        if py_version[0] == 3 and py_version < (3, 3, 5): return
-        # Issue 20367. Duplicate futures should not raise exceptions or give
-        # duplicate responses.
-        future1 = self.executor.submit(time.sleep, 0.1)
-        completed = [f for f in futures.as_completed([future1, future1])]
-        self.assertEqual(len(completed), 1)
-
-
-class ProcessPoolAsCompletedTest(ProcessPoolMixin,
-                                 AsCompletedTestMixin,
-                                 unittest.TestCase):
-    pass
-
-
-class ExecutorTestMixin:
-
-    def test_submit(self):
-        future = self.executor.submit(pow, 2, 8)
-        self.assertEqual(256, future.result())
-
-    def test_submit_keyword(self):
-        future = self.executor.submit(mul, 2, y=8)
-        self.assertEqual(16, future.result())
-        future = self.executor.submit(mul, x=2, y=8)
-        self.assertEqual(16, future.result())
-
-    def test_submit_cancel(self):
-        future1 = self.executor.submit(time.sleep, 0.25)
-        future2 = self.executor.submit(time.sleep, 0)
-        future2.cancel()
-        self.assertEqual(None,  future1.result())
-        self.assertEqual(False, future1.cancelled())
-        self.assertEqual(True,  future2.cancelled())
-
-    def test_map(self):
-        self.assertEqual(
-                list(self.executor.map(pow, range(10), range(10))),
-                list(map(pow, range(10), range(10))))
-
-    def test_starmap(self):
-        sequence = [(a,a) for a in range(10)]
-        self.assertEqual(
-                list(self.executor.starmap(pow, sequence)),
-                list(map(pow, range(10), range(10))))
-        self.assertEqual(
-                list(self.executor.starmap(pow, iter(sequence))),
-                list(map(pow, range(10), range(10))))
-
-    def test_map_exception(self):
-        i = self.executor.map(divmod, [1, 1, 1, 1], [2, 3, 0, 5])
-        self.assertEqual(next(i), (0, 1))
-        self.assertEqual(next(i), (0, 1))
-        self.assertRaises(ZeroDivisionError, next, i)
-
-    def test_map_timeout(self):
-        results = []
-        try:
-            for i in self.executor.map(time.sleep,
-                                       [0, 0, 1],
-                                       timeout=0.25):
-                results.append(i)
-        except futures.TimeoutError:
-            pass
-        else:
-            self.fail('expected TimeoutError')
-
-        self.assertEqual([None, None], results)
-
-    def test_map_timeout_one(self):
-        results = []
-        for i in self.executor.map(time.sleep, [0, 0, 0], timeout=1):
-            results.append(i)
-        self.assertEqual([None, None, None], results)
-
-
-class ProcessPoolExecutorTest(ProcessPoolMixin,
-                              ExecutorTestMixin,
-                              unittest.TestCase):
-
-    def test_map_chunksize(self):
-        ref = list(map(pow, range(40), range(40)))
-        self.assertEqual(
-            list(self.executor.map(pow, range(40), range(40), chunksize=6)),
-            ref)
-        self.assertEqual(
-            list(self.executor.map(pow, range(40), range(40), chunksize=50)),
-            ref)
-        self.assertEqual(
-            list(self.executor.map(pow, range(40), range(40), chunksize=40)),
-            ref)
-
-        def bad():
-            list(self.executor.map(pow, range(40), range(40), chunksize=-1))
-        self.assertRaises(ValueError, bad)
-
-    def test_starmap_chunksize(self):
-        ref = list(map(pow, range(40), range(40)))
-        sequence = [(a, a) for a in range(40)]
-        self.assertEqual(
-            list(self.executor.starmap(pow, sequence, chunksize=6)),
-            ref)
-        self.assertEqual(
-            list(self.executor.starmap(pow, sequence, chunksize=50)),
-            ref)
-        self.assertEqual(
-            list(self.executor.starmap(pow, sequence, chunksize=40)),
-            ref)
-        self.assertEqual(
-            list(self.executor.starmap(pow, iter(sequence), chunksize=6)),
-            ref)
-        self.assertEqual(
-            list(self.executor.starmap(pow, iter(sequence), chunksize=50)),
-            ref)
-        self.assertEqual(
-            list(self.executor.starmap(pow, iter(sequence), chunksize=40)),
-            ref)
-
-        def bad():
-            list(self.executor.starmap(pow, sequence, chunksize=-1))
-        self.assertRaises(ValueError, bad)
-
-    def test_map_unordered(self):
-        map_unordered = functools.partial(self.executor.map, unordered=True)
-        self.assertEqual(
-                set(map_unordered(pow, range(10), range(10))),
-                set(map(pow, range(10), range(10))))
-
-    def test_map_unordered_timeout(self):
-        map_unordered = functools.partial(self.executor.map, unordered=True)
-        num_workers = self.executor._pool.size
-        results = []
-        try:
-            args = [0.2] + [0]*(num_workers-1)
-            for i in map_unordered(time.sleep, args, timeout=0.1):
-                results.append(i)
-        except futures.TimeoutError:
-            pass
-        else:
-            self.fail('expected TimeoutError')
-
-        self.assertEqual([None]*(num_workers-1), results)
-
-    def test_map_unordered_timeout_one(self):
-        map_unordered = functools.partial(self.executor.map, unordered=True)
-        results = []
-        for i in map_unordered(time.sleep, [0, 0, 0], timeout=1):
-            results.append(i)
-        self.assertEqual([None, None, None], results)
-
-    def test_map_unordered_exception(self):
-        map_unordered = functools.partial(self.executor.map, unordered=True)
-        i = map_unordered(divmod, [1, 1, 1, 1], [2, 3, 0, 5])
-        try:
-            self.assertEqual(next(i), (0, 1))
-        except ZeroDivisionError:
-            return
-
-    def test_map_unordered_chunksize(self):
-        map_unordered = functools.partial(self.executor.map, unordered=True)
-        ref = set(map(pow, range(40), range(40)))
-        self.assertEqual(
-            set(map_unordered(pow, range(40), range(40), chunksize=6)),
-            ref)
-        self.assertEqual(
-            set(map_unordered(pow, range(40), range(40), chunksize=50)),
-            ref)
-        self.assertEqual(
-            set(map_unordered(pow, range(40), range(40), chunksize=40)),
-            ref)
-
-        def bad():
-            set(map_unordered(pow, range(40), range(40), chunksize=-1))
-        self.assertRaises(ValueError, bad)
-
-
-class ProcessPoolSubmitTest(unittest.TestCase):
-
-    @unittest.skipIf(MPI.get_vendor()[0] == 'Microsoft MPI', 'msmpi')
-    def test_multiple_executors(self):
-        executor1 = futures.MPIPoolExecutor(1).bootup(wait=True)
-        executor2 = futures.MPIPoolExecutor(1).bootup(wait=True)
-        executor3 = futures.MPIPoolExecutor(1).bootup(wait=True)
-        fs1 = [executor1.submit(abs, i) for i in range(100, 200)]
-        fs2 = [executor2.submit(abs, i) for i in range(200, 300)]
-        fs3 = [executor3.submit(abs, i) for i in range(300, 400)]
-        futures.wait(fs3+fs2+fs1)
-        for i, f in enumerate(fs1):
-            self.assertEqual(f.result(), i + 100)
-        for i, f in enumerate(fs2):
-            self.assertEqual(f.result(), i + 200)
-        for i, f in enumerate(fs3):
-            self.assertEqual(f.result(), i + 300)
-        executor1 = executor2 = executor3 = None
-
-    def test_mpi_serialized_support(self):
-        futures._lib.setup_mpi_threads()
-        threading = futures._lib.threading
-        serialized = futures._lib.serialized
-        lock_save = serialized.lock
-        try:
-            if lock_save is None:
-                serialized.lock = threading.Lock()
-                executor = futures.MPIPoolExecutor(1).bootup()
-                executor.submit(abs, 0).result()
-                executor.shutdown()
-                serialized.lock = lock_save
-            else:
-                serialized.lock = None
-                with lock_save:
-                    executor = futures.MPIPoolExecutor(1).bootup()
-                    executor.submit(abs, 0).result()
-                    executor.shutdown()
-                serialized.lock = lock_save
-        finally:
-            serialized.lock = lock_save
-
-    def orig_test_mpi_serialized_support(self):
-        threading = futures._lib.threading
-        serialized = futures._lib.serialized
-        lock_save = serialized.lock
-        try:
-            serialized.lock = threading.Lock()
-            executor = futures.MPIPoolExecutor(1).bootup()
-            executor.submit(abs, 0).result()
-            if lock_save is not None:
-                serialized.lock = None
-                with lock_save:
-                    executor.submit(abs, 0).result()
-            serialized.lock = lock_save
-            executor.submit(abs, 0).result()
-            executor.shutdown()
-            if lock_save is not None:
-                serialized.lock = None
-                with lock_save:
-                    executor = futures.MPIPoolExecutor(1).bootup()
-                    executor.submit(abs, 0).result()
-                    executor.shutdown()
-                serialized.lock = lock_save
-        finally:
-            serialized.lock = lock_save
-
-    def test_shared_executors(self):
-        if not SHARED_POOL: return
-        executors = [futures.MPIPoolExecutor() for _ in range(16)]
-        fs = []
-        for i in range(128):
-            fs.extend(e.submit(abs, i*16+j)
-                      for j, e in enumerate(executors))
-        assert sorted(f.result() for f in fs) == list(range(16*128))
-        world_size = MPI.COMM_WORLD.Get_size()
-        num_workers = max(1, world_size - 1)
-        for e in executors:
-            self.assertEqual(e._pool.size, num_workers)
-        del e, executors
-
-
-def inout(arg):
-    return arg
-
-
-class GoodPickle(object):
-
-    def __init__(self, value=0):
-        self.value = value
-        self.pickled = False
-        self.unpickled = False
-
-    def __getstate__(self):
-        self.pickled = True
-        return (self.value,)
-
-    def __setstate__(self, state):
-        self.unpickled = True
-        self.value = state[0]
-
-
-class BadPickle(object):
-
-    def __init__(self):
-        self.pickled = False
-
-    def __getstate__(self):
-        self.pickled = True
-        1/0
-
-    def __setstate__(self, state):
-        pass
-
-
-class BadUnpickle(object):
-
-    def __init__(self):
-        self.pickled = False
-
-    def __getstate__(self):
-        self.pickled = True
-        return (None,)
-
-    def __setstate__(self, state):
-        if state[0] is not None:
-            raise ValueError
-        1/0
-
-
-@unittest.skipIf(SHARED_POOL and WORLD_SIZE == 1, 'shared-pool')
-class ProcessPoolPickleTest(unittest.TestCase):
-
-    def setUp(self):
-        self.executor = futures.MPIPoolExecutor(1)
-
-    def tearDown(self):
-        self.executor.shutdown()
-
-    def test_good_pickle(self):
-        o = GoodPickle(42)
-        r = self.executor.submit(inout, o).result()
-        self.assertEqual(o.value, r.value)
-        self.assertTrue(o.pickled)
-        self.assertTrue(r.unpickled)
-
-        r = self.executor.submit(GoodPickle, 77).result()
-        self.assertEqual(r.value, 77)
-        self.assertTrue(r.unpickled)
-
-    def test_bad_pickle(self):
-        o = BadPickle()
-        self.assertFalse(o.pickled)
-        f = self.executor.submit(inout, o)
-        self.assertRaises(ZeroDivisionError, f.result)
-        self.assertTrue(o.pickled)
-
-        f = self.executor.submit(BadPickle)
-        self.assertRaises(ZeroDivisionError, f.result)
-
-        f = self.executor.submit(abs, 42)
-        self.assertEqual(f.result(), 42)
-
-    def test_bad_unpickle(self):
-        o = BadUnpickle()
-        self.assertFalse(o.pickled)
-        f = self.executor.submit(inout, o)
-        self.assertRaises(ZeroDivisionError, f.result)
-        self.assertTrue(o.pickled)
-
-        f = self.executor.submit(BadUnpickle)
-        self.assertRaises(ZeroDivisionError, f.result)
-
-        f = self.executor.submit(abs, 42)
-        self.assertEqual(f.result(), 42)
-
-
-class MPICommExecutorTest(unittest.TestCase):
-
-    MPICommExecutor = futures.MPICommExecutor
-
-    def test_default(self):
-        with self.MPICommExecutor() as executor:
-            if executor is not None:
-                executor.bootup()
-                future1 = executor.submit(time.sleep, 0)
-                future2 = executor.submit(time.sleep, 0)
-                executor.shutdown()
-                self.assertEqual(None, future1.result())
-                self.assertEqual(None, future2.result())
-
-    def test_self(self):
-        with self.MPICommExecutor(MPI.COMM_SELF) as executor:
-            future = executor.submit(time.sleep, 0)
-            self.assertEqual(None, future.result())
-            self.assertEqual(None, future.exception())
-
-            future = executor.submit(sleep_and_raise, 0)
-            self.assertRaises(Exception, future.result)
-            self.assertEqual(Exception, type(future.exception()))
-
-            list(executor.map(time.sleep, [0, 0]))
-            list(executor.map(time.sleep, [0, 0], timeout=1))
-            iterator = executor.map(time.sleep, [0.1, 0], timeout=0)
-            self.assertRaises(futures.TimeoutError, list, iterator)
-
-    def test_args(self):
-        with self.MPICommExecutor(MPI.COMM_SELF) as executor:
-            self.assertTrue(executor is not None)
-        with self.MPICommExecutor(MPI.COMM_SELF, 0) as executor:
-            self.assertTrue(executor is not None)
-
-    def test_kwargs(self):
-        with self.MPICommExecutor(comm=MPI.COMM_SELF) as executor:
-            self.assertTrue(executor is not None)
-        with self.MPICommExecutor(comm=MPI.COMM_SELF, root=0) as executor:
-            self.assertTrue(executor is not None)
-
-    @unittest.skipIf(SHARED_POOL, 'shared-pool')
-    def test_arg_root(self):
-        comm = MPI.COMM_WORLD
-        rank = comm.Get_rank()
-        for root in range(comm.Get_size()):
-            with self.MPICommExecutor(comm, root) as executor:
-                if rank != root:
-                    self.assertTrue(executor is None)
-            with self.MPICommExecutor(root=root) as executor:
-                if rank != root:
-                    self.assertTrue(executor is None)
-
-    def test_arg_root_bad(self):
-        size = MPI.COMM_WORLD.Get_size()
-        self.assertRaises(ValueError, self.MPICommExecutor, root=-size)
-        self.assertRaises(ValueError, self.MPICommExecutor, root=-1)
-        self.assertRaises(ValueError, self.MPICommExecutor, root=+size)
-
-    @unittest.skipIf(SHARED_POOL, 'shared-pool')
-    def test_arg_comm_bad(self):
-        if MPI.COMM_WORLD.Get_size() == 1:
-            return
-        intercomm = futures._lib.comm_split(MPI.COMM_WORLD)
-        try:
-            self.assertRaises(ValueError, self.MPICommExecutor, intercomm)
-        finally:
-            intercomm.Free()
-
-    def test_with_bad(self):
-        mpicommexecutor = self.MPICommExecutor(MPI.COMM_SELF)
-        with mpicommexecutor as executor:
-            try:
-                with mpicommexecutor:
-                    pass
-            except RuntimeError:
-                pass
-            else:
-                self.fail('expected RuntimeError')
-
-
-from mpi4pyve.futures.aplus import ThenableFuture
-
-class ThenTest(unittest.TestCase):
-
-    assert_ = unittest.TestCase.assertTrue
-
-    def test_not_done(self):
-
-        base_f = ThenableFuture()
-        new_f = base_f.then()
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f._invoke_callbacks()
-        self.assert_(new_f.cancelled())
-
-    def test_cancel(self):
-
-        base_f = ThenableFuture()
-        new_f = base_f.then()
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.cancel()
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(base_f.cancelled())
-        self.assert_(new_f.cancelled())
-
-    def test_then_multiple(self):
-
-        base_f = ThenableFuture()
-        new_f1 = base_f.then()
-        new_f2 = base_f.then()
-        new_f3 = base_f.then()
-
-        self.assert_(base_f is not new_f1)
-        self.assert_(base_f is not new_f2)
-        self.assert_(base_f is not new_f3)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f1.done())
-        self.assert_(not new_f2.done())
-        self.assert_(not new_f3.done())
-
-        base_f.set_result('done')
-        self.assert_(base_f.done())
-        self.assert_(new_f1.done())
-        self.assert_(new_f2.done())
-        self.assert_(new_f3.done())
-
-        self.assert_(not new_f1.exception())
-        self.assert_(not new_f2.exception())
-        self.assert_(not new_f3.exception())
-        self.assert_(new_f1.result() == 'done')
-        self.assert_(new_f2.result() == 'done')
-        self.assert_(new_f3.result() == 'done')
-
-    def test_no_callbacks_and_success(self):
-
-        base_f = ThenableFuture()
-        new_f = base_f.then()
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_result('done')
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(not new_f.exception())
-        self.assert_(new_f.result() == 'done')
-
-    def test_no_callbacks_and_failure(self):
-
-        class MyException(Exception):
-            pass
-
-        base_f = ThenableFuture()
-        new_f = base_f.then()
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_exception(MyException('sad'))
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(new_f.exception())
-        with self.assertRaises(MyException) as catcher:
-            new_f.result()
-        self.assert_(catcher.exception.args[0] == 'sad')
-
-    def test_success_callback_and_success(self):
-
-        base_f = ThenableFuture()
-        new_f = base_f.then(lambda result: result + ' manipulated')
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_result('done')
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(not new_f.exception())
-        self.assert_(new_f.result() == 'done manipulated')
-
-    def test_err_callback_and_failure_repackage(self):
-
-        class MyException(Exception):
-            pass
-
-        class MyRepackagedException(Exception):
-            pass
-
-        class NotMatched(Exception):
-            pass
-
-        def on_failure(ex):
-            if isinstance(ex, MyException):
-                return MyRepackagedException(ex.args[0] + ' repackaged')
-            else:
-                return NotMatched('?')
-
-        base_f = ThenableFuture()
-        new_f = base_f.then(None, on_failure)
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_exception(MyException('sad'))
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(new_f.exception())
-        with self.assertRaises(MyRepackagedException) as catcher:
-            new_f.result()
-        self.assert_(catcher.exception.args[0] == 'sad repackaged')
-
-    def test_err_callback_and_failure_raised(self):
-
-        class MyException(Exception):
-            pass
-
-        class MyRepackagedException(Exception):
-            pass
-
-        def raise_something_else(ex):
-            raise MyRepackagedException(ex.args[0] + ' repackaged')
-
-        base_f = ThenableFuture()
-        new_f = base_f.then(None, raise_something_else)
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_exception(MyException('sad'))
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(new_f.exception())
-        with self.assertRaises(MyRepackagedException) as catcher:
-            new_f.result()
-        self.assert_(catcher.exception.args[0] == 'sad repackaged')
-
-    def test_err_callback_convert_to_success(self):
-
-        class MyException(Exception):
-            pass
-
-        class NotMatched(Exception):
-            pass
-
-        def on_failure(ex):
-            if isinstance(ex, MyException):
-                return ex.args[0] + ' repackaged'
-            else:
-                return NotMatched('?')
-
-        base_f = ThenableFuture()
-        new_f = base_f.catch(on_failure)
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_exception(MyException('sad'))
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(not new_f.exception())
-        self.assert_(new_f.result() == 'sad repackaged')
-
-    def test_err_catch_ignore(self):
-
-        base_f = ThenableFuture()
-        new_f = base_f.catch()
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_exception(Exception('sad'))
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(new_f.exception() is None)
-        self.assert_(new_f.result() is None)
-
-    def test_success_callback_and_failure_raised(self):
-
-        class MyException(Exception):
-            pass
-
-        def raise_something_else(value):
-            raise MyException(value + ' repackaged')
-
-        base_f = ThenableFuture()
-        new_f = base_f.then(raise_something_else)
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_result('sad')
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(new_f.exception())
-        with self.assertRaises(MyException) as catcher:
-            new_f.result()
-        assert catcher.exception.args[0] == 'sad repackaged'
-
-    def test_chained_success_callback_and_success(self):
-
-        def transform(value):
-            f = ThenableFuture()
-            if value < 5:
-                f.set_result(transform(value+1))
-            else:
-                f.set_result(value)
-            return f
-
-        base_f = ThenableFuture()
-        new_f = base_f.then(transform)
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_result(1)
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(not new_f.exception())
-        self.assert_(new_f.result() == 5)
-
-    def test_detect_circular_chains(self):
-
-        f1 = ThenableFuture()
-        f2 = ThenableFuture()
-        chain = [f1, f2, f1]
-
-        def transform(a):
-            try:
-                f = chain.pop(0)
-                f.set_result(transform(a))
-                return f
-            except IndexError:
-                return 42
-
-        base_f = ThenableFuture()
-        new_f = base_f.then(transform)
-
-        self.assert_(base_f is not new_f)
-        self.assert_(not base_f.done())
-        self.assert_(not new_f.done())
-
-        base_f.set_result(1)
-        self.assert_(base_f.done())
-        self.assert_(new_f.done())
-
-        self.assert_(new_f.exception())
-        with self.assertRaises(RuntimeError) as catcher:
-            new_f.result()
-        assert 'Circular future chain detected' in catcher.exception.args[0]
-
-
-SKIP_POOL_TEST = False
-name, version = MPI.get_vendor()
-if name == 'Open MPI':
-    if version < (3,0,0):
-        SKIP_POOL_TEST = True
-    if version == (4,0,0):
-        SKIP_POOL_TEST = True
-if name == 'MPICH':
-    if MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is None:
-        SKIP_POOL_TEST = True
-if name == 'MVAPICH2':
-    SKIP_POOL_TEST = True
-if name == 'MPICH2':
-    if MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is None:
-        SKIP_POOL_TEST = True
-if name == 'Microsoft MPI':
-    if version < (8,1,0):
-        SKIP_POOL_TEST = True
-    if MPI.COMM_WORLD.Get_attr(MPI.APPNUM) is None:
-        SKIP_POOL_TEST = True
-if name == 'Platform MPI':
-    SKIP_POOL_TEST = True
-if MPI.Get_version() < (2,0):
-    SKIP_POOL_TEST = True
-
-
-if SHARED_POOL:
-    del MPICommExecutorTest.test_arg_root
-    del MPICommExecutorTest.test_arg_comm_bad
-    del ProcessPoolInitTest.test_init_globals
-    if WORLD_SIZE == 1:
-        del ProcessPoolInitTest.test_run_name
-        del ProcessPoolPickleTest
-elif WORLD_SIZE > 1 or SKIP_POOL_TEST:
-    del ProcessPoolInitTest
-    del ProcessPoolBootupTest
-    del ProcessPoolShutdownTest
-    del ProcessPoolWaitTest
-    del ProcessPoolAsCompletedTest
-    del ProcessPoolExecutorTest
-    del ProcessPoolSubmitTest
-    del ProcessPoolPickleTest
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/demo/futures/test_service.py b/demo/futures/test_service.py
deleted file mode 100644
index fcab708..0000000
--- a/demo/futures/test_service.py
+++ /dev/null
@@ -1,31 +0,0 @@
-import sys
-from mpi4pyve.futures import MPIPoolExecutor
-
-
-def main():
-    def getarg(opt, default=None):
-        try:
-            return sys.argv[sys.argv.index('--'+opt)+1]
-        except ValueError:
-            return default
-
-    options = {}
-    if '--host' in sys.argv or '--port' in sys.argv:
-        service = (getarg('host'), getarg('port'))
-    else:
-        service = getarg('service')
-    if '--info' in sys.argv:
-        info = getarg('info').split(',')
-        info = dict(entry.split('=') for entry in info if entry)
-    else:
-        info = None
-
-    with MPIPoolExecutor(service=service, mpi_info=info) as executor:
-        fut1 = executor.submit(abs, +42)
-        fut2 = executor.submit(abs, -42)
-    assert fut1.result(0) == 42
-    assert fut2.result(0) == 42
-
-
-if __name__ == '__main__':
-    main()
diff --git a/demo/futures/test_service.sh b/demo/futures/test_service.sh
deleted file mode 100755
index c9ce451..0000000
--- a/demo/futures/test_service.sh
+++ /dev/null
@@ -1,34 +0,0 @@
-#!/bin/bash
-
-PYTHON=${1-${PYTHON-python}}
-MPIEXEC=${MPIEXEC-mpiexec}
-testdir=$(dirname "$0")
-
-set -e
-
-if [ $(command -v mpichversion) ]; then
-    $MPIEXEC -n 1 $PYTHON -m mpi4pyve.futures.server --xyz > /dev/null 2>&1 || true
-    $MPIEXEC -n 2 $PYTHON -m mpi4pyve.futures.server --bind localhost &
-    mpi4pyveserver=$!; sleep 0.25;
-    $MPIEXEC -n 1 $PYTHON $testdir/test_service.py --host localhost
-    wait $mpi4pyveserver
-    $MPIEXEC -n 2 $PYTHON -m mpi4pyve.futures.server --port 31414 --info "a=x,b=y" &
-    mpi4pyveserver=$!; sleep 0.25;
-    $MPIEXEC -n 1 $PYTHON $testdir/test_service.py --port 31414 --info "a=x,b=y"
-    wait $mpi4pyveserver
-fi
-
-if [ $(command -v mpichversion) ] && [ $(command -v hydra_nameserver) ]; then
-    hydra_nameserver &
-    nameserver=$!; sleep 0.25;
-    $MPIEXEC -nameserver localhost -n 2 $PYTHON -m mpi4pyve.futures.server &
-    mpi4pyveserver=$!; sleep 0.25;
-    $MPIEXEC -nameserver localhost -n 1 $PYTHON $testdir/test_service.py
-    wait $mpi4pyveserver
-    $MPIEXEC -nameserver localhost -n 2 $PYTHON -m mpi4pyve.futures.server --service test-service &
-    mpi4pyveserver=$!; sleep 0.25;
-    $MPIEXEC -nameserver localhost -n 1 $PYTHON $testdir/test_service.py --service test-service
-    wait $mpi4pyveserver
-    kill -TERM $nameserver
-    wait $nameserver 2>/dev/null || true
-fi
diff --git a/demo/gemm/README.rst b/demo/gemm/README.rst
new file mode 100644
index 0000000..29547ff
--- /dev/null
+++ b/demo/gemm/README.rst
@@ -0,0 +1,27 @@
+Performs the matrix-matrix operations:
+
+    C = A * B
+
+where A, B, and C are n by n matrices.
+
+This example must be satisfied following conditions:
+
+    1. int(sqrt(nproc)) * int(sqrt(nproc)) == nproc
+    2. n % int(sqrt(nproc)) == 0
+
+Note that this example is not fully optimized for SX-Aurora TSUBASA.
+This is just only a prototype to demonstrate gemm on multi processes.
+
+Issuing at the command line for VH::
+
+    $ mpirun -veo -np 4 python gemm.py -dev vh -dtype float -n 10000
+    {'dev': 'vh', 'dtype': 'float', 'n': 10000}
+    elapsed: 4.292237043380737 [sec], GFLOPS: 465.9574901820241
+    result OK
+
+Issuing at the command line for VE::
+
+    $ VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 4 python gemm.py -dev ve -dtype float -n 10000
+    {'dev': 've', 'dtype': 'float', 'n': 10000}
+    elapsed: 0.17874383926391602 [sec], TFLOPS: 11.189196831824741
+    result OK
diff --git a/demo/gemm/gemm.py b/demo/gemm/gemm.py
new file mode 100644
index 0000000..5d3107b
--- /dev/null
+++ b/demo/gemm/gemm.py
@@ -0,0 +1,190 @@
+from mpi4pyve import MPI
+from mpi4pyve import util
+import argparse
+import math
+
+DTYPE = 'float32'
+MPI_DTYPE = MPI.FLOAT
+ROOT = 0
+
+class Grid:
+    def __init__(self, nproc, order, cart_comm, row_comm, col_comm, row_pos,
+                 col_pos, rank):
+        self.nproc = nproc
+        self.order = order
+        self.cart_comm = cart_comm
+        self.row_comm = row_comm
+        self.col_comm = col_comm
+        self.row_pos = row_pos
+        self.col_pos = col_pos
+        self.rank = rank
+
+    def __str__(self):
+        return "Grid Process <row_pos:{}, col_pol:{}, rank:{}>".format(
+            self.row_pos, self.col_pos, self.rank)
+
+
+def create_grid_process(nproc):
+    dims = MPI.Compute_dims(nproc, 2)
+    if dims[0] != dims[1]:
+        raise ValueError('the number of process is not a perfect square')
+    cart_comm = comm.Create_cart(dims, periods=[True, True], reorder=True)
+    row_comm = cart_comm.Sub([0, 1])
+    col_comm = cart_comm.Sub([1, 0])
+    coords = cart_comm.coords
+    grid = Grid(nproc, dims[0], cart_comm, row_comm, col_comm, coords[0], coords[1],
+                cart_comm.Get_rank())
+    return grid
+
+def create_block_datatype(sizes, subsizes, grid, itemsize):
+    assert subsizes[0] * grid.order == sizes[0]
+    assert subsizes[1] * grid.order == sizes[1]
+    starts = [0, 0]
+    block_type = MPI.Datatype(MPI_DTYPE).Create_subarray(
+        sizes, subsizes, starts, order=MPI.ORDER_C)
+    resized_type = MPI.Datatype(block_type).Create_resized(
+        0, subsizes[1] * itemsize)
+    resized_type.Commit()
+    return resized_type
+
+def scatter_matrix(root_mat, local_mat, grid, n_d, block_type):
+    sendcount = [1 for i in range(grid.nproc)]
+    recvcount = local_mat.size
+    displs = []
+    offset = 0
+    for i in range(grid.order):
+        offset = i * grid.order * n_d
+        for j in range(grid.order):
+            displs.append(offset)
+            offset += 1
+    sendbuf = [root_mat, sendcount, displs, block_type]
+    recvbuf = [local_mat, recvcount, MPI_DTYPE]
+    grid.cart_comm.Scatterv(sendbuf, recvbuf, root=ROOT)
+
+def gather_matrix(root_mat, local_mat, grid, n_d, block_type):
+    recvcount = [1 for i in range(grid.nproc)]
+    sendcount = local_mat.size
+    displs = []
+    offset = 0
+    for i in range(grid.order):
+        offset = i * grid.order * n_d
+        for j in range(grid.order):
+            displs.append(offset)
+            offset += 1
+    sendbuf = [local_mat, sendcount, MPI_DTYPE]
+    recvbuf = [root_mat, recvcount, displs, block_type]
+    grid.cart_comm.Gatherv(sendbuf, recvbuf, root=ROOT)
+
+def matmul(local_A, local_B, local_C, grid):
+    for i in range(grid.order - 1):
+        peer_send = (grid.col_pos + i + 1) % grid.order
+        grid.row_comm.Isend(local_A[grid.col_pos], peer_send)
+    A_recvreqs = [None for i in range(grid.order)]
+    for i in range(grid.order - 1):
+        peer_recv = (grid.col_pos - i - 1 + grid.order) % grid.order
+        req = grid.row_comm.Irecv(local_A[peer_recv], peer_recv)
+        A_recvreqs[peer_recv] = req
+    for i in range(grid.order - 1):
+        peer_send = (grid.row_pos + i + 1) % grid.order
+        grid.col_comm.Isend(local_B[grid.row_pos], peer_send)
+    B_recvreqs = [None for i in range(grid.order)]
+    for i in range(grid.order - 1):
+        peer_recv = (grid.row_pos - i - 1 + grid.order) % grid.order
+        req = grid.col_comm.Irecv(local_B[peer_recv], peer_recv)
+        B_recvreqs[peer_recv] = req
+    for i in range(grid.order):
+        idx = i
+        if A_recvreqs[idx]: A_recvreqs[idx].wait()
+        if B_recvreqs[idx]: B_recvreqs[idx].wait()
+        local_C += local_A[idx] @ local_B[idx]
+
+def scaling(flops):
+    units = [
+        [1e12, 'TFLOPS'], [1e9, 'GFLOPS'], [1e6, 'MFLOPS'],
+        [1e3, 'KFLOPS'], [1, 'FLOPS']]
+    for scale, unit in units:
+        if flops >= scale:
+            break
+    return unit, flops / scale
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-dev', type=str, required=True, choices=['vh', 've'],
+                        help='Execution device')
+    parser.add_argument('-dtype', type=str, required=True, choices=['float', 'double'],
+                        help='Execution data type')
+    parser.add_argument('-n', type=int, required=False, default=10,
+                        help='the number of row and col')
+    args = parser.parse_args()
+
+    comm = MPI.COMM_WORLD
+    nproc = comm.Get_size()
+    rank = comm.Get_rank()
+    if rank == 0: print(vars(args))
+
+    # parse arguments
+    if args.dev == 'vh':
+        import numpy as np
+        dev = np
+    elif args.dev == 've':
+        import nlcpy as vp
+        dev = vp
+    if args.dtype == 'float':
+        DTYPE = 'f4'
+        MPI_DTYPE = MPI.FLOAT
+    elif args.dtype == 'double':
+        DTYPE = 'f8'
+        MPI_DTYPE = MPI.DOUBLE
+    else:
+        raise ValueError
+    n = args.n
+
+    # create grid process
+    grid = create_grid_process(nproc)
+
+    # create matrix A, B, C
+    rng = dev.random.default_rng()
+    if grid.rank == ROOT:
+        A = rng.random((n, n), dtype=DTYPE)
+        B = rng.random((n, n), dtype=DTYPE)
+        C = dev.zeros((n, n), dtype=DTYPE)
+    else:
+        A = None
+        B = None
+        C = None
+    if n % grid.order != 0:
+        raise ValueError('n is not evenly divisible by sqrt(nproc)')
+
+    # create local matrix for computation
+    n_d = n // grid.order
+    local_A = [dev.zeros((n_d, n_d), dtype=DTYPE) for _ in range(grid.order)]
+    local_B = [dev.zeros((n_d, n_d), dtype=DTYPE) for _ in range(grid.order)]
+    local_C = dev.zeros((n_d, n_d), dtype=DTYPE)
+    block_type = create_block_datatype(
+        (n, n), (n_d, n_d), grid, dev.dtype(DTYPE).itemsize)
+    scatter_matrix(A, local_A[grid.row_pos], grid, n_d, block_type)
+    scatter_matrix(B, local_B[grid.col_pos], grid, n_d, block_type)
+
+    # execute matmul
+    if dev.__name__ == 'nlcpy':
+        dev.request.flush()
+    grid.cart_comm.Barrier()
+    t0 = MPI.Wtime()
+    matmul(local_A, local_B, local_C, grid)
+    if dev.__name__ == 'nlcpy':
+        dev.request.flush()
+    grid.cart_comm.Barrier()
+    t1 = MPI.Wtime()
+
+    # result check and show perf
+    gather_matrix(C, local_C, grid, n_d, block_type)
+    if grid.rank == 0:
+        elapsed = t1 - t0
+        flops = 2 * n ** 3 / elapsed
+        print("elapsed: {} [sec], {}: {}".format(elapsed, *scaling(flops)))
+        exp = A @ B
+        norm = dev.linalg.norm(C)
+        if dev.all(((C - exp) / norm) < 1e-4):
+            print("result OK")
+        else:
+            print("result NG")
diff --git a/demo/gemm/makefile b/demo/gemm/makefile
new file mode 100644
index 0000000..4098a7c
--- /dev/null
+++ b/demo/gemm/makefile
@@ -0,0 +1,11 @@
+.PHONY: test
+
+PYTHON=python
+
+N=10000
+DTYPE=float
+
+test:
+	echo "GEMM on  VH (4 process)" &&                           mpirun -veo -np 4 ${PYTHON} gemm.py -dev vh -dtype ${DTYPE} -n ${N}
+	echo "GEMM on 1VE (1 process)" && VE_NLCPY_NODELIST=0       mpirun -veo -np 1 ${PYTHON} gemm.py -dev ve -dtype ${DTYPE} -n ${N}
+	echo "GEMM on 4VE (4 process)" && VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 4 ${PYTHON} gemm.py -dev ve -dtype ${DTYPE} -n ${N}
diff --git a/demo/gemv/Makefile b/demo/gemv/Makefile
new file mode 100644
index 0000000..4e9cc5d
--- /dev/null
+++ b/demo/gemv/Makefile
@@ -0,0 +1,8 @@
+.PHONY: test
+
+MPIEXEC=mpiexec -veo -n 1
+PYTHON=python
+
+test:
+	echo 100 | ${MPIEXEC} ${PYTHON} cpi-cco.py
+	echo 100 | ${MPIEXEC} ${PYTHON} cpi-rma.py
diff --git a/demo/gemv/README.rst b/demo/gemv/README.rst
new file mode 100644
index 0000000..7a0d021
--- /dev/null
+++ b/demo/gemv/README.rst
@@ -0,0 +1,19 @@
+Performs the matrix-vector operations:
+
+    y = A * x
+
+where y is an m vector, x is an n vector, and A is an m by n matrix.
+
+Issuing at the command line for VH::
+
+    $ mpiexec -veo -np 4 python gemv.py -dev vh -dtype float -m 10000 -n 10000 -iter 100
+    {'dev': 'vh', 'dtype': 'float', 'm': 10000, 'n': 10000, 'iter': 100}
+    elapsed: 0.7454090118408203 [sec]
+    Result success
+
+Issuing at the command line for VE::
+
+    $ VE_NLCPY_NODELIST=0,1,2,3 mpiexec -veo -np 4 python gemv.py -dev ve -dtype float -m 10000 -n 10000 -iter 100
+    {'dev': 've', 'dtype': 'float', 'm': 10000, 'n': 10000, 'iter': 100}
+    elapsed: 0.012457132339477539 [sec]
+    Result success
diff --git a/demo/gemv/gemv.py b/demo/gemv/gemv.py
new file mode 100644
index 0000000..2fa1a17
--- /dev/null
+++ b/demo/gemv/gemv.py
@@ -0,0 +1,96 @@
+from mpi4pyve import MPI
+import numpy as np
+import nlcpy as vp
+import argparse
+
+parser = argparse.ArgumentParser()
+parser.add_argument('-dev', type=str, required=True, choices=['vh', 've'],
+                    help='Execution device')
+parser.add_argument('-dtype', type=str, required=True, choices=['float', 'double'],
+                    help='Execution data type')
+parser.add_argument('-m', type=int, required=False, default=10,
+                    help='Number of rows of matrix A')
+parser.add_argument('-n', type=int, required=False, default=10,
+                    help='Number of cols of matrix A and number of vector x')
+parser.add_argument('-iter', type=int, required=False, default=10000,
+                    help='Number of iterations for gemv')
+args = parser.parse_args()
+
+# set module
+if args.dev == 'vh':
+    dev = np
+elif args.dev == 've':
+    dev = vp
+else:
+    raise ValueError
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+
+if rank == 0: print(vars(args))
+
+m = args.m
+n = args.n
+if args.dtype == 'float':
+    dtype = 'f4'
+    mpi_dtype = MPI.FLOAT
+elif args.dtype == 'double':
+    dtype = 'f8'
+    mpi_dtype = MPI.DOUBLE
+else:
+    raise ValueError
+
+# estimate send/recv count and displacement
+A_count = [0 for _ in range(size)]
+A_displ = [0 for _ in range(size)]
+y_count = [0 for _ in range(size)]
+y_displ = [0 for _ in range(size)]
+for i in range(size):
+    m_s = m * i // size
+    m_e = m * (i + 1) // size
+    if i == rank:
+        m_d = m_e - m_s
+    A_count[i] = (m_e - m_s) * n
+    A_displ[i] = m_s * n
+    y_count[i] = (m_e - m_s)
+    y_displ[i] = m_s
+
+# create matrix A and vector x
+rng = dev.random.default_rng()
+if rank == 0:
+    A = rng.random((m, n), dtype=dtype)
+    x = rng.random(n, dtype=dtype)
+else:
+    A = None
+    x = dev.zeros(n, dtype=dtype)
+A_local = dev.empty((m_d, n), dtype=dtype)
+comm.Scatterv([A, A_count, A_displ, mpi_dtype], [A_local, A_count[rank], mpi_dtype], root=0)  # divide matrix A into each process.
+comm.Bcast(x, root=0)  # all processes share same vector x.
+
+# execute gemv
+if dev.__name__ == 'nlcpy':
+    dev.request.flush()
+comm.Barrier()
+t0 = MPI.Wtime()
+for _ in range(args.iter):
+    y_local = A_local @ x  # local gemv
+if dev.__name__ == 'nlcpy':
+    dev.request.flush()
+comm.Barrier()
+t1 = MPI.Wtime()
+
+# gather local vector y into root process
+if rank == 0:
+    y = dev.empty(m, dtype=dtype)
+else:
+    y = None
+comm.Gatherv([y_local, y_count[rank], mpi_dtype], [y, y_count, y_displ, mpi_dtype], root=0)
+
+if rank == 0:
+    print("elapsed:", t1 - t0, "[sec]")
+
+# result check
+if rank == 0:
+    res = dev.all((y - A @ x) / y < 1e-4)
+    print("Result {}".format("success" if res else "failed"))
diff --git a/demo/gemv/makefile b/demo/gemv/makefile
new file mode 100644
index 0000000..b23a5b8
--- /dev/null
+++ b/demo/gemv/makefile
@@ -0,0 +1,17 @@
+.PHONY: test
+
+PYTHON=python
+
+M=10000
+N=10000
+DTYPE=float
+ITER=100
+
+test:
+	echo "GEMV on  VH ( 1 process)" &&                           mpirun -veo -np 1  ${PYTHON} gemv.py -dev vh -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
+	echo "GEMV on  VH ( 2 process)" &&                           mpirun -veo -np 2  ${PYTHON} gemv.py -dev vh -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
+	echo "GEMV on  VH ( 4 process)" &&                           mpirun -veo -np 4  ${PYTHON} gemv.py -dev vh -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
+	echo "GEMV on 1VE ( 1 process)" && VE_NLCPY_NODELIST=0       mpirun -veo -np 1  ${PYTHON} gemv.py -dev ve -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
+	echo "GEMV on 2VE ( 2 process)" && VE_NLCPY_NODELIST=0,1     mpirun -veo -np 2  ${PYTHON} gemv.py -dev ve -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
+	echo "GEMV on 4VE ( 4 process)" && VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 4  ${PYTHON} gemv.py -dev ve -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
+	echo "GEMV on 4VE (32 process)" && VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 32 ${PYTHON} gemv.py -dev ve -dtype ${DTYPE} -m ${M} -n ${N} -iter ${ITER}
diff --git a/demo/helloworld.c b/demo/helloworld.c
deleted file mode 100644
index f19ece7..0000000
--- a/demo/helloworld.c
+++ /dev/null
@@ -1,32 +0,0 @@
-#include <mpi.h>
-#include <stdio.h>
-
-int main(int argc, char *argv[])
-{
-  int size, rank, len;
-  char name[MPI_MAX_PROCESSOR_NAME];
-
-#if defined(MPI_VERSION) && (MPI_VERSION >= 2)
-  int provided;
-  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
-#else
-  MPI_Init(&argc, &argv);
-#endif
-
-  MPI_Comm_size(MPI_COMM_WORLD, &size);
-  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
-  MPI_Get_processor_name(name, &len);
-
-  printf("Hello, World! I am process %d of %d on %s.\n", rank, size, name);
-
-  MPI_Finalize();
-  return 0;
-}
-
-/*
- * Local Variables:
- * mode: C
- * c-basic-offset: 2
- * indent-tabs-mode: nil
- * End:
-*/
diff --git a/demo/helloworld.cxx b/demo/helloworld.cxx
deleted file mode 100644
index e18066e..0000000
--- a/demo/helloworld.cxx
+++ /dev/null
@@ -1,32 +0,0 @@
-#include <mpi.h>
-#include <iostream>
-
-int main(int argc, char *argv[])
-{
-#if defined(MPI_VERSION) && (MPI_VERSION >= 2)
-  MPI::Init_thread(MPI_THREAD_MULTIPLE);
-#else
-  MPI::Init();
-#endif
-
-  int size = MPI::COMM_WORLD.Get_size();
-  int rank = MPI::COMM_WORLD.Get_rank();
-  int len; char name[MPI_MAX_PROCESSOR_NAME];
-  MPI::Get_processor_name(name, len);
-
-  std::cout <<
-    "Hello, World! " <<
-    "I am process "  << rank <<
-    " of "           << size <<
-    " on  "          << name <<
-    "."              << std::endl;
-
-  MPI::Finalize();
-  return 0;
-}
-
-// Local Variables:
-// mode: C++
-// c-basic-offset: 2
-// indent-tabs-mode: nil
-// End:
diff --git a/demo/helloworld.f08 b/demo/helloworld.f08
deleted file mode 100644
index 2a62542..0000000
--- a/demo/helloworld.f08
+++ /dev/null
@@ -1,23 +0,0 @@
-program main
-
-  use mpi_f08
-  implicit none
-
-  integer :: provided,  size, rank, len
-  character (len=MPI_MAX_PROCESSOR_NAME) :: name
-
-  call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided)
-
-  call MPI_Comm_rank(MPI_COMM_WORLD, rank)
-  call MPI_Comm_size(MPI_COMM_WORLD, size)
-  call MPI_Get_processor_name(name, len)
-
-  write(*, '(2A,I2,A,I2,3A)') &
-       'Hello, World! ', &
-       'I am process ', rank, &
-       ' of ', size, &
-       ' on ', name(1:len), '.'
-
-  call MPI_Finalize()
-
-end program main
diff --git a/demo/helloworld.f90 b/demo/helloworld.f90
deleted file mode 100644
index 6c454bf..0000000
--- a/demo/helloworld.f90
+++ /dev/null
@@ -1,23 +0,0 @@
-program main
-
-  use mpi
-  implicit none
-
-  integer :: provided,  ierr, size, rank, len
-  character (len=MPI_MAX_PROCESSOR_NAME) :: name
-
-  call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, ierr)
-
-  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
-  call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)
-  call MPI_Get_processor_name(name, len, ierr)
-
-  write(*, '(2A,I2,A,I2,3A)') &
-       'Hello, World! ', &
-       'I am process ', rank, &
-       ' of ', size, &
-       ' on ', name(1:len), '.'
-
-  call MPI_Finalize(ierr)
-
-end program main
diff --git a/demo/init-fini/makefile b/demo/init-fini/makefile
deleted file mode 100644
index 5684e34..0000000
--- a/demo/init-fini/makefile
+++ /dev/null
@@ -1,15 +0,0 @@
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=3
-
-PYTHON=python
-
-.PHONY: test
-test:
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_0.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_1.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_2a.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_2b.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_3.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_4.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_5.py
diff --git a/demo/init-fini/runtests.bat b/demo/init-fini/runtests.bat
deleted file mode 100644
index cf929c7..0000000
--- a/demo/init-fini/runtests.bat
+++ /dev/null
@@ -1,13 +0,0 @@
-@echo off
-setlocal ENABLEEXTENSIONS
-
-set PYTHON=python
-
-@echo on
-%PYTHON% test_0.py
-%PYTHON% test_1.py
-%PYTHON% test_2a.py
-%PYTHON% test_2b.py
-%PYTHON% test_3.py
-%PYTHON% test_4.py
-%PYTHON% test_5.py
diff --git a/demo/init-fini/runtests.sh b/demo/init-fini/runtests.sh
deleted file mode 100755
index 6858d75..0000000
--- a/demo/init-fini/runtests.sh
+++ /dev/null
@@ -1,16 +0,0 @@
-#!/bin/sh
-
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=3
-
-PYTHON=python
-
-set -x
-$MPIEXEC $NP_FLAG $NP $PYTHON test_0.py
-$MPIEXEC $NP_FLAG $NP $PYTHON test_1.py
-$MPIEXEC $NP_FLAG $NP $PYTHON test_2a.py
-$MPIEXEC $NP_FLAG $NP $PYTHON test_2b.py
-$MPIEXEC $NP_FLAG $NP $PYTHON test_3.py
-$MPIEXEC $NP_FLAG $NP $PYTHON test_4.py
-$MPIEXEC $NP_FLAG $NP $PYTHON test_5.py
diff --git a/demo/init-fini/test_0.py b/demo/init-fini/test_0.py
deleted file mode 100644
index cb44a5f..0000000
--- a/demo/init-fini/test_0.py
+++ /dev/null
@@ -1,2 +0,0 @@
-from mpi4pyve import rc
-from mpi4pyve import MPI
diff --git a/demo/init-fini/test_1.py b/demo/init-fini/test_1.py
deleted file mode 100644
index b6eb3af..0000000
--- a/demo/init-fini/test_1.py
+++ /dev/null
@@ -1,14 +0,0 @@
-from mpi4pyve import rc
-rc.initialize = False
-
-from mpi4pyve import MPI
-assert not MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-MPI.Init()
-assert MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-MPI.Finalize()
-assert MPI.Is_initialized()
-assert MPI.Is_finalized()
diff --git a/demo/init-fini/test_2a.py b/demo/init-fini/test_2a.py
deleted file mode 100644
index de88275..0000000
--- a/demo/init-fini/test_2a.py
+++ /dev/null
@@ -1,14 +0,0 @@
-from mpi4pyve import rc
-rc.initialize = False
-
-from mpi4pyve import MPI
-assert not MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-MPI.Init_thread(MPI.THREAD_MULTIPLE)
-assert MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-MPI.Finalize()
-assert MPI.Is_initialized()
-assert MPI.Is_finalized()
diff --git a/demo/init-fini/test_2b.py b/demo/init-fini/test_2b.py
deleted file mode 100644
index a7e3548..0000000
--- a/demo/init-fini/test_2b.py
+++ /dev/null
@@ -1,21 +0,0 @@
-from mpi4pyve import rc
-rc.initialize = False
-
-from mpi4pyve import MPI
-assert not MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-MPI.Init_thread()
-assert MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-import sys
-name, _ = MPI.get_vendor()
-if name == 'MPICH':
-    assert MPI.Query_thread() == MPI.THREAD_MULTIPLE
-if name == 'MPICH2' and sys.platform[:3] != 'win':
-    assert MPI.Query_thread() == MPI.THREAD_MULTIPLE
-
-MPI.Finalize()
-assert MPI.Is_initialized()
-assert MPI.Is_finalized()
diff --git a/demo/init-fini/test_3.py b/demo/init-fini/test_3.py
deleted file mode 100644
index 8a7d4a1..0000000
--- a/demo/init-fini/test_3.py
+++ /dev/null
@@ -1,6 +0,0 @@
-from mpi4pyve import rc
-rc.finalize = False
-
-from mpi4pyve import MPI
-assert  MPI.Is_initialized()
-assert not MPI.Is_finalized()
diff --git a/demo/init-fini/test_4.py b/demo/init-fini/test_4.py
deleted file mode 100644
index 03895d8..0000000
--- a/demo/init-fini/test_4.py
+++ /dev/null
@@ -1,10 +0,0 @@
-from mpi4pyve import rc
-rc.finalize = False
-
-from mpi4pyve import MPI
-assert  MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-MPI.Finalize()
-assert MPI.Is_initialized()
-assert MPI.Is_finalized()
diff --git a/demo/init-fini/test_5.py b/demo/init-fini/test_5.py
deleted file mode 100644
index 3d8ad9f..0000000
--- a/demo/init-fini/test_5.py
+++ /dev/null
@@ -1,16 +0,0 @@
-from mpi4pyve import rc
-del rc.initialize
-del rc.threads
-del rc.thread_level
-del rc.finalize
-
-from mpi4pyve import MPI
-assert MPI.Is_initialized()
-assert not MPI.Is_finalized()
-
-import sys
-name, _ = MPI.get_vendor()
-if name == 'MPICH':
-    assert MPI.Query_thread() == MPI.THREAD_MULTIPLE
-if name == 'MPICH2' and sys.platform[:3] != 'win':
-    assert MPI.Query_thread() == MPI.THREAD_MULTIPLE
diff --git a/demo/libmpi-cffi/apigen.py b/demo/libmpi-cffi/apigen.py
deleted file mode 100644
index 5769b7f..0000000
--- a/demo/libmpi-cffi/apigen.py
+++ /dev/null
@@ -1,30 +0,0 @@
-import sys, os.path as p
-wdir = p.abspath(p.dirname(__file__))
-topdir = p.normpath(p.join(wdir, p.pardir, p.pardir))
-srcdir = p.join(topdir, 'src')
-sys.path.insert(0, p.join(topdir, 'conf'))
-
-from mpiscanner import Scanner
-scanner = Scanner()
-libmpi_pxd = p.join(srcdir, 'mpi4pyve', 'libmpi.pxd')
-scanner.parse_file(libmpi_pxd)
-libmpi_h = p.join(wdir, 'libmpi.h')
-scanner.dump_header_h(libmpi_h)
-
-#try:
-#    from cStringIO import StringIO
-#except ImportError:
-#    from io import StringIO
-#libmpi_h = StringIO()
-#scanner.dump_header_h(libmpi_h)
-#print libmpi_h.read()
-
-libmpi_c = p.join(wdir, 'libmpi.c.in')
-with open(libmpi_c, 'w') as f:
-    f.write("""\
-#include <mpi.h>
-#include "%(srcdir)s/lib-mpi/config.h"
-#include "%(srcdir)s/lib-mpi/missing.h"
-#include "%(srcdir)s/lib-mpi/fallback.h"
-#include "%(srcdir)s/lib-mpi/compat.h"
-""" % vars())
diff --git a/demo/libmpi-cffi/build.py b/demo/libmpi-cffi/build.py
deleted file mode 100644
index 805f890..0000000
--- a/demo/libmpi-cffi/build.py
+++ /dev/null
@@ -1,54 +0,0 @@
-import os
-import cffi
-
-ffi = cffi.FFI()
-with open("libmpi.c.in") as f:
-    ffi.set_source("libmpi", f.read())
-with open("libmpi.h") as f:
-    ffi.cdef(f.read())
-
-class mpicompiler(object):
-
-    from cffi import ffiplatform
-
-    def __init__(self, cc, ld=None):
-        self.cc = cc
-        self.ld = ld if ld else cc
-        self.ffi_compile = self.ffiplatform.compile
-
-    def __enter__(self):
-        self.ffiplatform.compile = self.compile
-
-    def __exit__(self, *args):
-        self.ffiplatform.compile = self.ffi_compile
-
-    def configure(self, compiler):
-        from distutils.util import split_quoted
-        from distutils.spawn import find_executable
-        def fix_command(command, cmd):
-            if not cmd: return
-            cmd = split_quoted(cmd)
-            exe = find_executable(cmd[0])
-            if not exe: return
-            command[0] = exe
-            command += cmd[1:]
-        fix_command(compiler.compiler_so, self.cc)
-        fix_command(compiler.linker_so, self.ld)
-
-    def compile(self, *args, **kargs):
-        from distutils.command import build_ext
-        customize_compiler_orig = build_ext.customize_compiler
-        def customize_compiler(compiler):
-            customize_compiler_orig(compiler)
-            self.configure(compiler)
-        build_ext.customize_compiler = customize_compiler
-        try:
-            return self.ffi_compile(*args, **kargs)
-        finally:
-            build_ext.customize_compiler = customize_compiler_orig
-
-if __name__ == '__main__':
-    cc = os.environ.get('MPICC', 'mpicc')
-    ld = os.environ.get('MPILD')
-    with mpicompiler(cc, ld):
-        ffi.compile()
diff --git a/demo/libmpi-cffi/makefile b/demo/libmpi-cffi/makefile
deleted file mode 100644
index 2ff6557..0000000
--- a/demo/libmpi-cffi/makefile
+++ /dev/null
@@ -1,24 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-
-.PHONY: build
-build: libmpi.h libmpi.c.in
-	$(PYTHON) build.py
-
-libmpi.h libmpi.c.in:
-	$(PYTHON) apigen.py
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-.PHONY: test
-test: build
-	$(MPIEXEC) $(NP_FLAG) 5 $(PYTHON) test_helloworld.py
-	$(MPIEXEC) $(NP_FLAG) 4 $(PYTHON) test_ringtest.py
-	$(MPIEXEC) $(NP_FLAG) 2 $(PYTHON) test_latency.py
-
-.PHONY: clean
-clean:
-	$(RM) -r libmpi.*
-	$(RM) -r *py[co] __pycache__
diff --git a/demo/libmpi-cffi/test_helloworld.py b/demo/libmpi-cffi/test_helloworld.py
deleted file mode 100644
index 3c3d689..0000000
--- a/demo/libmpi-cffi/test_helloworld.py
+++ /dev/null
@@ -1,23 +0,0 @@
-from libmpi import ffi, lib
-
-NULL   = ffi.NULL
-size_p = ffi.new('int*')
-rank_p = ffi.new('int*')
-nlen_p = ffi.new('int*')
-name_p = ffi.new('char[]', lib.MPI_MAX_PROCESSOR_NAME);
-
-lib.MPI_Init(NULL, NULL);
-
-lib.MPI_Comm_size(lib.MPI_COMM_WORLD, size_p)
-lib.MPI_Comm_rank(lib.MPI_COMM_WORLD, rank_p)
-lib.MPI_Get_processor_name(name_p, nlen_p)
-
-size = size_p[0]
-rank = rank_p[0]
-nlen = nlen_p[0]
-name = ffi.string(name_p[0:nlen])
-
-print("Hello, World! I am process %d of %d on %s."
-      % (rank, size, name))
-
-lib.MPI_Finalize()
diff --git a/demo/libmpi-cffi/test_latency.py b/demo/libmpi-cffi/test_latency.py
deleted file mode 100644
index 09a00f2..0000000
--- a/demo/libmpi-cffi/test_latency.py
+++ /dev/null
@@ -1,73 +0,0 @@
-# http://mvapich.cse.ohio-state.edu/benchmarks/
-
-from libmpi import ffi, lib
-
-def osu_latency(
-    BENCHMARH = "MPI Latency Test",
-    skip = 1000,
-    loop = 10000,
-    skip_large = 10,
-    loop_large = 100,
-    large_message_size = 8192,
-    MAX_MSG_SIZE = 1<<22,
-    ):
-
-    myid = ffi.new('int*')
-    numprocs = ffi.new('int*')
-    lib.MPI_Comm_rank(lib.MPI_COMM_WORLD, myid)
-    lib.MPI_Comm_size(lib.MPI_COMM_WORLD, numprocs)
-    myid = myid[0]
-    numprocs = numprocs[0]
-
-    if numprocs != 2:
-        if myid == 0:
-            errmsg = "This test requires exactly two processes"
-        else:
-            errmsg = None
-        raise SystemExit(errmsg)
-
-    sbuf = ffi.new('unsigned char[]', MAX_MSG_SIZE)
-    rbuf = ffi.new('unsigned char[]', MAX_MSG_SIZE)
-    dtype = lib.MPI_BYTE
-    tag = 1
-    comm = lib.MPI_COMM_WORLD
-    status = lib.MPI_STATUS_IGNORE
-
-    if myid == 0:
-        print ('# %s' % (BENCHMARH,))
-    if myid == 0:
-        print ('# %-8s%20s' % ("Size [B]", "Latency [us]"))
-
-    message_sizes = [0] + [2**i for i in range(30)]
-    for size in message_sizes:
-        if size > MAX_MSG_SIZE:
-            break
-        if size > large_message_size:
-            skip = skip_large
-            loop = loop_large
-        iterations = list(range(loop+skip))
-        #
-        lib.MPI_Barrier(comm)
-        if myid == 0:
-            for i in iterations:
-                if i == skip:
-                    t_start = lib.MPI_Wtime()
-                lib.MPI_Send(sbuf, size, dtype, 1, tag, comm)
-                lib.MPI_Recv(rbuf, size, dtype, 1, tag, comm, status)
-            t_end = lib.MPI_Wtime()
-        elif myid == 1:
-            for i in iterations:
-                lib.MPI_Recv(rbuf, size, dtype, 0, tag, comm, status)
-                lib.MPI_Send(sbuf, size, dtype, 0, tag, comm)
-        #
-        if myid == 0:
-            latency = (t_end - t_start) * 1e6 / (2 * loop)
-            print ('%-10d%20.2f' % (size, latency))
-
-def main():
-    lib.MPI_Init(ffi.NULL, ffi.NULL)
-    osu_latency()
-    lib.MPI_Finalize()
-
-if __name__ == '__main__':
-    main()
diff --git a/demo/libmpi-cffi/test_ringtest.py b/demo/libmpi-cffi/test_ringtest.py
deleted file mode 100644
index 7b5dc90..0000000
--- a/demo/libmpi-cffi/test_ringtest.py
+++ /dev/null
@@ -1,76 +0,0 @@
-from libmpi import ffi, lib
-
-def ring(comm, count=1, loop=1, skip=0):
-
-    size_p = ffi.new('int*')
-    rank_p = ffi.new('int*')
-    lib.MPI_Comm_size(comm, size_p)
-    lib.MPI_Comm_rank(comm, rank_p)
-    size = size_p[0]
-    rank = rank_p[0]
-
-    source  = (rank - 1) % size
-    dest = (rank + 1) % size
-    sbuf = ffi.new('unsigned char[]', [42]*count)
-    rbuf = ffi.new('unsigned char[]', [ 0]*count)
-
-    iterations = list(range((loop+skip)))
-
-    if size == 1:
-        for i in iterations:
-            if i == skip:
-                tic = lib.MPI_Wtime()
-            lib.MPI_Sendrecv(sbuf, count, lib.MPI_BYTE, dest,   0,
-                             rbuf, count, lib.MPI_BYTE, source, 0,
-                             comm, lib.MPI_STATUS_IGNORE)
-    else:
-        if rank == 0:
-            for i in iterations:
-                if i == skip:
-                    tic = lib.MPI_Wtime()
-                lib.MPI_Send(sbuf, count, lib.MPI_BYTE, dest,   0, comm)
-                lib.MPI_Recv(rbuf, count, lib.MPI_BYTE, source, 0, comm, lib.MPI_STATUS_IGNORE)
-        else:
-            sbuf = rbuf
-            for i in iterations:
-                if i == skip:
-                    tic = lib.MPI_Wtime()
-                lib.MPI_Recv(rbuf, count, lib.MPI_BYTE, source, 0, comm, lib.MPI_STATUS_IGNORE)
-                lib.MPI_Send(sbuf, count, lib.MPI_BYTE, dest,   0, comm)
-    toc = lib.MPI_Wtime()
-    if rank == 0 and ffi.string(sbuf) != ffi.string(rbuf):
-        import warnings, traceback
-        try:
-            warnings.warn("received message does not match!")
-        except UserWarning:
-            traceback.print_exc()
-            lib.MPI_Abort(comm, 2)
-    return toc - tic
-
-def ringtest(comm):
-
-    size = ( 1 )
-    loop = ( 1 )
-    skip = ( 0 )
-
-    lib.MPI_Barrier(comm)
-    elapsed = ring(comm, size, loop, skip)
-
-    size_p = ffi.new('int*')
-    rank_p = ffi.new('int*')
-    lib.MPI_Comm_size(comm, size_p)
-    lib.MPI_Comm_rank(comm, rank_p)
-    comm_size = size_p[0]
-    comm_rank = rank_p[0]
-
-    if comm_rank == 0:
-        print ("time for %d loops = %g seconds (%d processes, %d bytes)"
-               % (loop, elapsed, comm_size, size))
-
-def main():
-    lib.MPI_Init(ffi.NULL, ffi.NULL)
-    ringtest(lib.MPI_COMM_WORLD)
-    lib.MPI_Finalize()
-
-if __name__ == '__main__':
-    main()
diff --git a/demo/makefile b/demo/makefile
index cdc77cf..c1c3e34 100644
--- a/demo/makefile
+++ b/demo/makefile
@@ -2,26 +2,7 @@ PYTHON=python
 .PHONY: default
 default:
 	${MAKE} PYTHON=${PYTHON} -C compute-pi
-	${MAKE} PYTHON=${PYTHON} -C mandelbrot
-	${MAKE} PYTHON=${PYTHON} -C nxtval
-	${MAKE} PYTHON=${PYTHON} -C reductions
-	${MAKE} PYTHON=${PYTHON} -C sequential
-	${MAKE} PYTHON=${PYTHON} -C spawning
-
-	${MAKE} PYTHON=${PYTHON} -C wrap-c
-	${MAKE} PYTHON=${PYTHON} -C wrap-f2py
-	${MAKE} PYTHON=${PYTHON} -C wrap-swig
-	${MAKE} PYTHON=${PYTHON} -C wrap-boost
-	${MAKE} PYTHON=${PYTHON} -C wrap-cython
-	${MAKE} PYTHON=${PYTHON} -C wrap-ctypes
-	${MAKE} PYTHON=${PYTHON} -C wrap-cffi
-
-	${MAKE} PYTHON=${PYTHON} -C cython
-	${MAKE} PYTHON=${PYTHON} -C embedding
-	${MAKE} PYTHON=${PYTHON} -C libmpi-cffi
-
-	${MAKE} PYTHON=${PYTHON} -C mpi-ref-v1
-	${MAKE} PYTHON=${PYTHON} -C init-fini
-	${MAKE} PYTHON=${PYTHON} -C threads
-
-	${MAKE} PYTHON=${PYTHON} -C futures
+	${MAKE} PYTHON=${PYTHON} -C gemv
+	${MAKE} PYTHON=${PYTHON} -C gemm
+	${MAKE} PYTHON=${PYTHON} -C ping-pong
+	${MAKE} PYTHON=${PYTHON} -C thermal
diff --git a/demo/mandelbrot/makefile b/demo/mandelbrot/makefile
deleted file mode 100644
index eb1c254..0000000
--- a/demo/mandelbrot/makefile
+++ /dev/null
@@ -1,25 +0,0 @@
-.PHONY: default build test clean
-
-default: build test clean
-
-build: mandelbrot-worker.exe
-
-MPIF90=mpif90
-FFLAGS= -O3
-ifneq (${MPI_FORTRAN_MOD_DIR},)
-FFLAGS += -I${MPI_FORTRAN_MOD_DIR}
-endif
-mandelbrot-worker.exe: mandelbrot-worker.f90
-	${MPIF90} ${FFLAGS} -o $@ $<
-
-PYTHON=python
-MPIEXEC=mpiexec
-NP_FLAG=-n
-
-test: build
-	${MPIEXEC} ${NP_FLAG} 1 ${PYTHON} mandelbrot-master.py
-	${MPIEXEC} ${NP_FLAG} 7 ${PYTHON} mandelbrot.py
-	${PYTHON} mandelbrot-seq.py
-
-clean:
-	${RM} mandelbrot-worker.exe
diff --git a/demo/mandelbrot/mandelbrot-master.py b/demo/mandelbrot/mandelbrot-master.py
deleted file mode 100644
index 913590b..0000000
--- a/demo/mandelbrot/mandelbrot-master.py
+++ /dev/null
@@ -1,66 +0,0 @@
-from mpi4pyve import MPI
-import numpy as np
-
-x1 = -2.0
-x2 =  1.0
-y1 = -1.0
-y2 =  1.0
-
-w = 600
-h = 400
-maxit = 255
-
-import os
-dirname = os.path.abspath(os.path.dirname(__file__))
-executable = os.path.join(dirname, 'mandelbrot-worker.exe')
-
-# spawn worker
-worker = MPI.COMM_SELF.Spawn(executable, maxprocs=7)
-size = worker.Get_remote_size()
-
-# send parameters
-rmsg = np.array([x1, x2, y1, y2], dtype='f')
-imsg = np.array([w, h, maxit], dtype='i')
-worker.Bcast([rmsg, MPI.REAL], root=MPI.ROOT)
-worker.Bcast([imsg, MPI.INTEGER], root=MPI.ROOT)
-
-# gather results
-counts = np.empty(size, dtype='i')
-indices = np.empty(h, dtype='i')
-cdata = np.empty([h, w], dtype='i')
-worker.Gather(sendbuf=None,
-              recvbuf=[counts, MPI.INTEGER],
-              root=MPI.ROOT)
-worker.Gatherv(sendbuf=None,
-               recvbuf=[indices, (counts, None), MPI.INTEGER],
-               root=MPI.ROOT)
-worker.Gatherv(sendbuf=None,
-               recvbuf=[cdata, (counts * w, None), MPI.INTEGER],
-               root=MPI.ROOT)
-
-# disconnect worker
-worker.Disconnect()
-
-# reconstruct full result
-M = np.zeros([h, w], dtype='i')
-M[indices, :] = cdata
-
-# eye candy (requires matplotlib)
-if 1:
-    try:
-        from matplotlib import pyplot as plt
-        plt.imshow(M, aspect='equal')
-        try:
-            plt.nipy_spectral()
-        except AttributeError:
-            plt.spectral()
-        try:
-            import signal
-            def action(*args): raise SystemExit
-            signal.signal(signal.SIGALRM, action)
-            signal.alarm(2)
-        except:
-            pass
-        plt.show()
-    except:
-        pass
diff --git a/demo/mandelbrot/mandelbrot-seq.py b/demo/mandelbrot/mandelbrot-seq.py
deleted file mode 100644
index 0904898..0000000
--- a/demo/mandelbrot/mandelbrot-seq.py
+++ /dev/null
@@ -1,57 +0,0 @@
-import numpy as np
-import time
-
-tic = time.time()
-
-x1 = -2.0
-x2 =  1.0
-y1 = -1.0
-y2 =  1.0
-
-w = 150
-h = 100
-maxit = 127
-
-def mandelbrot(x, y, maxit):
-    c = x + y*1j
-    z = 0 + 0j
-    it = 0
-    while abs(z) < 2 and it < maxit:
-        z = z**2 + c
-        it += 1
-    return it
-
-dx = (x2 - x1) / w
-dy = (y2 - y1) / h
-
-C = np.empty([h, w], dtype='i')
-for k in np.arange(h):
-    y = y1 + k * dy
-    for j in np.arange(w):
-        x = x1 + j * dx
-        C[k, j] = mandelbrot(x, y, maxit)
-
-M = C
-
-toc = time.time()
-print('wall clock time: %8.2f seconds' % (toc-tic))
-
-# eye candy (requires matplotlib)
-if 1:
-    try:
-        from matplotlib import pyplot as plt
-        plt.imshow(M, aspect='equal')
-        try:
-            plt.nipy_spectral()
-        except AttributeError:
-            plt.spectral()
-        try:
-            import signal
-            def action(*args): raise SystemExit
-            signal.signal(signal.SIGALRM, action)
-            signal.alarm(2)
-        except:
-            pass
-        plt.show()
-    except:
-        pass
diff --git a/demo/mandelbrot/mandelbrot-worker.f90 b/demo/mandelbrot/mandelbrot-worker.f90
deleted file mode 100644
index 0bb7d15..0000000
--- a/demo/mandelbrot/mandelbrot-worker.f90
+++ /dev/null
@@ -1,96 +0,0 @@
-! $ mpif90 -o mandelbrot.exe mandelbrot.f90
-
-program main
-
-  use MPI
-  implicit none
-
-  integer master, nprocs, myrank, ierr
-
-  real    :: rmsg(4), x1, x2, y1, y2
-  integer :: imsg(3), w, h, maxit
-
-  integer              :: N
-  integer, allocatable :: I(:)
-  integer, allocatable :: C(:,:)
-  integer :: j, k
-  real    :: x, dx, y, dy
-
-  call MPI_Init(ierr)
-  call MPI_Comm_get_parent(master, ierr)
-  if (master == MPI_COMM_NULL) then
-     print *, "parent communicator is MPI_COMM_NULL"
-     call MPI_Abort(MPI_COMM_WORLD, 1, ierr)
-  end if
-  call MPI_Comm_size(master, nprocs, ierr)
-  call MPI_Comm_rank(master, myrank, ierr)
-
-  ! receive parameters and unpack
-  call MPI_Bcast(rmsg, 4, MPI_REAL,    0, master, ierr)
-  call MPI_Bcast(imsg, 3, MPI_INTEGER, 0, master, ierr)
-  x1 = rmsg(1); x2 = rmsg(2)
-  y1 = rmsg(3); y2 = rmsg(4)
-  w = imsg(1); h = imsg(2); maxit = imsg(3)
-  dx = (x2-x1)/real(w)
-  dy = (y2-y1)/real(h)
-
-  ! number of lines to compute here
-  N = h / nprocs
-  if (modulo(h, nprocs) > myrank) then
-     N = N + 1
-  end if
-
-  ! indices of lines to compute here
-  allocate( I(0:N-1) )
-  I = (/ (k, k=myrank, h-1, nprocs) /)
-
-  ! compute local lines
-  allocate( C(0:w-1, 0:N-1) )
-  do k = 0, N-1
-     y = y1 + real(I(k)) * dy
-     do j = 0, w-1
-        x = x1 + real(j) * dx
-        C(j, k) = mandelbrot(x, y, maxit)
-     end do
-  end do
-
-  ! send number of lines computed here
-  call MPI_Gather(N, 1, MPI_INTEGER, &
-                  MPI_BOTTOM, 0, MPI_BYTE, &
-                  0, master, ierr)
-
-  ! send indices of lines computed here
-  call MPI_Gatherv(I, N, MPI_INTEGER, &
-                   MPI_BOTTOM, MPI_BOTTOM, MPI_BOTTOM, MPI_BYTE, &
-                   0, master, ierr)
-
-  ! send data of lines computed here
-  call MPI_Gatherv(C, N*w, MPI_INTEGER, &
-                   MPI_BOTTOM, MPI_BOTTOM, MPI_BOTTOM, MPI_BYTE, &
-                   0, master, ierr)
-
-  deallocate(C)
-  deallocate(I)
-
-  ! we are done
-  call MPI_Comm_disconnect(master, ierr)
-  call MPI_Finalize(ierr)
-
-contains
-
-  function mandelbrot(x, y, maxit) result (it)
-    implicit none
-    real,    intent(in) :: x, y
-    integer, intent(in) :: maxit
-    integer :: it
-    complex :: z, c
-    z = cmplx(0, 0)
-    c = cmplx(x, y)
-    it = 0
-    do while (abs(z) < 2.0 .and. it < maxit)
-       z = z*z + c
-       it = it + 1
-    end do
-  end function mandelbrot
-
-end program main
diff --git a/demo/mandelbrot/mandelbrot.py b/demo/mandelbrot/mandelbrot.py
deleted file mode 100644
index e558f24..0000000
--- a/demo/mandelbrot/mandelbrot.py
+++ /dev/null
@@ -1,107 +0,0 @@
-from mpi4pyve import MPI
-import numpy as np
-
-tic = MPI.Wtime()
-
-x1 = -2.0
-x2 =  1.0
-y1 = -1.0
-y2 =  1.0
-
-w = 150
-h = 100
-maxit = 127
-
-def mandelbrot(x, y, maxit):
-    c = x + y*1j
-    z = 0 + 0j
-    it = 0
-    while abs(z) < 2 and it < maxit:
-        z = z**2 + c
-        it += 1
-    return it
-
-comm = MPI.COMM_WORLD
-size = comm.Get_size()
-rank = comm.Get_rank()
-
-rmsg = np.empty(4, dtype='f')
-imsg = np.empty(3, dtype='i')
-
-if rank == 0:
-    rmsg[:] = [x1, x2, y1, y2]
-    imsg[:] = [w, h, maxit]
-
-comm.Bcast([rmsg, MPI.FLOAT], root=0)
-comm.Bcast([imsg, MPI.INT], root=0)
-
-x1, x2, y1, y2 = [float(r) for r in rmsg]
-w, h, maxit    = [int(i) for i in imsg]
-dx = (x2 - x1) / w
-dy = (y2 - y1) / h
-
-# number of lines to compute here
-N = h // size + (h % size > rank)
-N = np.array(N, dtype='i')
-# indices of lines to compute here
-I = np.arange(rank, h, size, dtype='i')
-# compute local lines
-C = np.empty([N, w], dtype='i')
-for k in np.arange(N):
-    y = y1 + I[k] * dy
-    for j in np.arange(w):
-        x = x1 + j * dx
-        C[k, j] = mandelbrot(x, y, maxit)
-# gather results at root
-counts = 0
-indices = None
-cdata = None
-if rank == 0:
-    counts = np.empty(size, dtype='i')
-    indices = np.empty(h, dtype='i')
-    cdata = np.empty([h, w], dtype='i')
-comm.Gather(sendbuf=[N, MPI.INT],
-            recvbuf=[counts, MPI.INT],
-            root=0)
-comm.Gatherv(sendbuf=[I, MPI.INT],
-             recvbuf=[indices, (counts, None), MPI.INT],
-             root=0)
-comm.Gatherv(sendbuf=[C, MPI.INT],
-             recvbuf=[cdata, (counts*w, None), MPI.INT],
-             root=0)
-# reconstruct full result at root
-if rank == 0:
-    M = np.zeros([h,w], dtype='i')
-    M[indices, :] = cdata
-
-toc = MPI.Wtime()
-wct = comm.gather(toc-tic, root=0)
-if rank == 0:
-    for task, time in enumerate(wct):
-        print('wall clock time: %8.2f seconds (task %d)' % (time, task))
-    def mean(seq): return sum(seq)/len(seq)
-    print    ('all tasks, mean: %8.2f seconds' % mean(wct))
-    print    ('all tasks, min:  %8.2f seconds' % min(wct))
-    print    ('all tasks, max:  %8.2f seconds' % max(wct))
-    print    ('all tasks, sum:  %8.2f seconds' % sum(wct))
-
-# eye candy (requires matplotlib)
-if rank == 0:
-    try:
-        from matplotlib import pyplot as plt
-        plt.imshow(M, aspect='equal')
-        try:
-            plt.nipy_spectral()
-        except AttributeError:
-            plt.spectral()
-        try:
-            import signal
-            def action(*args): raise SystemExit
-            signal.signal(signal.SIGALRM, action)
-            signal.alarm(2)
-        except:
-            pass
-        plt.show()
-    except:
-        pass
-MPI.COMM_WORLD.Barrier()
diff --git a/demo/mpe-logging/cpilog.py b/demo/mpe-logging/cpilog.py
deleted file mode 100644
index 001ed94..0000000
--- a/demo/mpe-logging/cpilog.py
+++ /dev/null
@@ -1,53 +0,0 @@
-#!/usr/bin/env python
-
-# If you want MPE to log MPI calls, you have to add the two lines
-# below at the very beginning of your main bootstrap script.
-import mpi4pyve
-mpi4pyve.profile('mpe', logfile='cpilog')
-
-# Import the MPI extension module
-from mpi4pyve import MPI
-if 0: # <- use '1' to disable logging of MPI calls
-    MPI.Pcontrol(0)
-
-# Import the 'array' module
-from array import array
-
-# This is just to make the logging
-# output a bit more interesting
-from time import sleep
-
-comm = MPI.COMM_WORLD
-nprocs = comm.Get_size()
-myrank = comm.Get_rank()
-
-n  = array('i', [0])
-pi = array('d', [0])
-mypi = array('d', [0])
-
-def comp_pi(n, myrank=0, nprocs=1):
-    h = 1.0 / n;
-    s = 0.0;
-    for i in range(myrank + 1, n + 1, nprocs):
-        x = h * (i - 0.5);
-        s += 4.0 / (1.0 + x**2);
-    return s * h
-
-comm.Barrier()
-
-for N in [10000]*10:
-
-    if myrank == 0:
-        n[0] = N
-
-    comm.Bcast([n, MPI.INT], root=0)
-
-    mypi[0] = comp_pi(n[0], myrank, nprocs)
-
-    comm.Reduce([mypi, MPI.DOUBLE],
-                [pi, MPI.DOUBLE],
-                op=MPI.SUM, root=0)
-
-    comm.Barrier()
-
-    sleep(0.01)
diff --git a/demo/mpe-logging/makefile b/demo/mpe-logging/makefile
deleted file mode 100644
index 31ec83e..0000000
--- a/demo/mpe-logging/makefile
+++ /dev/null
@@ -1,42 +0,0 @@
-MPIEXEC = mpiexec
-PYTHON = python
-N = 8
-
-.PHONY: default
-default: build test clean
-
-
-.PHONY: run-cpilog run-ring run-threads run
-run:    run-cpilog run-ring run-threads
-run-cpilog:
-	${MPIEXEC} -n ${N} ${PYTHON} cpilog.py
-run-ring:
-	${MPIEXEC} -n ${N} ${PYTHON} ring.py
-run-threads:
-	${MPIEXEC} -n ${N} ${PYTHON} threads.py
-
-.PHONY: view-cpilog view-ring view-threads view
-view:   view-cpilog view-ring view-threads
-view-cpilog: cpilog.slog2
-	jumpshot $<
-view-ring: ring.slog2
-	jumpshot $<
-view-threads: threads.slog2
-	jumpshot $<
-
-cpilog.clog2: run-cpilog
-ring.clog2: run-ring
-threads.clog2: run-threads
-%.slog2: %.clog2
-	clog2TOslog2 $<
-
-
-.PHONY: build
-build: run
-
-.PHONY: test
-test:
-
-.PHONY: clean
-clean:
-	${RM} *.[cs]log2
diff --git a/demo/mpe-logging/ring.py b/demo/mpe-logging/ring.py
deleted file mode 100644
index 39eb741..0000000
--- a/demo/mpe-logging/ring.py
+++ /dev/null
@@ -1,36 +0,0 @@
-#!/usr/bin/env python
-import os
-os.environ['MPE_LOGFILE_PREFIX'] = 'ring'
-import mpi4pyve
-mpi4pyve.profile('mpe')
-
-from mpi4pyve import MPI
-from array import array
-
-comm = MPI.COMM_WORLD
-size = comm.Get_size()
-rank = comm.Get_rank()
-
-src  = rank-1
-dest = rank+1
-if rank == 0:
-    src = size-1
-if rank == size-1:
-    dest = 0
-
-try:
-    from numpy import zeros
-    a1 = zeros(1000000, 'd')
-    a2 = zeros(1000000, 'd')
-except ImportError:
-    from array import array
-    a1 = array('d', [0]*1000); a1 *= 1000
-    a2 = array('d', [0]*1000); a2 *= 1000
-
-comm.Sendrecv(sendbuf=a1, recvbuf=a2,
-              source=src, dest=dest)
-
-MPI.Request.Waitall([
-    comm.Isend(a1, dest=dest),
-    comm.Irecv(a2, source=src),
-    ])
diff --git a/demo/mpe-logging/threads.py b/demo/mpe-logging/threads.py
deleted file mode 100644
index d4b3587..0000000
--- a/demo/mpe-logging/threads.py
+++ /dev/null
@@ -1,32 +0,0 @@
-import sys
-import mpi4pyve
-mpi4pyve.profile('mpe', logfile='threads')
-
-from mpi4pyve import MPI
-from array import array
-try:
-    import threading
-except ImportError:
-    sys.stderr.write("threading module not available\n")
-    sys.exit(0)
-
-send_msg = array('i', [7]*1000); send_msg *= 1000
-recv_msg = array('i', [0]*1000); recv_msg *= 1000
-
-
-def self_send(comm, rank):
-    comm.Send([send_msg, MPI.INT], dest=rank, tag=0)
-
-def self_recv(comm, rank):
-    comm.Recv([recv_msg, MPI.INT], source=rank, tag=0)
-
-comm = MPI.COMM_WORLD
-rank = comm.Get_rank()
-send_thread = threading.Thread(target=self_send, args=(comm, rank))
-recv_thread = threading.Thread(target=self_recv, args=(comm, rank))
-
-
-send_thread.start()
-recv_thread.start()
-recv_thread.join()
-send_thread.join()
diff --git a/demo/mpi-ref-v1/README.txt b/demo/mpi-ref-v1/README.txt
deleted file mode 100644
index 689dc1a..0000000
--- a/demo/mpi-ref-v1/README.txt
+++ /dev/null
@@ -1,11 +0,0 @@
-@Book{MPI-Ref-V1,
-  title =        {{MPI} - The Complete Reference: Volume 1, The {MPI} Core},
-  author =       {Marc Snir and Steve Otto and Steven Huss-Lederman
-                  and David Walker and Jack Dongarra},
-  edition =      {2nd.},
-  year =         1998,
-  publisher =    {MIT Press},
-  volume =       {1, The MPI Core},
-  series =       {Scientific and Engineering Computation},
-  address =      {Cambridge, MA, USA},
-}
diff --git a/demo/mpi-ref-v1/ex-2.01.py b/demo/mpi-ref-v1/ex-2.01.py
deleted file mode 100644
index a42c5d5..0000000
--- a/demo/mpi-ref-v1/ex-2.01.py
+++ /dev/null
@@ -1,39 +0,0 @@
-## mpiexec -n 2 python ex-2.01.py
-
-# Process 0 sends a message to process 1
-
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-import array
-
-if MPI.COMM_WORLD.Get_size() < 2:
-    raise SystemExit
-
-# --------------------------------------------------------------------
-
-s = "Hello there"
-
-msg = array.array('c', '\0'*20)
-tag = 99
-status = MPI.Status()
-
-myrank = MPI.COMM_WORLD.Get_rank()
-
-if myrank == 0:
-    msg[:len(s)] = array.array('c', s)
-    MPI.COMM_WORLD.Send([msg, len(s)+1, MPI.CHAR], 1, tag)
-elif myrank == 1:
-    MPI.COMM_WORLD.Recv([msg, 20, MPI.CHAR], 0, tag, status)
-
-# --------------------------------------------------------------------
-
-if myrank == 1:
-    assert list(msg[:len(s)]) == list(s)
-    assert msg[len(s)] == '\0'
-    assert status.source == 0
-    assert status.tag == tag
-    assert status.error == MPI.SUCCESS
-    assert status.Get_count(MPI.CHAR) == len(s)+1
-
-# --------------------------------------------------------------------
diff --git a/demo/mpi-ref-v1/ex-2.08.py b/demo/mpi-ref-v1/ex-2.08.py
deleted file mode 100644
index f7809e0..0000000
--- a/demo/mpi-ref-v1/ex-2.08.py
+++ /dev/null
@@ -1,46 +0,0 @@
-## mpiexec -n 2 python ex-2.08.py
-
-# An exchange of messages
-
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-import array
-
-if MPI.COMM_WORLD.Get_size() < 2:
-    raise SystemExit
-
-# --------------------------------------------------------------------
-
-sendbuf = array.array('d', [0]*10)
-recvbuf = array.array('d', [0]*10)
-tag = 0
-status = MPI.Status()
-
-myrank = MPI.COMM_WORLD.Get_rank()
-
-if myrank == 0:
-    sendbuf[:] = array.array('d', range(len(sendbuf)))
-    MPI.COMM_WORLD.Send([sendbuf, MPI.DOUBLE], 1, tag)
-    MPI.COMM_WORLD.Recv([recvbuf, MPI.DOUBLE], 1, tag, status)
-elif myrank == 1:
-    MPI.COMM_WORLD.Recv([recvbuf, MPI.DOUBLE], 0, tag, status)
-    sendbuf[:] = recvbuf
-    MPI.COMM_WORLD.Send([sendbuf, MPI.DOUBLE], 0, tag)
-
-# --------------------------------------------------------------------
-
-if myrank == 0:
-    assert status.source == 1
-    assert status.tag == tag
-    assert status.error == MPI.SUCCESS
-    assert status.Get_count(MPI.DOUBLE) == len(recvbuf)
-    assert sendbuf == recvbuf
-elif myrank == 1:
-    assert status.source == 0
-    assert status.tag == tag
-    assert status.error == MPI.SUCCESS
-    assert status.Get_count(MPI.DOUBLE) == len(recvbuf)
-    assert sendbuf == recvbuf
-
-# --------------------------------------------------------------------
diff --git a/demo/mpi-ref-v1/ex-2.16.py b/demo/mpi-ref-v1/ex-2.16.py
deleted file mode 100644
index b1f0255..0000000
--- a/demo/mpi-ref-v1/ex-2.16.py
+++ /dev/null
@@ -1,71 +0,0 @@
-## mpiexec -n 4 python ex-2.16.py
-
-# Jacobi code
-# version of parallel code using sendrecv and null proceses.
-
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-try:
-    import numpy
-except ImportError:
-    raise SystemExit
-
-# --------------------------------------------------------------------
-
-n = 5 * MPI.COMM_WORLD.Get_size()
-
-# compute number of processes and myrank
-p = MPI.COMM_WORLD.Get_size()
-myrank = MPI.COMM_WORLD.Get_rank()
-
-# compute size of local block
-m = n/p
-if myrank < (n - p * m):
-    m = m + 1
-
-#compute neighbors
-if myrank == 0:
-    left = MPI.PROC_NULL
-else:
-    left = myrank - 1
-if myrank == p - 1:
-    right = MPI.PROC_NULL
-else:
-    right = myrank + 1
-
-# allocate local arrays
-A = numpy.empty((n+2, m+2), dtype='d', order='fortran')
-B = numpy.empty((n, m),     dtype='d', order='fortran')
-
-A.fill(1)
-A[0, :] = A[-1, :] = 0
-A[:, 0] = A[:, -1] = 0
-
-# main loop
-converged = False
-while not converged:
-    # compute,  B = 0.25 * ( N + S + E + W)
-    N, S = A[:-2, 1:-1], A[2:, 1:-1]
-    E, W = A[1:-1, :-2], A[1:-1, 2:]
-    numpy.add(N, S, B)
-    numpy.add(E, B, B)
-    numpy.add(W, B, B)
-    B *= 0.25
-    A[1:-1, 1:-1] = B
-    # communicate
-    tag = 0
-    MPI.COMM_WORLD.Sendrecv([B[:, -1], MPI.DOUBLE], right, tag,
-                            [A[:,  0], MPI.DOUBLE], left,  tag)
-    MPI.COMM_WORLD.Sendrecv((B[:,  0], MPI.DOUBLE), left,  tag,
-                            (A[:, -1], MPI.DOUBLE), right, tag)
-    # convergence
-    myconv = numpy.allclose(B, 0)
-    loc_conv = numpy.asarray(myconv, dtype='i')
-    glb_conv = numpy.asarray(0, dtype='i')
-    MPI.COMM_WORLD.Allreduce([loc_conv, MPI.INT],
-                             [glb_conv, MPI.INT],
-                             op=MPI.LAND)
-    converged = bool(glb_conv)
-
-# --------------------------------------------------------------------
diff --git a/demo/mpi-ref-v1/ex-2.29.py b/demo/mpi-ref-v1/ex-2.29.py
deleted file mode 100644
index 9209350..0000000
--- a/demo/mpi-ref-v1/ex-2.29.py
+++ /dev/null
@@ -1,44 +0,0 @@
-## mpiexec -n 3 python ex-2.29.py
-
-# Use a blocking probe to wait for an incoming message
-
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-import array
-
-if MPI.COMM_WORLD.Get_size() < 3:
-    raise SystemExit
-
-# --------------------------------------------------------------------
-
-comm = MPI.COMM_WORLD
-rank = comm.Get_rank()
-
-if rank == 0:
-    i = array.array('i', [7]*5)
-    comm.Send([i, MPI.INT], 2, 0)
-elif rank == 1:
-    x = array.array('f', [7]*5)
-    comm.Send([x, MPI.FLOAT], 2, 0)
-elif rank == 2:
-    i = array.array('i', [0]*5)
-    x = array.array('f', [0]*5)
-    status = MPI.Status()
-    for j in range(2):
-        comm.Probe(MPI.ANY_SOURCE, 0, status)
-        if status.Get_source() == 0:
-            comm.Recv([i, MPI.INT], 0, 0, status)
-        else:
-            comm.Recv([x, MPI.FLOAT], 1, 0, status)
-
-# --------------------------------------------------------------------
-
-if rank == 2:
-    for v in i: assert v == 7
-    for v in x: assert v == 7
-    assert status.source in (0, 1)
-    assert status.tag == 0
-    assert status.error == 0
-
-# --------------------------------------------------------------------
diff --git a/demo/mpi-ref-v1/ex-2.32.py b/demo/mpi-ref-v1/ex-2.32.py
deleted file mode 100644
index ab7f275..0000000
--- a/demo/mpi-ref-v1/ex-2.32.py
+++ /dev/null
@@ -1,94 +0,0 @@
-# Jacobi computation, using persitent requests
-
-from mpi4pyve import MPI
-try:
-    import numpy
-except ImportError:
-    raise SystemExit
-
-
-n = 5 * MPI.COMM_WORLD.Get_size()
-
-# compute number of processes and myrank
-p = MPI.COMM_WORLD.Get_size()
-myrank = MPI.COMM_WORLD.Get_rank()
-
-# compute size of local block
-m = n/p
-if myrank < (n - p * m):
-    m = m + 1
-
-#compute neighbors
-if myrank == 0:
-    left = MPI.PROC_NULL
-else:
-    left = myrank - 1
-if myrank == p - 1:
-    right = MPI.PROC_NULL
-else:
-    right = myrank + 1
-
-# allocate local arrays
-A = numpy.empty((n+2, m+2), dtype=float, order='fortran')
-B = numpy.empty((n, m), dtype=float, order='fortran')
-
-A.fill(1)
-A[0, :] = A[-1, :] = 0
-A[:, 0] = A[:, -1] = 0
-
-# create persintent requests
-tag = 0
-sreq1 = MPI.COMM_WORLD.Send_init((B[:,  0], MPI.DOUBLE), left,  tag)
-sreq2 = MPI.COMM_WORLD.Send_init((B[:, -1], MPI.DOUBLE), right, tag)
-rreq1 = MPI.COMM_WORLD.Recv_init((A[:,  0], MPI.DOUBLE), left,  tag)
-rreq2 = MPI.COMM_WORLD.Recv_init((A[:, -1], MPI.DOUBLE), right, tag)
-reqlist = [sreq1, sreq2, rreq1, rreq2]
-
-for req in reqlist:
-    assert req != MPI.REQUEST_NULL
-
-# main loop
-converged = False
-while not converged:
-    # compute boundary columns
-    N, S = A[ :-2, 1], A[2:,   1]
-    E, W = A[1:-1, 0], A[1:-1, 2]
-    C = B[:, 0]
-    numpy.add(N, S, C)
-    numpy.add(C, E, C)
-    numpy.add(C, W, C)
-    C *= 0.25
-    N, S = A[ :-2, -2], A[2:,   -2]
-    E, W = A[1:-1, -3], A[1:-1, -1]
-    C = B[:, -1]
-    numpy.add(N, S, C)
-    numpy.add(C, E, C)
-    numpy.add(C, W, C)
-    C *= 0.25
-    # start communication
-    #MPI.Prequest.Startall(reqlist)
-    for r in reqlist:
-        r.Start()
-    # compute interior
-    N, S = A[ :-2, 2:-2], A[2,    2:-2]
-    E, W = A[1:-1, 2:-2], A[1:-1, 2:-2]
-    C = B[:, 1:-1]
-    numpy.add(N, S, C)
-    numpy.add(E, C, C)
-    numpy.add(W, C, C)
-    C *= 0.25
-    A[1:-1, 1:-1] = B
-    # complete communication
-    MPI.Prequest.Waitall(reqlist)
-    # convergence
-    myconv = numpy.allclose(B, 0)
-    loc_conv = numpy.asarray(myconv, dtype='i')
-    glb_conv = numpy.asarray(0, dtype='i')
-    MPI.COMM_WORLD.Allreduce([loc_conv, MPI.INT],
-                             [glb_conv, MPI.INT],
-                             op=MPI.LAND)
-    converged = bool(glb_conv)
-
-# free persintent requests
-for req in reqlist:
-    req.Free()
diff --git a/demo/mpi-ref-v1/ex-2.34.py b/demo/mpi-ref-v1/ex-2.34.py
deleted file mode 100644
index e2ee957..0000000
--- a/demo/mpi-ref-v1/ex-2.34.py
+++ /dev/null
@@ -1,47 +0,0 @@
-## mpiexec -n 2 python ex-2.34.py
-
-# Use of ready-mode and synchonous-mode
-
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-try:
-    import numpy
-except ImportError:
-    raise SystemExit
-
-if MPI.COMM_WORLD.Get_size() < 2:
-    raise SystemExit
-
-# --------------------------------------------------------------------
-
-comm = MPI.COMM_WORLD
-
-buff = numpy.empty((1000,2), dtype='f', order='fortran')
-
-rank = comm.Get_rank()
-
-if rank == 0:
-    req1 = comm.Irecv([buff[:, 0], MPI.FLOAT], 1, 1)
-    req2 = comm.Irecv([buff[:, 1], MPI.FLOAT], 1, 2)
-    status = [MPI.Status(), MPI.Status()]
-    MPI.Request.Waitall([req1, req2], status)
-elif rank == 1:
-    buff[:, 0] = 5
-    buff[:, 1] = 7
-    comm.Ssend([buff[:, 1], MPI.FLOAT], 0, 2)
-    comm.Rsend([buff[:, 0], MPI.FLOAT], 0, 1)
-
-# --------------------------------------------------------------------
-
-all = numpy.all
-
-if rank == 0:
-    assert all(buff[:, 0] == 5)
-    assert all(buff[:, 1] == 7)
-    assert status[0].source == 1
-    assert status[0].tag ==  1
-    assert status[1].source == 1
-    assert status[1].tag ==  2
-
-# --------------------------------------------------------------------
diff --git a/demo/mpi-ref-v1/ex-2.35.py b/demo/mpi-ref-v1/ex-2.35.py
deleted file mode 100644
index d768528..0000000
--- a/demo/mpi-ref-v1/ex-2.35.py
+++ /dev/null
@@ -1,34 +0,0 @@
-## mpiexec -n 1 python ex-2.35.py
-
-# Calls to attach and detach buffers
-
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-try:
-    from numpy import empty
-except ImportError:
-    from array import array
-    def empty(size, dtype):
-        return array(dtype, [0]*size)
-
-# --------------------------------------------------------------------
-
-BUFSISE = 10000 + MPI.BSEND_OVERHEAD
-
-buff = empty(BUFSISE, dtype='b')
-
-MPI.Attach_buffer(buff)
-
-buff2 = MPI.Detach_buffer()
-
-MPI.Attach_buffer(buff2)
-
-MPI.Detach_buffer()
-
-
-# --------------------------------------------------------------------
-
-assert len(buff2) == BUFSISE
-
-# --------------------------------------------------------------------
diff --git a/demo/mpi-ref-v1/ex-3.01.py b/demo/mpi-ref-v1/ex-3.01.py
deleted file mode 100644
index 07a8f88..0000000
--- a/demo/mpi-ref-v1/ex-3.01.py
+++ /dev/null
@@ -1,33 +0,0 @@
-from mpi4pyve import MPI
-try:
-    import numpy
-except ImportError:
-    raise SystemExit
-
-# send a upper triangular matrix
-
-N = 10
-
-a = numpy.empty((N, N), dtype=float, order='c')
-b = numpy.zeros((N, N), dtype=float, order='c')
-a.flat = numpy.arange(a.size, dtype=float)
-
-# compute start and size of each row
-i = numpy.arange(N)
-blocklen = N - i
-disp = N * i + i
-
-# create datatype for upper triangular part
-upper = MPI.DOUBLE.Create_indexed(blocklen, disp)
-upper.Commit()
-
-# send and recv matrix
-myrank = MPI.COMM_WORLD.Get_rank()
-MPI.COMM_WORLD.Sendrecv((a, 1, upper), myrank, 0,
-                        [b, 1, upper], myrank, 0)
-
-assert numpy.allclose(numpy.triu(b), numpy.triu(a))
-assert numpy.allclose(numpy.tril(b, -1), numpy.zeros((N,N)))
-
-upper.Free()
-
diff --git a/demo/mpi-ref-v1/ex-3.02.py b/demo/mpi-ref-v1/ex-3.02.py
deleted file mode 100644
index 686d3ac..0000000
--- a/demo/mpi-ref-v1/ex-3.02.py
+++ /dev/null
@@ -1,12 +0,0 @@
-from mpi4pyve import MPI
-
-# Type = { (double, 0), (char, 8) }
-
-blens = (1, 1)
-disps = (0, MPI.DOUBLE.size)
-types = (MPI.DOUBLE, MPI.CHAR)
-
-dtype = MPI.Datatype.Create_struct(blens, disps, types)
-
-if 'ex-3.02' in __file__:
-    dtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.03.py b/demo/mpi-ref-v1/ex-3.03.py
deleted file mode 100644
index 75df6c3..0000000
--- a/demo/mpi-ref-v1/ex-3.03.py
+++ /dev/null
@@ -1,6 +0,0 @@
-execfile('ex-3.02.py')
-
-assert dtype.size == MPI.DOUBLE.size + MPI.CHAR.size
-assert dtype.extent >= dtype.size
-
-dtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.04.py b/demo/mpi-ref-v1/ex-3.04.py
deleted file mode 100644
index 11cbcdf..0000000
--- a/demo/mpi-ref-v1/ex-3.04.py
+++ /dev/null
@@ -1,9 +0,0 @@
-execfile('ex-3.02.py')
-
-count = 3
-newtype = dtype.Create_contiguous(count)
-
-assert newtype.extent == dtype.extent * count
-
-dtype.Free()
-newtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.05.py b/demo/mpi-ref-v1/ex-3.05.py
deleted file mode 100644
index e00ca4b..0000000
--- a/demo/mpi-ref-v1/ex-3.05.py
+++ /dev/null
@@ -1,11 +0,0 @@
-execfile('ex-3.02.py')
-
-count  = 2
-blklen = 3
-stride = 4
-newtype = dtype.Create_vector(count, blklen, stride)
-
-assert newtype.size == dtype.size * count * blklen
-
-dtype.Free()
-newtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.06.py b/demo/mpi-ref-v1/ex-3.06.py
deleted file mode 100644
index 922bcc4..0000000
--- a/demo/mpi-ref-v1/ex-3.06.py
+++ /dev/null
@@ -1,11 +0,0 @@
-execfile('ex-3.02.py')
-
-count  = 3
-blklen = 1
-stride = -2
-newtype = dtype.Create_vector(count, blklen, stride)
-
-assert newtype.size == dtype.size * count * blklen
-
-dtype.Free()
-newtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.07.py b/demo/mpi-ref-v1/ex-3.07.py
deleted file mode 100644
index 49243cf..0000000
--- a/demo/mpi-ref-v1/ex-3.07.py
+++ /dev/null
@@ -1,11 +0,0 @@
-execfile('ex-3.02.py')
-
-count  = 2
-blklen = 3
-stride = 4 * dtype.extent
-newtype = dtype.Create_hvector(count, blklen, stride)
-
-assert newtype.size == dtype.size * count * blklen
-
-dtype.Free()
-newtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.08.py b/demo/mpi-ref-v1/ex-3.08.py
deleted file mode 100644
index 0737ba1..0000000
--- a/demo/mpi-ref-v1/ex-3.08.py
+++ /dev/null
@@ -1,34 +0,0 @@
-from mpi4pyve import MPI
-try:
-    import numpy
-except ImportError:
-    raise SystemExit
-
-# extract the section a[0:6:2, 0:5:2] and store it in e[:,:]
-
-a = numpy.empty((6, 5), dtype=float, order='fortran')
-e = numpy.empty((3, 3), dtype=float, order='fortran')
-a.flat = numpy.arange(a.size, dtype=float)
-
-lb, sizeofdouble = MPI.DOUBLE.Get_extent()
-
-# create datatype for a 1D section
-oneslice = MPI.DOUBLE.Create_vector(3, 1, 2)
-
-# create datatype for a 2D section
-twoslice = oneslice.Create_hvector(3, 1, 12*sizeofdouble)
-twoslice.Commit()
-
-# send and recv on same process
-myrank = MPI.COMM_WORLD.Get_rank()
-status = MPI.Status()
-MPI.COMM_WORLD.Sendrecv([a, 1, twoslice], myrank, 0,
-                        (e, MPI.DOUBLE), myrank, 0, status)
-
-assert numpy.allclose(a[::2, ::2], e)
-assert status.Get_count(twoslice) == 1
-assert status.Get_count(MPI.DOUBLE) == e.size
-
-oneslice.Free()
-twoslice.Free()
-
diff --git a/demo/mpi-ref-v1/ex-3.09.py b/demo/mpi-ref-v1/ex-3.09.py
deleted file mode 100644
index ceab636..0000000
--- a/demo/mpi-ref-v1/ex-3.09.py
+++ /dev/null
@@ -1,39 +0,0 @@
-from mpi4pyve import MPI
-try:
-    import numpy
-except ImportError:
-    raise SystemExit
-
-# transpose a matrix a into b
-
-a = numpy.empty((100, 100), dtype=float, order='fortran')
-b = numpy.empty((100, 100), dtype=float, order='fortran')
-a.flat = numpy.arange(a.size, dtype=float)
-
-lb, sizeofdouble = MPI.DOUBLE.Get_extent()
-
-# create datatype dor one row
-# (vector with 100 double entries and stride 100)
-row = MPI.DOUBLE.Create_vector(100, 1, 100)
-
-# create datatype for matrix in row-major order
-
-# (one hundred copies of the row datatype, strided one word
-#  apart; the succesive row datatypes are interleaved)
-xpose = row.Create_hvector(100, 1, sizeofdouble)
-xpose.Commit()
-
-# send matrix in row-major order and receive in column major order
-abuf = (a, xpose)
-bbuf = (b, MPI.DOUBLE)
-myrank = MPI.COMM_WORLD.Get_rank()
-status = MPI.Status()
-MPI.COMM_WORLD.Sendrecv(abuf, myrank, 0, bbuf, myrank, 0, status)
-
-assert numpy.allclose(a, b.transpose())
-assert status.Get_count(xpose) == 1
-assert status.Get_count(MPI.DOUBLE) == b.size
-
-row.Free()
-xpose.Free()
-
diff --git a/demo/mpi-ref-v1/ex-3.11.py b/demo/mpi-ref-v1/ex-3.11.py
deleted file mode 100644
index 488acef..0000000
--- a/demo/mpi-ref-v1/ex-3.11.py
+++ /dev/null
@@ -1,8 +0,0 @@
-execfile('ex-3.02.py')
-
-B = (3, 1)
-D = (4, 0)
-newtype = dtype.Create_indexed(B, D)
-
-dtype.Free()
-newtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.12.py b/demo/mpi-ref-v1/ex-3.12.py
deleted file mode 100644
index f75986f..0000000
--- a/demo/mpi-ref-v1/ex-3.12.py
+++ /dev/null
@@ -1,8 +0,0 @@
-execfile('ex-3.02.py')
-
-B = (3, 1)
-D = (4 * dtype.extent, 0)
-newtype = dtype.Create_hindexed(B, D)
-
-dtype.Free()
-newtype.Free()
diff --git a/demo/mpi-ref-v1/ex-3.13.py b/demo/mpi-ref-v1/ex-3.13.py
deleted file mode 100644
index 9173bed..0000000
--- a/demo/mpi-ref-v1/ex-3.13.py
+++ /dev/null
@@ -1,14 +0,0 @@
-from mpi4pyve import MPI
-
-blens = (1, 1)
-disps = (0, MPI.DOUBLE.size)
-types = (MPI.DOUBLE, MPI.CHAR)
-type1 = MPI.Datatype.Create_struct(blens, disps, types)
-
-B = (2, 1, 3)
-D = (0, 16, 26)
-T = (MPI.FLOAT, type1, MPI.CHAR)
-dtype = MPI.Datatype.Create_struct(B, D, T)
-
-type1.Free()
-dtype.Free()
diff --git a/demo/mpi-ref-v1/makefile b/demo/mpi-ref-v1/makefile
deleted file mode 100644
index 727d33a..0000000
--- a/demo/mpi-ref-v1/makefile
+++ /dev/null
@@ -1,23 +0,0 @@
-.PHONY: default build test clean test_seq test_mpi
-
-default: build test clean
-
-build:
-
-PYTHON  = python
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP      = 3
-
-test_seq:
-	${MAKE} MPIEXEC= NP_FLAG= NP= test_mpi
-
-test_mpi:
-	-@for i in `ls ex-*.py`; do \
-	    echo ${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} $$i; \
-	    ${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} $$i; \
-	done
-
-test: test_seq test_mpi
-
-clean:
diff --git a/demo/mpi-ref-v1/runtests.bat b/demo/mpi-ref-v1/runtests.bat
deleted file mode 100644
index d75b9a8..0000000
--- a/demo/mpi-ref-v1/runtests.bat
+++ /dev/null
@@ -1,37 +0,0 @@
-@echo off
-setlocal ENABLEEXTENSIONS
-
-set MPI=Microsoft MPI
-set PATH="%ProgramFiles%\%MPI%\bin";%PATH%
-
-set MPIEXEC=mpiexec
-set NP_FLAG=-n
-set NP=5
-
-set PYTHON=C:\Python27\python.exe
-set PYTHON=C:\Python36\python.exe
-set PYTHON=python
-
-@echo on
-set MPIEXEC=
-set NP_FLAG=
-set NP=
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.01.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.08.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.16.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.29.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.32.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.34.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-2.35.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.01.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.02.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.03.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.04.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.05.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.06.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.07.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.08.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.09.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.11.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.12.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% ex-3.13.py
diff --git a/demo/mpi-ref-v1/runtests.sh b/demo/mpi-ref-v1/runtests.sh
deleted file mode 100755
index 25f8da9..0000000
--- a/demo/mpi-ref-v1/runtests.sh
+++ /dev/null
@@ -1,28 +0,0 @@
-#!/bin/sh
-
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=3
-
-PYTHON=python
-
-set -x
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.01.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.08.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.16.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.29.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.32.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.34.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-2.35.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.01.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.02.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.03.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.04.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.05.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.06.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.07.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.08.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.09.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.11.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.12.py
-$MPIEXEC $NP_FLAG $NP $PYTHON ex-3.13.py
diff --git a/demo/nxtval/makefile b/demo/nxtval/makefile
deleted file mode 100644
index f08b5ca..0000000
--- a/demo/nxtval/makefile
+++ /dev/null
@@ -1,13 +0,0 @@
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=5
-
-PYTHON=python
-
-.PHONY: test
-test:
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} nxtval-threads.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} nxtval-dynproc.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} nxtval-onesided.py
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} nxtval-scalable.py
-#	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} nxtval-mpi3.py
diff --git a/demo/nxtval/nxtval-dynproc.py b/demo/nxtval/nxtval-dynproc.py
deleted file mode 100644
index 015c360..0000000
--- a/demo/nxtval/nxtval-dynproc.py
+++ /dev/null
@@ -1,77 +0,0 @@
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-import sys, os
-
-class Counter(object):
-
-    def __init__(self, comm):
-        assert not comm.Is_inter()
-        self.comm = comm.Dup()
-        # start counter process
-        script  = os.path.abspath(__file__)
-        if script[-4:] in ('.pyc', '.pyo'):
-            script = script[:-1]
-        self.child = self.comm.Spawn(sys.executable,
-                                     [script, '--child'], 1)
-
-    def free(self):
-        self.comm.Barrier()
-        # stop counter process
-        rank = self.child.Get_rank()
-        if rank == 0:
-            self.child.send(None, 0, 1)
-        self.child.Disconnect()
-        #
-        self.comm.Free()
-
-    def next(self):
-        #
-        incr = 1
-        self.child.send(incr, 0, 0)
-        ival = self.child.recv(None, 0, 0)
-        nxtval = ival
-        #
-        return nxtval
-
-# --------------------------------------------------------------------
-
-def _counter_child():
-    parent = MPI.Comm.Get_parent()
-    assert parent != MPI.COMM_NULL
-    try:
-        counter = 0
-        status = MPI.Status()
-        any_src, any_tag = MPI.ANY_SOURCE, MPI.ANY_TAG
-        while True: # server loop
-            incr = parent.recv(None, any_src, any_tag, status)
-            if status.tag == 1: break
-            parent.send(counter, status.source, 0)
-            counter += incr
-    finally:
-        parent.Disconnect()
-
-if __name__ == '__main__':
-    if (len(sys.argv) > 1 and
-        sys.argv[0] == __file__ and
-        sys.argv[1] == '--child'):
-        _counter_child()
-        sys.exit(0)
-
-# --------------------------------------------------------------------
-
-def test():
-    vals = []
-    counter = Counter(MPI.COMM_WORLD)
-    for i in range(5):
-        c = counter.next()
-        vals.append(c)
-    counter.free()
-    #
-    vals = MPI.COMM_WORLD.allreduce(vals)
-    assert sorted(vals) == list(range(len(vals)))
-
-if __name__ == '__main__':
-    test()
-
-# --------------------------------------------------------------------
diff --git a/demo/nxtval/nxtval-mpi3.py b/demo/nxtval/nxtval-mpi3.py
deleted file mode 100644
index 4922117..0000000
--- a/demo/nxtval/nxtval-mpi3.py
+++ /dev/null
@@ -1,85 +0,0 @@
-from mpi4pyve import MPI
-from array import array as _array
-import struct as _struct
-
-# --------------------------------------------------------------------
-
-class Counter(object):
-
-    def __init__(self, comm):
-        rank = comm.Get_rank()
-        itemsize = MPI.INT.Get_size()
-        if rank == 0:
-            n = 1
-        else:
-            n = 0
-        self.win = MPI.Win.Allocate(n*itemsize, itemsize, 
-                                    MPI.INFO_NULL, comm)
-        if rank == 0:
-            mem = self.win.tomemory()
-            mem[:] = _struct.pack('i', 0)
-
-    def free(self):
-        self.win.Free()
-
-    def next(self, increment=1):
-        incr = _array('i', [increment])
-        nval = _array('i', [0])
-        self.win.Lock(0)
-        self.win.Get_accumulate([incr, 1, MPI.INT], 
-                                [nval, 1, MPI.INT],
-                                0, op=MPI.SUM)
-        self.win.Unlock(0)
-        return nval[0]
-
-# -----------------------------------------------------------------------------
-
-class Mutex(object):
-
-    def __init__(self, comm):
-        self.counter = Counter(comm)
-
-    def __enter__(self):
-        self.lock()
-        return self
-
-    def __exit__(self, *exc):
-        self.unlock()
-        return None
-
-    def free(self):
-        self.counter.free()
-
-    def lock(self):
-        value = self.counter.next(+1)
-        while value != 0:
-            value = self.counter.next(-1)
-            value = self.counter.next(+1)
-
-    def unlock(self):
-        self.counter.next(-1)
-
-# -----------------------------------------------------------------------------
-
-def test_counter():
-    vals = []
-    counter = Counter(MPI.COMM_WORLD)
-    for i in range(5):
-        c = counter.next()
-        vals.append(c)
-    counter.free()
-
-    vals = MPI.COMM_WORLD.allreduce(vals)
-    assert sorted(vals) == list(range(len(vals)))
-
-def test_mutex():
-    mutex = Mutex(MPI.COMM_WORLD)
-    mutex.lock()
-    mutex.unlock()
-    mutex.free()
-
-if __name__ == '__main__':
-    test_counter()
-    test_mutex()
-
-# -----------------------------------------------------------------------------
diff --git a/demo/nxtval/nxtval-onesided.py b/demo/nxtval/nxtval-onesided.py
deleted file mode 100644
index 235c2d0..0000000
--- a/demo/nxtval/nxtval-onesided.py
+++ /dev/null
@@ -1,72 +0,0 @@
-# --------------------------------------------------------------------
-
-from mpi4pyve import MPI
-from array import array as _array
-import struct as _struct
-
-class Counter(object):
-
-    def __init__(self, comm):
-        #
-        size = comm.Get_size()
-        rank = comm.Get_rank()
-        #
-        itemsize = MPI.INT.Get_size()
-        if rank == 0:
-            mem = MPI.Alloc_mem(itemsize*size, MPI.INFO_NULL)
-            mem[:] = _struct.pack('i', 0) * size
-        else:
-            mem = MPI.BOTTOM
-        self.win = MPI.Win.Create(mem, itemsize, MPI.INFO_NULL, comm)
-        #
-        blens = [rank, size-rank-1]
-        disps = [0, rank+1]
-        self.dt_get = MPI.INT.Create_indexed(blens, disps).Commit()
-        #
-        self.myval = 0
-
-    def free(self):
-        self.dt_get.Free()
-        mem = self.win.tomemory()
-        self.win.Free()
-        if mem: MPI.Free_mem(mem)
-
-    def next(self):
-        #
-        group  = self.win.Get_group()
-        size = group.Get_size()
-        rank = group.Get_rank()
-        group.Free()
-        #
-        incr = _array('i', [1])
-        vals = _array('i', [0])*size
-        self.win.Lock(0)
-        self.win.Accumulate([incr, 1, MPI.INT], 0,
-                            [rank, 1, MPI.INT], MPI.SUM)
-        self.win.Get([vals, 1, self.dt_get], 0,
-                     [   0, 1, self.dt_get])
-        self.win.Unlock(0)
-        #
-        vals[rank] = self.myval
-        self.myval += 1
-        nxtval = sum(vals)
-        #
-        return nxtval
-
-# --------------------------------------------------------------------
-
-def test():
-    vals = []
-    counter = Counter(MPI.COMM_WORLD)
-    for i in range(5):
-        c = counter.next()
-        vals.append(c)
-    counter.free()
-
-    vals = MPI.COMM_WORLD.allreduce(vals)
-    assert sorted(vals) == list(range(len(vals)))
-
-if __name__ == '__main__':
-    test()
-
-# --------------------------------------------------------------------
diff --git a/demo/nxtval/nxtval-scalable.py b/demo/nxtval/nxtval-scalable.py
deleted file mode 100644
index 675682b..0000000
--- a/demo/nxtval/nxtval-scalable.py
+++ /dev/null
@@ -1,148 +0,0 @@
-from mpi4pyve import MPI
-
-# -----------------------------------------------------------------------------
-
-import struct as _struct
-try:
-    from numpy import empty as _empty
-    def _array_new(size, typecode, init=0):
-        a = _empty(size, typecode)
-        a.fill(init)
-        return a
-    def _array_set(ary, value):
-        ary.fill(value)
-    def _array_sum(ary):
-        return ary.sum()
-except ImportError:
-    from array import array as _array
-    def _array_new(size, typecode, init=0):
-        return _array(typecode, [init]) * size
-    def _array_set(ary, value):
-        for i, _ in enumerate(ary):
-            ary[i] = value
-    def _array_sum(ary):
-        return sum(ary, 0)
-
-# -----------------------------------------------------------------------------
-
-class Counter(object):
-
-    def __init__(self, comm, init=0):
-        #
-        size = comm.Get_size()
-        rank = comm.Get_rank()
-        mask = 1
-        while mask < size:
-            mask <<= 1
-        mask >>= 1
-        idx = 0
-        get_idx = []
-        acc_idx = []
-        while mask >= 1:
-            left  = idx + 1
-            right = idx + (mask<<1)
-            if rank < mask:
-                acc_idx.append( left  )
-                get_idx.append( right )
-                idx = left
-            else:
-                acc_idx.append( right )
-                get_idx.append( left  )
-                idx = right
-            rank = rank % mask
-            mask >>= 1
-        #
-        typecode = 'i'
-        datatype = MPI.INT
-        itemsize = datatype.Get_size()
-        #
-        root = 0
-        rank = comm.Get_rank()
-        if rank == root:
-            nlevels = len(get_idx) + 1
-            nentries = (1<<nlevels) - 1
-            self.mem = MPI.Alloc_mem(nentries*itemsize, MPI.INFO_NULL)
-            self.mem[:] = _struct.pack(typecode, init) * nentries
-        else:
-            self.mem = None
-        #
-        self.win = MPI.Win.Create(self.mem, itemsize, MPI.INFO_NULL, comm)
-        self.acc_type = datatype.Create_indexed_block(1, acc_idx).Commit()
-        self.get_type = datatype.Create_indexed_block(1, get_idx).Commit()
-        self.acc_buf = _array_new(len(acc_idx), typecode)
-        self.get_buf = _array_new(len(get_idx), typecode)
-        self.myval = 0
-
-    def free(self):
-        if self.win:
-            self.win.Free()
-        if self.mem:
-            MPI.Free_mem(self.mem)
-            self.mem = None
-        if self.get_type:
-            self.get_type.Free()
-        if self.acc_type:
-            self.acc_type.Free()
-
-    def next(self, increment=1):
-        _array_set(self.acc_buf, increment)
-        root = 0
-        self.win.Lock(root)
-        self.win.Get(self.get_buf, root, [0, 1, self.get_type])
-        self.win.Accumulate(self.acc_buf, root, [0, 1, self.acc_type], MPI.SUM)
-        self.win.Unlock(root)
-        nxtval = self.myval + _array_sum(self.get_buf)
-        self.myval += increment
-        return nxtval
-
-# -----------------------------------------------------------------------------
-
-class Mutex(object):
-
-    def __init__(self, comm):
-        self.counter = Counter(comm)
-
-    def __enter__(self):
-        self.lock()
-        return self
-
-    def __exit__(self, *exc):
-        self.unlock()
-        return None
-
-    def free(self):
-        self.counter.free()
-
-    def lock(self):
-        value = self.counter.next(+1)
-        while value != 0:
-            value = self.counter.next(-1)
-            value = self.counter.next(+1)
-
-    def unlock(self):
-        self.counter.next(-1)
-
-# -----------------------------------------------------------------------------
-
-def test_counter():
-    vals = []
-    counter = Counter(MPI.COMM_WORLD)
-    for i in range(5):
-        c = counter.next()
-        vals.append(c)
-    counter.free()
-
-    vals = MPI.COMM_WORLD.allreduce(vals)
-    assert sorted(vals) == list(range(len(vals)))
-
-def test_mutex():
-    mutex = Mutex(MPI.COMM_WORLD)
-    mutex.lock()
-    mutex.unlock()
-    mutex.free()
-
-if __name__ == '__main__':
-    test_counter()
-    test_mutex()
-
-# -----------------------------------------------------------------------------
diff --git a/demo/nxtval/nxtval-threads.py b/demo/nxtval/nxtval-threads.py
deleted file mode 100644
index b2e7bf1..0000000
--- a/demo/nxtval/nxtval-threads.py
+++ /dev/null
@@ -1,78 +0,0 @@
-# -----------------------------------------------------------------------------
-
-from mpi4pyve import MPI
-from array import array
-from threading import Thread
-
-class Counter(object):
-
-    def __init__(self, comm):
-        # duplicate communicator
-        assert not comm.Is_inter()
-        self.comm = comm.Dup()
-        # start counter thread
-        self.thread = None
-        rank = self.comm.Get_rank()
-        if rank == 0:
-            self.thread = Thread(target=self._counter_thread)
-            self.thread.start()
-
-    def _counter_thread(self):
-        incr = array('i', [0])
-        ival = array('i', [0])
-        status = MPI.Status()
-        while True: # server loop
-            self.comm.Recv([incr, MPI.INT],
-                           MPI.ANY_SOURCE, MPI.ANY_TAG,
-                           status)
-            if status.Get_tag() == 1:
-                return
-            self.comm.Send([ival, MPI.INT],
-                           status.Get_source(), 0)
-            ival[0] += incr[0]
-
-    def free(self):
-        self.comm.Barrier()
-        # stop counter thread
-        rank = self.comm.Get_rank()
-        if rank == 0:
-            self.comm.Send([None, MPI.INT], 0, 1)
-            self.thread.join()
-        #
-        self.comm.Free()
-
-    def next(self):
-        incr = array('i', [1])
-        ival = array('i', [0])
-        self.comm.Send([incr, MPI.INT], 0, 0)
-        self.comm.Recv([ival, MPI.INT], 0, 0)
-        nxtval = ival[0]
-        return nxtval
-
-# -----------------------------------------------------------------------------
-
-def test_thread_level():
-    import sys
-    flag = (MPI.Query_thread() == MPI.THREAD_MULTIPLE)
-    flag = MPI.COMM_WORLD.bcast(flag, root=0)
-    if not flag:
-        if MPI.COMM_WORLD.Get_rank() == 0:
-            sys.stderr.write("MPI does not provide enough thread support\n")
-        sys.exit(0)
-
-def test():
-    vals = []
-    counter = Counter(MPI.COMM_WORLD)
-    for i in range(5):
-        c = counter.next()
-        vals.append(c)
-    counter.free()
-
-    vals = MPI.COMM_WORLD.allreduce(vals)
-    assert sorted(vals) == list(range(len(vals)))
-
-if __name__ == '__main__':
-    test_thread_level()
-    test()
-
-# -----------------------------------------------------------------------------
diff --git a/demo/nxtval/runtests.bat b/demo/nxtval/runtests.bat
deleted file mode 100644
index e97027c..0000000
--- a/demo/nxtval/runtests.bat
+++ /dev/null
@@ -1,19 +0,0 @@
-@echo off
-setlocal ENABLEEXTENSIONS
-
-set MPI=Microsoft MPI
-set PATH="%ProgramFiles%\%MPI%\bin";%PATH%
-
-set MPIEXEC=mpiexec
-set NP_FLAG=-n
-set NP=5
-
-set PYTHON=C:\Python27\python.exe
-set PYTHON=C:\Python36\python.exe
-set PYTHON=python
-
-@echo on
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% nxtval-threads.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% nxtval-dynproc.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% nxtval-onesided.py
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% nxtval-scalable.py
diff --git a/demo/nxtval/runtests.sh b/demo/nxtval/runtests.sh
deleted file mode 100755
index ef382ad..0000000
--- a/demo/nxtval/runtests.sh
+++ /dev/null
@@ -1,13 +0,0 @@
-#!/bin/sh
-
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=5
-
-PYTHON=python
-
-set -x
-$MPIEXEC $NP_FLAG $NP $PYTHON nxtval-threads.py
-$MPIEXEC $NP_FLAG $NP $PYTHON nxtval-dynproc.py
-$MPIEXEC $NP_FLAG $NP $PYTHON nxtval-onesided.py
-$MPIEXEC $NP_FLAG $NP $PYTHON nxtval-scalable.py
diff --git a/demo/ping-pong/README.rst b/demo/ping-pong/README.rst
new file mode 100644
index 0000000..3e727ce
--- /dev/null
+++ b/demo/ping-pong/README.rst
@@ -0,0 +1,139 @@
+Performs the ping-pong among 2 processes to measure bandwidth for data transfer.
+
+
+Issuing at the command line for data transferring between VH-VH::
+
+    $ mpiexec -veo -np 2 python ping-pong.py -dev1 vh -dev2 vh -n 25 -loop_count 10
+    {'dev1': 'vh', 'dev2': 'vh', 'n': 25, 'loop_count': 10}
+    | Data Size (B) | Avg Transfer Time (s) | Bandwidth (GB/s) |
+    |---------------|-----------------------|------------------|
+    |              8|            0.000008452|       0.000881523|
+    |             16|            0.000004351|       0.003424658|
+    |             32|            0.000003600|       0.008278146|
+    |             64|            0.000003707|       0.016077170|
+    |            128|            0.000003648|       0.032679739|
+    |            256|            0.000003755|       0.063492063|
+    |            512|            0.000003648|       0.130718954|
+    |           1024|            0.000005054|       0.188679245|
+    |           2048|            0.000036144|       0.052770449|
+    |           4096|            0.000005901|       0.646464646|
+    |           8192|            0.000006151|       1.240310078|
+    |          16384|            0.000007653|       1.993769470|
+    |          32768|            0.000011599|       2.631038027|
+    |          65536|            0.000021899|       2.787152967|
+    |         131072|            0.000036502|       3.344219464|
+    |         262144|            0.000070703|       3.453043332|
+    |         524288|            0.000132751|       3.678160920|
+    |        1048576|            0.000364351|       2.680277451|
+    |        2097152|            0.000744307|       2.624085078|
+    |        4194304|            0.001507103|       2.591892426|
+    |        8388608|            0.002944696|       2.653075270|
+    |       16777216|            0.005708051|       2.737361797|
+    |       33554432|            0.010132349|       3.084181311|
+    |       67108864|            0.018249702|       3.424713370|
+    |      134217728|            0.036001647|       3.472063422|
+    |---------------|-----------------------|------------------|
+
+Issuing at the command line for data transferring between VH-VE::
+
+    $ mpiexec -veo -np 2 python ping-pong.py -dev1 vh -dev2 ve -n 25 -loop_count 10
+    {'dev1': 'vh', 'dev2': 've', 'n': 25, 'loop_count': 10}
+    | Data Size (B) | Avg Transfer Time (s) | Bandwidth (GB/s) |
+    |---------------|-----------------------|------------------|
+    |              8|            0.000904047|       0.000008241|
+    |             16|            0.000029361|       0.000507511|
+    |             32|            0.000036156|       0.000824266|
+    |             64|            0.000034356|       0.001734906|
+    |            128|            0.000032794|       0.003635042|
+    |            256|            0.000033355|       0.007147963|
+    |            512|            0.000033557|       0.014209591|
+    |           1024|            0.000033748|       0.028258566|
+    |           2048|            0.000089848|       0.021228606|
+    |           4096|            0.000036156|       0.105506100|
+    |           8192|            0.000035954|       0.212201592|
+    |          16384|            0.000036597|       0.416938111|
+    |          32768|            0.000038445|       0.793798450|
+    |          65536|            0.000045455|       1.342774718|
+    |         131072|            0.000060809|       2.007449520|
+    |         262144|            0.000079346|       3.076923077|
+    |         524288|            0.000111651|       4.373265001|
+    |        1048576|            0.000187802|       5.199949219|
+    |        2097152|            0.000393045|       4.969215371|
+    |        4194304|            0.000740647|       5.274102688|
+    |        8388608|            0.001368248|       5.709854762|
+    |       16777216|            0.002909100|       5.371076863|
+    |       33554432|            0.005516708|       5.664610232|
+    |       67108864|            0.010738945|       5.819938547|
+    |      134217728|            0.021171010|       5.904300395|
+    |---------------|-----------------------|------------------|
+
+Issuing at the command line for data transferring between VE-VE (same VE node)::
+
+
+    $ mpiexec -veo -np 2 python ping-pong.py -dev1 ve -dev2 ve -n 25 -loop_count 10
+    {'dev1': 've', 'dev2': 've', 'n': 25, 'loop_count': 10}
+    | Data Size (B) | Avg Transfer Time (s) | Bandwidth (GB/s) |
+    |---------------|-----------------------|------------------|
+    |              8|            0.001123750|       0.000006630|
+    |             16|            0.000040495|       0.000367972|
+    |             32|            0.000036347|       0.000819941|
+    |             64|            0.000034952|       0.001705321|
+    |            128|            0.000034201|       0.003485535|
+    |            256|            0.000032651|       0.007301935|
+    |            512|            0.000033247|       0.014342058|
+    |           1024|            0.000034904|       0.027322404|
+    |           2048|            0.000036144|       0.052770449|
+    |           4096|            0.000036144|       0.105540897|
+    |           8192|            0.000037503|       0.203432931|
+    |          16384|            0.000034547|       0.441683920|
+    |          32768|            0.000038600|       0.790611489|
+    |          65536|            0.000036395|       1.677038978|
+    |         131072|            0.000034857|       3.502051984|
+    |         262144|            0.000039697|       6.150150150|
+    |         524288|            0.000040603|      12.025836759|
+    |        1048576|            0.000037289|      26.189258312|
+    |        2097152|            0.000044656|      43.737319808|
+    |        4194304|            0.000056958|      68.580996233|
+    |        8388608|            0.000084150|      92.840345658|
+    |       16777216|            0.000144458|     108.163063212|
+    |       33554432|            0.000249147|     125.427751196|
+    |       67108864|            0.000472796|     132.192329997|
+    |      134217728|            0.000917947|     136.173395842|
+    |---------------|-----------------------|------------------|
+
+    Data transfer is performed with memory copy.
+
+Issuing at the command line for data transferring between VE-VE (different VE node)::
+
+    $ VE_NLCPY_NODELIST=0,1 mpiexec -veo -np 2 python ping-pong.py -dev1 ve -dev2 ve -n 25 -loop_count 10
+    {'dev1': 've', 'dev2': 've', 'n': 25, 'loop_count': 10}
+    | Data Size (B) | Avg Transfer Time (s) | Bandwidth (GB/s) |
+    |---------------|-----------------------|------------------|
+    |              8|            0.001999855|       0.000003726|
+    |             16|            0.000044811|       0.000332535|
+    |             32|            0.000041497|       0.000718184|
+    |             64|            0.000040352|       0.001477105|
+    |            128|            0.000043356|       0.002749519|
+    |            256|            0.000047195|       0.005051781|
+    |            512|            0.000044143|       0.010802052|
+    |           1024|            0.000040650|       0.023460411|
+    |           2048|            0.000040698|       0.046865847|
+    |           4096|            0.000042200|       0.090395480|
+    |           8192|            0.000044847|       0.170122275|
+    |          16384|            0.000050294|       0.303389429|
+    |          32768|            0.000048006|       0.635708964|
+    |          65536|            0.000046694|       1.307122798|
+    |         131072|            0.000058997|       2.069104870|
+    |         262144|            0.000066853|       3.651925820|
+    |         524288|            0.000100696|       4.849058837|
+    |        1048576|            0.000138152|       7.068772111|
+    |        2097152|            0.000233293|       8.371997956|
+    |        4194304|            0.000414205|       9.430725839|
+    |        8388608|            0.000784397|       9.959878419|
+    |       16777216|            0.001530206|      10.211042123|
+    |       33554432|            0.003027654|      10.321523912|
+    |       67108864|            0.005987191|      10.438951740|
+    |      134217728|            0.011934698|      10.473662919|
+    |---------------|-----------------------|------------------|
+
+    Data transfer is performed with DMA.
diff --git a/demo/ping-pong/makefile b/demo/ping-pong/makefile
new file mode 100644
index 0000000..d250235
--- /dev/null
+++ b/demo/ping-pong/makefile
@@ -0,0 +1,12 @@
+.PHONY: test
+
+PYTHON=python
+
+N=25
+ITER=10
+
+test:
+	echo "Ping-Pong VH-VH" &&                                     mpirun -veo -np 2 python ping-pong.py -dev1 vh -dev2 vh -n ${N} -loop_count ${ITER}
+	echo "Ping-Pong VH-VE" &&                                     mpirun -veo -np 2 python ping-pong.py -dev1 vh -dev2 ve -n ${N} -loop_count ${ITER}
+	echo "Ping-Pong VE-VE(same      VE)" &&                       mpirun -veo -np 2 python ping-pong.py -dev1 ve -dev2 ve -n ${N} -loop_count ${ITER}
+	echo "Ping-Pong VE-VE(different VE)" && VE_NLCPY_NODELIST=0,1 mpirun -veo -np 2 python ping-pong.py -dev1 ve -dev2 ve -n ${N} -loop_count ${ITER}
diff --git a/demo/ping-pong/ping-pong.py b/demo/ping-pong/ping-pong.py
new file mode 100644
index 0000000..f58554a
--- /dev/null
+++ b/demo/ping-pong/ping-pong.py
@@ -0,0 +1,79 @@
+from mpi4pyve import MPI
+from mpi4pyve import util
+import numpy as np
+import nlcpy as vp
+import argparse
+
+def send_recv_helper(buf, comm):
+    max_count = 2 ** 31
+    begin = 0
+    remain = buf.size
+    while remain > 0:
+        part = buf[begin:min(begin + remain, begin + max_count - 1)]
+        if rank == 0:
+            comm.Send([part, MPI.DOUBLE], 1)
+            comm.Recv([part, MPI.DOUBLE], 1)
+        elif rank == 1:
+            comm.Recv([part, MPI.DOUBLE], 0)
+            comm.Send([part, MPI.DOUBLE], 0)
+        begin += part.size
+        remain -= part.size
+
+parser = argparse.ArgumentParser()
+parser.add_argument('-dev1', type=str, required=True, choices=['vh', 've'])
+parser.add_argument('-dev2', type=str, required=True, choices=['vh', 've'])
+parser.add_argument('-n', type=int, required=False, default=20)
+parser.add_argument('-loop_count', type=int, required=False, default=10)
+args = parser.parse_args()
+
+# set module
+if args.dev1 == 'vh':
+    dev1 = np
+elif args.dev1 == 've':
+    dev1 = vp
+else:
+    raise ValueError
+if args.dev2 == 'vh':
+    dev2 = np
+elif args.dev2 == 've':
+    dev2 = vp
+else:
+    raise ValueError
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+if rank == 0: print(vars(args))
+if size != 2:
+    raise ValueError
+
+loop_count = args.loop_count
+if rank == 0:
+    print('| Data Size (B) | Avg Transfer Time (s) | Bandwidth (GB/s) |')
+    print('|---------------|-----------------------|------------------|')
+
+for n in range(0, args.n):
+    nelem = 1 << n
+    if rank == 0:
+        buf = dev1.arange(nelem, dtype='f8')
+    else:
+        buf = dev2.empty(nelem, dtype='f8')
+    comm.Barrier()
+    t0 = MPI.Wtime()
+    for i in range(loop_count):
+        send_recv_helper(buf, comm)
+    comm.Barrier()
+    t1 = MPI.Wtime()
+    elapsed_time = t1 - t0
+    avg_transfer_time = elapsed_time / (2.0 * loop_count)
+    bandwidth = buf.nbytes / (1024 ** 3) / avg_transfer_time
+    if rank == 0:
+        print('|{:>15d}|{:23.9f}|{:18.9f}|'.format(buf.nbytes, avg_transfer_time, bandwidth))
+    if not np.array_equal(np.asarray(buf), np.arange(nelem, dtype='f8')):
+        print("Result mismatch (rank = {})".format(rank))
+        MPI.Finalize()
+        exit()
+
+if rank == 0:
+    print('|---------------|-----------------------|------------------|')
+MPI.Finalize()
diff --git a/demo/python-config b/demo/python-config
deleted file mode 100755
index 498b414..0000000
--- a/demo/python-config
+++ /dev/null
@@ -1,80 +0,0 @@
-#!/usr/bin/env python
-# -*- python -*-
-
-import sys, os
-import getopt
-try:
-    import sysconfig
-except ImportError:
-    from distutils import sysconfig
-
-valid_opts = ['help', 'prefix', 'exec-prefix', 'includes', 'libs', 'cflags',
-              'ldflags', 'extension-suffix', 'abiflags', 'configdir']
-
-def exit_with_usage(code=1):
-    sys.stderr.write("Usage: %s [%s]\n" % (
-        sys.argv[0], '|'.join('--'+opt for opt in valid_opts)))
-    sys.exit(code)
-
-try:
-    opts, args = getopt.getopt(sys.argv[1:], '', valid_opts)
-except getopt.error:
-    exit_with_usage()
-
-if not opts:
-    exit_with_usage()
-
-getvar = sysconfig.get_config_var
-pyver = getvar('VERSION')
-try:
-    abiflags = sys.abiflags
-except AttributeError:
-    abiflags = ''
-
-opt_flags = [flag for (flag, val) in opts]
-
-if '--help' in opt_flags:
-    exit_with_usage(code=0)
-
-for opt in opt_flags:
-    if opt == '--prefix':
-        print(getvar('prefix'))
-
-    elif opt == '--exec-prefix':
-        print(getvar('exec_prefix'))
-
-    elif opt in ('--includes', '--cflags'):
-        try:
-            include = sysconfig.get_path('include')
-            platinclude = sysconfig.get_path('platinclude')
-        except AttributeError:
-            include = sysconfig.get_python_inc()
-            platinclude = sysconfig.get_python_inc(plat_specific=True)
-        flags = ['-I' + include]
-        if include != platinclude:
-            flags.append('-I' + platinclude)
-        if opt == '--cflags':
-            flags.extend(getvar('CFLAGS').split())
-        print(' '.join(flags))
-
-    elif opt in ('--libs', '--ldflags'):
-        libs = getvar('LIBS').split() + getvar('SYSLIBS').split()
-        libs.append('-lpython' + pyver + abiflags)
-        if opt == '--ldflags':
-            if not getvar('Py_ENABLE_SHARED'):
-                libs.insert(0, '-L' + getvar('LIBPL'))
-            if not getvar('PYTHONFRAMEWORK'):
-                libs.extend(getvar('LINKFORSHARED').split())
-        print(' '.join(libs))
-
-    elif opt == '--extension-suffix':
-        ext_suffix = getvar('EXT_SUFFIX')
-        if ext_suffix is None:
-            ext_suffix = getvar('SO')
-        print(ext_suffix)
-
-    elif opt == '--abiflags':
-        print(abiflags)
-
-    elif opt == '--configdir':
-        print(getvar('LIBPL'))
diff --git a/demo/reductions/makefile b/demo/reductions/makefile
deleted file mode 100644
index ef24f38..0000000
--- a/demo/reductions/makefile
+++ /dev/null
@@ -1,10 +0,0 @@
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=5
-
-PYTHON=python
-
-.PHONY: test
-test:
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_reductions.py -q
-	${RM} -r *.py[co] __pycache__
diff --git a/demo/reductions/reductions.py b/demo/reductions/reductions.py
deleted file mode 100644
index cff7ca1..0000000
--- a/demo/reductions/reductions.py
+++ /dev/null
@@ -1,103 +0,0 @@
-from mpi4pyve import MPI
-
-class Intracomm(MPI.Intracomm):
-    """
-    Intracommunicator class with scalable, point-to-point based
-    implementations of global reduction operations.
-    """
-
-    def __new__(cls, comm=None):
-        return super(Intracomm, cls).__new__(cls, comm)
-
-    def reduce(self, sendobj=None, recvobj=None, op=MPI.SUM, root=0):
-        size = self.size
-        rank = self.rank
-        assert 0 <= root < size
-        tag = MPI.COMM_WORLD.Get_attr(MPI.TAG_UB)-1
-
-        recvobj = sendobj
-        mask = 1
-
-        while mask < size:
-            if (mask & rank) != 0:
-                target = (rank & ~mask) % size
-                self.send(recvobj, dest=target, tag=tag)
-            else:
-                target = (rank | mask)
-                if target < size:
-                    tmp = self.recv(None, source=target, tag=tag)
-                    recvobj = op(recvobj, tmp)
-            mask <<= 1
-
-        if root != 0:
-            if rank == 0:
-                self.send(recvobj, dest=root, tag=tag)
-            elif rank == root:
-                recvobj = self.recv(None, source=0, tag=tag)
-
-        if rank != root:
-            recvobj = None
-
-        return recvobj
-
-    def allreduce(self, sendobj=None, recvobj=None, op=MPI.SUM):
-        recvobj = self.reduce(sendobj, recvobj, op, 0)
-        recvobj = self.bcast(recvobj, 0)
-        return recvobj
-
-    def scan(self, sendobj=None, recvobj=None, op=MPI.SUM):
-        size = self.size
-        rank = self.rank
-        tag = MPI.COMM_WORLD.Get_attr(MPI.TAG_UB)-1
-
-        recvobj = sendobj
-        partial = sendobj
-        mask = 1
-
-        while mask < size:
-            target = rank ^ mask
-            if target < size:
-                tmp = self.sendrecv(partial, dest=target, source=target,
-                                    sendtag=tag, recvtag=tag)
-                if rank > target:
-                    partial = op(tmp, partial)
-                    recvobj = op(tmp, recvobj)
-                else:
-                    tmp = op(partial, tmp)
-                    partial = tmp
-            mask <<= 1
-
-        return recvobj
-
-    def exscan(self, sendobj=None, recvobj=None, op=MPI.SUM):
-        size = self.size
-        rank = self.rank
-        tag = MPI.COMM_WORLD.Get_attr(MPI.TAG_UB)-1
-
-        recvobj = sendobj
-        partial = sendobj
-        mask = 1
-        flag = False
-
-        while mask < size:
-            target = rank ^ mask
-            if target < size:
-                tmp = self.sendrecv(partial, dest=target, source=target,
-                                    sendtag=tag, recvtag=tag)
-                if rank > target:
-                    partial = op(tmp, partial)
-                    if rank != 0:
-                        if not flag:
-                            recvobj = tmp
-                            flag = True
-                        else:
-                            recvobj = op(tmp, recvobj)
-                else:
-                    tmp = op(partial, tmp)
-                    partial = tmp
-            mask <<= 1
-
-        if rank == 0:
-            recvobj = None
-
-        return recvobj
diff --git a/demo/reductions/runtests.bat b/demo/reductions/runtests.bat
deleted file mode 100644
index 9fd4b1a..0000000
--- a/demo/reductions/runtests.bat
+++ /dev/null
@@ -1,16 +0,0 @@
-@echo off
-setlocal ENABLEEXTENSIONS
-
-set MPI=Microsoft MPI
-set PATH="%ProgramFiles%\%MPI%\bin";%PATH%
-
-set MPIEXEC=mpiexec
-set NP_FLAG=-n
-set NP=5
-
-set PYTHON=C:\Python27\python.exe
-set PYTHON=C:\Python36\python.exe
-set PYTHON=python
-
-@echo on
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% test_reductions.py -q
diff --git a/demo/reductions/runtests.sh b/demo/reductions/runtests.sh
deleted file mode 100755
index c03dfdd..0000000
--- a/demo/reductions/runtests.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/bin/sh
-
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=5
-
-PYTHON=python
-
-set -x
-$MPIEXEC $NP_FLAG $NP $PYTHON test_reductions.py -q
diff --git a/demo/reductions/test_reductions.py b/demo/reductions/test_reductions.py
deleted file mode 100644
index 3522f1a..0000000
--- a/demo/reductions/test_reductions.py
+++ /dev/null
@@ -1,210 +0,0 @@
-#import mpi4pyve
-#mpi4pyve.profile("mpe")
-from mpi4pyve import MPI
-
-import unittest
-
-import sys, os
-sys.path.insert(0, os.path.dirname(__file__))
-from reductions import Intracomm
-del sys.path[0]
-
-class BaseTest(object):
-
-    def test_reduce(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        for root in range(size):
-            msg = rank
-            res = self.comm.reduce(sendobj=msg, root=root)
-            if self.comm.rank == root:
-                self.assertEqual(res, sum(range(size)))
-            else:
-                self.assertEqual(res, None)
-
-    def test_reduce_min(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        for root in range(size):
-            msg = rank
-            res = self.comm.reduce(sendobj=msg, op=MPI.MIN, root=root)
-            if self.comm.rank == root:
-                self.assertEqual(res, 0)
-            else:
-                self.assertEqual(res, None)
-
-    def test_reduce_max(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        for root in range(size):
-            msg = rank
-            res = self.comm.reduce(sendobj=msg, op=MPI.MAX, root=root)
-            if self.comm.rank == root:
-                self.assertEqual(res, size-1)
-            else:
-                self.assertEqual(res, None)
-
-    def test_reduce_minloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        for root in range(size):
-            msg = rank
-            res = self.comm.reduce(sendobj=(msg, rank), op=MPI.MINLOC, root=root)
-            if self.comm.rank == root:
-                self.assertEqual(res, (0, 0))
-            else:
-                self.assertEqual(res, None)
-
-    def test_reduce_maxloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        for root in range(size):
-            msg = rank
-            res = self.comm.reduce(sendobj=(msg, rank), op=MPI.MAXLOC, root=root)
-            if self.comm.rank == root:
-                self.assertEqual(res, (size-1, size-1))
-            else:
-                self.assertEqual(res, None)
-
-    def test_allreduce(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.allreduce(sendobj=msg)
-        self.assertEqual(res, sum(range(size)))
-
-    def test_allreduce_min(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.allreduce(sendobj=msg, op=MPI.MIN)
-        self.assertEqual(res, 0)
-
-    def test_allreduce_max(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.allreduce(sendobj=msg, op=MPI.MAX)
-        self.assertEqual(res, size-1)
-
-    def test_allreduce_minloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.allreduce(sendobj=(msg, rank), op=MPI.MINLOC)
-        self.assertEqual(res, (0, 0))
-
-    def test_allreduce_maxloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.allreduce(sendobj=(msg, rank), op=MPI.MAXLOC)
-        self.assertEqual(res, (size-1, size-1))
-
-    def test_scan(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.scan(sendobj=msg)
-        self.assertEqual(res, sum(list(range(size))[:rank+1]))
-
-    def test_scan_min(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.scan(sendobj=msg, op=MPI.MIN)
-        self.assertEqual(res, 0)
-
-    def test_scan_max(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.scan(sendobj=msg, op=MPI.MAX)
-        self.assertEqual(res, rank)
-
-    def test_scan_minloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.scan(sendobj=(msg, rank), op=MPI.MINLOC)
-        self.assertEqual(res, (0, 0))
-
-    def test_scan_maxloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.scan(sendobj=(msg, rank), op=MPI.MAXLOC)
-        self.assertEqual(res, (rank, rank))
-
-    def test_exscan(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.exscan(sendobj=msg)
-        if self.comm.rank == 0:
-            self.assertEqual(res, None)
-        else:
-            self.assertEqual(res, sum(list(range(size))[:rank]))
-
-    def test_exscan_min(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.exscan(sendobj=msg, op=MPI.MIN)
-        if self.comm.rank == 0:
-            self.assertEqual(res, None)
-        else:
-            self.assertEqual(res, 0)
-
-    def test_exscan_max(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.exscan(sendobj=msg, op=MPI.MAX)
-        if self.comm.rank == 0:
-            self.assertEqual(res, None)
-        else:
-            self.assertEqual(res, rank-1)
-
-    def test_exscan_minloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.exscan(sendobj=(msg, rank), op=MPI.MINLOC)
-        if self.comm.rank == 0:
-            self.assertEqual(res, None)
-        else:
-            self.assertEqual(res, (0, 0))
-
-    def test_exscan_maxloc(self):
-        rank = self.comm.rank
-        size = self.comm.size
-        msg = rank
-        res = self.comm.exscan(sendobj=(msg, rank), op=MPI.MAXLOC)
-        if self.comm.rank == 0:
-            self.assertEqual(res, None)
-        else:
-            self.assertEqual(res, (rank-1, rank-1))
-
-class TestS(BaseTest, unittest.TestCase):
-    def setUp(self):
-        self.comm = Intracomm(MPI.COMM_SELF)
-
-class TestW(BaseTest, unittest.TestCase):
-    def setUp(self):
-        self.comm = Intracomm(MPI.COMM_WORLD)
-
-class TestSD(BaseTest, unittest.TestCase):
-    def setUp(self):
-        self.comm = Intracomm(MPI.COMM_SELF.Dup())
-    def tearDown(self):
-        self.comm.Free()
-
-class TestWD(BaseTest, unittest.TestCase):
-    def setUp(self):
-        self.comm = Intracomm(MPI.COMM_WORLD.Dup())
-    def tearDown(self):
-        self.comm.Free()
-
-if __name__ == "__main__":
-    unittest.main()
diff --git a/demo/sequential/makefile b/demo/sequential/makefile
deleted file mode 100644
index ce63ee1..0000000
--- a/demo/sequential/makefile
+++ /dev/null
@@ -1,10 +0,0 @@
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=5
-
-PYTHON=python
-
-.PHONY: test
-test:
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test_seq.py
-	${RM} -r *.py[co] __pycache__
diff --git a/demo/sequential/runtests.bat b/demo/sequential/runtests.bat
deleted file mode 100644
index b219574..0000000
--- a/demo/sequential/runtests.bat
+++ /dev/null
@@ -1,16 +0,0 @@
-@echo off
-setlocal ENABLEEXTENSIONS
-
-set MPI=Microsoft MPI
-set PATH="%ProgramFiles%\%MPI%\bin";%PATH%
-
-set MPIEXEC=mpiexec
-set NP_FLAG=-n
-set NP=5
-
-set PYTHON=C:\Python27\python.exe
-set PYTHON=C:\Python36\python.exe
-set PYTHON=python
-
-@echo on
-%MPIEXEC% %NP_FLAG% %NP% %PYTHON% test_seq.py
diff --git a/demo/sequential/runtests.sh b/demo/sequential/runtests.sh
deleted file mode 100755
index e030e81..0000000
--- a/demo/sequential/runtests.sh
+++ /dev/null
@@ -1,10 +0,0 @@
-#!/bin/sh
-
-MPIEXEC=mpiexec
-NP_FLAG=-n
-NP=5
-
-PYTHON=python
-
-set -x
-$MPIEXEC $NP_FLAG $NP $PYTHON test_seq.py
diff --git a/demo/sequential/seq.py b/demo/sequential/seq.py
deleted file mode 100644
index 1abe451..0000000
--- a/demo/sequential/seq.py
+++ /dev/null
@@ -1,52 +0,0 @@
-class Seq(object):
-
-    """
-    Sequential execution
-    """
-
-    def __init__(self, comm, ng=1, tag=0):
-        ng = int(ng)
-        tag = int(tag)
-        assert ng >= 1
-        assert ng <= comm.Get_size()
-        self.comm = comm
-        self.ng = ng
-        self.tag = tag
-
-    def __enter__(self):
-        self.begin()
-        return self
-
-    def __exit__(self, *exc):
-        self.end()
-        return None
-
-    def begin(self):
-        """
-        Begin a sequential execution of a section of code
-        """
-        comm = self.comm
-        size = comm.Get_size()
-        if size == 1: return
-        rank = comm.Get_rank()
-        ng = self.ng
-        tag = self.tag
-        if rank != 0:
-            comm.Recv([None, 'B'], rank - 1, tag)
-        if rank != (size - 1) and (rank % ng) < (ng - 1):
-            comm.Send([None, 'B'], rank + 1, tag)
-
-    def end(self):
-        """
-        End a sequential execution of a section of code
-        """
-        comm = self.comm
-        size = comm.Get_size()
-        if size == 1: return
-        rank = comm.Get_rank()
-        ng = self.ng
-        tag = self.tag
-        if rank == (size - 1) or (rank % ng) == (ng - 1):
-            comm.Send([None, 'B'], (rank + 1) % size, tag)
-        if rank == 0:
-            comm.Recv([None, 'B'], size - 1, tag)
diff --git a/demo/sequential/test_seq.py b/demo/sequential/test_seq.py
deleted file mode 100644
index 12810d1..0000000
--- a/demo/sequential/test_seq.py
+++ /dev/null
@@ -1,22 +0,0 @@
-#import mpi4pyve
-#mpi4pyve.profile("mpe")
-from mpi4pyve import MPI
-
-import unittest
-
-import sys, os
-sys.path.insert(0, os.path.dirname(__file__))
-from seq import Seq
-del sys.path[0]
-
-def test():
-    size = MPI.COMM_WORLD.Get_size()
-    rank = MPI.COMM_WORLD.Get_rank()
-    name = MPI.Get_processor_name()
-    with Seq(MPI.COMM_WORLD, 1, 10):
-        print(
-            "Hello, World! I am process %d of %d on %s."
-            % (rank, size, name))
-
-if __name__ == "__main__":
-    test()
diff --git a/demo/spawning/cpi-master.c b/demo/spawning/cpi-master.c
deleted file mode 100644
index 0d84c8c..0000000
--- a/demo/spawning/cpi-master.c
+++ /dev/null
@@ -1,35 +0,0 @@
-#include <mpi.h>
-#include <stdio.h>
-#include <string.h>
-#include <math.h>
-
-int main(int argc, char *argv[])
-{
-  char cmd[32] = "./cpi-worker-c.exe";
-  MPI_Comm worker;
-  int n;
-  double pi;
-
-  MPI_Init(&argc, &argv);
-
-  if (argc > 1) strcpy(cmd, argv[1]);
-  printf("%s -> %s\n", argv[0], cmd);
-
-  MPI_Comm_spawn(cmd, MPI_ARGV_NULL, 5,
-                 MPI_INFO_NULL, 0,
-                 MPI_COMM_SELF, &worker,
-                 MPI_ERRCODES_IGNORE);
-
-  n = 100;
-  MPI_Bcast(&n, 1, MPI_INT, MPI_ROOT, worker);
-
-  MPI_Reduce(MPI_BOTTOM, &pi, 1, MPI_DOUBLE,
-             MPI_SUM, MPI_ROOT, worker);
-
-  MPI_Comm_disconnect(&worker);
-
-  printf("pi: %.16f, error: %.16f\n", pi, fabs(M_PI-pi));
-
-  MPI_Finalize();
-  return 0;
-}
diff --git a/demo/spawning/cpi-master.cxx b/demo/spawning/cpi-master.cxx
deleted file mode 100644
index f119e8e..0000000
--- a/demo/spawning/cpi-master.cxx
+++ /dev/null
@@ -1,31 +0,0 @@
-#include <mpi.h>
-#include <cstdio>
-#include <cstring>
-#include <cmath>
-
-int main(int argc, char *argv[])
-{
-  MPI::Init();
-
-  char cmd[32] = "./cpi-worker-cxx.exe";
-  if (argc > 1) std::strcpy(cmd, argv[1]);
-  std::printf("%s -> %s\n", argv[0], cmd);
-
-  MPI::Intercomm worker;
-  worker = MPI::COMM_SELF.Spawn(cmd, MPI::ARGV_NULL, 5,
-                                MPI::INFO_NULL, 0);
-
-  int n = 100;
-  worker.Bcast(&n, 1, MPI::INT, MPI::ROOT);
-
-  double pi;
-  worker.Reduce(MPI::BOTTOM, &pi, 1, MPI::DOUBLE,
-                MPI::SUM, MPI::ROOT);
-
-  worker.Disconnect();
-
-  std::printf("pi: %.16f, error: %.16f\n", pi, std::fabs(M_PI-pi));
-
-  MPI::Finalize();
-  return 0;
-}
diff --git a/demo/spawning/cpi-master.f90 b/demo/spawning/cpi-master.f90
deleted file mode 100644
index 30451f9..0000000
--- a/demo/spawning/cpi-master.f90
+++ /dev/null
@@ -1,45 +0,0 @@
-PROGRAM main
-
-  USE mpi
-  implicit none
-
-  real (kind=8), parameter :: PI = 3.1415926535897931D0
-
-  integer argc
-  character(len=32) argv(0:1)
-
-  character(len=32) cmd
-  integer ierr, n, worker
-  real(kind=8) cpi
-
-  call MPI_INIT(ierr)
-
-  argc = iargc() + 1
-  call getarg(0, argv(0))
-  call getarg(1, argv(1))
-
-  cmd = './cpi-worker-f90.exe'
-  if (argc > 1) then
-     cmd = argv(1)
-  end if
-  write(*,'(A,A,A)') trim(argv(0)), ' -> ', trim(cmd)
-
-  call MPI_COMM_SPAWN(cmd, MPI_ARGV_NULL, 5, &
-                      MPI_INFO_NULL, 0, &
-                      MPI_COMM_SELF, worker, &
-                      MPI_ERRCODES_IGNORE, ierr)
-
-  n = 100
-  call MPI_BCAST(n, 1, MPI_INTEGER, &
-                 MPI_ROOT, worker, ierr)
-
-  call MPI_REDUCE(MPI_BOTTOM, cpi, 1, MPI_DOUBLE_PRECISION, &
-                  MPI_SUM, MPI_ROOT, worker, ierr)
-
-  call MPI_COMM_DISCONNECT(worker, ierr)
-
-  write(*,'(A,F18.16,A,F18.16)') 'pi: ', cpi, ', error: ', abs(PI-cpi)
-
-  call MPI_FINALIZE(ierr)
-
-END PROGRAM main
diff --git a/demo/spawning/cpi-master.py b/demo/spawning/cpi-master.py
deleted file mode 100644
index 133d602..0000000
--- a/demo/spawning/cpi-master.py
+++ /dev/null
@@ -1,23 +0,0 @@
-from mpi4pyve import MPI
-from array import array
-from math import pi as PI
-from sys import argv
-
-cmd = './cpi-worker-py.exe'
-if len(argv) > 1: cmd = argv[1]
-print("%s -> %s" % (argv[0], cmd))
-
-worker = MPI.COMM_SELF.Spawn(cmd, None, 5)
-
-n  = array('i', [100])
-worker.Bcast([n,MPI.INT], root=MPI.ROOT)
-
-pi = array('d', [0.0])
-worker.Reduce(sendbuf=None,
-              recvbuf=[pi, MPI.DOUBLE],
-              op=MPI.SUM, root=MPI.ROOT)
-pi = pi[0]
-
-worker.Disconnect()
-
-print('pi: %.16f, error: %.16f' % (pi, abs(PI-pi)))
diff --git a/demo/spawning/cpi-worker.c b/demo/spawning/cpi-worker.c
deleted file mode 100644
index 7aed62a..0000000
--- a/demo/spawning/cpi-worker.c
+++ /dev/null
@@ -1,33 +0,0 @@
-#include <mpi.h>
-
-int main(int argc, char *argv[])
-{
-  int myrank, nprocs;
-  int n, i;
-  double h, s, pi;
-  MPI_Comm master;
-
-  MPI_Init(&argc, &argv);
-
-  MPI_Comm_get_parent(&master);
-  MPI_Comm_size(master, &nprocs);
-  MPI_Comm_rank(master, &myrank);
-
-  MPI_Bcast(&n, 1, MPI_INT, 0, master);
-
-  h = 1.0 / (double) n;
-  s = 0.0;
-  for (i = myrank+1; i < n+1; i += nprocs) {
-    double x = h * (i - 0.5);
-    s += 4.0 / (1.0 + x*x);
-  }
-  pi = s * h;
-
-  MPI_Reduce(&pi, MPI_BOTTOM, 1, MPI_DOUBLE,
-             MPI_SUM, 0, master);
-
-  MPI_Comm_disconnect(&master);
-
-  MPI_Finalize();
-  return 0;
-}
diff --git a/demo/spawning/cpi-worker.cxx b/demo/spawning/cpi-worker.cxx
deleted file mode 100644
index 8d0eae2..0000000
--- a/demo/spawning/cpi-worker.cxx
+++ /dev/null
@@ -1,29 +0,0 @@
-#include <mpi.h>
-
-int main(int argc, char *argv[])
-{
-  MPI::Init();
-
-  MPI::Intercomm master = MPI::Comm::Get_parent();
-  int nprocs = master.Get_size();
-  int myrank = master.Get_rank();
-
-  int n;
-  master.Bcast(&n, 1, MPI_INT, 0);
-
-  double h = 1.0 / (double) n;
-  double s = 0.0;
-  for (int i = myrank+1; i < n+1; i += nprocs) {
-    double x = h * (i - 0.5);
-    s += 4.0 / (1.0 + x*x);
-  }
-  double pi = s * h;
-
-  master.Reduce(&pi, MPI_BOTTOM, 1, MPI_DOUBLE,
-                MPI_SUM, 0);
-
-  master.Disconnect();
-
-  MPI::Finalize();
-  return 0;
-}
diff --git a/demo/spawning/cpi-worker.f90 b/demo/spawning/cpi-worker.f90
deleted file mode 100644
index 9b20955..0000000
--- a/demo/spawning/cpi-worker.f90
+++ /dev/null
@@ -1,32 +0,0 @@
-PROGRAM main
-
-  USE mpi
-  implicit none
-
-  integer ierr
-  integer n, i, master, myrank, nprocs
-  real (kind=8) h, s, x, cpi
-
-  call MPI_INIT(ierr)
-  call MPI_COMM_GET_PARENT(master, ierr)
-  call MPI_COMM_SIZE(master, nprocs, ierr)
-  call MPI_COMM_RANK(master, myrank, ierr)
-
-  call MPI_BCAST(n, 1, MPI_INTEGER, &
-                 0, master, ierr)
-
-  h = 1 / DFLOAT(n)
-  s = 0.0
-  DO i=myrank+1,n,nprocs
-     x = h * (DFLOAT(i) - 0.5)
-     s = s +  4.0 / (1.0 + x*x)
-  END DO
-  cpi = s * h
-
-  call MPI_REDUCE(cpi, MPI_BOTTOM, 1, MPI_DOUBLE_PRECISION, &
-                  MPI_SUM, 0, master, ierr)
-
-  call MPI_COMM_DISCONNECT(master, ierr)
-  call MPI_FINALIZE(ierr)
-
-END PROGRAM main
diff --git a/demo/spawning/cpi-worker.py b/demo/spawning/cpi-worker.py
deleted file mode 100644
index 8862d54..0000000
--- a/demo/spawning/cpi-worker.py
+++ /dev/null
@@ -1,24 +0,0 @@
-from mpi4pyve import MPI
-from array import array
-
-master = MPI.Comm.Get_parent()
-nprocs = master.Get_size()
-myrank = master.Get_rank()
-
-n  = array('i', [0])
-master.Bcast([n, MPI.INT], root=0)
-n = n[0]
-
-h = 1.0 / n
-s = 0.0
-for i in range(myrank+1, n+1, nprocs):
-    x = h * (i - 0.5)
-    s += 4.0 / (1.0 + x**2)
-pi = s * h
-
-pi = array('d', [pi])
-master.Reduce(sendbuf=[pi, MPI.DOUBLE],
-              recvbuf=None,
-              op=MPI.SUM, root=0)
-
-master.Disconnect()
diff --git a/demo/spawning/makefile b/demo/spawning/makefile
deleted file mode 100644
index 2017ba7..0000000
--- a/demo/spawning/makefile
+++ /dev/null
@@ -1,57 +0,0 @@
-.PHONY: default build test clean
-
-MPIEXEC=mpiexec -n 1
-
-default: build test clean
-
-MASTERS = cpi-master-py.exe cpi-master-c.exe cpi-master-cxx.exe cpi-master-f90.exe
-WORKERS = cpi-worker-py.exe cpi-worker-c.exe cpi-worker-cxx.exe cpi-worker-f90.exe
-
-build: ${MASTERS} ${WORKERS}
-
-LANGS=py c cxx f90
-test: build
-	@for i in ${LANGS}; do \
-	    for j in ${LANGS}; do \
-                 ${MPIEXEC} ./cpi-master-$$i.exe ./cpi-worker-$$j.exe; \
-	    done; \
-	done 
-
-clean:
-	${RM} -r ${MASTERS} ${WORKERS}
-
-
-MPICC=mpicc
-MPICXX=mpicxx
-MPIF90=mpif90
-ifneq (${MPI_FORTRAN_MOD_DIR},)
-FFLAGS += -I${MPI_FORTRAN_MOD_DIR}
-endif
-
-# Python
-cpi-master-py.exe: cpi-master.py
-	echo '#!'`which python` > $@
-	cat $< >> $@
-	chmod +x $@
-cpi-worker-py.exe: cpi-worker.py
-	echo '#!'`which python` > $@
-	cat $< >> $@
-	chmod +x $@
-
-# C
-cpi-master-c.exe: cpi-master.c
-	${MPICC} $< -o $@
-cpi-worker-c.exe: cpi-worker.c
-	${MPICC} $< -o $@
-
-# C++
-cpi-master-cxx.exe: cpi-master.cxx
-	${MPICXX} $< -o $@
-cpi-worker-cxx.exe: cpi-worker.cxx
-	${MPICXX} $< -o $@
-
-# Fortran 90
-cpi-master-f90.exe: cpi-master.f90
-	${MPIF90} ${FFLAGS} $< -o $@
-cpi-worker-f90.exe: cpi-worker.f90
-	${MPIF90} ${FFLAGS} $< -o $@
diff --git a/demo/test-run/makefile b/demo/test-run/makefile
deleted file mode 100644
index ef0bd20..0000000
--- a/demo/test-run/makefile
+++ /dev/null
@@ -1,18 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-
-.PHONY: build
-build:
-	mkdir -p run-directory/
-	cp run-script.py run-directory/__main__.py
-	zip -qj run-zipfile.zip run-directory/__main__.py
-
-.PHONY: test
-test:
-	${PYTHON} test_run.py -v
-
-.PHONY: clean
-clean:
-	${RM} -r run-directory run-zipfile.zip
diff --git a/demo/test-run/run-script.py b/demo/test-run/run-script.py
deleted file mode 100644
index 9fc4075..0000000
--- a/demo/test-run/run-script.py
+++ /dev/null
@@ -1,43 +0,0 @@
-from mpi4pyve import MPI
-import sys, os, optparse
-
-assert __name__ == '__main__'
-from os.path import split, splitext, dirname, realpath
-dirname = dirname(__file__)
-assert sys.path[0] == realpath(dirname)
-if split(__file__)[1] == '__main__.py':
-    if splitext(dirname)[0] == '.zip':
-        assert sys.argv[0] == dirname
-    else:
-        assert realpath(sys.argv[0]) == realpath(dirname)
-else:
-    assert sys.argv[0] == __file__
-
-parser = optparse.OptionParser()
-parser.add_option("--rank", action='store',
-                  type='int', dest="rank", default=0)
-parser.add_option("--sys-exit", action='store',
-                  type='int', dest="sys_exit", default=None)
-parser.add_option("--sys-exit-msg", action="store",
-                  type="string", dest="sys_exit", default=None)
-parser.add_option("--exception", action="store",
-                  type="string", dest="exception", default=None)
-(options, args) = parser.parse_args()
-assert not args
-
-comm = MPI.COMM_WORLD
-if comm.rank == options.rank:
-    if options.sys_exit:
-        sys.exit(options.sys_exit)
-    if options.exception:
-        raise RuntimeError(options.exception)
-
-comm.Barrier()
-if comm.rank > 0:
-    comm.Recv([None, 'B'], comm.rank - 1)
-print("Hello, World!")
-if comm.rank < comm.size - 1:
-    comm.Send([None, 'B'], comm.rank + 1)
-comm.Barrier()
-
-sys.exit()
diff --git a/demo/test-run/test_run.py b/demo/test-run/test_run.py
deleted file mode 100644
index 0c7df8b..0000000
--- a/demo/test-run/test_run.py
+++ /dev/null
@@ -1,173 +0,0 @@
-import sys, os, shlex
-import subprocess as sp
-import unittest
-import mpi4pyve
-
-def find_executable(exe):
-    from distutils.spawn import find_executable as find_exe
-    command = shlex.split(exe)
-    executable = find_exe(command[0])
-    if executable:
-        command[0] = executable
-        return ' '.join(command)
-
-def find_mpiexec(mpiexec='mpiexec'):
-    mpiexec = os.environ.get('MPIEXEC') or mpiexec
-    mpiexec = find_executable(mpiexec)
-    if not mpiexec and sys.platform.startswith('win'):
-        MSMPI_BIN = os.environ.get('MSMPI_BIN', '')
-        mpiexec = os.path.join(MSMPI_BIN, mpiexec)
-        mpiexec = find_executable(mpiexec)
-    if not mpiexec:
-        mpiexec = find_executable('mpirun')
-    return mpiexec
-
-def launcher(np):
-    mpiexec = find_mpiexec()
-    python = sys.executable
-    if 'coverage' in sys.modules:
-        python += ' -m coverage run -p -m'
-    module = 'mpi4pyve.run -rc threads=False'
-    command = '{mpiexec} -n {np} {python} -m {module}'
-    return shlex.split(command.format(**vars()))
-
-def execute(np, command, args=''):
-    env = os.environ.copy()
-    pypath = os.environ.get('PYTHONPATH', '').split(os.pathsep)
-    pypath.insert(0, os.path.abspath(os.path.dirname(mpi4pyve.__path__[0])))
-    env['PYTHONPATH'] = os.pathsep.join(pypath)
-    if isinstance(command, str):
-        command = shlex.split(command)
-    if isinstance(args, str):
-        args = shlex.split(args)
-    cmdline = launcher(np) + command + args
-    p = sp.Popen(cmdline, stdout=sp.PIPE, stderr=sp.PIPE, env=env, bufsize=0)
-    stdout, stderr = p.communicate()
-    return p.returncode, stdout.decode(), stderr.decode()
-
-
-class BaseTestRun(object):
-
-    def assertMPIAbort(self, stdout, stderr):
-        if not ('MPI_Abort' in stdout or 'MPI_ABORT' in stdout or
-                'MPI_Abort' in stderr or 'MPI_ABORT' in stderr):
-            msg = ("expecting MPI_Abort() message in stdout/stderr:\n"
-                   "[stdout]:\n{0}\n[stderr]:\n{1}\n").format(stdout, stderr)
-            raise self.failureException(msg)
-
-
-class TestRunScript(BaseTestRun, unittest.TestCase):
-    pyfile = 'run-script.py'
-
-    def execute(self, args='', np=1):
-        dirname = os.path.abspath(os.path.dirname(__file__))
-        script = os.path.join(dirname, self.pyfile)
-        return execute(np, script, args)
-
-    def testSuccess(self):
-        success = 'Hello, World!'
-        for np in (1, 2, 3):
-            status, stdout, stderr = self.execute(np=np)
-            self.assertEqual(status, 0)
-            self.assertEqual(stderr, '')
-            self.assertEqual(stdout.count(success), np)
-
-    def testException(self):
-        message = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
-        excmess = 'RuntimeError: {0}'.format(message)
-        for np in (1, 2, 3):
-            for rank in range(0, np):
-                args = ['--rank', str(rank), '--exception', message]
-                status, stdout, stderr = self.execute(args, np)
-                self.assertEqual(status, 1)
-                self.assertMPIAbort(stdout, stderr)
-                self.assertTrue(excmess in stderr)
-
-    def testSysExitCode(self):
-        errcode = 7
-        for np in (1, 2, 3):
-            for r in sorted(set([0, np-1])):
-                args = ['--rank', str(r), '--sys-exit', str(errcode)]
-                status, stdout, stderr = self.execute(args, np)
-                self.assertTrue(status in (errcode, 1))
-                self.assertMPIAbort(stdout, stderr)
-                self.assertTrue('Traceback' not in stderr)
-
-    def testSysExitMess(self):
-        exitmsg = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
-        for np in (1, 2, 3):
-            for r in sorted(set([0, np-1])):
-                args = ['--rank', str(r), '--sys-exit-msg', exitmsg]
-                status, stdout, stderr = self.execute(args, np)
-                self.assertEqual(status, 1)
-                self.assertMPIAbort(stdout, stderr)
-                self.assertTrue('Traceback' not in stderr)
-                self.assertTrue(exitmsg in stderr)
-
-if os.path.exists(os.path.join(os.path.dirname(__file__), 'run-directory')):
-    class TestRunDirectory(TestRunScript):
-        pyfile = 'run-directory'
-
-if os.path.exists(os.path.join(os.path.dirname(__file__), 'run-zipfile.zip')):
-    class TestRunZipFile(TestRunScript):
-        pyfile = 'run-zipfile.zip'
-
-
-class TestRunModule(BaseTestRun, unittest.TestCase):
-
-    def execute(self, module, np=1):
-        return execute(np, '-m', module)
-
-    def testSuccess(self):
-        module = 'mpi4pyve.bench --no-threads helloworld'
-        message = 'Hello, World!'
-        for np in (1, 2, 3):
-            status, stdout, stderr = self.execute(module, np)
-            self.assertEqual(status, 0)
-            self.assertEqual(stdout.count(message), np)
-            self.assertEqual(stderr, '')
-
-
-class TestRunCommand(BaseTestRun, unittest.TestCase):
-
-    def execute(self, command, np=1):
-        return execute(np, '-c', command)
-
-    def testArgv0(self):
-        command = '"import sys; print(sys.argv[0])"'
-        status, stdout, stderr = self.execute(command, 1)
-        self.assertEqual(status, 0)
-        self.assertEqual(stdout, '-c\n')
-        self.assertEqual(stderr, '')
-
-    def testSuccess(self):
-        command = '"from mpi4pyve import MPI"'
-        for np in (1, 2, 3):
-            status, stdout, stderr = self.execute(command, np)
-            self.assertEqual(status, 0)
-            self.assertEqual(stdout, '')
-            self.assertEqual(stderr, '')
-
-    def testException(self):
-        command = '"from mpi4pyve import MPI; 1/0 if MPI.COMM_WORLD.Get_rank()==0 else 0;"'
-        excmess = 'ZeroDivisionError:'
-        for np in (1, 2, 3):
-            for rank in range(0, np):
-                status, stdout, stderr = self.execute(command, np)
-                self.assertEqual(status, 1)
-                self.assertMPIAbort(stdout, stderr)
-                self.assertTrue(excmess in stderr)
-
-
-if not find_mpiexec():
-    del TestRunScript
-    try: del TestRunDirectory
-    except: pass
-    try: del TestRunZipFile
-    except: pass
-    del TestRunModule
-    del TestRunCommand
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/demo/thermal/README.rst b/demo/thermal/README.rst
new file mode 100644
index 0000000..4e353b8
--- /dev/null
+++ b/demo/thermal/README.rst
@@ -0,0 +1,53 @@
+Performs the 3D Thermal Simulation
+
+
+Issuing at the command line for VH::
+
+    $ mpirun -veo -np 8 python thermal.py -dev vh
+    loop_cnt:          0, l2_norm:   560.756531
+    loop_cnt:       1000, l2_norm:     3.863911
+    loop_cnt:       2000, l2_norm:     1.941148
+    loop_cnt:       3000, l2_norm:     1.204764
+    loop_cnt:       4000, l2_norm:     0.803408
+    loop_cnt:       5000, l2_norm:     0.552903
+    loop_cnt:       6000, l2_norm:     0.386492
+    loop_cnt:       7000, l2_norm:     0.272347
+    loop_cnt:       8000, l2_norm:     0.192709
+    loop_cnt:       9000, l2_norm:     0.136654
+    loop_cnt:      10000, l2_norm:     0.097005
+    loop_cnt:      11000, l2_norm:     0.068903
+    loop_cnt:      12000, l2_norm:     0.048968
+    loop_cnt:      13000, l2_norm:     0.034811
+    loop_cnt:      14000, l2_norm:     0.024766
+    loop_cnt:      15000, l2_norm:     0.017631
+    loop_cnt:      16000, l2_norm:     0.012578
+    loop_cnt:      17000, l2_norm:     0.009005
+    elapsed: 12.070626020431519
+
+
+Issuing at the command line for VE::
+
+    $ VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 4 python thermal.py -dev ve
+    loop_cnt:          0, l2_norm:   560.756592
+    loop_cnt:       1000, l2_norm:     3.863911
+    loop_cnt:       2000, l2_norm:     1.941150
+    loop_cnt:       3000, l2_norm:     1.204767
+    loop_cnt:       4000, l2_norm:     0.803411
+    loop_cnt:       5000, l2_norm:     0.552902
+    loop_cnt:       6000, l2_norm:     0.386493
+    loop_cnt:       7000, l2_norm:     0.272352
+    loop_cnt:       8000, l2_norm:     0.192710
+    loop_cnt:       9000, l2_norm:     0.136653
+    loop_cnt:      10000, l2_norm:     0.097009
+    loop_cnt:      11000, l2_norm:     0.068909
+    loop_cnt:      12000, l2_norm:     0.048973
+    loop_cnt:      13000, l2_norm:     0.034821
+    loop_cnt:      14000, l2_norm:     0.024770
+    loop_cnt:      15000, l2_norm:     0.017647
+    loop_cnt:      16000, l2_norm:     0.012596
+    loop_cnt:      17000, l2_norm:     0.009027
+    elapsed: 3.367392063140869
+
+
+.. image:: ./img/img_thermal_nlcpy.png
+    :alt: img_thermal_simulation
diff --git a/demo/thermal/img/img_thermal_nlcpy.png b/demo/thermal/img/img_thermal_nlcpy.png
new file mode 100644
index 0000000..15a9e1d
Binary files /dev/null and b/demo/thermal/img/img_thermal_nlcpy.png differ
diff --git a/demo/thermal/img/img_thermal_numpy.png b/demo/thermal/img/img_thermal_numpy.png
new file mode 100644
index 0000000..15a9e1d
Binary files /dev/null and b/demo/thermal/img/img_thermal_numpy.png differ
diff --git a/demo/thermal/makefile b/demo/thermal/makefile
new file mode 100644
index 0000000..675dfde
--- /dev/null
+++ b/demo/thermal/makefile
@@ -0,0 +1,7 @@
+.PHONY: test
+
+PYTHON=python
+
+test:
+	echo "Thermal Simulation on  VH (8 process)" &&                           mpirun -veo -np 8 ${PYTHON} thermal.py -dev vh
+	echo "Thermal Simulation on 4VE (4 process)" && VE_NLCPY_NODELIST=0,1,2,3 mpirun -veo -np 4 ${PYTHON} thermal.py -dev ve
diff --git a/demo/thermal/thermal.py b/demo/thermal/thermal.py
new file mode 100644
index 0000000..9239efb
--- /dev/null
+++ b/demo/thermal/thermal.py
@@ -0,0 +1,257 @@
+from mpi4pyve import MPI
+from mpi4pyve import util
+from matplotlib import pyplot as plt
+import argparse
+
+NX = 100               # The number of grid points in X-direction.
+NY = 100               # The number of grid points in Y-direction.
+NZ = 100               # The number of grid points in Z-direction.
+DT = 0.001             # The time step interval.
+CHECK_INTERVAL = 1000  # The number of time steps for checking convergence.
+LX = NX * 1e-3
+LY = NY * 1e-3
+LZ = NZ * 1e-3
+T0 = 20.0
+T1 = 60.0
+T2 = 80.0
+HC = 398.0 / (8960.0 * 385.0)
+DTYPE = 'float32'
+TOLERANCE = 1e-2
+
+def initialize(grid, dev):
+    grid.fill(T0)
+    grid[:, :, 0] = T1 * dev.sin(
+        dev.linspace(0, dev.pi, grid.shape[0]))[:, None]
+    grid[:, 0, :] = T2 * dev.sin(
+        dev.linspace(0, dev.pi, grid.shape[0]))[:, None]
+
+def get_count_and_displs(rank, size):
+    offset = 0
+    count = []
+    displs = []
+    for r in range(size):
+        lz_s = NZ * r // size
+        lz_e = NZ * (r + 1) // size
+        count.append(lz_e - lz_s + 2)
+        displs.append(lz_s)
+    return count, displs
+
+def create_stencil_kernel(grid_work1, grid_work2, coef, vp):
+    kernels = []
+    dgrid1, dgrid2 = vp.sca.create_descriptor((grid_work1, grid_work2))
+    # input: grid_work1, output: grid_work2
+    desc = ((dgrid1[0, 0, -1] + dgrid1[0, 0, 1]) * coef[0] +
+            (dgrid1[0, -1, 0] + dgrid1[0, 1, 0]) * coef[1] +
+            (dgrid1[-1, 0, 0] + dgrid1[1, 0, 0]) * coef[2] +
+            dgrid1[0, 0, 0] * coef[3])
+    kernels.append(vp.sca.create_kernel(desc, desc_o=dgrid2[0, 0, 0]))
+    # input: grid_work2, output: grid_work1
+    desc = ((dgrid2[0, 0, -1] + dgrid2[0, 0, 1]) * coef[0] +
+            (dgrid2[0, -1, 0] + dgrid2[0, 1, 0]) * coef[1] +
+            (dgrid2[-1, 0, 0] + dgrid2[1, 0, 0]) * coef[2] +
+            dgrid2[0, 0, 0] * coef[3])
+    kernels.append(vp.sca.create_kernel(desc, desc_o=dgrid1[0, 0, 0]))
+    return kernels
+
+def execute_naive(grid_in, grid_out, coef):
+    grid_out[1:-1, 1:-1, 1:-1] = (
+        (grid_in[1:-1, 1:-1, 0:-2] + grid_in[1:-1, 1:-1, 2:]) * coef[0] +
+        (grid_in[1:-1, 0:-2, 1:-1] + grid_in[1:-1, 2:,   1:-1]) * coef[1] +
+        (grid_in[0:-2, 1:-1, 1:-1] + grid_in[2:,   1:-1, 1:-1]) * coef[2] +
+        grid_in[1:-1, 1:-1, 1:-1] * coef[3])
+    return grid_out
+
+def exchange_data(grid, comm, rank, size):
+    """ Exchange local boundary data
+        '-' indicates xy planar.
+            grid_root: -----------
+            * transfer to upper process
+            rank0    : -----
+                           ^
+                           |
+            rank1    :    -----
+                              ^
+                              |
+            rank2    :       -----
+            * transfer to lower process
+            rank0    : -----
+                          |
+                          v
+            rank1    :    -----
+                             |
+                             v
+            rank2    :       -----
+    """
+    if size == 1: return
+    # transfer to upper process
+    if rank == 0:
+        peer_src = rank + 1
+        peer_dst = MPI.PROC_NULL
+    elif rank == size - 1:
+        peer_src = MPI.PROC_NULL
+        peer_dst = rank - 1
+    else:
+        peer_src = rank + 1
+        peer_dst = rank - 1
+    comm.Sendrecv(grid[1], dest=peer_dst, recvbuf=grid[-1], source=peer_src)
+    # transfer to lower process
+    if rank == 0:
+        peer_src = MPI.PROC_NULL
+        peer_dst = rank + 1
+    elif rank == size - 1:
+        peer_src = rank - 1
+        peer_dst = MPI.PROC_NULL
+    else:
+        peer_src = rank - 1
+        peer_dst = rank + 1
+    comm.Sendrecv(grid[-2], dest=peer_dst, recvbuf=grid[0], source=peer_src)
+
+def get_l2_norm(grid_work1, grid_work2, comm, dev):
+    norm_local = dev.power(
+        grid_work1[1:-1, 1:-1, 1:-1] - grid_work2[1:-1, 1:-1, 1:-1], 2).sum()
+    l2_norm = dev.zeros_like(norm_local)
+    comm.Allreduce(norm_local, l2_norm, op=MPI.SUM)
+    l2_norm = dev.sqrt(l2_norm)
+    return float(l2_norm)
+
+def scatter_to_local_grid(grid_root, grid_local, count, displs, comm, rank, size):
+    if size == 1:
+        grid_local[...] = grid_root
+    else:
+        if rank == 0:
+            begin = displs[0]
+            end = begin + count[0]
+            grid_local[...] = grid_root[begin:end]
+        for r in range(1, size):
+            if rank == 0:
+                begin = displs[r]
+                end = begin + count[r]
+                comm.Send(grid_root[begin:end], dest=r)
+            elif rank == r:
+                comm.Recv(grid_local, source=0)
+
+def gather_from_local_grid(grid_root, grid_local, count, displs, comm, rank, size):
+    if size == 1:
+        grid_root[...] = grid_local
+    else:
+        if rank == 0:
+            begin = displs[0]
+            end = begin + count[0]
+            grid_root[begin:end] = grid_local
+        for r in range(1, size):
+            if rank == 0:
+                begin = displs[r]
+                end = begin + count[r]
+                comm.Recv(grid_root[begin:end], source=r)
+            elif rank == r:
+                comm.Send(grid_local, dest=0)
+
+def draw(fig, ax, xx, yy, grid, z, t):
+    ax.set_xlabel("x[m]")
+    ax.set_ylabel("y[m]")
+    ax.set_title("z = {:4.3f} [m], timestep = {:>10d}".format(z, t))
+    c = ax.pcolormesh(xx, yy, grid, cmap='coolwarm', vmin=0, vmax=100)
+    return c
+
+def thermal(dev):
+    comm = MPI.COMM_WORLD
+    size = comm.Get_size()
+    rank = comm.Get_rank()
+    dx = LX / (NX + 1)
+    dy = LY / (NY + 1)
+    dz = LZ / (NZ + 1)
+    coef = [
+        (HC * DT) / (dx * dx),
+        (HC * DT) / (dy * dy),
+        (HC * DT) / (dz * dz),
+        1.0 - 2.0 * HC * DT * (1 / (dx * dx) + 1 / (dy * dy) + 1 / (dz * dz)),
+    ]
+    mx = NX + 2
+    my = NY + 2
+    mz = NZ + 2
+    # create base grid on root process
+    if rank == 0:
+        x = dev.linspace(0, LX, mx)
+        y = dev.linspace(0, LY, my)
+        z = dev.linspace(0, LZ, mz)
+        zz, yy, xx = dev.meshgrid(z, y, x, indexing='ij')
+        grid_root = dev.empty((mz, my, mx), dtype=DTYPE)
+        initialize(grid_root, dev)
+    else:
+        grid_root = None
+    # draw initial grid
+    if rank == 0:
+        fig, axes = plt.subplots(3, 2, figsize=(9, 9), constrained_layout=True)
+        zstep = dev.linspace(0, mz, 5, dtype=int)[1:-1]
+        for i, ax in enumerate(axes[:, 0]):
+            zind = int(zstep[i])
+            c = draw(fig, ax, xx[zind, :, :], yy[zind, :, :],
+                     grid_root[zind, :, :], float(LZ * zind / mz), 0)
+    # create local grid
+    count, displs = get_count_and_displs(rank, size)
+    lz_d = count[rank]
+    grid_work1 = dev.empty((lz_d, my, mx), dtype=DTYPE)
+    grid_work2 = dev.empty((lz_d, my, mx), dtype=DTYPE)
+    scatter_to_local_grid(grid_root, grid_work1, count, displs, comm, rank, size)
+    grid_work2[...] = grid_work1
+
+    if dev.__name__ == 'nlcpy':
+        # create stencil kernels
+        kernels = create_stencil_kernel(grid_work1, grid_work2, coef, dev)
+
+    # execute difference method
+    comm.Barrier()
+    t0 = MPI.Wtime()
+    loop_cnt = 0
+    while True:
+        if dev.__name__ == 'nlcpy':
+            grid = kernels[loop_cnt % 2].execute()
+        else:
+            grid = execute_naive(
+                grid_work1 if loop_cnt % 2 == 0 else grid_work2,
+                grid_work2 if loop_cnt % 2 == 0 else grid_work1,
+                coef)
+        exchange_data(grid, comm, rank, size)
+        if loop_cnt % CHECK_INTERVAL == 0:  # check convergence
+            l2_norm = get_l2_norm(grid_work1, grid_work2, comm, dev)
+            if rank == 0: print("loop_cnt: {:>10d}, l2_norm: {:>12.6f}".format(loop_cnt, l2_norm))
+            if l2_norm < TOLERANCE:
+                break
+        loop_cnt += 1
+    comm.Barrier()
+    t1 = MPI.Wtime()
+    if rank == 0:
+        print("elapsed:", t1 - t0)
+
+    gather_from_local_grid(grid_root, grid, count, displs, comm, rank, size)
+    # draw latest grid and save figure
+    if rank == 0:
+        for i, ax in enumerate(axes[:, 1]):
+            zind = int(zstep[i])
+            c = draw(fig, ax, xx[zind, :, :], yy[zind, :, :],
+                     grid_root[zind, :, :], float(LZ * zind / mz), loop_cnt)
+        fig.colorbar(c, ax=axes[:, 1], location='bottom', label='T[$^{\circ}$C]')
+        plt.savefig('img_thermal_{}.png'.format(dev.__name__))
+
+    if dev.__name__ == 'nlcpy':
+        # destroy stencil kernels
+        for kern in kernels:
+            vp.sca.destroy_kernel(kern)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument('-dev', type=str, required=True, choices=['vh', 've'],
+                        help='Execution device')
+    args = parser.parse_args()
+
+    # set module
+    if args.dev == 'vh':
+        import numpy as np
+        dev = np
+    elif args.dev == 've':
+        import nlcpy as vp
+        dev = vp
+    else:
+        raise ValueError
+
+    thermal(dev)
diff --git a/demo/threads/makefile b/demo/threads/makefile
deleted file mode 100644
index 845e13d..0000000
--- a/demo/threads/makefile
+++ /dev/null
@@ -1,12 +0,0 @@
-.PHONY: default build test clean
-
-default: build test clean
-
-PYTHON=python
-
-build:
-
-test:
-	${PYTHON} sendrecv.py
-
-clean:
\ No newline at end of file
diff --git a/demo/threads/sendrecv.py b/demo/threads/sendrecv.py
deleted file mode 100644
index a843b03..0000000
--- a/demo/threads/sendrecv.py
+++ /dev/null
@@ -1,48 +0,0 @@
-from mpi4pyve import MPI
-import sys
-
-if MPI.Query_thread() < MPI.THREAD_MULTIPLE:
-    sys.stderr.write("MPI does not provide enough thread support\n")
-    sys.exit(0)
-
-try:
-    import threading
-except ImportError:
-    sys.stderr.write("threading module not available\n")
-    sys.exit(0)
-
-try:
-    import numpy
-except ImportError:
-    sys.stderr.write("NumPy package not available\n")
-    sys.exit(0)
-
-send_msg = numpy.arange(1000000, dtype='i')
-recv_msg = numpy.zeros_like(send_msg)
-
-start_event = threading.Event()
-
-def self_send():
-    start_event.wait()
-    comm = MPI.COMM_WORLD
-    rank = comm.Get_rank()
-    comm.Send([send_msg, MPI.INT], dest=rank, tag=0)
-
-def self_recv():
-    start_event.wait()
-    comm = MPI.COMM_WORLD
-    rank = comm.Get_rank()
-    comm.Recv([recv_msg, MPI.INT], source=rank, tag=0)
-
-send_thread = threading.Thread(target=self_send)
-recv_thread = threading.Thread(target=self_recv)
-
-for t in (recv_thread, send_thread):
-    t.start()
-assert not numpy.allclose(send_msg, recv_msg)
-
-start_event.set()
-
-for t in (recv_thread, send_thread):
-    t.join()
-assert numpy.allclose(send_msg, recv_msg)
diff --git a/demo/vampirtrace/cpilog.py b/demo/vampirtrace/cpilog.py
deleted file mode 100644
index 3b119e5..0000000
--- a/demo/vampirtrace/cpilog.py
+++ /dev/null
@@ -1,52 +0,0 @@
-#!/usr/bin/env python
-
-# If you want VampirTrace to log MPI calls, you have to add the two
-# lines below at the very beginning of your main bootstrap script.
-import mpi4pyve
-mpi4pyve.rc.threads = False
-mpi4pyve.profile('vt', logfile='cpilog')
-
-# Import the MPI extension module
-from mpi4pyve import MPI
-
-# Import the 'array' module
-from array import array
-
-# This is just to make the logging
-# output a bit more interesting
-from time import sleep
-
-comm = MPI.COMM_WORLD
-nprocs = comm.Get_size()
-myrank = comm.Get_rank()
-
-n  = array('i', [0])
-pi = array('d', [0])
-mypi = array('d', [0])
-
-def comp_pi(n, myrank=0, nprocs=1):
-    h = 1.0 / n;
-    s = 0.0;
-    for i in range(myrank + 1, n + 1, nprocs):
-        x = h * (i - 0.5);
-        s += 4.0 / (1.0 + x**2);
-    return s * h
-
-comm.Barrier()
-
-for N in [10000]*10:
-
-    if myrank == 0:
-        n[0] = N
-
-    comm.Bcast([n, MPI.INT], root=0)
-
-    mypi[0] = comp_pi(n[0], myrank, nprocs)
-
-    comm.Reduce([mypi, MPI.DOUBLE],
-                [pi, MPI.DOUBLE],
-                op=MPI.SUM, root=0)
-
-    comm.Barrier()
-
-    sleep(0.01)
diff --git a/demo/vampirtrace/makefile b/demo/vampirtrace/makefile
deleted file mode 100644
index fcafdb5..0000000
--- a/demo/vampirtrace/makefile
+++ /dev/null
@@ -1,37 +0,0 @@
-MPIEXEC = mpiexec
-PYTHON = python
-N = 8
-
-.PHONY: default
-default: build test clean
-
-
-.PHONY: run-cpilog run-ring run-threads run
-run:    run-cpilog run-ring run-threads
-run-cpilog:
-	${MPIEXEC} -n ${N} ${PYTHON} cpilog.py
-run-ring:
-	${MPIEXEC} -n ${N} ${PYTHON} ring.py
-run-threads:
-	${MPIEXEC} -n ${N} ${PYTHON} threads.py
-
-.PHONY: view-cpilog view-ring view-threads view
-view:   view-cpilog view-ring view-threads
-view-cpilog:  cpilog.otf
-view-ring:    ring.otf
-view-threads: threads.otf
-
-cpilog.otf:  run-cpilog
-ring.otf:    run-ring
-threads.otf: run-threads
-
-.PHONY: build
-build:
-
-.PHONY: test
-test: run
-
-.PHONY: clean
-clean:
-	${RM} *.otf *.uctl *.*.def.z *.*.events.z *.*.marker.z
-	${RM} *.thumb *.*.def *.*.events
diff --git a/demo/vampirtrace/ring.py b/demo/vampirtrace/ring.py
deleted file mode 100644
index eee0ea1..0000000
--- a/demo/vampirtrace/ring.py
+++ /dev/null
@@ -1,37 +0,0 @@
-#!/usr/bin/env python
-
-# If you want VampirTrace to log MPI calls, you have to add the two
-# lines below at the very beginning of your main bootstrap script.
-import mpi4pyve
-mpi4pyve.rc.threads = False
-mpi4pyve.profile('vt-mpi', logfile='ring')
-
-from mpi4pyve import MPI
-
-comm = MPI.COMM_WORLD
-size = comm.Get_size()
-rank = comm.Get_rank()
-
-src  = rank-1
-dest = rank+1
-if rank == 0:
-    src = size-1
-if rank == size-1:
-    dest = 0
-
-try:
-    from numpy import zeros
-    a1 = zeros(1000000, 'd')
-    a2 = zeros(1000000, 'd')
-except ImportError:
-    from array import array
-    a1 = array('d', [0]*1000); a1 *= 1000
-    a2 = array('d', [0]*1000); a2 *= 1000
-
-comm.Sendrecv(sendbuf=a1, recvbuf=a2,
-              source=src, dest=dest)
-
-MPI.Request.Waitall([
-    comm.Isend(a1, dest=dest),
-    comm.Irecv(a2, source=src),
-    ])
diff --git a/demo/vampirtrace/threads.py b/demo/vampirtrace/threads.py
deleted file mode 100644
index d2f8d32..0000000
--- a/demo/vampirtrace/threads.py
+++ /dev/null
@@ -1,37 +0,0 @@
-#!/usr/bin/env python
-
-import mpi4pyve
-mpi4pyve.rc.threads = True
-mpi4pyve.rc.thread_level = "funneled"
-mpi4pyve.profile('vt-hyb', logfile='threads')
-
-from mpi4pyve import MPI
-from threading import Thread
-
-MPI.COMM_WORLD.Barrier()
-
-# Understanding the Python GIL
-# David Beazley, http://www.dabeaz.com
-# PyCon 2010, Atlanta, Georgia
-# http://www.dabeaz.com/python/UnderstandingGIL.pdf
-
-# Consider this trivial CPU-bound function
-def countdown(n):
-    while n > 0:
-        n -= 1
-
-# Run it once with a lot of work
-COUNT = 10000000 # 10 millon
-tic = MPI.Wtime()
-countdown(COUNT)
-toc = MPI.Wtime()
-print ("sequential: %f seconds" % (toc-tic))
-
-# Now, subdivide the work across two threads
-t1 = Thread(target=countdown, args=(COUNT//2,))
-t2 = Thread(target=countdown, args=(COUNT//2,))
-tic = MPI.Wtime()
-for t in (t1, t2): t.start()
-for t in (t1, t2): t.join()
-toc = MPI.Wtime()
-print ("threaded:   %f seconds" % (toc-tic))
diff --git a/demo/wrap-boost/helloworld.cxx b/demo/wrap-boost/helloworld.cxx
deleted file mode 100644
index f4598d4..0000000
--- a/demo/wrap-boost/helloworld.cxx
+++ /dev/null
@@ -1,49 +0,0 @@
-#include <mpi.h>
-#include <iostream>
-
-static void sayhello(MPI_Comm comm)
-{
-  if (comm == MPI_COMM_NULL) {
-    std::cout << "You passed MPI_COMM_NULL !!!" << std::endl;
-    return;
-  }
-  int size;
-  MPI_Comm_size(comm, &size);
-  int rank;
-  MPI_Comm_rank(comm, &rank);
-  int plen; char pname[MPI_MAX_PROCESSOR_NAME];
-  MPI_Get_processor_name(pname, &plen);
-  std::cout <<
-    "Hello, World! " <<
-    "I am process "  << rank  <<
-    " of "           << size  <<
-    " on  "          << pname <<
-    "."              << std::endl;
-}
-
-
-#include <boost/python.hpp>
-#include <mpi4pyve/mpi4pyve.h>
-using namespace boost::python;
-
-static void hw_sayhello(object py_comm)
-{
-  PyObject* py_obj = py_comm.ptr();
-  MPI_Comm *comm_p = PyMPIComm_Get(py_obj);
-  if (comm_p == NULL) throw_error_already_set();
-  sayhello(*comm_p);
-}
-
-BOOST_PYTHON_MODULE(helloworld)
-{
-  if (import_mpi4pyve() < 0) return; /* Python 2.X */
-
-  def("sayhello", hw_sayhello);
-}
-
-
-/*
- * Local Variables:
- * mode: C++
- * End:
- */
diff --git a/demo/wrap-boost/makefile b/demo/wrap-boost/makefile
deleted file mode 100644
index 95ad41f..0000000
--- a/demo/wrap-boost/makefile
+++ /dev/null
@@ -1,31 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-MPI4PY_INCLUDE = ${shell ${PYTHON} -c 'import mpi4pyve; print( mpi4pyve.get_include() )'}
-BOOST_INCS = 
-BOOST_LIBS = -lboost_python
-
-
-MPICXX = mpicxx
-CXXFLAGS = -fPIC ${shell ${PYTHON_CONFIG} --includes} ${BOOST_INCS}
-LDFLAGS = -shared ${shell ${PYTHON_CONFIG} --libs} ${BOOST_LIBS}
-SO = ${shell ${PYTHON_CONFIG} --extension-suffix}
-.PHONY: build
-build: helloworld${SO}
-helloworld${SO}: helloworld.cxx
-	${MPICXX} ${CXXFLAGS} -I${MPI4PY_INCLUDE} -o $@ $< ${LDFLAGS}
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} helloworld${SO}
diff --git a/demo/wrap-boost/test.py b/demo/wrap-boost/test.py
deleted file mode 100644
index bf1ab30..0000000
--- a/demo/wrap-boost/test.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-hw.sayhello(null)
-
-comm = MPI.COMM_WORLD
-hw.sayhello(comm)
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/demo/wrap-c/helloworld.c b/demo/wrap-c/helloworld.c
deleted file mode 100644
index f3ac3e6..0000000
--- a/demo/wrap-c/helloworld.c
+++ /dev/null
@@ -1,112 +0,0 @@
-#define MPICH_SKIP_MPICXX 1
-#define OMPI_SKIP_MPICXX  1
-#include <Python.h>
-#include <mpi4pyve/mpi4pyve.h>
-
-/* -------------------------------------------------------------------------- */
-
-static void
-sayhello(MPI_Comm comm) {
-  int size, rank;
-  char pname[MPI_MAX_PROCESSOR_NAME]; int len;
-  if (comm == MPI_COMM_NULL) {
-    printf("You passed MPI_COMM_NULL !!!\n");
-    return;
-  }
-  MPI_Comm_size(comm, &size);
-  MPI_Comm_rank(comm, &rank);
-  MPI_Get_processor_name(pname, &len);
-  pname[len] = 0;
-  printf("Hello, World! I am process %d of %d on %s.\n",
-         rank, size, pname);
-}
-
-/* -------------------------------------------------------------------------- */
-
-static PyObject *
-hw_sayhello(PyObject *self, PyObject *args)
-{
-  PyObject *py_comm = NULL;
-  MPI_Comm *comm_p  = NULL;
-
-  if (!PyArg_ParseTuple(args, "O:sayhello", &py_comm))
-    return NULL;
-
-  comm_p = PyMPIComm_Get(py_comm);
-  if (comm_p == NULL)
-    return NULL;
-
-  sayhello(*comm_p);
-
-  Py_INCREF(Py_None);
-  return Py_None;
-}
-
-static struct PyMethodDef hw_methods[] = {
-  {"sayhello", (PyCFunction)hw_sayhello, METH_VARARGS, NULL},
-  {NULL,       NULL,                     0,            NULL} /* sentinel */
-};
-
-#if PY_MAJOR_VERSION < 3
-/* --- Python 2 --- */
-
-PyMODINIT_FUNC inithelloworld(void)
-{
-  PyObject *m = NULL;
-
-  /* Initialize mpi4pyve C-API */
-  if (import_mpi4pyve() < 0) goto bad;
-
-  /* Module initialization  */
-  m = Py_InitModule("helloworld", hw_methods);
-  if (m == NULL) goto bad;
-
-  return;
-
- bad:
-  return;
-}
-
-#else
-/* --- Python 3 --- */
-
-static struct PyModuleDef hw_module = {
-  PyModuleDef_HEAD_INIT,
-  "helloworld", /* m_name */
-  NULL,         /* m_doc */
-  -1,           /* m_size */
-  hw_methods    /* m_methods */,
-  NULL,         /* m_reload */
-  NULL,         /* m_traverse */
-  NULL,         /* m_clear */
-  NULL          /* m_free */
-};
-
-PyMODINIT_FUNC
-PyInit_helloworld(void)
-{
-  PyObject *m = NULL;
-
-  /* Initialize mpi4pyve's C-API */
-  if (import_mpi4pyve() < 0) goto bad;
-
-  /* Module initialization  */
-  m = PyModule_Create(&hw_module);
-  if (m == NULL) goto bad;
-
-  return m;
-
- bad:
-  return NULL;
-}
-
-#endif
-
-/* -------------------------------------------------------------------------- */
-
-/*
-  Local variables:
-  c-basic-offset: 2
-  indent-tabs-mode: nil
-  End:
-*/
diff --git a/demo/wrap-c/makefile b/demo/wrap-c/makefile
deleted file mode 100644
index 32adff2..0000000
--- a/demo/wrap-c/makefile
+++ /dev/null
@@ -1,29 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-MPI4PY_INCLUDE = ${shell ${PYTHON} -c 'import mpi4pyve; print( mpi4pyve.get_include() )'}
-
-
-MPICC = mpicc
-CFLAGS = -fPIC ${shell ${PYTHON_CONFIG} --includes}
-LDFLAGS = -shared ${shell ${PYTHON_CONFIG} --libs}
-SO = ${shell ${PYTHON_CONFIG} --extension-suffix}
-.PHONY: build
-build: helloworld${SO}
-helloworld${SO}: helloworld.c
-	${MPICC} ${CFLAGS} -I${MPI4PY_INCLUDE} -o $@ $< ${LDFLAGS}
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} helloworld${SO}
diff --git a/demo/wrap-c/test.py b/demo/wrap-c/test.py
deleted file mode 100644
index bf1ab30..0000000
--- a/demo/wrap-c/test.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-hw.sayhello(null)
-
-comm = MPI.COMM_WORLD
-hw.sayhello(comm)
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/demo/wrap-cffi/helloworld.c b/demo/wrap-cffi/helloworld.c
deleted file mode 100644
index ba121d4..0000000
--- a/demo/wrap-cffi/helloworld.c
+++ /dev/null
@@ -1,27 +0,0 @@
-#define MPICH_SKIP_MPICXX 1
-#define OMPI_SKIP_MPICXX  1
-#include <mpi.h>
-#include <stdio.h>
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-extern void sayhello(MPI_Comm);
-#ifdef __cplusplus
-}
-#endif
-
-void sayhello(MPI_Comm comm) {
-  int size, rank;
-  char pname[MPI_MAX_PROCESSOR_NAME]; int len;
-  if (comm == MPI_COMM_NULL) {
-    printf("You passed MPI_COMM_NULL !!!\n");
-    return;
-  }
-  MPI_Comm_size(comm, &size);
-  MPI_Comm_rank(comm, &rank);
-  MPI_Get_processor_name(pname, &len);
-  pname[len] = 0;
-  printf("Hello, World! I am process %d of %d on %s.\n",
-         rank, size, pname);
-}
diff --git a/demo/wrap-cffi/helloworld.py b/demo/wrap-cffi/helloworld.py
deleted file mode 100644
index 267fccc..0000000
--- a/demo/wrap-cffi/helloworld.py
+++ /dev/null
@@ -1,21 +0,0 @@
-from mpi4pyve import MPI
-import cffi
-import os
-
-_libdir = os.path.dirname(__file__)
-
-ffi = cffi.FFI()
-if MPI._sizeof(MPI.Comm) == ffi.sizeof('int'):
-    _mpi_comm_t = 'int'
-else:
-    _mpi_comm_t = 'void*'
-ffi.cdef("""
-typedef %(_mpi_comm_t)s MPI_Comm;
-void sayhello(MPI_Comm);
-""" % vars())
-lib = ffi.dlopen(os.path.join(_libdir, "libhelloworld.so"))
-
-def sayhello(comm):
-    comm_ptr = MPI._addressof(comm)
-    comm_val = ffi.cast('MPI_Comm*', comm_ptr)[0]
-    lib.sayhello(comm_val)
diff --git a/demo/wrap-cffi/makefile b/demo/wrap-cffi/makefile
deleted file mode 100644
index e7eb200..0000000
--- a/demo/wrap-cffi/makefile
+++ /dev/null
@@ -1,24 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-
-MPICC = mpicc
-LIBNAME = libhelloworld.so
-.PHONY: build
-build: ${LIBNAME}
-${LIBNAME}: helloworld.c
-	${MPICC} -shared -o $@ $<
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} -r ${LIBNAME} *.pyc __pycache__
diff --git a/demo/wrap-cffi/test.py b/demo/wrap-cffi/test.py
deleted file mode 100644
index bf1ab30..0000000
--- a/demo/wrap-cffi/test.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-hw.sayhello(null)
-
-comm = MPI.COMM_WORLD
-hw.sayhello(comm)
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/demo/wrap-ctypes/helloworld.c b/demo/wrap-ctypes/helloworld.c
deleted file mode 100644
index ba121d4..0000000
--- a/demo/wrap-ctypes/helloworld.c
+++ /dev/null
@@ -1,27 +0,0 @@
-#define MPICH_SKIP_MPICXX 1
-#define OMPI_SKIP_MPICXX  1
-#include <mpi.h>
-#include <stdio.h>
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-extern void sayhello(MPI_Comm);
-#ifdef __cplusplus
-}
-#endif
-
-void sayhello(MPI_Comm comm) {
-  int size, rank;
-  char pname[MPI_MAX_PROCESSOR_NAME]; int len;
-  if (comm == MPI_COMM_NULL) {
-    printf("You passed MPI_COMM_NULL !!!\n");
-    return;
-  }
-  MPI_Comm_size(comm, &size);
-  MPI_Comm_rank(comm, &rank);
-  MPI_Get_processor_name(pname, &len);
-  pname[len] = 0;
-  printf("Hello, World! I am process %d of %d on %s.\n",
-         rank, size, pname);
-}
diff --git a/demo/wrap-ctypes/helloworld.py b/demo/wrap-ctypes/helloworld.py
deleted file mode 100644
index 275942a..0000000
--- a/demo/wrap-ctypes/helloworld.py
+++ /dev/null
@@ -1,18 +0,0 @@
-from mpi4pyve import MPI
-import ctypes
-import os
-
-_libdir = os.path.dirname(__file__)
-
-if MPI._sizeof(MPI.Comm) == ctypes.sizeof(ctypes.c_int):
-    MPI_Comm = ctypes.c_int
-else:
-    MPI_Comm = ctypes.c_void_p
-_lib = ctypes.CDLL(os.path.join(_libdir, "libhelloworld.so"))
-_lib.sayhello.restype = None
-_lib.sayhello.argtypes = [MPI_Comm]
-
-def sayhello(comm):
-    comm_ptr = MPI._addressof(comm)
-    comm_val = MPI_Comm.from_address(comm_ptr)
-    _lib.sayhello(comm_val)
diff --git a/demo/wrap-ctypes/makefile b/demo/wrap-ctypes/makefile
deleted file mode 100644
index e7eb200..0000000
--- a/demo/wrap-ctypes/makefile
+++ /dev/null
@@ -1,24 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-
-MPICC = mpicc
-LIBNAME = libhelloworld.so
-.PHONY: build
-build: ${LIBNAME}
-${LIBNAME}: helloworld.c
-	${MPICC} -shared -o $@ $<
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} -r ${LIBNAME} *.pyc __pycache__
diff --git a/demo/wrap-ctypes/test.py b/demo/wrap-ctypes/test.py
deleted file mode 100644
index bf1ab30..0000000
--- a/demo/wrap-ctypes/test.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-hw.sayhello(null)
-
-comm = MPI.COMM_WORLD
-hw.sayhello(comm)
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/demo/wrap-cython/helloworld.pyx b/demo/wrap-cython/helloworld.pyx
deleted file mode 100644
index 4ec2671..0000000
--- a/demo/wrap-cython/helloworld.pyx
+++ /dev/null
@@ -1,23 +0,0 @@
-cdef extern from "mpi-compat.h": pass
-
-cimport mpi4pyve.MPI as MPI
-from mpi4pyve.libmpi cimport *
-
-cdef extern from "stdio.h":
-    int printf(char*, ...)
-
-cdef void c_sayhello(MPI_Comm comm):
-    cdef int size, rank, plen
-    cdef char pname[MPI_MAX_PROCESSOR_NAME]
-    if comm == MPI_COMM_NULL:
-        printf(b"You passed MPI_COMM_NULL !!!%s", b"\n")
-        return
-    MPI_Comm_size(comm, &size)
-    MPI_Comm_rank(comm, &rank)
-    MPI_Get_processor_name(pname, &plen)
-    printf(b"Hello, World! I am process %d of %d on %s.\n",
-           rank, size, pname)
-
-def sayhello(MPI.Comm comm not None ):
-    cdef MPI_Comm c_comm = comm.ob_mpi
-    c_sayhello(c_comm)
diff --git a/demo/wrap-cython/makefile b/demo/wrap-cython/makefile
deleted file mode 100644
index b8c3ad1..0000000
--- a/demo/wrap-cython/makefile
+++ /dev/null
@@ -1,35 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-
-
-CYTHON = cython
-.PHONY: src
-src: helloworld.c
-helloworld.c: helloworld.pyx
-	${CYTHON} $<
-
-
-MPICC = mpicc
-CFLAGS = -fPIC ${shell ${PYTHON_CONFIG} --includes}
-LDFLAGS = -shared ${shell ${PYTHON_CONFIG} --libs}
-SO = ${shell ${PYTHON_CONFIG} --extension-suffix}
-.PHONY: build
-build: helloworld${SO}
-helloworld${SO}: helloworld.c
-	${MPICC} ${CFLAGS} -o $@ $< ${LDFLAGS}
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} helloworld.c helloworld${SO}
diff --git a/demo/wrap-cython/mpi-compat.h b/demo/wrap-cython/mpi-compat.h
deleted file mode 100644
index 367c58a..0000000
--- a/demo/wrap-cython/mpi-compat.h
+++ /dev/null
@@ -1,14 +0,0 @@
-/* Author:  Lisandro Dalcin   */
-/* Contact: dalcinl@gmail.com */
-
-#ifndef MPI_COMPAT_H
-#define MPI_COMPAT_H
-
-#include <mpi.h>
-
-#if (MPI_VERSION < 3) && !defined(PyMPI_HAVE_MPI_Message)
-typedef void *PyMPI_MPI_Message;
-#define MPI_Message PyMPI_MPI_Message
-#endif
-
-#endif/*MPI_COMPAT_H*/
diff --git a/demo/wrap-cython/test.py b/demo/wrap-cython/test.py
deleted file mode 100644
index 9d93d76..0000000
--- a/demo/wrap-cython/test.py
+++ /dev/null
@@ -1,22 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-hw.sayhello(null)
-
-comm = MPI.COMM_WORLD
-hw.sayhello(comm)
-
-try:
-    hw.sayhello(None)
-except:
-    pass
-else:
-    assert 0, "exception not raised"
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/demo/wrap-f2py/helloworld.f90 b/demo/wrap-f2py/helloworld.f90
deleted file mode 100644
index 51661c4..0000000
--- a/demo/wrap-f2py/helloworld.f90
+++ /dev/null
@@ -1,31 +0,0 @@
-!
-! $ f2py --f90exec=mpif90 -m helloworld -c helloworld.f90
-!
-
-subroutine sayhello(comm)
-  use mpi
-  implicit none
-  integer :: comm
-  integer :: rank, size, nlen, ierr
-  character (len=MPI_MAX_PROCESSOR_NAME) :: pname
-  if (comm == MPI_COMM_NULL) then
-     print *, 'You passed MPI_COMM_NULL !!!'
-     return
-  end if
-  call MPI_Comm_rank(comm, rank, ierr)
-  call MPI_Comm_size(comm, size, ierr)
-  call MPI_Get_processor_name(pname, nlen, ierr)
-  print *, 'Hello, World!', &
-       ' I am process ', rank, &
-       ' of ', size, &
-       ' on ', pname(1:nlen), '.'
-end subroutine sayhello
-
-! program main
-!   use mpi
-!   implicit none
-!   integer ierr
-!   call MPI_Init(ierr)
-!   call sayhello(MPI_COMM_WORLD)
-!   call MPI_Finalize(ierr)
-! end program main
diff --git a/demo/wrap-f2py/makefile b/demo/wrap-f2py/makefile
deleted file mode 100644
index 5de785a..0000000
--- a/demo/wrap-f2py/makefile
+++ /dev/null
@@ -1,30 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON  = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-
-
-SO = ${shell ${PYTHON_CONFIG} --extension-suffix}
-MPIF90 = mpif90
-F2PY = f2py
-ifneq (${MPI_FORTRAN_MOD_DIR},)
-F2PYFLAGS += --f90flags=-I${MPI_FORTRAN_MOD_DIR}
-endif
-.PHONY: build
-build: helloworld${SO}
-helloworld${SO}: helloworld.f90
-	${F2PY} ${F2PYFLAGS} --f90exec=${MPIF90} -m helloworld -c $<
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} helloworld${SO}
diff --git a/demo/wrap-f2py/test.py b/demo/wrap-f2py/test.py
deleted file mode 100644
index 304c1c1..0000000
--- a/demo/wrap-f2py/test.py
+++ /dev/null
@@ -1,17 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-fnull = null.py2f()
-hw.sayhello(fnull)
-
-comm = MPI.COMM_WORLD
-fcomm = comm.py2f()
-hw.sayhello(fcomm)
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/demo/wrap-swig/helloworld.i b/demo/wrap-swig/helloworld.i
deleted file mode 100644
index 30c64e6..0000000
--- a/demo/wrap-swig/helloworld.i
+++ /dev/null
@@ -1,38 +0,0 @@
-%module helloworld
-
-%{
-
-#define MPICH_SKIP_MPICXX 1
-#define OMPI_SKIP_MPICXX  1
-
-#include <mpi.h>
-#include <stdio.h>
-
-void sayhello(MPI_Comm comm) {
-  int size, rank;
-  char pname[MPI_MAX_PROCESSOR_NAME]; int len;
-  if (comm == MPI_COMM_NULL) {
-    printf("You passed MPI_COMM_NULL !!!\n");
-    return;
-  }
-  MPI_Comm_size(comm, &size);
-  MPI_Comm_rank(comm, &rank);
-  MPI_Get_processor_name(pname, &len);
-  pname[len] = 0;
-  printf("Hello, World! I am process %d of %d on %s.\n",
-         rank, size, pname);
-}
-
-%}
-
-%include mpi4pyve/mpi4pyve.i
-
-%mpi4pyve_typemap(Comm, MPI_Comm);
-
-void sayhello(MPI_Comm comm);
-
-/*
- * Local Variables:
- * mode: C
- * End:
- */
diff --git a/demo/wrap-swig/makefile b/demo/wrap-swig/makefile
deleted file mode 100644
index 70fd805..0000000
--- a/demo/wrap-swig/makefile
+++ /dev/null
@@ -1,36 +0,0 @@
-.PHONY: default
-default: build test clean
-
-PYTHON = python
-PYTHON_CONFIG = ${PYTHON} ../python-config
-MPI4PY_INCLUDE = ${shell ${PYTHON} -c 'import mpi4pyve; print( mpi4pyve.get_include() )'}
-
-
-SWIG = swig
-SWIG_PY = ${SWIG} -python
-.PHONY: src
-src: helloworld_wrap.c
-helloworld_wrap.c: helloworld.i
-	${SWIG_PY} -I${MPI4PY_INCLUDE} -o $@ $<
-
-MPICC = mpicc
-CFLAGS = -fPIC ${shell ${PYTHON_CONFIG} --includes}
-LDFLAGS = -shared ${shell ${PYTHON_CONFIG} --libs}
-SO = ${shell ${PYTHON_CONFIG} --extension-suffix}
-.PHONY: build
-build: _helloworld${SO}
-_helloworld${SO}: helloworld_wrap.c
-	${MPICC} ${CFLAGS} -I${MPI4PY_INCLUDE} -o $@ $< ${LDFLAGS}
-
-
-MPIEXEC = mpiexec
-NP_FLAG = -n
-NP = 5
-.PHONY: test
-test: build
-	${MPIEXEC} ${NP_FLAG} ${NP} ${PYTHON} test.py
-
-
-.PHONY: clean
-clean:
-	${RM} helloworld_wrap.c helloworld.py* _helloworld${SO}
diff --git a/demo/wrap-swig/test.py b/demo/wrap-swig/test.py
deleted file mode 100644
index bf1ab30..0000000
--- a/demo/wrap-swig/test.py
+++ /dev/null
@@ -1,15 +0,0 @@
-from mpi4pyve import MPI
-import helloworld as hw
-
-null = MPI.COMM_NULL
-hw.sayhello(null)
-
-comm = MPI.COMM_WORLD
-hw.sayhello(comm)
-
-try:
-    hw.sayhello(list())
-except:
-    pass
-else:
-    assert 0, "exception not raised"
diff --git a/docs/index.rst b/docs/index.rst
index cab5270..bdd3741 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -22,7 +22,7 @@ This package supports:
   interface* (NLCPy arrays, builtin bytes/string/array objects)
 
   + point-to-point (blocking/nonbloking/persistent send & receive)
-  + collective (broadcast, block/vector scatter & gather, reductions)
+  + collective (broadcast, block scatter & gather, reductions, vector scatter & gather)
 
 * Process groups and communication domains
 
@@ -35,14 +35,14 @@ This package supports:
   + blocking/nonbloking & collective/noncollective
   + individual/shared file pointers & explicit offset
 
-This package has NOT supported the following functions yet:
-
 * One-sided operations
 
   + remote memory access (put, get, accumulate)
   + passive target syncronization (start/complete & post/wait)
   + active target syncronization (lock & unlock)
 
+This package has NOT supported the following functions yet:
+
 * Dynamic process management
 
   + spawn & spawn multiple
@@ -50,8 +50,8 @@ This package has NOT supported the following functions yet:
   + name publishing & lookup
 
 
-List of Supprted Functions
---------------------------
+List of Supported Functions
+---------------------------
 
 A list of supported functions is shown below.
 
@@ -67,8 +67,6 @@ Alltoall              All to All Scatter/Gather, send data from all to all proce
 Alltoallv             All to All Scatter/Gather Vector, send data from all to all processes in a group providing different amount of data and displacements.
 Alltoallw             Generalized All-to-All communication allowing different counts, displacements and datatypes for each partner.
 Bcast                 Broadcast a message from one process to all other processes in a group.
-Bsend                 Blocking send in buffered mode.
-Bsend_init            Persistent request for a send in buffered mode.
 Gather                Gather together values from a group of processes.
 Gatherv               Gather Vector, gather data to one process from all other processes in a group providing different amount of data and displacements at the receiving sides.
 Iallgather            Nonblocking Gather to All.
@@ -78,7 +76,6 @@ Ialltoall             Nonblocking All to All Scatter/Gather.
 Ialltoallv            Nonblocking All to All Scatter/Gather Vector.
 Ialltoallw            Nonblocking Generalized All-to-All.
 Ibcast                Nonblocking Broadcast.
-Ibsend                Nonblocking send in buffered mode.
 Igather               Nonblocking Gather.
 Igatherv              Nonblocking Gather Vector.
 Irecv                 Nonblocking receive.
@@ -109,9 +106,7 @@ allgather             Gather to All.
 allreduce             Reduce to All.
 alltoall              All to All Scatter/Gather.
 bcast                 Broadcast.
-bsend                 Send in buffered mode.
 gather                Gather.
-ibsend                Nonblocking send in buffered mode.
 irecv                 Nonblocking receive.
 isend                 Nonblocking send.
 issend                Nonblocking send in synchronous mode.
@@ -155,14 +150,6 @@ neighbor_allgather    Neighbor Gather to All.
 neighbor_alltoall     Neighbor All to All Scatter/Gather.
 ===================== ===============================================================================================================================================================
 
-* MPI (Miscellanea)
-
-===================== ===============================================================================================================================================================
-Name                  Summary
-===================== ===============================================================================================================================================================
-Attach_buffer         Attach a user-provided buffer for sending in buffered mode.
-===================== ===============================================================================================================================================================
-
 * MPI.Request Class (Request handle)
 
 ===================== ===============================================================================================================================================================
@@ -188,23 +175,13 @@ irecv                 Nonblocking receive of matched message.
 recv                  Blocking receive of matched message.
 ===================== ===============================================================================================================================================================
 
-* MPI.Op Class (Ancillay / Operation object)
-
-===================== ===============================================================================================================================================================
-Name                  Summary
-===================== ===============================================================================================================================================================
-Reduce_local          Apply a reduction operator to local data.  
-===================== ===============================================================================================================================================================
-
 * MPI.Datatype Class (Ancillay / Datatype object)
 
 ===================== ===============================================================================================================================================================
 Name                  Summary
 ===================== ===============================================================================================================================================================
 Pack                  Pack into contiguous memory according to datatype.
-Pack_external         Pack into contiguous memory according to datatype, using a portable data representation (external32).
 Unpack                Unpack from contiguous memory according to datatype.
-Unpack_external       Unpack from contiguous memory according to datatype, using a portable data representation (external32).
 ===================== ===============================================================================================================================================================
 
 * MPI.File Class (Parallel input/output)
@@ -248,8 +225,52 @@ Write_ordered_end     Complete a split collective write using shared file pointe
 Write_shared          Write using shared file pointer.
 ===================== ===============================================================================================================================================================
 
-List of Unsupprted Functions
-----------------------------
+* MPI.Win Class (One-sided operations) 
+
+===================== ===============================================================================================================================================================
+Name                  Summary
+===================== ===============================================================================================================================================================
+Accumulate            Accumulate data into the target process.
+Compare_and_swap      Perform one-sided atomic compare-and-swap.
+Fetch_and_op          Perform one-sided read-modify-write.
+Get                   Get data from a memory window on a remote process.
+Get_accumulate        Fetch-and-accumulate data into the target process.
+Put                   Put data into a memory window on a remote process.
+Raccumulate           Fetch-and-accumulate data into the target process.
+Rget                  Get data from a memory window on a remote process.
+Rget_accumulate       Accumulate data into the target process using remote memory access.
+Rput                  Put data into a memory window on a remote process.
+===================== ===============================================================================================================================================================
+
+List of mpi4py-ve Original Functions
+------------------------------------
+
+* veo (VE Offloading operations) 
+
++-------------------------------+-----------------------------------------------------------------------------------------------------+
+| Name                          | Summary                                                                                             |
++===============================+=====================================================================================================+
+| alloc_hmem(proc_handle,size)  | Allocate a VE memory buffer or a VH memory buffer which users can use them as heterogeneous memory. | 
+|                               |                                                                                                     |
+|                               | Parameters:                                                                                         |
+|                               |     proc_handle: pointer                                                                            |
+|                               |         VEO process handle                                                                          |
+|                               |     size: int                                                                                       |
+|                               |         size in bytes                                                                               |
+|                               |                                                                                                     |
+|                               | Returns:                                                                                            |
+|                               |     addr: int                                                                                       |
+|                               |         VEMVA address with the identifier                                                           |
++-------------------------------+-----------------------------------------------------------------------------------------------------+
+| free_hmem(addr)               | Free a VE memory buffer.                                                                            |
+|                               |                                                                                                     |
+|                               | Parameters:                                                                                         |
+|                               |     addr: int                                                                                       |
+|                               |     VEMVA address                                                                                   |
++-------------------------------+-----------------------------------------------------------------------------------------------------+
+
+List of Unsupported Functions
+-----------------------------
 
 The current version of *mpi4py-ve* does not support the following functions. Please note that "NotImplementedError" occurs if your Python script calls them.
 
@@ -258,6 +279,11 @@ The current version of *mpi4py-ve* does not support the following functions. Ple
 ===================== ===============================================================================================================================================================
 Name                  Summary
 ===================== ===============================================================================================================================================================
+Bsend                 Blocking send in buffered mode.
+Bsend_init            Persistent request for a send in buffered mode.
+Ibsend                Nonblocking send in buffered mode.
+bsend                 Send in buffered mode.
+ibsend                Nonblocking send in buffered mode.
 Accept                Accept a request to form a new intercommunicator.
 Connect               Make a request to form a new intercommunicator.
 Close_port            Close a port.
@@ -268,21 +294,39 @@ Publish_name          Publish a service name.
 Unpublish_name        Unpublish a service name.
 ===================== ===============================================================================================================================================================
 
-* MPI.Win Class (One-sided operations) 
+* MPI (Miscellanea)
 
 ===================== ===============================================================================================================================================================
 Name                  Summary
 ===================== ===============================================================================================================================================================
-Accumulate            Accumulate data into the target process.
-Compare_and_swap      Perform one-sided atomic compare-and-swap.
-Fetch_and_op          Perform one-sided read-modify-write.
-Get                   Get data from a memory window on a remote process.
-Get_accumulate        Fetch-and-accumulate data into the target process.
-Put                   Put data into a memory window on a remote process.
-Raccumulate           Fetch-and-accumulate data into the target process.
-Rget                  Get data from a memory window on a remote process.
-Rget_accumulate       Accumulate data into the target process using remote memory access.
-Rput                  Put data into a memory window on a remote process.
+Attach_buffer         Attach a user-provided buffer for sending in buffered mode.
+===================== ===============================================================================================================================================================
+
+* MPI.Op Class (Ancillay / Operation object)
+
+===================== ===============================================================================================================================================================
+Name                  Summary
+===================== ===============================================================================================================================================================
+Reduce_local          Apply a reduction operator to local data.  
+===================== ===============================================================================================================================================================
+
+* MPI.Datatype Class (Ancillay / Datatype object)
+
+===================== ===============================================================================================================================================================
+Name                  Summary
+===================== ===============================================================================================================================================================
+Pack_external         Pack into contiguous memory according to datatype, using a portable data representation (external32).
+Unpack_external       Unpack from contiguous memory according to datatype, using a portable data representation (external32).
+===================== ===============================================================================================================================================================
+
+* mpi4pyve.futures package ( MPIPoolExecutor / MPICommExecutor)
+
+===================== ===============================================================================================================================================================
+Name                  Summary
+===================== ===============================================================================================================================================================
+MPIPoolExecutor       The MPIPoolExecutor class uses a pool of MPI processes to execute calls asynchronously. 
+MPICommExecutor       Context manager for MPIPoolExecutor.
+                      This context manager splits a MPI (intra) communicator comm in two disjoint sets: a single master process and the remaining worker processes. 
 ===================== ===============================================================================================================================================================
 
 Exception Handling
@@ -306,7 +350,7 @@ Assume this code is stored in a standard Python script file and run with mpirun
 
 ::
 
-    $ mpirun -vh -np 2 $(which python) ZeroDivisionError.py
+    $ mpirun -veo -np 2 $(which python) ZeroDivisionError.py
 
 Process 0 raises **ZeroDivisionError** exception before performing a send call to process 1. As the exception is not handled, the Python interpreter running in process 0 will proceed to exit with non-zero status. However, as *mpi4py-ve* installed a finalizer hook to call *MPI_Finalize()* before exit, process 0 will block waiting for other processes to also enter the *MPI_Finalize()* call. Meanwhile, process 1 will block waiting for a message to arrive from process 0, thus never reaching to *MPI_Finalize()*. The whole MPI execution environment is irremediably in a deadlock state.
 
@@ -314,7 +358,7 @@ To alleviate this issue, *mpi4py-ve* offers a simple, alternative command line e
 
     ::
 
-    $ mpirun -vh -np 2 $(which python) -m mpi4pyve ZeroDivisionError.py
+    $ mpirun -veo -np 2 $(which python) -m mpi4pyve ZeroDivisionError.py
 
 
 This is a mimic of the option **-m mpi4py** described in the `mpi4py manual (mpi4py.run) <https://mpi4py.readthedocs.io/en/stable/mpi4py.run.html>`_.
diff --git a/docs/vai_spec_example.rst b/docs/vai_spec_example.rst
new file mode 100644
index 0000000..9f567b4
--- /dev/null
+++ b/docs/vai_spec_example.rst
@@ -0,0 +1,231 @@
+###################################################
+Use mpi4py-ve with homebrew classes (without NLCPy)
+###################################################
+
+*mpi4py-ve* allows objects with the *__ve_array_interface__* attribute to be specified as arguments to the communication API.
+
+******************************
+VE Array Interface (Version 1)
+******************************
+The *VE Array Interface* (or VAI) is created for interoperability between different implementations
+of VE array-like objects in various projects. The idea is borrowed from the `NumPy array interface <https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__>`_
+and `CUDA Array Interface <https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html>`_.
+
+------------------------------
+Python Interface Specification
+------------------------------
+
+    Note
+
+    Experimental feature. Specification may change.
+
+The  ``__ve_array_interface__``  attribute returns a dictionary ( ``dict`` ) that must contain the
+following entries:
+
+* **shape**:  ``(integer, ...)``
+  A tuple of ``int``  (or  ``long`` ) representing the size of each dimension.
+
+* **typestr**:  ``str``
+  The type string. This has the same definition as ``typestr`` in the `numpy array interface <https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__>`_.
+
+* **data**:  ``(integer, boolean)``
+  The data is a 2-tuple. The first element is the data pointer to VEO HMEM (Heterogenious
+  Memory) as a Python  ``int``  (or  ``long`` ). For zero-size arrays, use ``0``  here. The second element
+  is the read-only flag as a Python  ``bool`` .
+
+* **version**:  ``integer``
+  An integer for the version of the interface being exported. The current version is *1*.
+
+The followings are optional entries:
+
+* **strides**: ``None`` or  ``(integer, ...)``
+  If **strides** is not given, or it is  ``None`` , the array is in C-contiguous layout. Otherwise, a tuple
+  of  ``int``  (or  ``long`` ) is explicitly given for representing the number of bytes to skip to access
+  the next element at each dimension.
+
+* **descr**:
+  This is for describing more complicated types. This follows the same specification as in
+  the `numpy array interface <https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__>`_.
+
+* **mask**: ``None`` or object exposing the ``__ve_array_interface__``
+  If ``None`` then all values in **data** are valid. All elements of the mask array should be
+  interpreted only as true or not true indicating which elements of this array are valid. This
+  has the same definition as ``mask`` in the `numpy array interface <https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html#__array_interface__>`_.
+
+      Note
+
+      mpi4py-ve does not currently support working with masked VE arrays and will raise
+      a exception if one is passed to a function.
+
+* **veo_ctxt**: ``None`` or ``integer``
+  The pointer of ``veo_thr_ctxt`` as a Python ``int`` (or ``long``).
+
+*************************************************
+Example code for mpi4py-ve using homebrew classes
+*************************************************
+
+-----------
+source code
+-----------
+* mpi_send_recv.py: Main script to communicate between objects that have ``__ve_array_interface__`` attribute.
+
+.. code-block:: python
+
+  from mpi4pyve import MPI
+  from mpi4pyve import util
+  import numpy as np
+  import veo_Py_wrapper
+  
+  comm = MPI.COMM_WORLD
+  size = comm.Get_size()
+  rank = comm.Get_rank()
+  
+  veo = veo_Py_wrapper.Veo(rank)            # create VE process
+  x = np.array([123, 456, 789], dtype=int)  # create buffer on VH
+  x_hmem = veo.alloc_hmem(x.dtype, x.size)  # create buffer on VE
+  
+  if rank == 0:
+      x_hmem.set_value(x)  # set value into VE buffer
+      comm.Send(x_hmem, dest=1)
+      comm.Recv(x_hmem, source=1)
+  elif rank == 1:
+      comm.Recv(x_hmem, source=0)
+      comm.Send(x_hmem, dest=0)
+  comm.Barrier()
+  
+  res = np.all(x == x_hmem.get_value())  # result check
+  print('Result {} (rank={})'.format('Success' if res else 'Failed', rank))
+  
+  del x_hmem
+
+* veo_Py_wrapper.py: Sub script to call veo C APIs and to create an object that have ``__ve_array_interface__``.
+
+.. code-block:: python
+
+  from mpi4pyve import veo
+  import ctypes
+  import numpy as np
+  import atexit
+  
+  _veo_proc_destroyed = False
+  
+  
+  class VEMem(object):
+      """
+      Object that controls VE memory
+      """
+  
+      def __init__(self, libveo, proc, ctxt, dtype, nelem):
+          """
+          Allocate VE memory
+          """
+          self.libveo = libveo
+          self.proc = proc
+          self.ctxt = ctxt
+          self.dtype = dtype
+          self.nelem = nelem
+          self.nbytes = dtype.itemsize * nelem
+          self.hmem = veo.alloc_hmem(self.proc, self.nbytes)
+  
+      def __del__(self):
+          """
+          Free VE memory
+          """
+          if not _veo_proc_destroyed:
+              veo.free_hmem(self.hmem)
+  
+      def set_value(self, val):
+          """
+          Set value into VE memory
+          """
+          val = np.asarray(val, dtype=self.dtype)
+          if val.size != self.nelem:
+              raise ValueError
+          src = ctypes.c_void_p(val.ctypes.data)
+          dst = ctypes.c_void_p(self.hmem)
+          ret = self.libveo.veo_hmemcpy(dst, src, self.nbytes)
+          if ret:
+            raise RuntimeError("ret = %d" % ret)
+  
+      def get_value(self):
+          """
+          Retrieve value from VE memory
+          """
+          vhbuf = np.empty(self.nelem, dtype=self.dtype)
+          dst = ctypes.c_void_p(vhbuf.ctypes.data)
+          src = ctypes.c_void_p(self.hmem)
+          ret = self.libveo.veo_hmemcpy( dst, src, self.nbytes)
+          if ret:
+              raise RuntimeError("ret = %d" % ret)
+          return vhbuf
+
+      @property
+      def __ve_array_interface__(self):
+          """
+          VE array interface for interoperating Python VE libraries.
+          """
+          return {
+              'shape': (self.nelem,),
+              'typestr': self.dtype.str,
+              'version': 1,
+              'strides': None,
+              'data': (self.hmem, False)}
+  
+  class Veo(object):
+  
+      def __init__(self, venode, libpath='/opt/nec/ve/veos/lib64/libveo.so.1'):
+          # Load shared object
+          self.libveo = ctypes.cdll.LoadLibrary(libpath)
+  
+          #
+          # Register argument types and return type for veo C APIs.
+          #
+          # veo_proc_create
+          self.libveo.veo_proc_create.argtypes = (ctypes.c_int32,)
+          self.libveo.veo_proc_create.restype = ctypes.c_uint64
+          # veo_context_open
+          self.libveo.veo_context_open.argtypes = (ctypes.c_void_p,)
+          self.libveo.veo_context_open.restype = ctypes.c_uint64
+          # veo_hmemcpy
+          self.libveo.veo_hmemcpy.argtypes = (ctypes.c_void_p, ctypes.c_void_p, ctypes.c_size_t)
+          self.libveo.veo_hmemcpy.restype = ctypes.c_int32
+          # veo_proc_destroy
+          self.libveo.veo_proc_destroy.argtypes = (ctypes.c_void_p,)
+          self.libveo.veo_proc_destroy.restype = ctypes.c_int32
+          # veo_context_close
+          self.libveo.veo_context_close.argtypes = (ctypes.c_void_p,)
+          self.libveo.veo_context_close.restype = ctypes.c_int32
+
+          #
+          # Call veo C APIs for initialization.
+          #
+          self.proc = self.libveo.veo_proc_create(ctypes.c_int32(venode))
+          self.ctxt = self.libveo.veo_context_open(ctypes.c_void_p(self.proc))
+
+          def finalize(libveo, ctxt, proc):
+              # Close veo context and destroy veo process.
+              libveo.veo_context_close(ctypes.c_void_p(ctxt))
+              libveo.veo_proc_destroy(ctypes.c_void_p(proc))
+              global _veo_proc_destroyed
+              _veo_proc_destroyed = True
+
+          # Register function that calls at exit time.
+          atexit.register(finalize, self.libveo, self.ctxt, self.proc)
+  
+      def alloc_hmem(self, dtype, nelem):
+          return VEMem(self.libveo, self.proc, self.ctxt, dtype, nelem)
+
+| The above example uses ctypes to call veo C APIs from a Python script, although there are other ways to call them.
+| e.g.) ctypes, cython, pybind, Python C API, etc.
+
+---------
+Execution
+---------
+
+::
+
+  $ mpirun -veo -np 2 python mpi_send_recv.py
+  Result Success (rank=0)
+  Result Success (rank=1)
+
+
diff --git a/mpi.cfg b/mpi.cfg
index 9885167..455c882 100644
--- a/mpi.cfg
+++ b/mpi.cfg
@@ -188,5 +188,6 @@ include_dirs  = misc/mpiuni:$PETSC_DIR/include:$PETSC_DIR/$PETSC_ARCH/include
 mpi_dir = $NMPI_ROOT
 mpicc = %(mpi_dir)s/bin64/mpincc
 mpicxx = %(mpi_dir)s/bin64/mpinc++
-extra_link_args = -vh -L/opt/nec/ve/veos/lib64 -Wl,-rpath=/opt/nec/ve/veos/lib64 -lveo -shared-mpi
-extra_compile_args = -vh
+extra_link_args = -vh -L/opt/nec/ve/veos/lib64 -Wl,-rpath=/opt/nec/ve/veos/lib64 -lveo -shared-mpi -mpiprof
+extra_compile_args = -vh -mpiprof
+define_macros = MPI4PYVE_NEC_MPI=1
diff --git a/nlcpy_test/109_gather.py b/nlcpy_test/109_gather.py
index 7cc8df4..cca96db 100644
--- a/nlcpy_test/109_gather.py
+++ b/nlcpy_test/109_gather.py
@@ -19,9 +19,8 @@
 print("x       = ",x)
 
 if rank == root:
-    print("type(x[0]) = ",type(x[0]))
-    print("type(x[1]) = ",type(x[1]))
-    print("type(x[2]) = ",type(x[2])) 
+    for i in range(rank):
+        print("type(x[{}]) = ".format(i),type(x[i]))
     import sys
     try:
         for y in x:
diff --git a/nlcpy_test/111_allgather.py b/nlcpy_test/111_allgather.py
index cbba581..742aefd 100644
--- a/nlcpy_test/111_allgather.py
+++ b/nlcpy_test/111_allgather.py
@@ -17,9 +17,8 @@
 print("allgather done")
 print("x       = ",x)
 
-print("type(x[0]) = ",type(x[0]))
-print("type(x[1]) = ",type(x[1]))
-print("type(x[2]) = ",type(x[2])) 
+for i in range(rank):
+    print("type(x[{}]) = ".format(i),type(x[i]))
 import sys
 try:
     for y in x:
diff --git a/nlcpy_test/112_alltoall.py b/nlcpy_test/112_alltoall.py
index 45dc14b..a2c8b18 100644
--- a/nlcpy_test/112_alltoall.py
+++ b/nlcpy_test/112_alltoall.py
@@ -16,9 +16,8 @@
 print("allgather done")
 print("x       = ",x)
 
-print("type(x[0]) = ",type(x[0]))
-print("type(x[1]) = ",type(x[1]))
-print("type(x[2]) = ",type(x[2])) 
+for i in range(rank):
+    print("type(x[{}]) = ".format(i),type(x[i]))
 import sys
 try:
     for y in x:
diff --git a/nlcpy_test/121_AttachDetach.py b/nlcpy_test/121_AttachDetach.py
new file mode 100644
index 0000000..eb18b0b
--- /dev/null
+++ b/nlcpy_test/121_AttachDetach.py
@@ -0,0 +1,17 @@
+from mpi4pyve import MPI
+import numpy as np
+import nlcpy as vp
+import sys
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+
+if rank == 0:
+    print("rank = ",rank)
+
+    x = vp.arange(200000, dtype=int)
+    print(x.__ve_array_interface__) 
+    MPI.Attach_buffer(x)
+    MPI.Detach_buffer()
+
diff --git a/nlcpy_test/200_Send_Recv_Offset.py b/nlcpy_test/200_Send_Recv_Offset.py
new file mode 100644
index 0000000..18b319d
--- /dev/null
+++ b/nlcpy_test/200_Send_Recv_Offset.py
@@ -0,0 +1,32 @@
+from mpi4pyve import MPI
+import numpy as np
+import nlcpy as vp
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+
+if rank == 0:
+    print("rank = ",rank)
+
+    #x = vp.arange(10)[::2]
+    x = vp.arange(10)[2:]
+    print("x       = ",x)
+    print("type(x) = ",type(x))
+    comm.Send([x, MPI.INT], dest=1)
+
+elif rank == 1:
+    print("rank = ",rank)
+    
+    y = vp.empty(10-2, dtype=int)
+    comm.Recv([y, MPI.INT])
+    print("y       = ",y)
+    print("type(y) = ",type(y)) 
+
+    import sys
+    try:
+        y
+        if not isinstance(y, vp.core.core.ndarray):
+            print("NG : ", __file__, file=sys.stderr)
+    except NameError:
+        print("Failure test case : ", __file__, file=sys.stderr)
diff --git a/nlcpy_test/214_Bcast.py b/nlcpy_test/214_Bcast.py
index 06d4168..1f4a2d3 100644
--- a/nlcpy_test/214_Bcast.py
+++ b/nlcpy_test/214_Bcast.py
@@ -9,7 +9,7 @@
 if rank == 0:
     x = vp.array([1,2,3], dtype=int)
 else:
-    x = vp.empty(3, dtype=int)
+    x = vp.zeros(3, dtype=int)
 
 print("rank = ",rank)
 
diff --git a/nlcpy_test/217_Scatter.py b/nlcpy_test/217_Scatter.py
index a42264f..851e5b8 100644
--- a/nlcpy_test/217_Scatter.py
+++ b/nlcpy_test/217_Scatter.py
@@ -15,7 +15,7 @@
 else:
     x = None
 
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 
 
 print("x       = ",x)
diff --git a/nlcpy_test/218_Scatterv.py b/nlcpy_test/218_Scatterv.py
index ce43d45..96edfb9 100644
--- a/nlcpy_test/218_Scatterv.py
+++ b/nlcpy_test/218_Scatterv.py
@@ -21,7 +21,7 @@
 else:
     x = None
 
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 
 
 
diff --git a/nlcpy_test/219_Allgather.py b/nlcpy_test/219_Allgather.py
index e9e617b..001185e 100644
--- a/nlcpy_test/219_Allgather.py
+++ b/nlcpy_test/219_Allgather.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((size, 2), dtype=int)
+y = vp.zeros((size, 2), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/220_Allgatherv.py b/nlcpy_test/220_Allgatherv.py
index 38c6b21..481ea71 100644
--- a/nlcpy_test/220_Allgatherv.py
+++ b/nlcpy_test/220_Allgatherv.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((size, 2), dtype=int)
+y = vp.zeros((size, 2), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/221_Alltoall.py b/nlcpy_test/221_Alltoall.py
index 8c2911a..2b582f1 100644
--- a/nlcpy_test/221_Alltoall.py
+++ b/nlcpy_test/221_Alltoall.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.arange(size**2, dtype=int).reshape(size, size) * (rank + 1)
-y = vp.empty((size, size), dtype=int)
+y = vp.zeros((size, size), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/222_Alltoallv.py b/nlcpy_test/222_Alltoallv.py
index ab96793..92a9245 100644
--- a/nlcpy_test/222_Alltoallv.py
+++ b/nlcpy_test/222_Alltoallv.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.arange(size**2, dtype=int).reshape(size, size) * (rank + 1)
-y = vp.empty((size, size), dtype=int)
+y = vp.zeros((size, size), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/224_Reduce.py b/nlcpy_test/224_Reduce.py
index 41cd186..75d572f 100644
--- a/nlcpy_test/224_Reduce.py
+++ b/nlcpy_test/224_Reduce.py
@@ -10,7 +10,7 @@
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 , rank], dtype=int)
-y = vp.empty(2, dtype=int)
+y = vp.zeros(2, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/225_Allreduce.py b/nlcpy_test/225_Allreduce.py
index c6d8f75..03a2768 100644
--- a/nlcpy_test/225_Allreduce.py
+++ b/nlcpy_test/225_Allreduce.py
@@ -10,7 +10,7 @@
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 , rank], dtype=int)
-y = vp.empty(2, dtype=int)
+y = vp.zeros(2, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/226_Reduce_scatter_block.py b/nlcpy_test/226_Reduce_scatter_block.py
index 2e2b030..b5e29da 100644
--- a/nlcpy_test/226_Reduce_scatter_block.py
+++ b/nlcpy_test/226_Reduce_scatter_block.py
@@ -10,7 +10,7 @@
 print("rank = ",rank)
 
 x = vp.arange(size, dtype=int) * (rank + 1)
-y = vp.empty(1, dtype=int)
+y = vp.zeros(1, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/227_Reduce_scatter.py b/nlcpy_test/227_Reduce_scatter.py
index 2f8734e..def86b5 100644
--- a/nlcpy_test/227_Reduce_scatter.py
+++ b/nlcpy_test/227_Reduce_scatter.py
@@ -10,7 +10,7 @@
 print("rank = ",rank)
 
 x = vp.arange(size, dtype=int) * (rank + 1)
-y = vp.empty(1, dtype=int)
+y = vp.zeros(1, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/233_Iallgather.py b/nlcpy_test/233_Iallgather.py
index 7a5eea1..fe27fc6 100644
--- a/nlcpy_test/233_Iallgather.py
+++ b/nlcpy_test/233_Iallgather.py
@@ -10,7 +10,7 @@
 
 #x = vp.array([(rank+1)**2 ,rank], dtype=int)
 x = vp.array([1,2,3], dtype=int)
-y = vp.empty((3,3), dtype=int)
+y = vp.empty((size,3), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 
diff --git a/nlcpy_test/234_Iallgatherv.py b/nlcpy_test/234_Iallgatherv.py
index 15739d2..e584f19 100644
--- a/nlcpy_test/234_Iallgatherv.py
+++ b/nlcpy_test/234_Iallgatherv.py
@@ -10,7 +10,7 @@
 
 #x = vp.array([(rank+1)**2 ,rank], dtype=int)
 x = vp.array([1,2,3], dtype=int)
-y = vp.empty((3,3), dtype=int)
+y = vp.empty((size,3), dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/240_Ireduce_scatter_block.py b/nlcpy_test/240_Ireduce_scatter_block.py
index 41f2847..673e0d1 100644
--- a/nlcpy_test/240_Ireduce_scatter_block.py
+++ b/nlcpy_test/240_Ireduce_scatter_block.py
@@ -11,7 +11,7 @@
 
 #x = vp.arange(size, dtype=int) * (rank + 1)
 x = vp.array([[rank+1 for i in range(size)] for j in range(size) ], dtype=int)
-y = vp.empty(3, dtype=int)
+y = vp.empty(size, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/241_Ireduce_scatter.py b/nlcpy_test/241_Ireduce_scatter.py
index d883480..5d61e98 100644
--- a/nlcpy_test/241_Ireduce_scatter.py
+++ b/nlcpy_test/241_Ireduce_scatter.py
@@ -11,7 +11,7 @@
 
 #x = vp.arange(size, dtype=int) * (rank + 1)
 x = vp.array([[rank+1 for i in range(size)] for j in range(size) ], dtype=int)
-y = vp.empty(3, dtype=int)
+y = vp.empty(size, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/300_Scan.py b/nlcpy_test/300_Scan.py
index 6325aa3..9e4c4bd 100644
--- a/nlcpy_test/300_Scan.py
+++ b/nlcpy_test/300_Scan.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.array([1,2,3], dtype=int) * (rank+1)
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/301_Exscan.py b/nlcpy_test/301_Exscan.py
index 65e2605..6184e71 100644
--- a/nlcpy_test/301_Exscan.py
+++ b/nlcpy_test/301_Exscan.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.array([1,2,3], dtype=int) * (rank+1)
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/302_Iscan.py b/nlcpy_test/302_Iscan.py
index 005b22f..cc42da6 100644
--- a/nlcpy_test/302_Iscan.py
+++ b/nlcpy_test/302_Iscan.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.array([1,2,3], dtype=int) * (rank+1)
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/303_Iexscan.py b/nlcpy_test/303_Iexscan.py
index 0b619e9..44691df 100644
--- a/nlcpy_test/303_Iexscan.py
+++ b/nlcpy_test/303_Iexscan.py
@@ -9,7 +9,7 @@
 print("rank = ",rank)
 
 x = vp.array([1,2,3], dtype=int) * (rank+1)
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/306_Neighbor_allgather.py b/nlcpy_test/306_Neighbor_allgather.py
index a0b1a2c..71be247 100644
--- a/nlcpy_test/306_Neighbor_allgather.py
+++ b/nlcpy_test/306_Neighbor_allgather.py
@@ -2,14 +2,15 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((2, 2), dtype=int)
+y = vp.zeros((2, 2), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/307_Neighbor_allgatherv.py b/nlcpy_test/307_Neighbor_allgatherv.py
index d648a78..f29e9d7 100644
--- a/nlcpy_test/307_Neighbor_allgatherv.py
+++ b/nlcpy_test/307_Neighbor_allgatherv.py
@@ -2,14 +2,14 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((2, 2), dtype=int)
+y = vp.zeros((2, 2), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/308_Neighbor_alltoall.py b/nlcpy_test/308_Neighbor_alltoall.py
index c734a6a..fa80a1e 100644
--- a/nlcpy_test/308_Neighbor_alltoall.py
+++ b/nlcpy_test/308_Neighbor_alltoall.py
@@ -2,14 +2,14 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((2,), dtype=int)
+y = vp.zeros((2,), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/309_Neighbor_alltoallv.py b/nlcpy_test/309_Neighbor_alltoallv.py
index b62776e..0727184 100644
--- a/nlcpy_test/309_Neighbor_alltoallv.py
+++ b/nlcpy_test/309_Neighbor_alltoallv.py
@@ -2,14 +2,14 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((2,), dtype=int)
+y = vp.zeros((2,), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/310_Neighbor_alltoallw.py b/nlcpy_test/310_Neighbor_alltoallw.py
index e67df45..ec623ca 100644
--- a/nlcpy_test/310_Neighbor_alltoallw.py
+++ b/nlcpy_test/310_Neighbor_alltoallw.py
@@ -3,7 +3,7 @@
 import nlcpy as vp
 
 dim = 3
-comm = MPI.COMM_WORLD.Create_cart((dim,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
diff --git a/nlcpy_test/311_Ineighbor_allgather.py b/nlcpy_test/311_Ineighbor_allgather.py
index 5343fb0..238950f 100644
--- a/nlcpy_test/311_Ineighbor_allgather.py
+++ b/nlcpy_test/311_Ineighbor_allgather.py
@@ -2,14 +2,14 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((2, 2), dtype=int)
+y = vp.zeros((2, 2), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/312_Ineighbor_allgatherv.py b/nlcpy_test/312_Ineighbor_allgatherv.py
index f1eca84..f31e94a 100644
--- a/nlcpy_test/312_Ineighbor_allgatherv.py
+++ b/nlcpy_test/312_Ineighbor_allgatherv.py
@@ -2,14 +2,14 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
 print("rank = ",rank)
 
 x = vp.array([(rank+1)**2 ,rank], dtype=int)
-y = vp.empty((2, 2), dtype=int)
+y = vp.zeros((2, 2), dtype=int)
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
diff --git a/nlcpy_test/313_Ineighbor_alltoall.py b/nlcpy_test/313_Ineighbor_alltoall.py
index ed32684..995af40 100644
--- a/nlcpy_test/313_Ineighbor_alltoall.py
+++ b/nlcpy_test/313_Ineighbor_alltoall.py
@@ -3,7 +3,7 @@
 import nlcpy as vp
 
 dim = 3
-comm = MPI.COMM_WORLD.Create_cart((dim,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
diff --git a/nlcpy_test/314_Ineighbor_alltoallv.py b/nlcpy_test/314_Ineighbor_alltoallv.py
index ff78fb8..63fe412 100644
--- a/nlcpy_test/314_Ineighbor_alltoallv.py
+++ b/nlcpy_test/314_Ineighbor_alltoallv.py
@@ -3,7 +3,7 @@
 import nlcpy as vp
 
 dim = 3
-comm = MPI.COMM_WORLD.Create_cart((dim,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
diff --git a/nlcpy_test/315_Ineighbor_alltoallw.py b/nlcpy_test/315_Ineighbor_alltoallw.py
index 2811e42..071bb98 100644
--- a/nlcpy_test/315_Ineighbor_alltoallw.py
+++ b/nlcpy_test/315_Ineighbor_alltoallw.py
@@ -3,7 +3,7 @@
 import nlcpy as vp
 
 dim = 3
-comm = MPI.COMM_WORLD.Create_cart((dim,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
diff --git a/nlcpy_test/316_neighbor_allgather.py b/nlcpy_test/316_neighbor_allgather.py
index ebaccf6..1354e78 100644
--- a/nlcpy_test/316_neighbor_allgather.py
+++ b/nlcpy_test/316_neighbor_allgather.py
@@ -2,7 +2,7 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
diff --git a/nlcpy_test/317_neighbor_alltoall.py b/nlcpy_test/317_neighbor_alltoall.py
index 9f110e8..f45212f 100644
--- a/nlcpy_test/317_neighbor_alltoall.py
+++ b/nlcpy_test/317_neighbor_alltoall.py
@@ -2,7 +2,7 @@
 import numpy as np
 import nlcpy as vp
 
-comm = MPI.COMM_WORLD.Create_cart((3,))
+comm = MPI.COMM_WORLD.Create_cart((MPI.COMM_WORLD.Get_size(),))
 size = comm.Get_size()
 rank = comm.Get_rank()
 
diff --git a/nlcpy_test/500_Reduce_local.py b/nlcpy_test/500_Reduce_local.py
index aaf0b78..1a8f019 100644
--- a/nlcpy_test/500_Reduce_local.py
+++ b/nlcpy_test/500_Reduce_local.py
@@ -12,7 +12,7 @@
 op   = MPI.SUM
 
 x = vp.array([(rank+1)**2 , rank], dtype=int)
-y = vp.empty(2, dtype=int)
+y = vp.zeros(2, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/600_Pack_Unpack.py b/nlcpy_test/600_Pack_Unpack.py
index b86a1b9..6c4a426 100644
--- a/nlcpy_test/600_Pack_Unpack.py
+++ b/nlcpy_test/600_Pack_Unpack.py
@@ -12,11 +12,11 @@
 datatype = MPI.INT
 
 x = vp.array([(rank+1)**2 , rank], dtype=int)
-y = vp.empty(2, dtype=int)
+y = vp.zeros(2, dtype=int)
 
 size1 = datatype.Pack_size(len(x), comm)
 size2 = datatype.Pack_size(len(y), comm)
-tmpbuf = vp.empty(size1 + size2 + 1, dtype=int)
+tmpbuf = vp.zeros(size1 + size2 + 1, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/601_Pack_Unpack_external.py b/nlcpy_test/601_Pack_Unpack_external.py
index 52ca71a..578adf2 100644
--- a/nlcpy_test/601_Pack_Unpack_external.py
+++ b/nlcpy_test/601_Pack_Unpack_external.py
@@ -13,11 +13,11 @@
 EXT32 = 'external32'
 
 x = vp.array([(rank+1)**2 , rank], dtype=int)
-y = vp.empty(2, dtype=int)
+y = vp.zeros(2, dtype=int)
 
 size1 = datatype.Pack_external_size(EXT32, x.size)
 size2 = datatype.Pack_external_size(EXT32, len(y))
-tmpbuf = vp.empty(size1 + size2 + 1, dtype=int)
+tmpbuf = vp.zeros(size1 + size2 + 1, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
diff --git a/nlcpy_test/713_IReadWrite_AtAll.py b/nlcpy_test/713_IReadWrite_AtAll.py
index 9f2e7de..e0f1673 100644
--- a/nlcpy_test/713_IReadWrite_AtAll.py
+++ b/nlcpy_test/713_IReadWrite_AtAll.py
@@ -9,21 +9,21 @@
 
 fh = get_fh()
 fh.Set_size(0)
-fh.Set_view(0, MPI.INT)
+fh.Set_view(rank*12, MPI.INT)
 
 x = vp.array([1,2,3], dtype=int)
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
 print("type(y) = ",type(y))
 
-fh.Iwrite_at_all(3, x).Wait()
+fh.Iwrite_at_all(rank*12, x).Wait()
 fh.Sync()
 comm.Barrier()
 fh.Sync()
-fh.Iread_at_all(3, y).Wait()
+fh.Iread_at_all(rank*12, y).Wait()
 comm.Barrier()
 
 print("Iwrite_at_all-Iread_at_all done")
diff --git a/nlcpy_test/714_IReadWrite_All.py b/nlcpy_test/714_IReadWrite_All.py
index 647369a..79e2354 100644
--- a/nlcpy_test/714_IReadWrite_All.py
+++ b/nlcpy_test/714_IReadWrite_All.py
@@ -9,23 +9,25 @@
 
 fh = get_fh()
 fh.Set_size(0)
-fh.Set_view(0, MPI.INT)
+fh.Set_view(rank*12, MPI.INT)
 
 x = vp.array([1,2,3], dtype=int)
-y = vp.empty(3, dtype=int)
+y = vp.zeros(3, dtype=int)
 
 print("x       = ",x)
 print("type(x) = ",type(x))
 print("y       = ",y)
 print("type(y) = ",type(y))
 
-fh.Seek(3, MPI.SEEK_SET)
-fh.Iwrite_all(x).Wait()
+fh.Seek(rank*12, MPI.SEEK_SET)
+req = fh.Iwrite_all(x)
+req.Wait()
 fh.Sync()
 comm.Barrier()
 fh.Sync()
-fh.Seek(3, MPI.SEEK_SET)
-fh.Iread_all(y).Wait()
+fh.Seek(rank*12, MPI.SEEK_SET)
+req = fh.Iread_all(y)
+req.Wait()
 comm.Barrier()
 
 print("Iwrite_all-Iread_all done")
diff --git a/nlcpy_test/pingpong_elapse.py b/nlcpy_test/pingpong_elapse.py
new file mode 100644
index 0000000..b5a875c
--- /dev/null
+++ b/nlcpy_test/pingpong_elapse.py
@@ -0,0 +1,46 @@
+from mpi4pyve import MPI
+import nlcpy as vp
+import time
+
+comm = MPI.COMM_WORLD
+size = comm.Get_size()
+rank = comm.Get_rank()
+
+tag1 = 10
+tag2 = 20
+loop_count = 50
+
+comm.barrier()
+status = MPI.Status()
+
+for N in range(0,28):
+    A = 1 << N
+    V = vp.zeros(A, dtype=float)
+
+    # Warm-up
+    for i in range(5):
+        if rank == 0:
+            comm.Send([V, MPI.DOUBLE], 1, tag1)
+            comm.Recv([V, MPI.DOUBLE], 1, tag2, status=status)
+        elif rank == 1:
+            comm.Recv([V, MPI.DOUBLE], 0, tag1, status=status)
+            comm.Send([V, MPI.DOUBLE], 0, tag2)
+
+    comm.barrier()
+
+    t0 = MPI.Wtime()
+    for i in range(loop_count):
+        if rank == 0:
+            comm.Send([V, MPI.DOUBLE], 1, tag1)
+            comm.Recv([V, MPI.DOUBLE], 1, tag2, status=status)
+        elif rank == 1:
+            comm.Recv([V, MPI.DOUBLE], 0, tag1, status=status)
+            comm.Send([V, MPI.DOUBLE], 0, tag2)
+    t1 = MPI.Wtime()
+    elapsed_time = t1 - t0
+    num_B = 8*A
+    B_in_GB = 1 << 30
+    num_GB = num_B / B_in_GB
+    avg_time_per_transfer = elapsed_time / (2.0 * loop_count)
+    if rank == 0:
+        print('Transfer size (B): {:>10d}, Transfer Time (s): {:15.9f}, Bandwidth (GB/s): {:15.9f}'.format(num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer))
diff --git a/rpm/Makefile b/rpm/Makefile
new file mode 100644
index 0000000..f69147c
--- /dev/null
+++ b/rpm/Makefile
@@ -0,0 +1,57 @@
+.PHONY:mpi4py-ve
+
+all:mpi4py-ve
+
+BASEDIR = ..
+
+# Parameters passed into the spec file
+PLAT             = el8
+ARCH             = x86_64
+VERSION_PYTHON   = python3.6
+VERSION_MPI4PYVE = 1.0.0
+VERSION_RPM      = 1
+RELEASE          = 1
+
+DISTDIR  = $(BASEDIR)/dist
+RPMDIR   = $(DISTDIR)/rpm/RPMBUILD
+SPECS    = $(RPMDIR)/SPECS
+RPMS     = $(RPMDIR)/RPMS
+SOURCES  = $(RPMDIR)/SOURCES
+
+# convert 'python3.x -> python3x'
+$(eval REQUIRES_PYTHON = $(shell echo $(VERSION_PYTHON) | sed -e s"/\.//"))
+
+PRODUCT        = nec-$(REQUIRES_PYTHON)-mpi4py-ve-$(VERSION_MPI4PYVE)
+SRC_MPI4PYVE   = nec-$(REQUIRES_PYTHON)-mpi4py-ve-$(VERSION_MPI4PYVE)-$(VERSION_RPM)-$(RELEASE)
+RPM_MPI4PYVE   = $(SRC_MPI4PYVE).$(PLAT).$(ARCH).rpm
+
+# Options of rpmbuild command
+$(eval SPECDEF  =--define "_topdir            `cd ..;pwd`")
+$(eval SPECDEF +=--define "python_version     $(VERSION_PYTHON)")
+$(eval SPECDEF +=--define "mpi4py_ve_version  $(VERSION_MPI4PYVE)")
+$(eval SPECDEF +=--define "rpm_version        $(VERSION_RPM)")
+$(eval SPECDEF +=--define "mpi4py_ve_platform $(PLAT)")
+$(eval SPECDEF +=--define "mpi4py_ve_arch     $(ARCH)")
+$(eval SPECDEF +=--define "rpm_release        $(RELEASE)")
+$(eval SPECDEF +=--define "product            $(PRODUCT)")
+
+mpi4py-ve: $(RPMS)/$(ARCH)/$(RPM_MPI4PYVE)
+
+# Creates Specfiles
+$(SPECS)/$(RPM_MPI4PYVE:%.rpm=%.spec):
+	mkdir -p $(SPECS) $(RPMS) $(SOURCES)
+	cp -p SPECS/nec-python3-mpi4py-ve $@
+
+# Creates RPM files
+$(RPMS)/$(ARCH)/$(RPM_MPI4PYVE): $(SPECS)/$(RPM_MPI4PYVE:%.rpm=%.spec)
+	$(eval SRC    = $(SRC_MPI4PYVE))
+	$(eval SRCDIR = $(SOURCES)/$(SRC))
+	rm -rf $(SRCDIR)
+	mkdir -p $(SRCDIR)
+	cp -p $(DISTDIR)/mpi4py_ve*.whl $(SRCDIR)/
+	cd $(SRCDIR) && unzip mpi4py_ve*.whl && rm -f mpi4py_ve*.whl
+	cd $(SOURCES) && tar zcvf $(SRC).tar.gz $(SRC)
+	cd $(SPECS) && rpmbuild --bb $(SPECDEF) $(^F)
+
+clean:
+	rm -rf $(RPMDIR)
diff --git a/rpm/SPECS/nec-python3-mpi4py-ve b/rpm/SPECS/nec-python3-mpi4py-ve
new file mode 100644
index 0000000..2b0557d
--- /dev/null
+++ b/rpm/SPECS/nec-python3-mpi4py-ve
@@ -0,0 +1,113 @@
+%define __os_install_post %{nil}
+%define debug_package     %{nil}
+
+%define mpi4py_ve_topdir    /opt/nec/ve/mpi4py-ve
+%define mpi4py_ve_libdir    %{mpi4py_ve_topdir}/%{mpi4py_ve_version}/lib/%{python_version}
+
+Name:       %{product}
+Version:    %{rpm_version}
+Release:    %{rpm_release}.%{mpi4py_ve_platform}
+Summary:    Message Passing Interface Python library for SX-Aurora TSUBASA
+Group:      Development/Libraries
+Vendor:     NEC Corporation
+License:    BSD-2-Clause
+URL:        https://github.com/SX-Aurora/mpi4py-ve
+BuildArch:  %{mpi4py_ve_arch}
+#
+Source0:    %{product}-%{rpm_version}-%{rpm_release}.tar.gz
+#
+BuildRoot:  %(mktemp -ud %{_tmppath}/%{product}-XXXXXX)
+
+AutoReqProv: no
+
+%description
+mpi4py-ve is an extension to mpi4py for SX-Aurora TSUBASA systems.
+
+
+##################################################
+##
+%prep
+##
+##################################################
+# install Libraries
+%__rm -rf $RPM_BUILD_ROOT
+
+%setup -q -n %{product}-%{rpm_version}-%{rpm_release}
+
+##################################################
+##
+%build
+##
+##################################################
+
+##################################################
+##
+%install
+##
+##################################################
+# create install directories
+install -d --mode=755 %{buildroot}%{mpi4py_ve_libdir}
+cp -r mpi4pyve %{buildroot}%{mpi4py_ve_libdir}/
+find %{buildroot}%{mpi4py_ve_libdir} -name "*.so" -type f | xargs chmod 755
+find %{buildroot}%{mpi4py_ve_libdir} ! -name "*.so" -type f | xargs chmod 644
+
+##############################################
+##
+%clean
+##
+##############################################
+%__rm -rf %{buildroot}
+
+##############################################
+##
+%files
+##
+##############################################
+%defattr(-,root,root,-)
+
+%{mpi4py_ve_libdir}
+
+##############################################
+##
+%post
+##
+##############################################
+if [ ! -d /usr/lib64/%{python_version}/site-packages ]; then
+  install -m 755 -d /usr/lib64/%{python_version}/site-packages
+fi
+cd /usr/lib64/%{python_version}/site-packages/
+if [ ! -e mpi4py-ve.pth ]; then
+  touch mpi4py-ve.pth
+  chmod 644 mpi4py-ve.pth
+fi
+mpi4py_ve_path=`ls -d /opt/nec/ve/mpi4py-ve/*.*.*/lib/%{python_version} | sort -rV | head -1`
+echo ${mpi4py_ve_path} > mpi4py-ve.pth
+
+##############################################
+##
+%postun
+##
+##############################################
+if [ $1 == 0 ]; then
+  rm -rf %{mpi4py_ve_libdir}/mpi4py-ve
+  rmdir --ignore-fail-on-non-empty %{mpi4py_ve_libdir} > /dev/null 2>&1
+  rmdir --ignore-fail-on-non-empty %{mpi4py_ve_topdir}/%{mpi4py_ve_version}/lib > /dev/null 2>&1
+  rmdir --ignore-fail-on-non-empty %{mpi4py_ve_topdir}/%{mpi4py_ve_version} > /dev/null 2>&1
+  rmdir --ignore-fail-on-non-empty %{mpi4py_ve_topdir} > /dev/null 2>&1
+  if ls -d /opt/nec/ve/mpi4py-ve/*.*.*/lib/%{python_version}/mpi4py_ve > /dev/null 2>&1; then
+    mpi4py_ve_path=`ls -d /opt/nec/ve/mpi4py-ve/*.*.*/lib/%{python_version} | sort -rV | head -1`
+    echo ${mpi4py_ve_path} > /usr/lib64/%{python_version}/site-packages/mpi4py-ve.pth
+  else
+    rm -f  /usr/lib64/%{python_version}/site-packages/mpi4py-ve.pth
+    rmdir --ignore-fail-on-non-empty /usr/lib64/%{python_version}/site-packages > /dev/null 2>&1
+    rmdir --ignore-fail-on-non-empty /usr/lib64/%{python_version} > /dev/null 2>&1
+  fi
+fi
+
+##############################################
+##
+%changelog
+##
+##############################################
+* Tue Oct 25 2022 NEC Corporation
+- First version of this spec file.
diff --git a/setup.py b/setup.py
index c789571..7604b78 100644
--- a/setup.py
+++ b/setup.py
@@ -94,7 +94,6 @@ def description():
     }
 
 metadata['provides'] = ['mpi4py_ve']
-metadata['install_requires'] = ['nlcpy>=2.1.1']
 
 # --------------------------------------------------------------------
 # Extension modules
@@ -453,8 +452,10 @@ def run_setup():
         if not has_src or has_git or has_hg:
             setup_args['setup_requires'] = ['Cython>='+CYTHON]
     #
-    setup(packages     = ['mpi4pyve', 'mpi4pyve.futures'],
-          package_dir  = {'mpi4pyve' : 'src/mpi4pyve'},
+    setup(packages     = ['mpi4pyve', 'mpi4pyve.futures', 'mpi4pyve.util', 'mpi4pyve.veo'],
+          package_dir  = {'mpi4pyve' : 'src/mpi4pyve',
+                          'mpi4pyve.util' : 'src/mpi4pyve/util',
+                          'mpi4pyve.veo' : 'src/mpi4pyve/veo'},
           package_data = {'mpi4pyve' : ['*.pxd',
                                       'include/mpi4pyve/*.h',
                                       'include/mpi4pyve/*.i',
diff --git a/src/mpi4pyve/MPI.pxd b/src/mpi4pyve/MPI.pxd
index 6098ce0..544789c 100644
--- a/src/mpi4pyve/MPI.pxd
+++ b/src/mpi4pyve/MPI.pxd
@@ -95,8 +95,6 @@ ctypedef public api class Request [
     cdef MPI_Request ob_mpi
     cdef unsigned    flags
     cdef object      ob_buf
-    cdef object      numpy_arr
-    cdef object      nlcpy_arr
 
 ctypedef public api class Prequest(Request) [
     type   PyMPIPrequest_Type,
diff --git a/src/mpi4pyve/MPI/Comm.pyx b/src/mpi4pyve/MPI/Comm.pyx
index 1db9525..c39b5f0 100644
--- a/src/mpi4pyve/MPI/Comm.pyx
+++ b/src/mpi4pyve/MPI/Comm.pyx
@@ -53,11 +53,8 @@
 # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 
-import numpy
-import nlcpy
 import mpi4pyve
-include "NLCPy.pyx"
-include "Notimpl.pyx"
+
 
 # Communicator Comparisons
 # ------------------------
@@ -320,7 +317,6 @@ cdef class Comm:
 
     # Blocking Send and Receive Operations
     # ------------------------------------
-    @send_for_nlcpy_array
     def Send(self, buf, int dest, int tag=0):
         """
         Blocking send
@@ -334,7 +330,6 @@ cdef class Comm:
             smsg.buf, smsg.count, smsg.dtype,
             dest, tag, self.ob_mpi) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Recv(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG,
              Status status=None):
         """
@@ -350,7 +345,6 @@ cdef class Comm:
 
     # Send-Receive
     # ------------
-    @sendrecv_buffer_kwarg_for_nlcpy_array
     def Sendrecv(self, sendbuf, int dest, int sendtag=0,
                  recvbuf=None, int source=ANY_SOURCE, int recvtag=ANY_TAG,
                  Status status=None):
@@ -373,7 +367,6 @@ cdef class Comm:
             rmsg.buf, rmsg.count, rmsg.dtype, source, recvtag,
             self.ob_mpi, statusp) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Sendrecv_replace(self, buf, int dest, int sendtag=0,
                          int source=ANY_SOURCE, int recvtag=ANY_TAG,
                          Status status=None):
@@ -401,7 +394,6 @@ cdef class Comm:
     # Nonblocking Communications
     # --------------------------
 
-    @send_for_nlcpy_array
     def Isend(self, buf, int dest, int tag=0):
         """
         Nonblocking send
@@ -414,20 +406,12 @@ cdef class Comm:
         request.ob_buf = smsg
         return request
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Irecv(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG,
-              numpy_arr=None, nlcpy_arr=None):
+    def Irecv(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG):
         """
         Nonblocking receive
         """
         cdef _p_msg_p2p rmsg = message_p2p_recv(buf, source)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Irecv(
             rmsg.buf, rmsg.count, rmsg.dtype,
             source, tag, self.ob_mpi, &request.ob_mpi) )
@@ -494,7 +478,6 @@ cdef class Comm:
     # Persistent Communication
     # ------------------------
 
-    @send_for_nlcpy_array
     def Send_init(self, buf, int dest, int tag=0):
         """
         Create a persistent request for a standard send
@@ -507,20 +490,12 @@ cdef class Comm:
         request.ob_buf = smsg
         return request
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Recv_init(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG,
-                  numpy_arr=None, nlcpy_arr=None):
+    def Recv_init(self, buf, int source=ANY_SOURCE, int tag=ANY_TAG):
         """
         Create a persistent request for a receive
         """
         cdef _p_msg_p2p rmsg = message_p2p_recv(buf, source)
-        cdef Prequest request
-        if numpy_arr is None:
-            request = <Prequest>Prequest.__new__(Prequest)
-        else:
-            request = <Prequest>Prequest.__new__(Prequest,
-                                                 numpy_arr=numpy_arr,
-                                                 nlcpy_arr=nlcpy_arr)
+        cdef Prequest request = <Prequest>Prequest.__new__(Prequest)
         with nogil: CHKERR( MPI_Recv_init(
             rmsg.buf, rmsg.count, rmsg.dtype,
             source, tag, self.ob_mpi, &request.ob_mpi) )
@@ -532,7 +507,7 @@ cdef class Comm:
 
     # Blocking calls
 
-    @send_for_nlcpy_array
+    @raise_notimpl_for_vai_buffer
     def Bsend(self, buf, int dest, int tag=0):
         """
         Blocking send in buffered mode
@@ -542,7 +517,6 @@ cdef class Comm:
             smsg.buf, smsg.count, smsg.dtype,
             dest, tag, self.ob_mpi) )
 
-    @send_for_nlcpy_array
     def Ssend(self, buf, int dest, int tag=0):
         """
         Blocking send in synchronous mode
@@ -552,7 +526,6 @@ cdef class Comm:
             smsg.buf, smsg.count, smsg.dtype,
             dest, tag, self.ob_mpi) )
 
-    @send_for_nlcpy_array
     def Rsend(self, buf, int dest, int tag=0):
         """
         Blocking send in ready mode
@@ -564,7 +537,7 @@ cdef class Comm:
 
     # Nonblocking calls
 
-    @send_for_nlcpy_array
+    @raise_notimpl_for_vai_buffer
     def Ibsend(self, buf, int dest, int tag=0):
         """
         Nonblocking send in buffered mode
@@ -577,7 +550,6 @@ cdef class Comm:
         request.ob_buf = smsg
         return request
 
-    @send_for_nlcpy_array
     def Issend(self, buf, int dest, int tag=0):
         """
         Nonblocking send in synchronous mode
@@ -590,7 +562,6 @@ cdef class Comm:
         request.ob_buf = smsg
         return request
 
-    @send_for_nlcpy_array
     def Irsend(self, buf, int dest, int tag=0):
         """
         Nonblocking send in ready mode
@@ -605,7 +576,7 @@ cdef class Comm:
 
     # Persistent Requests
 
-    @send_for_nlcpy_array
+    @raise_notimpl_for_vai_buffer
     def Bsend_init(self, buf, int dest, int tag=0):
         """
         Persistent request for a send in buffered mode
@@ -618,7 +589,6 @@ cdef class Comm:
         request.ob_buf = smsg
         return request
 
-    @send_for_nlcpy_array
     def Ssend_init(self, buf, int dest, int tag=0):
         """
         Persistent request for a send in synchronous mode
@@ -631,7 +601,6 @@ cdef class Comm:
         request.ob_buf = smsg
         return request
 
-    @send_for_nlcpy_array
     def Rsend_init(self, buf, int dest, int tag=0):
         """
         Persistent request for a send in ready mode
@@ -659,7 +628,6 @@ cdef class Comm:
     # Global Communication Functions
     # ------------------------------
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Bcast(self, buf, int root=0):
         """
         Broadcast a message from one process
@@ -671,7 +639,6 @@ cdef class Comm:
             m.sbuf, m.scount, m.stype,
             root, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Gather(self, sendbuf, recvbuf, int root=0):
         """
         Gather together values from a group of processes
@@ -683,7 +650,6 @@ cdef class Comm:
             m.rbuf, m.rcount, m.rtype,
             root, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Gatherv(self, sendbuf, recvbuf, int root=0):
         """
         Gather Vector, gather data to one process from all other
@@ -697,7 +663,6 @@ cdef class Comm:
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
             root, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Scatter(self, sendbuf, recvbuf, int root=0):
         """
         Scatter data from one process
@@ -710,7 +675,6 @@ cdef class Comm:
             m.rbuf, m.rcount, m.rtype,
             root, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Scatterv(self, sendbuf, recvbuf, int root=0):
         """
         Scatter Vector, scatter data from one process to all other
@@ -724,7 +688,6 @@ cdef class Comm:
             m.rbuf, m.rcount,             m.rtype,
             root, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Allgather(self, sendbuf, recvbuf):
         """
         Gather to All, gather data from all processes and
@@ -737,7 +700,6 @@ cdef class Comm:
             m.rbuf, m.rcount, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Allgatherv(self, sendbuf, recvbuf):
         """
         Gather to All Vector, gather data from all processes and
@@ -751,7 +713,6 @@ cdef class Comm:
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Alltoall(self, sendbuf, recvbuf):
         """
         All to All Scatter/Gather, send data from all to all
@@ -764,7 +725,6 @@ cdef class Comm:
             m.rbuf, m.rcount, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Alltoallv(self, sendbuf, recvbuf):
         """
         All to All Scatter/Gather Vector, send data from all to all
@@ -778,7 +738,6 @@ cdef class Comm:
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Alltoallw(self, sendbuf, recvbuf):
         """
         Generalized All-to-All communication allowing different
@@ -795,7 +754,6 @@ cdef class Comm:
     # Global Reduction Operations
     # ---------------------------
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Reduce(self, sendbuf, recvbuf, Op op=SUM, int root=0):
         """
         Reduce
@@ -806,7 +764,6 @@ cdef class Comm:
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, root, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Allreduce(self, sendbuf, recvbuf, Op op=SUM):
         """
         All Reduce
@@ -817,7 +774,6 @@ cdef class Comm:
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Reduce_scatter_block(self, sendbuf, recvbuf, Op op=SUM):
         """
         Reduce-Scatter Block (regular, non-vector version)
@@ -828,7 +784,6 @@ cdef class Comm:
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Reduce_scatter(self, sendbuf, recvbuf, recvcounts=None, Op op=SUM):
         """
         Reduce-Scatter (vector version)
@@ -851,44 +806,26 @@ cdef class Comm:
         with nogil: CHKERR( MPI_Ibarrier(self.ob_mpi, &request.ob_mpi) )
         return request
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Ibcast(self, buf, int root=0, numpy_arr=None, nlcpy_arr=None):
+    def Ibcast(self, buf, int root=0):
         """
         Nonblocking Broadcast
         """
         cdef _p_msg_cco m = message_cco()
         m.for_bcast(buf, root, self.ob_mpi)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ibcast(
             m.sbuf, m.scount, m.stype,
             root, self.ob_mpi, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Igather(self, sendbuf, recvbuf, int root=0,
-                send_nlcpy_arr=None, send_numpy_arr=None,
-                recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Igather(self, sendbuf, recvbuf, int root=0):
         """
         Nonblocking Gather
         """
         cdef _p_msg_cco m = message_cco()
         m.for_gather(0, sendbuf, recvbuf, root, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Igather(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcount, m.rtype,
@@ -896,24 +833,13 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Igatherv(self, sendbuf, recvbuf, int root=0,
-                 send_nlcpy_arr=None, send_numpy_arr=None,
-                 recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Igatherv(self, sendbuf, recvbuf, int root=0):
         """
         Nonblocking Gather Vector
         """
         cdef _p_msg_cco m = message_cco()
         m.for_gather(1, sendbuf, recvbuf, root, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Igatherv(
             m.sbuf, m.scount,             m.stype,
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
@@ -921,24 +847,13 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iscatter(self, sendbuf, recvbuf, int root=0,
-                 send_nlcpy_arr=None, send_numpy_arr=None,
-                 recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iscatter(self, sendbuf, recvbuf, int root=0):
         """
         Nonblocking Scatter
         """
         cdef _p_msg_cco m = message_cco()
         m.for_scatter(0, sendbuf, recvbuf, root, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iscatter(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcount, m.rtype,
@@ -946,24 +861,13 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iscatterv(self, sendbuf, recvbuf, int root=0,
-                  send_nlcpy_arr=None, send_numpy_arr=None,
-                  recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iscatterv(self, sendbuf, recvbuf, int root=0):
         """
         Nonblocking Scatter Vector
         """
         cdef _p_msg_cco m = message_cco()
         m.for_scatter(1, sendbuf, recvbuf, root, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iscatterv(
             m.sbuf, m.scounts, m.sdispls, m.stype,
             m.rbuf, m.rcount,             m.rtype,
@@ -971,24 +875,13 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iallgather(self, sendbuf, recvbuf,
-                   send_nlcpy_arr=None, send_numpy_arr=None,
-                   recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iallgather(self, sendbuf, recvbuf):
         """
         Nonblocking Gather to All
         """
         cdef _p_msg_cco m = message_cco()
         m.for_allgather(0, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iallgather(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcount, m.rtype,
@@ -996,48 +889,26 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iallgatherv(self, sendbuf, recvbuf,
-                    send_nlcpy_arr=None, send_numpy_arr=None,
-                    recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iallgatherv(self, sendbuf, recvbuf):
         """
         Nonblocking Gather to All Vector
         """
         cdef _p_msg_cco m = message_cco()
         m.for_allgather(1, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iallgatherv(
             m.sbuf, m.scount,             m.stype,
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
             self.ob_mpi, &request.ob_mpi) )
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ialltoall(self, sendbuf, recvbuf,
-                  send_nlcpy_arr=None, send_numpy_arr=None,
-                  recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ialltoall(self, sendbuf, recvbuf):
         """
         Nonblocking All to All Scatter/Gather
         """
         cdef _p_msg_cco m = message_cco()
         m.for_alltoall(0, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ialltoall(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcount, m.rtype,
@@ -1045,24 +916,13 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ialltoallv(self, sendbuf, recvbuf,
-                   send_nlcpy_arr=None, send_numpy_arr=None,
-                   recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ialltoallv(self, sendbuf, recvbuf):
         """
         Nonblocking All to All Scatter/Gather Vector
         """
         cdef _p_msg_cco m = message_cco()
         m.for_alltoall(1, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ialltoallv(
             m.sbuf, m.scounts, m.sdispls, m.stype,
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
@@ -1070,24 +930,13 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ialltoallw(self, sendbuf, recvbuf,
-                   send_nlcpy_arr=None, send_numpy_arr=None,
-                   recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ialltoallw(self, sendbuf, recvbuf):
         """
         Nonblocking Generalized All-to-All
         """
         cdef _p_msg_ccow m = message_ccow()
         m.for_alltoallw(sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ialltoallw(
             m.sbuf, m.scounts, m.sdispls, m.stypes,
             m.rbuf, m.rcounts, m.rdispls, m.rtypes,
@@ -1095,94 +944,50 @@ cdef class Comm:
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ireduce(self, sendbuf, recvbuf, Op op=SUM, int root=0,
-                send_nlcpy_arr=None, send_numpy_arr=None,
-                recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ireduce(self, sendbuf, recvbuf, Op op=SUM, int root=0):
         """
         Nonblocking Reduce
         """
         cdef _p_msg_cco m = message_cco()
         m.for_reduce(sendbuf, recvbuf, root, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ireduce(
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, root, self.ob_mpi, &request.ob_mpi) )
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iallreduce(self, sendbuf, recvbuf, Op op=SUM,
-                   send_nlcpy_arr=None, send_numpy_arr=None,
-                   recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iallreduce(self, sendbuf, recvbuf, Op op=SUM):
         """
         Nonblocking All Reduce
         """
         cdef _p_msg_cco m = message_cco()
         m.for_allreduce(sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iallreduce(
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, self.ob_mpi, &request.ob_mpi) )
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ireduce_scatter_block(self, sendbuf, recvbuf, Op op=SUM,
-                              send_nlcpy_arr=None, send_numpy_arr=None,
-                              recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ireduce_scatter_block(self, sendbuf, recvbuf, Op op=SUM):
         """
         Nonblocking Reduce-Scatter Block (regular, non-vector version)
         """
         cdef _p_msg_cco m = message_cco()
         m.for_reduce_scatter_block(sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ireduce_scatter_block(
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, self.ob_mpi, &request.ob_mpi) )
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ireduce_scatter(self, sendbuf, recvbuf, recvcounts=None, Op op=SUM,
-                        send_nlcpy_arr=None, send_numpy_arr=None,
-                        recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ireduce_scatter(self, sendbuf, recvbuf, recvcounts=None, Op op=SUM,):
         """
         Nonblocking Reduce-Scatter (vector version)
         """
         cdef _p_msg_cco m = message_cco()
         m.for_reduce_scatter(sendbuf, recvbuf,
                              recvcounts, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ireduce_scatter(
             m.sbuf, m.rbuf, m.rcounts, m.rtype,
             op.ob_mpi, self.ob_mpi, &request.ob_mpi) )
@@ -1405,25 +1210,22 @@ cdef class Comm:
     # Python Communication
     # --------------------
     #
-    @send_for_nlcpy_array
     def send(self, obj, int dest, int tag=0):
         """Send"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_send(obj, dest, tag, comm)
     #
-    @send_for_nlcpy_array
+    @raise_notimpl_for_vai_buffer
     def bsend(self, obj, int dest, int tag=0):
         """Send in buffered mode"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_bsend(obj, dest, tag, comm)
     #
-    @send_for_nlcpy_array
     def ssend(self, obj, int dest, int tag=0):
         """Send in synchronous mode"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_ssend(obj, dest, tag, comm)
     #
-    @recv_for_nlcpy_array
     def recv(self, buf=None, int source=ANY_SOURCE, int tag=ANY_TAG,
              Status status=None):
         """Receive"""
@@ -1431,8 +1233,6 @@ cdef class Comm:
         cdef MPI_Status *statusp = arg_Status(status)
         return PyMPI_recv(buf, source, tag, comm, statusp)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def sendrecv(self, sendobj, int dest, int sendtag=0,
                  recvbuf=None, int source=ANY_SOURCE, int recvtag=ANY_TAG,
                  Status status=None):
@@ -1443,7 +1243,6 @@ cdef class Comm:
                               recvbuf, source, recvtag,
                               comm, statusp)
     #
-    @send_for_nlcpy_array
     def isend(self, obj, int dest, int tag=0):
         """Nonblocking send"""
         cdef MPI_Comm comm = self.ob_mpi
@@ -1451,7 +1250,7 @@ cdef class Comm:
         request.ob_buf = PyMPI_isend(obj, dest, tag, comm, &request.ob_mpi)
         return request
     #
-    @send_for_nlcpy_array
+    @raise_notimpl_for_vai_buffer
     def ibsend(self, obj, int dest, int tag=0):
         """Nonblocking send in buffered mode"""
         cdef MPI_Comm comm = self.ob_mpi
@@ -1459,7 +1258,6 @@ cdef class Comm:
         request.ob_buf = PyMPI_ibsend(obj, dest, tag, comm, &request.ob_mpi)
         return request
     #
-    @send_for_nlcpy_array
     def issend(self, obj, int dest, int tag=0):
         """Nonblocking send in synchronous mode"""
         cdef MPI_Comm comm = self.ob_mpi
@@ -1467,18 +1265,10 @@ cdef class Comm:
         request.ob_buf = PyMPI_issend(obj, dest, tag, comm, &request.ob_mpi)
         return request
     #
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def irecv(self, buf=None, int source=ANY_SOURCE, int tag=ANY_TAG,
-              numpy_arr=None, nlcpy_arr=None):
+    def irecv(self, buf=None, int source=ANY_SOURCE, int tag=ANY_TAG):
         """Nonblocking receive"""
         cdef MPI_Comm comm = self.ob_mpi
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = <Request>Request.__new__(Request)
         request.ob_buf = PyMPI_irecv(buf, source, tag, comm, &request.ob_mpi)
         return request
     #
@@ -1523,51 +1313,37 @@ cdef class Comm:
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_barrier(comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def bcast(self, obj, int root=0):
         """Broadcast"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_bcast(obj, root, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def gather(self, sendobj, int root=0):
         """Gather"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_gather(sendobj, root, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def scatter(self, sendobj, int root=0):
         """Scatter"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_scatter(sendobj, root, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def allgather(self, sendobj):
         """Gather to All"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_allgather(sendobj, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def alltoall(self, sendobj):
         """All to All Scatter/Gather"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_alltoall(sendobj, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def reduce(self, sendobj, op=SUM, int root=0):
         """Reduce"""
         if op is None: op = SUM
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_reduce(sendobj, op, root, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def allreduce(self, sendobj, op=SUM):
         """Reduce to All"""
         if op is None: op = SUM
@@ -1731,7 +1507,6 @@ cdef class Intracomm(Comm):
 
     # Inclusive Scan
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Scan(self, sendbuf, recvbuf, Op op=SUM):
         """
         Inclusive Scan
@@ -1744,7 +1519,6 @@ cdef class Intracomm(Comm):
 
     # Exclusive Scan
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Exscan(self, sendbuf, recvbuf, Op op=SUM):
         """
         Exclusive Scan
@@ -1757,47 +1531,25 @@ cdef class Intracomm(Comm):
 
     # Nonblocking
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iscan(self, sendbuf, recvbuf, Op op=SUM,
-              send_nlcpy_arr=None, send_numpy_arr=None,
-              recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iscan(self, sendbuf, recvbuf, Op op=SUM ):
         """
         Inclusive Scan
         """
         cdef _p_msg_cco m = message_cco()
         m.for_scan(sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iscan(
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, self.ob_mpi, &request.ob_mpi) )
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Iexscan(self, sendbuf, recvbuf, Op op=SUM,
-                send_nlcpy_arr=None, send_numpy_arr=None,
-                recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Iexscan(self, sendbuf, recvbuf, Op op=SUM):
         """
         Inclusive Scan
         """
         cdef _p_msg_cco m = message_cco()
         m.for_exscan(sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Iexscan(
             m.sbuf, m.rbuf, m.rcount, m.rtype,
             op.ob_mpi, self.ob_mpi, &request.ob_mpi) )
@@ -1805,16 +1557,12 @@ cdef class Intracomm(Comm):
 
     # Python Communication
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def scan(self, sendobj, op=SUM):
         """Inclusive Scan"""
         if op is None: op = SUM
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_scan(sendobj, op, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def exscan(self, sendobj, op=SUM):
         """Exclusive Scan"""
         if op is None: op = SUM
@@ -2013,7 +1761,6 @@ cdef class Topocomm(Intracomm):
     # Neighborhood Collectives
     # ------------------------
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Neighbor_allgather(self, sendbuf, recvbuf):
         """
         Neighbor Gather to All
@@ -2025,7 +1772,6 @@ cdef class Topocomm(Intracomm):
             m.rbuf, m.rcount, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Neighbor_allgatherv(self, sendbuf, recvbuf):
         """
         Neighbor Gather to All Vector
@@ -2037,7 +1783,6 @@ cdef class Topocomm(Intracomm):
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Neighbor_alltoall(self, sendbuf, recvbuf):
         """
         Neighbor All-to-All
@@ -2049,7 +1794,6 @@ cdef class Topocomm(Intracomm):
             m.rbuf, m.rcount, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Neighbor_alltoallv(self, sendbuf, recvbuf):
         """
         Neighbor All-to-All Vector
@@ -2061,7 +1805,6 @@ cdef class Topocomm(Intracomm):
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
             self.ob_mpi) )
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Neighbor_alltoallw(self, sendbuf, recvbuf):
         """
         Neighbor All-to-All Generalized
@@ -2076,24 +1819,13 @@ cdef class Topocomm(Intracomm):
     # Nonblocking Neighborhood Collectives
     # ------------------------------------
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ineighbor_allgather(self, sendbuf, recvbuf,
-                            send_nlcpy_arr=None, send_numpy_arr=None,
-                            recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ineighbor_allgather(self, sendbuf, recvbuf):
         """
         Nonblocking Neighbor Gather to All
         """
         cdef _p_msg_cco m = message_cco()
         m.for_neighbor_allgather(0, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ineighbor_allgather(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcount, m.rtype,
@@ -2101,24 +1833,13 @@ cdef class Topocomm(Intracomm):
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ineighbor_allgatherv(self, sendbuf, recvbuf,
-                             send_nlcpy_arr=None, send_numpy_arr=None,
-                             recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ineighbor_allgatherv(self, sendbuf, recvbuf):
         """
         Nonblocking Neighbor Gather to All Vector
         """
         cdef _p_msg_cco m = message_cco()
         m.for_neighbor_allgather(1, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ineighbor_allgatherv(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
@@ -2126,24 +1847,13 @@ cdef class Topocomm(Intracomm):
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ineighbor_alltoall(self, sendbuf, recvbuf,
-                           send_nlcpy_arr=None, send_numpy_arr=None,
-                           recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ineighbor_alltoall(self, sendbuf, recvbuf):
         """
         Nonblocking Neighbor All-to-All
         """
         cdef _p_msg_cco m = message_cco()
         m.for_neighbor_alltoall(0, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ineighbor_alltoall(
             m.sbuf, m.scount, m.stype,
             m.rbuf, m.rcount, m.rtype,
@@ -2151,24 +1861,13 @@ cdef class Topocomm(Intracomm):
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ineighbor_alltoallv(self, sendbuf, recvbuf,
-                            send_nlcpy_arr=None, send_numpy_arr=None,
-                            recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ineighbor_alltoallv(self, sendbuf, recvbuf):
         """
         Nonblocking Neighbor All-to-All Vector
         """
         cdef _p_msg_cco m = message_cco()
         m.for_neighbor_alltoall(1, sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ineighbor_alltoallv(
             m.sbuf, m.scounts, m.sdispls, m.stype,
             m.rbuf, m.rcounts, m.rdispls, m.rtype,
@@ -2176,24 +1875,13 @@ cdef class Topocomm(Intracomm):
         request.ob_buf = m
         return request
 
-    @nb_sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
-    def Ineighbor_alltoallw(self, sendbuf, recvbuf,
-                            send_nlcpy_arr=None, send_numpy_arr=None,
-                            recv_nlcpy_arr=None, recv_numpy_arr=None):
+    def Ineighbor_alltoallw(self, sendbuf, recvbuf):
         """
         Nonblocking Neighbor All-to-All Generalized
         """
         cdef _p_msg_ccow m = message_ccow()
         m.for_neighbor_alltoallw(sendbuf, recvbuf, self.ob_mpi)
-        cdef Request request
-        if send_nlcpy_arr is None and recv_nlcpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            numpy_arrays=[send_numpy_arr, recv_numpy_arr]
-            nlcpy_arrays=[send_nlcpy_arr, recv_nlcpy_arr]
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arrays,
-                                               nlcpy_arr=nlcpy_arrays)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Ineighbor_alltoallw(
             m.sbuf, m.scounts, m.sdisplsA, m.stypes,
             m.rbuf, m.rcounts, m.rdisplsA, m.rtypes,
@@ -2203,15 +1891,11 @@ cdef class Topocomm(Intracomm):
 
     # Python Communication
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def neighbor_allgather(self, sendobj):
         """Neighbor Gather to All"""
         cdef MPI_Comm comm = self.ob_mpi
         return PyMPI_neighbor_allgather(sendobj, comm)
     #
-    @recv_for_nlcpy_array
-    @send_for_nlcpy_array
     def neighbor_alltoall(self, sendobj):
         """Neighbor All to All Scatter/Gather"""
         cdef MPI_Comm comm = self.ob_mpi
@@ -2613,8 +2297,7 @@ COMM_WORLD = __COMM_WORLD__  #: World communicator handle
 BSEND_OVERHEAD = MPI_BSEND_OVERHEAD
 #: Upper bound of memory overhead for sending in buffered mode
 
-
-@send_for_nlcpy_array
+@raise_notimpl_for_vai_buffer
 def Attach_buffer(buf):
     """
     Attach a user-provided buffer for
diff --git a/src/mpi4pyve/MPI/Datatype.pyx b/src/mpi4pyve/MPI/Datatype.pyx
index 9bf47fc..99fca08 100644
--- a/src/mpi4pyve/MPI/Datatype.pyx
+++ b/src/mpi4pyve/MPI/Datatype.pyx
@@ -95,7 +95,6 @@ COMBINER_F90_COMPLEX      = MPI_COMBINER_F90_COMPLEX
 COMBINER_F90_INTEGER      = MPI_COMBINER_F90_INTEGER
 
 
-include "NLCPy.pyx"
 
 
 cdef class Datatype:
@@ -689,7 +688,6 @@ cdef class Datatype:
     # Pack and Unpack
     # ---------------
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
     def Pack(self, inbuf, outbuf, int position, Comm comm):
         """
         Pack into contiguous memory according to datatype.
@@ -708,7 +706,6 @@ cdef class Datatype:
                          &position, comm.ob_mpi) )
         return position
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 3))
     def Unpack(self, inbuf, int position, outbuf, Comm comm):
         """
         Unpack from contiguous memory according to datatype.
@@ -740,7 +737,7 @@ cdef class Datatype:
     # Canonical Pack and Unpack
     # -------------------------
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(2, 3))
+    @raise_notimpl_for_vai_buffer
     def Pack_external(self, datarep, inbuf, outbuf, Aint position):
         """
         Pack into contiguous memory according to datatype,
@@ -762,7 +759,7 @@ cdef class Datatype:
                                   obptr, oblen, &position) )
         return position
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(2, 4))
+    @raise_notimpl_for_vai_buffer
     def Unpack_external(self, datarep, inbuf, Aint position, outbuf):
         """
         Unpack from contiguous memory according to datatype,
diff --git a/src/mpi4pyve/MPI/File.pyx b/src/mpi4pyve/MPI/File.pyx
index bd35df6..36c3060 100644
--- a/src/mpi4pyve/MPI/File.pyx
+++ b/src/mpi4pyve/MPI/File.pyx
@@ -283,7 +283,6 @@ cdef class File:
     # Data Access with Explicit Offsets
     # ---------------------------------
 
-    @recv_buffer_for_nlcpy_array(arg_idx=2)
     def Read_at(self, Offset offset, buf, Status status=None):
         """
         Read using explicit offset
@@ -293,7 +292,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_at(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, statusp) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=2)
     def Read_at_all(self, Offset offset, buf, Status status=None):
         """
         Collective read using explicit offset
@@ -303,7 +301,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_at_all(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, statusp) )
 
-    @send_for_nlcpy_array
     def Write_at(self, Offset offset, buf, Status status=None):
         """
         Write using explicit offset
@@ -313,7 +310,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_at(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, statusp) )
 
-    @send_for_nlcpy_array
     def Write_at_all(self, Offset offset, buf, Status status=None):
         """
         Collective write using explicit offset
@@ -323,58 +319,40 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_at_all(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, statusp) )
 
-    @nb_recv_for_nlcpy_array(arg_idx=2)
-    def Iread_at(self, Offset offset, buf,
-                 numpy_arr=None, nlcpy_arr=None):
+    def Iread_at(self, Offset offset, buf):
                   
         """
         Nonblocking read using explicit offset
         """
         cdef _p_msg_io m = message_io_read(buf)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iread_at(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @nb_recv_for_nlcpy_array(arg_idx=2)
-    def Iread_at_all(self, Offset offset, buf,
-                     numpy_arr=None, nlcpy_arr=None):
+    def Iread_at_all(self, Offset offset, buf):
         """
         Nonblocking collective read using explicit offset
         """
         cdef _p_msg_io m = message_io_read(buf)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iread_at_all(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @send_for_nlcpy_array
     def Iwrite_at(self, Offset offset, buf):
         """
         Nonblocking write using explicit offset
         """
         cdef _p_msg_io m = message_io_write(buf)
-        cdef Request request = <Request>Request.__new__(Request)
+        cdef Request request = Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iwrite_at(
             self.ob_mpi, offset, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @send_for_nlcpy_array
     def Iwrite_at_all(self, Offset offset, buf):
         """
         Nonblocking collective write using explicit offset
@@ -389,7 +367,6 @@ cdef class File:
     # Data Access with Individual File Pointers
     # -----------------------------------------
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read(self, buf, Status status=None):
         """
         Read using individual file pointer
@@ -399,7 +376,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_all(self, buf, Status status=None):
         """
         Collective read using individual file pointer
@@ -409,7 +385,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_all(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @send_for_nlcpy_array
     def Write(self, buf, Status status=None):
         """
         Write using individual file pointer
@@ -419,7 +394,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @send_for_nlcpy_array
     def Write_all(self, buf, Status status=None):
         """
         Collective write using individual file pointer
@@ -429,57 +403,39 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_all(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Iread(self, buf,
-              numpy_arr=None, nlcpy_arr=None):
+    def Iread(self, buf):
         """
         Nonblocking read using individual file pointer
         """
         cdef _p_msg_io m = message_io_read(buf)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iread(
             self.ob_mpi, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Iread_all(self, buf,
-                  numpy_arr=None, nlcpy_arr=None):
+    def Iread_all(self, buf):
         """
         Nonblocking collective read using individual file pointer
         """
         cdef _p_msg_io m = message_io_read(buf)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iread_all(
             self.ob_mpi, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @send_for_nlcpy_array
     def Iwrite(self, buf):
         """
         Nonblocking write using individual file pointer
         """
         cdef _p_msg_io m = message_io_write(buf)
-        cdef Request request = <Request>Request.__new__(Request)
+        cdef Request request = Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iwrite(
             self.ob_mpi, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @send_for_nlcpy_array
     def Iwrite_all(self, buf):
         """
         Nonblocking collective write using individual file pointer
@@ -519,7 +475,6 @@ cdef class File:
     # Data Access with Shared File Pointers
     # -------------------------------------
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_shared(self, buf, Status status=None):
         """
         Read using shared file pointer
@@ -529,7 +484,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_shared(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @send_for_nlcpy_array
     def Write_shared(self, buf, Status status=None):
         """
         Write using shared file pointer
@@ -539,38 +493,28 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_shared(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Iread_shared(self, buf,
-                     numpy_arr=None, nlcpy_arr=None):
+    def Iread_shared(self, buf):
         """
         Nonblocking read using shared file pointer
         """
         cdef _p_msg_io m = message_io_read(buf)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iread_shared(
             self.ob_mpi, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @send_for_nlcpy_array
     def Iwrite_shared(self, buf):
         """
         Nonblocking write using shared file pointer
         """
         cdef _p_msg_io m = message_io_write(buf)
-        cdef Request request = <Request>Request.__new__(Request)
+        cdef Request request = Request.__new__(Request)
         with nogil: CHKERR( MPI_File_iwrite_shared(
             self.ob_mpi, m.buf, m.count, m.dtype, &request.ob_mpi) )
         request.ob_buf = m
         return request
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_ordered(self, buf, Status status=None):
         """
         Collective read using shared file pointer
@@ -580,7 +524,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_ordered(
             self.ob_mpi, m.buf, m.count, m.dtype, statusp) )
 
-    @send_for_nlcpy_array
     def Write_ordered(self, buf, Status status=None):
         """
         Collective write using shared file pointer
@@ -612,7 +555,6 @@ cdef class File:
 
     # explicit offset
 
-    @recv_buffer_for_nlcpy_array(arg_idx=2)
     def Read_at_all_begin(self, Offset offset, buf):
         """
         Start a split collective read using explict offset
@@ -621,7 +563,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_at_all_begin(
             self.ob_mpi, offset, m.buf, m.count, m.dtype) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_at_all_end(self, buf, Status status=None):
         """
         Complete a split collective read using explict offset
@@ -631,7 +572,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_at_all_end(
             self.ob_mpi, m.buf, statusp) )
 
-    @send_for_nlcpy_array
     def Write_at_all_begin(self, Offset offset, buf):
         """
         Start a split collective write using explict offset
@@ -640,7 +580,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_at_all_begin(
             self.ob_mpi, offset, m.buf, m.count, m.dtype) )
 
-    @send_for_nlcpy_array
     def Write_at_all_end(self, buf, Status status=None):
         """
         Complete a split collective write using explict offset
@@ -652,7 +591,6 @@ cdef class File:
 
     # individual file pointer
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_all_begin(self, buf):
         """
         Start a split collective read
@@ -662,7 +600,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_all_begin(
             self.ob_mpi, m.buf, m.count, m.dtype) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_all_end(self, buf, Status status=None):
         """
         Complete a split collective read
@@ -673,7 +610,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_all_end(
             self.ob_mpi, m.buf, statusp) )
 
-    @send_for_nlcpy_array
     def Write_all_begin(self, buf):
         """
         Start a split collective write
@@ -683,7 +619,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_all_begin(
             self.ob_mpi, m.buf, m.count, m.dtype) )
 
-    @send_for_nlcpy_array
     def Write_all_end(self, buf, Status status=None):
         """
         Complete a split collective write
@@ -696,7 +631,6 @@ cdef class File:
 
     # shared file pointer
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_ordered_begin(self, buf):
         """
         Start a split collective read
@@ -706,7 +640,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_ordered_begin(
             self.ob_mpi, m.buf, m.count, m.dtype) )
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Read_ordered_end(self, buf, Status status=None):
         """
         Complete a split collective read
@@ -717,7 +650,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_read_ordered_end(
             self.ob_mpi, m.buf, statusp) )
 
-    @send_for_nlcpy_array
     def Write_ordered_begin(self, buf):
         """
         Start a split collective write using
@@ -727,7 +659,6 @@ cdef class File:
         with nogil: CHKERR( MPI_File_write_ordered_begin(
             self.ob_mpi, m.buf, m.count, m.dtype) )
 
-    @send_for_nlcpy_array
     def Write_ordered_end(self, buf, Status status=None):
         """
         Complete a split collective write
diff --git a/src/mpi4pyve/MPI/MPI.pyx b/src/mpi4pyve/MPI/MPI.pyx
index cb80d09..cbc82b9 100644
--- a/src/mpi4pyve/MPI/MPI.pyx
+++ b/src/mpi4pyve/MPI/MPI.pyx
@@ -64,6 +64,7 @@ include "atimport.pxi"
 
 bootstrap()
 initialize()
+set_mpi_local_size()
 
 include "asstring.pxi"
 include "asbuffer.pxi"
@@ -127,6 +128,7 @@ WIN_MODEL         = MPI_WIN_MODEL
 
 include "Exception.pyx"
 include "Errhandler.pyx"
+include "Notimpl.pyx"
 include "Datatype.pyx"
 include "Status.pyx"
 include "Request.pyx"
@@ -137,6 +139,8 @@ include "Group.pyx"
 include "Comm.pyx"
 include "Win.pyx"
 include "File.pyx"
+include "Util.pyx"
+include "Veo.pyx"
 
 
 # Memory Allocation
@@ -168,6 +172,7 @@ def Init():
     """
     CHKERR( MPI_Init(NULL, NULL) )
     initialize()
+    set_mpi_local_size()
 
 def Finalize():
     """
@@ -191,13 +196,18 @@ THREAD_SERIALIZED = MPI_THREAD_SERIALIZED
 THREAD_MULTIPLE   = MPI_THREAD_MULTIPLE
 #: Multiple threads may call MPI
 
-def Init_thread(int required=THREAD_MULTIPLE):
+def Init_thread(int required=THREAD_SERIALIZED):
     """
     Initialize the MPI execution environment
     """
     cdef int provided = MPI_THREAD_SINGLE
+
+    if required == THREAD_MULTIPLE:
+        PyErr_WarnEx(UserWarning, b"MPI_THREAD_MULTIPLE cannot be used with NEC MPI", 1)
+
     CHKERR( MPI_Init_thread(NULL, NULL, required, &provided) )
     initialize()
+    set_mpi_local_size()
     return provided
 
 def Query_thread():
diff --git a/src/mpi4pyve/MPI/Message.pyx b/src/mpi4pyve/MPI/Message.pyx
index a0818b8..fc7169a 100644
--- a/src/mpi4pyve/MPI/Message.pyx
+++ b/src/mpi4pyve/MPI/Message.pyx
@@ -117,7 +117,6 @@ cdef class Message:
     # Matched receives
     # ----------------
 
-    @recv_buffer_for_nlcpy_array(arg_idx=1)
     def Recv(self, buf, Status status=None):
         """
         Blocking receive of matched message
@@ -134,8 +133,7 @@ cdef class Message:
         if self is not __MESSAGE_NO_PROC__:
             self.ob_mpi = message
 
-    @nb_recv_for_nlcpy_array(arg_idx=1)
-    def Irecv(self, buf, numpy_arr=None, nlcpy_arr=None):
+    def Irecv(self, buf):
         """
         Nonblocking receive of matched message
         """
@@ -144,13 +142,7 @@ cdef class Message:
         if message == MPI_MESSAGE_NO_PROC:
             source = MPI_PROC_NULL
         cdef _p_msg_p2p rmsg = message_p2p_recv(buf, source)
-        cdef Request request
-        if numpy_arr is None:
-            request = <Request>Request.__new__(Request)
-        else:
-            request = <Request>Request.__new__(Request,
-                                               numpy_arr=numpy_arr,
-                                               nlcpy_arr=nlcpy_arr)
+        cdef Request request = <Request>Request.__new__(Request)
         with nogil: CHKERR( MPI_Imrecv(
             rmsg.buf, rmsg.count, rmsg.dtype,
             &message, &request.ob_mpi) )
@@ -184,7 +176,6 @@ cdef class Message:
         if flag == 0: return None
         return message
     #
-    @recv_for_nlcpy_array
     def recv(self, Status status=None):
         """Blocking receive of matched message"""
         cdef object rmsg = self.ob_buf
diff --git a/src/mpi4pyve/MPI/NLCPy.pyx b/src/mpi4pyve/MPI/NLCPy.pyx
deleted file mode 100644
index 25249f0..0000000
--- a/src/mpi4pyve/MPI/NLCPy.pyx
+++ /dev/null
@@ -1,312 +0,0 @@
-### mpi4py-ve License ##
-#
-#  Copyright (c) 2022, NEC Corporation.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without modification,
-#  are permitted provided that the following conditions are met:
-#
-#   * Redistributions of source code must retain the above copyright notice, this
-#     list of conditions and the following disclaimer.
-#   * Redistributions in binary form must reproduce the above copyright notice, this
-#     list of conditions and the following disclaimer listed in this license in the
-#     documentation and/or other materials provided with the distribution.
-#
-# The copyright holders provide no reassurances that the source code provided does not
-# infringe any patent, copyright, or any other intellectual property rights of third
-# parties. The copyright holders disclaim any liability to any recipient for claims
-# brought against recipient by any third party for infringement of that parties
-# intellectual property rights.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
-# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
-# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
-# SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
-# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
-# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
-# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANYTHEORY OF LIABILITY, WHETHER IN CONTRACT,
-# STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
-# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-
-import numpy
-import nlcpy
-import mpi4pyve
-
-
-def _replace_nlcpy_to_numpy(args):
-    if args is None:
-        return args
-    _type = type(args)
-    _args = list(args)
-    for i, arg in enumerate(_args):
-        if isinstance(arg, nlcpy.core.core.ndarray):
-            _args[i] = numpy.asarray(arg)
-        elif isinstance(arg, (list, tuple)):
-            _args[i] = _replace_nlcpy_to_numpy(arg)
-    return _type(_args)
-
-
-def _replace_nlcpy_to_numpy_kwargs(kwargs):
-    for k in kwargs.keys():
-        if isinstance(kwargs[k], nlcpy.core.core.ndarray):
-            kwargs[k] = numpy.asarray(kwargs[k])
-        elif isinstance(kwargs[k], (list, tuple)):
-            kwargs[k] = _replace_nlcpy_to_numpy(kwargs[k])
-    return kwargs
-
-
-def _undo_numpy_to_nlcpy(args):
-    if args is None:
-        return args
-    _type = type(args)
-    _args = list(args)
-    for i, arg in enumerate(_args):
-        if isinstance(arg, numpy.ndarray):
-            _args[i] = nlcpy.asarray(arg)
-        elif isinstance(arg, (list, tuple)):
-            _args[i] = _undo_numpy_to_nlcpy(arg)
-    return _type(_args)
-
-
-def _undo_numpy_to_nlcpy_kwargs(kwargs):
-    for k in kwargs.keys():
-        if isinstance(kwargs[k], nlcpy.core.core.ndarray):
-            kwargs[k] = nlcpy.asarray(kwargs[k])
-        elif isinstance(kwargs[k], (list, tuple)):
-            kwargs[k] = _undo_numpy_to_nlcpy(kwargs[k])
-    return kwargs
-
-
-def send_for_nlcpy_array(send_func):
-    def _get_numpy_array_wrapper(*args, **kwargs):
-        args = _replace_nlcpy_to_numpy(args)
-        kwargs = _replace_nlcpy_to_numpy_kwargs(kwargs)
-        return send_func(*args, **kwargs)
-    return _get_numpy_array_wrapper
-
-
-def recv_for_nlcpy_array(recv_func):
-    def _get_nlcpy_array_wrapper(*args, **kwargs):
-        result = recv_func(*args, **kwargs)
-        if isinstance(result, (list, tuple)):
-            result = _undo_numpy_to_nlcpy(result)
-        elif isinstance(result, numpy.ndarray):
-            result = nlcpy.asarray(result)
-        return result
-    return _get_nlcpy_array_wrapper
-
-
-def nb_recv_for_nlcpy_array(arg_idx):
-    def _nb_recv_for_nlcpy_array(recv_func):
-        def _get_nlcpy_array_wrapper(*args, **kwargs):
-            if len(args) > arg_idx:
-                if isinstance(args[arg_idx], (list, tuple)):
-                    if isinstance(args[arg_idx][0], nlcpy.core.core.ndarray):
-                        kwargs["nlcpy_arr"] = args[arg_idx][0]
-                        args = _replace_nlcpy_to_numpy(args)
-                        kwargs["numpy_arr"] = args[arg_idx][0]
-                elif isinstance(args[arg_idx], nlcpy.core.core.ndarray):
-                    kwargs["nlcpy_arr"] = args[arg_idx]
-                    args = _replace_nlcpy_to_numpy(args)
-                    kwargs["numpy_arr"] = args[arg_idx]
-            return recv_func(*args, **kwargs)
-        return _get_nlcpy_array_wrapper
-    return _nb_recv_for_nlcpy_array
-
-
-def recv_buffer_for_nlcpy_array(arg_idx):
-    def _recv_buffer_for_nlcpy_array(recv_func):
-        def _get_nlcpy_array_wrapper(*args, **kwargs):
-            tmp_array, tmp_buf, tmp_buftype = None, None, None
-            if len(args) > arg_idx:
-                if isinstance(args[arg_idx], (list, tuple)):
-                    if isinstance(args[arg_idx][0], nlcpy.core.core.ndarray):
-                        tmp_buftype = type(args[arg_idx])
-                        tmp_buf = list(args[arg_idx])
-                        tmp_array = args[arg_idx][0]
-                elif isinstance(args[arg_idx], nlcpy.core.core.ndarray):
-                    tmp_array = args[arg_idx]
-            args = _replace_nlcpy_to_numpy(args)
-            recv_func(*args, **kwargs)
-            if tmp_array is not None:
-                args = _undo_numpy_to_nlcpy(args)
-                if tmp_buf is not None:
-                    tmp_array[:len(args[arg_idx][0])] = args[arg_idx][0]
-                    tmp_buf[0] = tmp_array
-                    _args = list(args)
-                    _args[arg_idx] = tmp_buftype(tmp_buf)
-                    args = tuple(_args)
-                else:
-                    tmp_array[:len(args[arg_idx])] = args[arg_idx]
-                    _args = list(args)
-                    _args[arg_idx] = tmp_array
-                    args = tuple(_args)
-        return _get_nlcpy_array_wrapper
-    return _recv_buffer_for_nlcpy_array
-
-
-def sendrecv_buffer_for_nlcpy_array(arg_idx):
-    def _sendrecv_buffer_for_nlcpy_array(sendrecv_func):
-        def _get_nlcpy_array_wrapper(*args, **kwargs):
-            tmp_send_array, tmp_send_buf, tmp_send_buftype = None, None, None
-            tmp_recv_array, tmp_recv_buf, tmp_recv_buftype = None, None, None
-            if len(args) > arg_idx[0]:
-                if isinstance(args[arg_idx[0]], (list, tuple)):
-                    if isinstance(args[arg_idx[0]][0], nlcpy.core.core.ndarray):
-                        tmp_send_buftype = type(args[arg_idx[0]])
-                        tmp_send_buf = list(args[arg_idx[0]])
-                        tmp_send_array = args[arg_idx[0]][0]
-                elif isinstance(args[arg_idx[0]], nlcpy.core.core.ndarray):
-                    tmp_send_array = args[arg_idx[0]]
-            if len(args) > arg_idx[1]:
-                if isinstance(args[arg_idx[1]], (list, tuple)):
-                    if isinstance(args[arg_idx[1]][0], nlcpy.core.core.ndarray):
-                        tmp_recv_buftype = type(args[arg_idx[1]])
-                        tmp_recv_buf = list(args[arg_idx[1]])
-                        tmp_recv_array = args[arg_idx[1]][0]
-                elif isinstance(args[arg_idx[1]], nlcpy.core.core.ndarray):
-                    tmp_recv_array = args[arg_idx[1]]
-            args = _replace_nlcpy_to_numpy(args)
-            result = sendrecv_func(*args, **kwargs)
-            if tmp_send_array is not None or tmp_recv_array is not None:
-                if tmp_send_array is not None and tmp_send_buf is None:
-                    tmp_send_array[:len(args[arg_idx[0]])] = args[arg_idx[0]]
-                    tmp_send_buf = tmp_send_array
-                elif tmp_send_array is not None:
-                    tmp_send_array[:len(args[arg_idx[0]][0])] = args[arg_idx[0]][0]
-                    tmp_send_buf[0] = tmp_send_array
-                    tmp_send_buf = tmp_send_buftype(tmp_send_buf)
-                if tmp_recv_array is not None and tmp_recv_buf is None:
-                    tmp_recv_array[:len(args[arg_idx[1]])] = args[arg_idx[1]]
-                    tmp_recv_buf = tmp_recv_array
-                elif tmp_recv_array is not None:
-                    tmp_recv_array[:len(args[arg_idx[1]][0])] = args[arg_idx[1]][0]
-                    tmp_recv_buf[0] = tmp_recv_array
-                    tmp_recv_buf = tmp_recv_buftype(tmp_recv_buf)
-                args = tuple((args[0], tmp_send_buf, tmp_recv_buf))
-            return result
-        return _get_nlcpy_array_wrapper
-    return _sendrecv_buffer_for_nlcpy_array
-
-
-def nb_sendrecv_buffer_for_nlcpy_array(arg_idx):
-    def _nb_sendrecv_buffer_for_nlcpy_array(recv_func):
-        def _get_nlcpy_array_wrapper(*args, **kwargs):
-            if (isinstance(args[arg_idx[0]], (list, tuple)) or
-                (not isinstance(args[arg_idx[0]],
-                 (list, tuple, nlcpy.core.core.ndarray)) and
-                 ((args[arg_idx[0]] is None) or
-                 (args[arg_idx[0]] == mpi4pyve.MPI.IN_PLACE)))) and\
-               (isinstance(args[arg_idx[1]], (list, tuple)) or
-                (not isinstance(args[arg_idx[1]],
-                 (list, tuple, nlcpy.core.core.ndarray)) and
-                 ((args[arg_idx[1]] is None) or
-                 (args[arg_idx[1]] == mpi4pyve.MPI.IN_PLACE)))):
-                kwargs["send_nlcpy_arr"] = args[arg_idx[0]][0]\
-                    if (isinstance(args[arg_idx[0]], (list, tuple)) and
-                        len(args[arg_idx[0]]) > 0 and
-                        isinstance(args[arg_idx[0]][0],
-                                   nlcpy.core.core.ndarray)) else None
-                kwargs["recv_nlcpy_arr"] = args[arg_idx[1]][0]\
-                    if (isinstance(args[arg_idx[1]], (list, tuple)) and
-                        len(args[arg_idx[1]]) > 0 and
-                        isinstance(args[arg_idx[1]][0],
-                                   nlcpy.core.core.ndarray)) else None
-                args = _replace_nlcpy_to_numpy(args)
-                kwargs["send_numpy_arr"] = args[arg_idx[0]][0]\
-                    if kwargs["send_nlcpy_arr"] is not None else None
-                kwargs["recv_numpy_arr"] = args[arg_idx[1]][0]\
-                    if kwargs["recv_nlcpy_arr"] is not None else None
-
-            elif (isinstance(args[arg_idx[0]], (list, tuple)) and
-                  isinstance(args[arg_idx[1]], nlcpy.core.core.ndarray)):
-                kwargs["send_nlcpy_arr"] = args[arg_idx[0]][0]\
-                    if isinstance(args[arg_idx[0]][0], nlcpy.core.core.ndarray) else None
-                kwargs["recv_nlcpy_arr"] = args[arg_idx[1]]\
-                    if isinstance(args[arg_idx[1]], nlcpy.core.core.ndarray) else None
-                args = _replace_nlcpy_to_numpy(args)
-                kwargs["send_numpy_arr"] = args[arg_idx[0]][0]\
-                    if kwargs["send_nlcpy_arr"] is not None else None
-                kwargs["recv_numpy_arr"] = args[arg_idx[1]]\
-                    if kwargs["recv_nlcpy_arr"] is not None else None
-
-            elif (isinstance(args[arg_idx[0]], nlcpy.core.core.ndarray) and
-                  isinstance(args[arg_idx[1]], (list, tuple))):
-                kwargs["send_nlcpy_arr"] = args[arg_idx[0]]\
-                    if isinstance(args[arg_idx[0]], nlcpy.core.core.ndarray) else None
-                kwargs["recv_nlcpy_arr"] = args[arg_idx[1]][0]\
-                    if isinstance(args[arg_idx[1]][0], nlcpy.core.core.ndarray) else None
-                args = _replace_nlcpy_to_numpy(args)
-                kwargs["send_numpy_arr"] = args[arg_idx[0]]\
-                    if kwargs["send_nlcpy_arr"] is not None else None
-                kwargs["recv_numpy_arr"] = args[arg_idx[1]][0]\
-                    if kwargs["recv_nlcpy_arr"] is not None else None
-
-            elif (isinstance(args[arg_idx[0]], nlcpy.core.core.ndarray) or
-                  args[arg_idx[0]] is None or
-                  args[arg_idx[0]] == mpi4pyve.MPI.IN_PLACE) and\
-                 (isinstance(args[arg_idx[1]], nlcpy.core.core.ndarray) or
-                  args[arg_idx[1]] is None or args[arg_idx[1]] == mpi4pyve.MPI.IN_PLACE):
-                kwargs["send_nlcpy_arr"] = args[arg_idx[0]]\
-                    if isinstance(args[arg_idx[0]], nlcpy.core.core.ndarray) else None
-                kwargs["recv_nlcpy_arr"] = args[arg_idx[1]]\
-                    if isinstance(args[arg_idx[1]], nlcpy.core.core.ndarray) else None
-                args = _replace_nlcpy_to_numpy(args)
-                kwargs["send_numpy_arr"] = args[arg_idx[0]]\
-                    if kwargs["send_nlcpy_arr"] is not None else None
-                kwargs["recv_numpy_arr"] = args[arg_idx[1]]\
-                    if kwargs["recv_nlcpy_arr"] is not None else None
-
-            return recv_func(*args, **kwargs)
-        return _get_nlcpy_array_wrapper
-    return _nb_sendrecv_buffer_for_nlcpy_array
-
-
-def sendrecv_buffer_kwarg_for_nlcpy_array(recv_func):
-    def _get_nlcpy_array_wrapper(*args, **kwargs):
-        tmp_send_array = None
-        tmp_send_buf = None
-        tmp_recv_array = None
-        tmp_recv_buf = None
-        sendbuf, sendkey = (kwargs['sendbuf'], True)\
-            if 'sendbuf' in kwargs else (args[0], False)
-        recvbuf, recvkey = (kwargs['recvbuf'], True)\
-            if 'recvbuf' in kwargs else (args[4], False)\
-            if len(args) > 4 else (None, False)
-
-        if isinstance(sendbuf, list):
-            if isinstance(sendbuf[0], nlcpy.core.core.ndarray):
-                tmp_send_array = sendbuf[0]
-                tmp_send_buf = sendbuf
-        elif isinstance(sendbuf, nlcpy.core.core.ndarray):
-            tmp_send_array = sendbuf
-
-        if recvbuf is not None:
-            if isinstance(recvbuf, list):
-                if isinstance(recvbuf[0], nlcpy.core.core.ndarray):
-                    tmp_recv_array = recvbuf[0]
-                    tmp_recv_buf = recvbuf
-            elif isinstance(recvbuf, nlcpy.core.core.ndarray):
-                tmp_recv_array = recvbuf
-        args = _replace_nlcpy_to_numpy(args)
-        kwargs = _replace_nlcpy_to_numpy_kwargs(kwargs)
-
-        recv_func(*args, **kwargs)
-
-        sendfact = kwargs['sendbuf'] if sendkey else args[0]
-        recvfact = kwargs['recvbuf'] if recvkey else args[4]\
-            if len(args) > 4 else None
-        args = _undo_numpy_to_nlcpy(args)
-        kwargs = _undo_numpy_to_nlcpy_kwargs(kwargs)
-        if tmp_send_array is not None:
-            sendfact = tmp_send_array if tmp_send_buf is None else tmp_send_buf # NOQA
-        if tmp_recv_array is not None:
-            if tmp_recv_buf is not None:
-                tmp_recv_array[:len(recvfact[0])] = recvfact[0]
-                tmp_recv_buf[0] = tmp_recv_array
-                recvfact = tmp_recv_buf
-            else:
-                tmp_recv_array[:len(recvfact)] = recvfact
-                recvfact = tmp_recv_array
-        return
-    return _get_nlcpy_array_wrapper
diff --git a/src/mpi4pyve/MPI/Notimpl.pyx b/src/mpi4pyve/MPI/Notimpl.pyx
index 90dc254..d11cd74 100644
--- a/src/mpi4pyve/MPI/Notimpl.pyx
+++ b/src/mpi4pyve/MPI/Notimpl.pyx
@@ -28,37 +28,37 @@
 # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 
-import nlcpy
+def _check_vai_buffer(obj):
+    try: return hasattr(obj, '__ve_array_interface__')
+    except: return False
 
-
-def _find_nlcpy(args):
+def _find_vai_buffer(args):
     for arg in args:
-        _raise_nlcpy_array(arg)
+        _raise_vai_buffer(arg)
     return
 
-
-def _find_nlcpy_kwargs(kwargs):
+def _find_vai_buffer_kwargs(kwargs):
     for k in kwargs.keys():
-        _raise_nlcpy_array(kwargs[k])
+        _raise_vai_buffer(kwargs[k])
     return
 
 
-def _raise_nlcpy_array(arg):
-    if isinstance(arg, nlcpy.core.core.ndarray):
-        raise NotImplementedError('NLCPy array is not implemented yet.')
+def _raise_vai_buffer(arg):
+    if _check_vai_buffer(arg):
+        raise NotImplementedError('__ve_array_interface__  is not implemented yet.')
     elif isinstance(arg, (list, tuple)):
-        _find_nlcpy(arg)
+        _find_vai_buffer(arg)
     elif isinstance(arg, dict):
-        _find_nlcpy_kwargs(arg)
+        _find_vai_buffer_kwargs(arg)
     return
 
 
-def check_for_nlcpy_array(func):
-    def _raise_nlcpy_array_wrapper(*args, **kwargs):
-        _find_nlcpy(args)
-        _find_nlcpy_kwargs(kwargs)
+def raise_notimpl_for_vai_buffer(func):
+    def _raise_vai_buffer_wrapper(*args, **kwargs):
+        _find_vai_buffer(args)
+        _find_vai_buffer_kwargs(kwargs)
         return func(*args, **kwargs)
-    return _raise_nlcpy_array_wrapper
+    return _raise_vai_buffer_wrapper
 
 
 def raise_notimpl_for_necmpi(func):
diff --git a/src/mpi4pyve/MPI/Op.pyx b/src/mpi4pyve/MPI/Op.pyx
index 0ee9983..4e05427 100644
--- a/src/mpi4pyve/MPI/Op.pyx
+++ b/src/mpi4pyve/MPI/Op.pyx
@@ -53,7 +53,6 @@
 # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 
-
 cdef class Op:
 
     """
@@ -138,7 +137,7 @@ cdef class Op:
         def __get__(self):
             return self.Is_commutative()
 
-    @sendrecv_buffer_for_nlcpy_array(arg_idx=(1, 2))
+    @raise_notimpl_for_vai_buffer
     def Reduce_local(self, inbuf, inoutbuf):
         """
         Apply a reduction operator to local data
diff --git a/src/mpi4pyve/MPI/Request.pyx b/src/mpi4pyve/MPI/Request.pyx
index e113dcb..ed34ea2 100644
--- a/src/mpi4pyve/MPI/Request.pyx
+++ b/src/mpi4pyve/MPI/Request.pyx
@@ -53,59 +53,13 @@
 # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 
-def _replace_nlcpy_to_numpy(args):
-    _type = type(args)
-    _args = list(args)
-    for i, arg in enumerate(_args):
-        if isinstance(arg, nlcpy.core.core.ndarray):
-            _args[i] = numpy.asarray(arg)
-        elif isinstance(arg, (list, tuple)):
-            _args[i] = _replace_nlcpy_to_numpy(arg)
-    return _type(_args)
-
-
-def _undo_numpy_to_nlcpy(args):
-    _type = type(args)
-    _args = list(args)
-    for i, arg in enumerate(_args):
-        if isinstance(arg, numpy.ndarray):
-            _args[i] = nlcpy.asarray(arg)
-        elif isinstance(arg, (list, tuple)):
-            _args[i] = _undo_numpy_to_nlcpy(arg)
-    return _type(_args)
-
-
-cdef wait_for_nlcpy_array(wait_func):
-    def _get_nlcpy_array_wrapper(*args, **kwargs):
-        result = wait_func(*args, **kwargs)
-        if isinstance(result, (list, tuple)):
-            result = _undo_numpy_to_nlcpy(result)
-        elif isinstance(result, numpy.ndarray):
-            result = nlcpy.asarray(result)
-        return result
-    return _get_nlcpy_array_wrapper
-
-
-def _get_numpy_to_nlcpy(args):
-    if isinstance(args, (list, tuple)):
-        return _undo_numpy_to_nlcpy(args)
-    elif isinstance(args, numpy.ndarray):
-        return nlcpy.asarray(args)
-    else:
-        return args
-
-
 cdef class Request:
 
     """
     Request
     """
 
-    def __cinit__(self, Request request=None, numpy_arr=None, nlcpy_arr=None):
-        if numpy_arr is not None:
-            self.numpy_arr = numpy_arr
-            self.nlcpy_arr = nlcpy_arr
-
+    def __cinit__(self, Request request=None):
         self.ob_mpi = MPI_REQUEST_NULL
         if request is None: return
         self.ob_mpi = request.ob_mpi
@@ -130,16 +84,6 @@ cdef class Request:
     # Completion Operations
     # ---------------------
 
-    def _trans_nlcpy_array(self):
-        if self.numpy_arr is not None:
-            if not isinstance(self.numpy_arr, list):
-                self.nlcpy_arr[:] = self.numpy_arr
-            else:
-                for i in range(0, len(self.numpy_arr)):
-                    if not self.numpy_arr[i] is None:
-                        self.nlcpy_arr[i][:] = self.numpy_arr[i]
-        return
-
     def Wait(self, Status status=None):
         """
         Wait for a send or receive to complete
@@ -150,8 +94,6 @@ cdef class Request:
             &self.ob_mpi, statusp) )
         if self.ob_mpi == MPI_REQUEST_NULL:
             self.ob_buf = None
-
-        self._trans_nlcpy_array()
         return True
 
     def Test(self, Status status=None):
@@ -164,8 +106,6 @@ cdef class Request:
             &self.ob_mpi, &flag, statusp) )
         if self.ob_mpi == MPI_REQUEST_NULL:
             self.ob_buf = None
-
-        self._trans_nlcpy_array()
         return <bint>flag
 
     def Free(self):
@@ -202,8 +142,6 @@ cdef class Request:
             with nogil: CHKERR( MPI_Waitany(
                 count, irequests, &index, statusp) )
         finally:
-            for req in requests:
-                req._trans_nlcpy_array()
             release_rs(requests, None, count, irequests, NULL)
         return index
 
@@ -223,8 +161,6 @@ cdef class Request:
             with nogil: CHKERR( MPI_Testany(
                 count, irequests, &index, &flag, statusp) )
         finally:
-            for req in requests:
-                req._trans_nlcpy_array()
             release_rs(requests, None, count, irequests, NULL)
         #
         return (index, <bint>flag)
@@ -244,8 +180,6 @@ cdef class Request:
             with nogil: CHKERR( MPI_Waitall(
                 count, irequests, istatuses) )
         finally:
-            for req in requests:
-                req._trans_nlcpy_array()
             release_rs(requests, statuses, count, irequests, istatuses)
         return True
 
@@ -265,8 +199,6 @@ cdef class Request:
             with nogil: CHKERR( MPI_Testall(
                 count, irequests, &flag, istatuses) )
         finally:
-            for req in requests:
-                req._trans_nlcpy_array()
             release_rs(requests, statuses, count, irequests, istatuses)
         return <bint>flag
 
@@ -287,8 +219,6 @@ cdef class Request:
             with nogil: CHKERR( MPI_Waitsome(
                 incount, irequests, &outcount, iindices, istatuses) )
         finally:
-            for req in requests:
-                req._trans_nlcpy_array()
             release_rs(requests, statuses, incount, irequests, istatuses)
         #
         cdef int i = 0
@@ -314,8 +244,6 @@ cdef class Request:
             with nogil: CHKERR( MPI_Testsome(
                 incount, irequests, &outcount, iindices, istatuses) )
         finally:
-            for req in requests:
-                req._trans_nlcpy_array()
             release_rs(requests, statuses, incount, irequests, istatuses)
         #
         cdef int i = 0
@@ -356,23 +284,19 @@ cdef class Request:
     # Python Communication
     # --------------------
     #
-    @wait_for_nlcpy_array
     def wait(self, Status status=None):
         """
         Wait for a send or receive to complete
         """
         cdef msg = PyMPI_wait(self, status)
-        self._trans_nlcpy_array()
         return msg
     #
-    @wait_for_nlcpy_array
     def test(self, Status status=None):
         """
         Test for the completion of a send or receive
         """
         cdef int flag = 0
         cdef msg = PyMPI_test(self, &flag, status)
-        self._trans_nlcpy_array()
         return (<bint>flag, msg)
     #
     @classmethod
@@ -382,10 +306,6 @@ cdef class Request:
         """
         cdef int index = MPI_UNDEFINED
         cdef msg = PyMPI_waitany(requests, &index, status)
-
-        for req in requests:
-            req._trans_nlcpy_array()
-        msg = _get_numpy_to_nlcpy(msg)
         return (index, msg)
     #
     @classmethod
@@ -396,10 +316,6 @@ cdef class Request:
         cdef int index = MPI_UNDEFINED
         cdef int flag  = 0
         cdef msg = PyMPI_testany(requests, &index, &flag, status)
-
-        for req in requests:
-            req._trans_nlcpy_array()
-        msg = _get_numpy_to_nlcpy(msg)
         return (index, <bint>flag, msg)
     #
     @classmethod
@@ -408,10 +324,6 @@ cdef class Request:
         Wait for all previously initiated requests to complete
         """
         cdef msg = PyMPI_waitall(requests, statuses)
-
-        for req in requests:
-            req._trans_nlcpy_array()
-        msg = _get_numpy_to_nlcpy(msg)
         return msg
     #
     @classmethod
@@ -421,10 +333,6 @@ cdef class Request:
         """
         cdef int flag = 0
         cdef msg = PyMPI_testall(requests, &flag, statuses)
-
-        for req in requests:
-            req._trans_nlcpy_array()
-        msg = _get_numpy_to_nlcpy(msg)
         return (<bint>flag, msg)
 
 
@@ -434,11 +342,7 @@ cdef class Prequest(Request):
     Persistent request
     """
 
-    def __cinit__(self, Request request=None, numpy_arr=None, nlcpy_arr=None):
-        if numpy_arr is not None:
-            self.numpy_arr = numpy_arr
-            self.nlcpy_arr = nlcpy_arr
-
+    def __cinit__(self, Request request=None):
         if self.ob_mpi == MPI_REQUEST_NULL: return
         <void>(<Prequest?>request)
 
diff --git a/src/mpi4pyve/MPI/Util.pyx b/src/mpi4pyve/MPI/Util.pyx
new file mode 100644
index 0000000..89c3867
--- /dev/null
+++ b/src/mpi4pyve/MPI/Util.pyx
@@ -0,0 +1,25 @@
+cdef extern long nmpi_aveo_dma_count[3]
+cdef extern long nmpi_aveo_dma_size[3]
+cdef extern double nmpi_aveo_dma_time[3]
+
+def _get_dma_count():
+    return [<long>nmpi_aveo_dma_count[0],
+            <long>nmpi_aveo_dma_count[1],
+            <long>nmpi_aveo_dma_count[2],]
+
+def _get_dma_size():
+    return [<long>nmpi_aveo_dma_size[0],
+            <long>nmpi_aveo_dma_size[1],
+            <long>nmpi_aveo_dma_size[2],]
+
+def _get_dma_time():
+    return [<double>nmpi_aveo_dma_time[0],
+            <double>nmpi_aveo_dma_time[1],
+            <double>nmpi_aveo_dma_time[2],]
+
+def _nmpi_aveo_dma_clear():
+    for i in range(3):
+        nmpi_aveo_dma_count[i] = 0
+        nmpi_aveo_dma_size[i] = 0
+        nmpi_aveo_dma_time[i] = 0
+
diff --git a/src/mpi4pyve/MPI/Veo.pyx b/src/mpi4pyve/MPI/Veo.pyx
new file mode 100644
index 0000000..4aca2db
--- /dev/null
+++ b/src/mpi4pyve/MPI/Veo.pyx
@@ -0,0 +1,98 @@
+### mpi4py-ve License ##
+#
+#  Copyright (c) 2022, NEC Corporation.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#   * Redistributions of source code must retain the above copyright notice, this
+#     list of conditions and the following disclaimer.
+#   * Redistributions in binary form must reproduce the above copyright notice, this
+#     list of conditions and the following disclaimer listed in this license in the
+#     documentation and/or other materials provided with the distribution.
+#
+# The copyright holders provide no reassurances that the source code provided does not
+# infringe any patent, copyright, or any other intellectual property rights of third
+# parties. The copyright holders disclaim any liability to any recipient for claims
+# brought against recipient by any third party for infringement of that parties
+# intellectual property rights.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANYTHEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
+
+from libc.stdint cimport *
+
+cdef extern from "<dlfcn.h>" nogil:
+    void *dlopen(const char *, int)
+    char *dlerror()
+    void *dlsym(void *, const char *)
+    int dlclose(void *)
+    int RTLD_LAZY
+    int RTLD_NOW
+    int RTLD_GLOBAL
+    int RTLD_LOCAL
+
+cdef int (*hooked_veo_alloc_hmem)(void *, void **, const size_t)
+cdef int (*hooked_veo_free_hmem)(void *)
+
+
+cdef _get_veo_sym():
+    global hooked_veo_alloc_hmem, hooked_veo_free_hmem
+    cdef void *hdl_veo = NULL
+    cdef void *hdl_mpi = NULL
+    cdef char *err = NULL
+
+    hdl_veo = <void *>dlopen('libmpi_veo.so.1', RTLD_NOW)
+    err = dlerror()
+    if err != NULL:
+        raise RuntimeError(err)
+    hooked_veo_alloc_hmem = \
+        <int (*)(void *, void **, const size_t)>dlsym(
+            hdl_veo, 'veo_alloc_hmem')
+    err = dlerror()
+    if err != NULL:
+        raise RuntimeError(err)
+    hooked_veo_free_hmem = \
+        <int (*)(void *)>dlsym(hdl_veo, 'veo_free_hmem')
+    err = dlerror()
+    if err != NULL:
+        raise RuntimeError(err)
+
+cdef int _hooked_alloc_hmem(void* proc, uint64_t* addr, const size_t size):
+    global hooked_veo_alloc_hmem
+    if hooked_veo_alloc_hmem == NULL:
+        _get_veo_sym()
+    cdef void *vemem = NULL
+    cdef int ret = 0
+    ret = hooked_veo_alloc_hmem(proc, &vemem, size)
+    addr[0] = <uint64_t>vemem
+    return ret
+
+cdef int _hooked_free_hmem(uint64_t addr):
+    global hooked_veo_free_hmem
+    if hooked_veo_free_hmem == NULL:
+        _get_veo_sym()
+    cdef int ret = 0
+    ret = hooked_veo_free_hmem(<void *>addr)
+    return ret
+
+
+def _alloc_hmem(uint64_t proc_handle, size_t size):
+    cdef uint64_t hmem_addr = 0
+    if _hooked_alloc_hmem(<void *>proc_handle, &hmem_addr, size):
+        raise MemoryError("Out of memory on VE")
+    return <uint64_t>(hmem_addr)
+
+
+def _free_hmem(uint64_t addr):
+    if _hooked_free_hmem(addr):
+        raise RuntimeError("veo_free_hmem failed")
diff --git a/src/mpi4pyve/MPI/Win.pyx b/src/mpi4pyve/MPI/Win.pyx
index db054d4..dd71a75 100644
--- a/src/mpi4pyve/MPI/Win.pyx
+++ b/src/mpi4pyve/MPI/Win.pyx
@@ -53,6 +53,8 @@
 # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 #
 
+import sys
+
 # Create flavors
 # --------------
 WIN_FLAVOR_CREATE   = MPI_WIN_FLAVOR_CREATE
@@ -79,7 +81,13 @@ LOCK_EXCLUSIVE = MPI_LOCK_EXCLUSIVE
 LOCK_SHARED    = MPI_LOCK_SHARED
 
 
-include "Notimpl.pyx"
+def _request_flush():
+    try:
+        if 'nlcpy' in sys.modules:
+            vp = sys.modules['nlcpy']
+            vp.request.flush()
+    except Exception as e:
+        pass
 
 
 cdef class Win:
@@ -402,7 +410,6 @@ cdef class Win:
     # Communication Operations
     # ------------------------
 
-    @check_for_nlcpy_array
     def Put(self, origin, int target_rank, target=None):
         """
         Put data into a memory window on a remote process.
@@ -415,7 +422,6 @@ cdef class Win:
             msg.tdisp, msg.tcount, msg.ttype,
             self.ob_mpi) )
 
-    @check_for_nlcpy_array
     def Get(self, origin, int target_rank, target=None):
         """
         Get data from a memory window on a remote process.
@@ -428,7 +434,6 @@ cdef class Win:
             msg.tdisp, msg.tcount, msg.ttype,
             self.ob_mpi) )
 
-    @check_for_nlcpy_array
     def Accumulate(self, origin, int target_rank,
                    target=None, Op op=SUM):
         """
@@ -442,7 +447,6 @@ cdef class Win:
             msg.tdisp, msg.tcount, msg.ttype,
             op.ob_mpi, self.ob_mpi) )
 
-    @check_for_nlcpy_array
     def Get_accumulate(self, origin, result, int target_rank,
                        target=None, Op op=SUM):
         """
@@ -457,7 +461,6 @@ cdef class Win:
             msg.tdisp, msg.tcount, msg.ttype,
             op.ob_mpi, self.ob_mpi) )
 
-    @check_for_nlcpy_array
     def Fetch_and_op(self, origin, result,int target_rank,
                      Aint target_disp=0, Op op=SUM):
         """
@@ -470,7 +473,6 @@ cdef class Win:
                 target_rank, target_disp,
                 op.ob_mpi, self.ob_mpi) )
 
-    @check_for_nlcpy_array
     def Compare_and_swap(self, origin, compare, result,
                          int target_rank, Aint target_disp=0):
         """
@@ -485,7 +487,6 @@ cdef class Win:
     # Request-based RMA Communication Operations
     # ------------------------------------------
 
-    @check_for_nlcpy_array
     def Rput(self, origin, int target_rank, target=None):
         """
         Put data into a memory window on a remote process.
@@ -501,7 +502,6 @@ cdef class Win:
         request.ob_buf = msg
         return request
 
-    @check_for_nlcpy_array
     def Rget(self, origin, int target_rank, target=None):
         """
         Get data from a memory window on a remote process.
@@ -517,7 +517,6 @@ cdef class Win:
         request.ob_buf = msg
         return request
 
-    @check_for_nlcpy_array
     def Raccumulate(self, origin, int target_rank,
                     target=None, Op op=SUM):
         """
@@ -534,7 +533,6 @@ cdef class Win:
         request.ob_buf = msg
         return request
 
-    @check_for_nlcpy_array
     def Rget_accumulate(self, origin, result, int target_rank,
                         target=None, Op op=SUM):
         """
@@ -563,6 +561,7 @@ cdef class Win:
         """
         Perform an MPI fence synchronization on a window
         """
+        _request_flush()
         with nogil: CHKERR( MPI_Win_fence(assertion, self.ob_mpi) )
 
     # General Active Target Synchronization
@@ -572,6 +571,7 @@ cdef class Win:
         """
         Start an RMA access epoch for MPI
         """
+        _request_flush()
         with nogil: CHKERR( MPI_Win_start(
             group.ob_mpi, assertion, self.ob_mpi) )
 
@@ -585,6 +585,7 @@ cdef class Win:
         """
         Start an RMA exposure epoch
         """
+        _request_flush()
         with nogil: CHKERR( MPI_Win_post(
             group.ob_mpi, assertion, self.ob_mpi) )
 
@@ -609,6 +610,7 @@ cdef class Win:
         """
         Begin an RMA access epoch at the target process
         """
+        _request_flush()
         with nogil: CHKERR( MPI_Win_lock(
             lock_type, rank, assertion, self.ob_mpi) )
 
@@ -622,6 +624,7 @@ cdef class Win:
         """
         Begin an RMA access epoch at all processes
         """
+        _request_flush()
         with nogil: CHKERR( MPI_Win_lock_all(assertion, self.ob_mpi) )
 
     def Unlock_all(self):
@@ -661,6 +664,7 @@ cdef class Win:
         """
         Synchronize public and private copies of the given window
         """
+        _request_flush()
         with nogil: CHKERR( MPI_Win_sync(self.ob_mpi) )
 
 
diff --git a/src/mpi4pyve/MPI/asbuffer.pxi b/src/mpi4pyve/MPI/asbuffer.pxi
index be7979a..36b70e2 100644
--- a/src/mpi4pyve/MPI/asbuffer.pxi
+++ b/src/mpi4pyve/MPI/asbuffer.pxi
@@ -132,10 +132,18 @@ cdef int Py27_GetBuffer(object obj, Py_buffer *view, int flags) except -1:
 
 #------------------------------------------------------------------------------
 
+include "asvaibuf.pxi"
+
 cdef int PyMPI_GetBuffer(object obj, Py_buffer *view, int flags) except -1:
-    if PYPY: return PyPy_GetBuffer(obj, view, flags)
-    if PY2:  return Py27_GetBuffer(obj, view, flags)
-    return PyObject_GetBuffer(obj, view, flags)
+    try:
+        if PYPY: return PyPy_GetBuffer(obj, view, flags)
+        if PY2:  return Py27_GetBuffer(obj, view, flags)
+        return PyObject_GetBuffer(obj, view, flags)
+    except BaseException:
+        try: return Py_GetVAIBuffer(obj, view, flags)
+        except NotImplementedError: pass
+        except BaseException: raise
+        raise
 
 #------------------------------------------------------------------------------
 
diff --git a/src/mpi4pyve/MPI/asvaibuf.pxi b/src/mpi4pyve/MPI/asvaibuf.pxi
new file mode 100644
index 0000000..66170ca
--- /dev/null
+++ b/src/mpi4pyve/MPI/asvaibuf.pxi
@@ -0,0 +1,128 @@
+cdef inline int ve_is_contig(tuple shape,
+                               tuple strides,
+                               Py_ssize_t itemsize,
+                               char order) except -1:
+    cdef Py_ssize_t i, ndim = len(shape)
+    cdef Py_ssize_t start, step, index
+    if order == c'F':
+        start = 0
+        step = 1
+    else:
+        start = ndim - 1
+        step = -1
+    for i from 0 <= i < ndim:
+        index = start + step * i
+        if itemsize != <Py_ssize_t>strides[index]:
+            return 0
+        itemsize *= <Py_ssize_t>shape[index]
+    return 1
+
+cdef inline char* ve_get_format(char typekind, Py_ssize_t itemsize) nogil:
+   if typekind == c'b':
+       if itemsize == sizeof(char):      return b"?"
+       if itemsize == (sizeof(char)*4):  return b"?" # NLCPy's bool is 4byte.
+   if typekind == c'i':
+       if itemsize == sizeof(char):      return b"b"
+       if itemsize == sizeof(short):     return b"h"
+       if itemsize == sizeof(int):       return b"i"
+       if itemsize == sizeof(long):      return b"l"
+       if itemsize == sizeof(long long): return b"q"
+   if typekind == c'u':
+       if itemsize == sizeof(char):      return b"B"
+       if itemsize == sizeof(short):     return b"H"
+       if itemsize == sizeof(int):       return b"I"
+       if itemsize == sizeof(long):      return b"L"
+       if itemsize == sizeof(long long): return b"Q"
+   if typekind == c'f':
+       if itemsize == sizeof(float)//2:    return b"e"
+       if itemsize == sizeof(float):       return b"f"
+       if itemsize == sizeof(double):      return b"d"
+       if itemsize == sizeof(long double): return b"g"
+   if typekind == c'c':
+       if itemsize == 2*sizeof(float)//2:    return b"Ze"
+       if itemsize == 2*sizeof(float):       return b"Zf"
+       if itemsize == 2*sizeof(double):      return b"Zd"
+       if itemsize == 2*sizeof(long double): return b"Zg"
+   return BYTE_FMT
+
+#------------------------------------------------------------------------------
+
+cdef int Py_CheckVAIBuffer(object obj):
+    try: return <bint>hasattr(obj, '__ve_array_interface__')
+    except: return 0
+
+cdef int Py_GetVAIBuffer(object obj, Py_buffer *view, int flags) except -1:
+    cdef dict ve_array_interface
+    cdef tuple data
+    cdef str   typestr
+    cdef tuple shape
+    cdef tuple strides
+    cdef list descr
+    cdef object dev_ptr, mask
+    cdef void *buf = NULL
+    cdef bint readonly = 0
+    cdef Py_ssize_t s, size = 1
+    cdef Py_ssize_t itemsize = 1
+    cdef char typekind = c'u'
+    cdef bint fixnull = 0
+
+    try:
+        ve_array_interface = obj.__ve_array_interface__
+    except AttributeError:
+        raise NotImplementedError("missing VE array interface")
+
+    # mandatory
+    data = ve_array_interface['data']
+    typestr = ve_array_interface['typestr']
+    shape = tuple(ve_array_interface['shape'])
+
+    # optional
+    strides = None if ve_array_interface.get('strides') is None else \
+        tuple(ve_array_interface.get('strides'))
+    descr = ve_array_interface.get('descr')
+    mask = ve_array_interface.get('mask')
+
+    dev_ptr, readonly = data
+    for s in shape: size *= s
+    if dev_ptr is None and size == 0: dev_ptr = 0 # XXX
+    buf = PyLong_AsVoidPtr(dev_ptr)
+    typekind = <char>ord(typestr[1])
+    itemsize = <Py_ssize_t>int(typestr[2:])
+
+    if mask is not None:
+        raise BufferError(
+            "__ve_array_interface__: "
+            "cannot handle masked arrays"
+        )
+    if size < 0:
+        raise BufferError(
+            "__ve_array_interface__: "
+            "buffer with negative size (shape:%s, size:%d)"
+            % (shape, size)
+        )
+    if (strides is not None and
+        not ve_is_contig(shape, strides, itemsize, c'C') and
+        not ve_is_contig(shape, strides, itemsize, c'F')):
+        raise BufferError(
+            "__ve_array_interface__: "
+            "buffer is not contiguous (shape:%s, strides:%s, itemsize:%d)"
+            % (shape, strides, itemsize)
+        )
+    if descr is not None and (len(descr) != 1 or descr[0] != ('', typestr)):
+        PyErr_WarnEx(RuntimeWarning,
+                     b"__ve_array_interface__: "
+                     b"ignoring 'descr' key", 1)
+
+    if PYPY and readonly and ((flags & PyBUF_WRITABLE) == PyBUF_WRITABLE):
+        raise BufferError("Object is not writable")
+
+    fixnull = (buf == NULL and size == 0)
+    if fixnull: buf = &fixnull
+    PyBuffer_FillInfo(view, obj, buf, size*itemsize, readonly, flags)
+    if fixnull: view.buf = NULL
+
+    if (flags & PyBUF_FORMAT) == PyBUF_FORMAT:
+        view.format = ve_get_format(typekind, itemsize)
+        if view.format != BYTE_FMT:
+            view.itemsize = itemsize
+    return 0
diff --git a/src/mpi4pyve/MPI/atimport.pxi b/src/mpi4pyve/MPI/atimport.pxi
index 089d179..c276c5e 100644
--- a/src/mpi4pyve/MPI/atimport.pxi
+++ b/src/mpi4pyve/MPI/atimport.pxi
@@ -91,7 +91,7 @@ ctypedef struct Options:
 cdef Options options
 options.initialize = 1
 options.threads = 1
-options.thread_level = MPI_THREAD_MULTIPLE
+options.thread_level = MPI_THREAD_SERIALIZED
 options.finalize = 1
 options.fast_reduce = 1
 options.recv_mprobe = 1
@@ -106,7 +106,7 @@ cdef int getOptions(Options* opts) except -1:
     cdef object rc
     opts.initialize = 1
     opts.threads = 1
-    opts.thread_level = MPI_THREAD_MULTIPLE
+    opts.thread_level = MPI_THREAD_SERIALIZED
     opts.finalize = 1
     opts.fast_reduce = 1
     opts.recv_mprobe = 1
@@ -116,7 +116,7 @@ cdef int getOptions(Options* opts) except -1:
     #
     cdef object initialize = True
     cdef object threads = True
-    cdef object thread_level = 'multiple'
+    cdef object thread_level = 'serialized'
     cdef object finalize = None
     cdef object fast_reduce = True
     cdef object recv_mprobe = True
@@ -273,6 +273,71 @@ def _set_abort_status(object status):
     except:
         abort_status = 1 if status else 0
 
+def print_option():
+    print('initialize   :', options.initialize)
+    print('threads      :', options.threads)
+    print('thread_level :', options.thread_level)
+    print('finalize     :', options.finalize)
+    print('fast_reduce  :', options.fast_reduce)
+    print('recv_mprobe  :', options.recv_mprobe)
+    print('errors       :', options.errors)
+
+# -----------------------------------------------------------------------------
+
+# Number of processes assigned to each VH when started with multiple VH.
+import os
+from libc.stdlib cimport malloc, free
+from libc.string cimport strcmp
+
+cdef int get_mpi_local_size_from_nodeid(int nodeid):
+    cdef int local_size = 0
+    cdef int size
+    comm = MPI_COMM_WORLD
+    MPI_Comm_size(comm, &size)
+    cdef int* nodes_nodeid = <int*>malloc(sizeof(int) * size)
+    MPI_Allgather(&nodeid, 1, MPI_INT, nodes_nodeid, 1, MPI_INT, comm)
+    for rank in range(0, size):
+        if nodeid ==  nodes_nodeid[rank]:
+            local_size += 1
+    free(nodes_nodeid)
+    return local_size
+
+
+cdef int get_mpi_local_size_from_processname():
+    cdef int local_size = 0
+    cdef char processor_name[MPI_MAX_PROCESSOR_NAME + 1]
+    cdef int resultlen 
+    cdef int size
+    comm = MPI_COMM_WORLD
+    MPI_Comm_size(comm, &size)
+    MPI_Get_processor_name(processor_name, &resultlen)
+    cdef char* nodes_processor_name = <char*>malloc(sizeof(processor_name) * size)
+    MPI_Allgather(processor_name, sizeof(processor_name), MPI_CHAR, 
+                  nodes_processor_name, sizeof(processor_name) ,MPI_CHAR, comm)
+    for rank in range(0, size):
+        if strcmp( processor_name , &nodes_processor_name[rank * sizeof(processor_name)]) == 0:
+            local_size += 1
+    free(nodes_processor_name)
+    return local_size
+
+cdef void set_mpi_local_size():
+    if not mpi_active(): return
+    os.environ["_MPI4PYVE_MPI_INITIALIZED"] = '1'
+
+    cdef int nodeid = -1
+    try:
+        nodeid = int(os.environ['MPINODEID'])
+    except:
+        pass
+
+    cdef int local_size = 0
+    if nodeid >= 0:
+        local_size = get_mpi_local_size_from_nodeid(nodeid)
+    else:
+        local_size = get_mpi_local_size_from_processname()
+    os.environ["_MPI4PYVE_MPI_LOCAL_SIZE"] = str(local_size)
+        
+
 # -----------------------------------------------------------------------------
 
 # Vile hack for raising a exception and not contaminate the traceback
diff --git a/src/mpi4pyve/MPI/msgbuffer.pxi b/src/mpi4pyve/MPI/msgbuffer.pxi
index 5ffdef2..02f155e 100644
--- a/src/mpi4pyve/MPI/msgbuffer.pxi
+++ b/src/mpi4pyve/MPI/msgbuffer.pxi
@@ -26,6 +26,9 @@ cdef inline int is_buffer(object ob):
     else:
         return PyObject_CheckBuffer(ob) or _Py2_IsBuffer(ob)
 
+cdef inline int is_vai_buffer(object ob):
+    return Py_CheckVAIBuffer(ob)
+
 cdef inline int is_datatype(object ob):
     if isinstance(ob, Datatype): return 1
     if PY3:
@@ -143,6 +146,8 @@ cdef _p_message message_simple(object msg,
             (o_buf, o_count, o_displ, o_type) = msg
         else:
             raise ValueError("message: expecting 2 to 4 items")
+    elif is_vai_buffer(msg):
+        o_buf = msg
     elif PYPY:
         o_buf = msg
     else:
@@ -255,6 +260,8 @@ cdef _p_message message_vector(object msg,
             (o_buf, o_counts, o_displs, o_type) = msg
         else:
             raise ValueError("message: expecting 2 to 4 items")
+    elif is_vai_buffer(msg):
+        o_buf = msg
     elif PYPY:
         o_buf = msg
     else:
diff --git a/src/mpi4pyve/__init__.py b/src/mpi4pyve/__init__.py
index 9267a20..3b022ce 100644
--- a/src/mpi4pyve/__init__.py
+++ b/src/mpi4pyve/__init__.py
@@ -77,7 +77,7 @@
 bindings.
 """
 
-__version__ = '0.1.0b1'
+__version__ = '1.0.0'
 __author__ = 'NEC (dev-nlcpy@sxarr.jp.nec.com)'
 __credits__ = 'NEC Corporation'
 
@@ -142,7 +142,7 @@ def rc(**kargs):  # pylint: disable=invalid-name
 
 rc.initialize = True
 rc.threads = True
-rc.thread_level = 'multiple'
+rc.thread_level = 'serialized'
 rc.finalize = None
 rc.fast_reduce = True
 rc.recv_mprobe = True
diff --git a/src/mpi4pyve/futures/pool.py b/src/mpi4pyve/futures/pool.py
index bed9ad5..2e89ec0 100644
--- a/src/mpi4pyve/futures/pool.py
+++ b/src/mpi4pyve/futures/pool.py
@@ -1,5 +1,58 @@
-# Author:  Lisandro Dalcin
-# Contact: dalcinl@gmail.com
+### mpi4py-ve License ##
+#
+#  Copyright (c) 2022, NEC Corporation.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#   * Redistributions of source code must retain the above copyright notice, this
+#     list of conditions and the following disclaimer.
+#   * Redistributions in binary form must reproduce the above copyright notice, this
+#     list of conditions and the following disclaimer listed in this license in the
+#     documentation and/or other materials provided with the distribution.
+#
+# The copyright holders provide no reassurances that the source code provided does not
+# infringe any patent, copyright, or any other intellectual property rights of third
+# parties. The copyright holders disclaim any liability to any recipient for claims
+# brought against recipient by any third party for infringement of that parties
+# intellectual property rights.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANYTHEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# NOTE: This code is derived from mpi4py written by Lisandro Dalcin.
+#
+### mpi4py License ##
+#
+#  Copyright (c) 2019, Lisandro Dalcin. All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright notice, this
+#    list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright notice,
+#    this list of conditions and the following disclaimer in the documentation
+#    and/or other materials provided with the distribution.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+
 """Implements MPIPoolExecutor."""
 
 import time
@@ -13,12 +66,18 @@
 
 from . import _lib
 
+def raise_notimpl_for_necmpi(func):
+    def _raise_wrapper(*args, **kwargs):
+        raise NotImplementedError('%s on mpi4py-ve is not implemented yet.'
+                                  %func)
+    return _raise_wrapper
 
 class MPIPoolExecutor(Executor):
     """MPI-based asynchronous executor."""
 
     Future = Future
 
+    @raise_notimpl_for_necmpi('MPIPoolExecutor')
     def __init__(self, max_workers=None, **kwargs):
         """Initialize a new MPIPoolExecutor instance.
 
@@ -264,6 +323,7 @@ class MPICommExecutor(object):
 
     # pylint: disable=too-few-public-methods
 
+    @raise_notimpl_for_necmpi('MPICommExecutor')
     def __init__(self, comm=None, root=0, **kwargs):
         """Initialize a new MPICommExecutor instance.
 
diff --git a/src/mpi4pyve/util/__init__.py b/src/mpi4pyve/util/__init__.py
new file mode 100644
index 0000000..491798d
--- /dev/null
+++ b/src/mpi4pyve/util/__init__.py
@@ -0,0 +1,3 @@
+from mpi4pyve.util.nmpi_aveo_dma import nmpi_aveo_dma
+nmpi_aveo_dma = nmpi_aveo_dma()
+
diff --git a/src/mpi4pyve/util/nmpi_aveo_dma.py b/src/mpi4pyve/util/nmpi_aveo_dma.py
new file mode 100644
index 0000000..6a15efe
--- /dev/null
+++ b/src/mpi4pyve/util/nmpi_aveo_dma.py
@@ -0,0 +1,62 @@
+from mpi4pyve import MPI
+
+class nmpi_aveo_dma(object):
+    def __init__(self):
+        self.__count = None
+        self.__size = None
+        self.__time = None
+
+    @property
+    def count(self):
+        self.__count = MPI._get_dma_count()
+        return self.__count
+
+    @property
+    def size(self):
+        self.__size = MPI._get_dma_size()
+        return self.__size
+
+    @property
+    def time(self):
+        self.__time = MPI._get_dma_time()
+        return self.__time
+    
+    def clear(self):
+        MPI._nmpi_aveo_dma_clear()
+
+    def show_stats(self):
+        (count, size, time) = (self.count, self.size, self.time)
+
+        sta = []
+        sta.append(['dma_count', str(count[0]), str(count[1]), str(count[2])])
+        sta.append(['dma_size', str(size[0]), str(size[1]), str(size[2])])
+        sta.append(['dma_time', str(time[0]), str(time[1]), str(time[2])])
+        
+        maxname = 0
+        max_ve_ve = 0
+        max_ve_vh = 0
+        max_vh_ve = 0
+        for val in sta:
+            if maxname < len(val[0]):
+                maxname = len(val[0])
+            if max_ve_ve < len(val[1]):
+                max_ve_ve = len(val[1])
+            if max_ve_vh < len(val[2]):
+                max_ve_vh = len(val[2])
+            if max_vh_ve < len(val[3]):
+                max_vh_ve = len(val[3])
+
+        if len(sta) > 0:
+            sp1 = max(10, maxname)
+            sp2 = max(10, max_ve_ve)
+            sp3 = max(10, max_ve_vh)
+            sp4 = max(10, max_vh_ve)
+            prval = "Info %s VE->VE %s VE->VH %s VH->VE %s" % (sp1*' ', sp2*' ', sp3*' ', sp4*' ')
+            print(prval + "\n" + "-"*(sp1+4) + "  " + "-"*(sp2+6) + "  " + "-"*(sp3+6) + "  " + "-"*(sp4+6))
+
+        for val in sta:
+            print("%s %s %s %s %s %s %s %s" % (val[0], ' '*(sp1-len(val[0])+4),
+                                            val[1], ' '*(sp2-len(val[1])+6),
+                                            val[2], ' '*(sp3-len(val[2])+6),
+                                            val[3], ' '*(sp4-len(val[3])+6)))
+
diff --git a/src/mpi4pyve/veo/__init__.py b/src/mpi4pyve/veo/__init__.py
new file mode 100644
index 0000000..94ea7fb
--- /dev/null
+++ b/src/mpi4pyve/veo/__init__.py
@@ -0,0 +1,32 @@
+### mpi4py-ve License ##
+#
+#  Copyright (c) 2022, NEC Corporation.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#   * Redistributions of source code must retain the above copyright notice, this
+#     list of conditions and the following disclaimer.
+#   * Redistributions in binary form must reproduce the above copyright notice, this
+#     list of conditions and the following disclaimer listed in this license in the
+#     documentation and/or other materials provided with the distribution.
+#
+# The copyright holders provide no reassurances that the source code provided does not
+# infringe any patent, copyright, or any other intellectual property rights of third
+# parties. The copyright holders disclaim any liability to any recipient for claims
+# brought against recipient by any third party for infringement of that parties
+# intellectual property rights.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANYTHEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# NOTE: This code is derived from mpi4py written by Lisandro Dalcin.
+#
+from mpi4pyve.veo.veo import *  # NOQA
diff --git a/src/mpi4pyve/veo/veo.py b/src/mpi4pyve/veo/veo.py
new file mode 100644
index 0000000..90788e7
--- /dev/null
+++ b/src/mpi4pyve/veo/veo.py
@@ -0,0 +1,59 @@
+### mpi4py-ve License ##
+#
+#  Copyright (c) 2022, NEC Corporation.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#   * Redistributions of source code must retain the above copyright notice, this
+#     list of conditions and the following disclaimer.
+#   * Redistributions in binary form must reproduce the above copyright notice, this
+#     list of conditions and the following disclaimer listed in this license in the
+#     documentation and/or other materials provided with the distribution.
+#
+# The copyright holders provide no reassurances that the source code provided does not
+# infringe any patent, copyright, or any other intellectual property rights of third
+# parties. The copyright holders disclaim any liability to any recipient for claims
+# brought against recipient by any third party for infringement of that parties
+# intellectual property rights.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANYTHEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+from mpi4pyve import MPI
+
+
+def alloc_hmem(proc_handle, size):
+    """ Allocate a VE memory buffer or a VH memory buffer which users can use them as heterogeneous memory.
+
+    Parameters
+    ----------
+    proc_handle: pointer
+        VEO process handle
+    size: int
+        size in bytes
+
+    Returns
+    -------
+    addr: int
+        VEMVA address with the identifier
+    """
+    return MPI._alloc_hmem(proc_handle, size)
+
+
+def free_hmem(addr):
+    """ Free a VE memory buffer.
+
+    Parameters
+    ----------
+    addr: int
+        VEMVA address
+    """
+    MPI._free_hmem(addr)
diff --git a/src/pympivendor.h b/src/pympivendor.h
index b14b4db..7bf2dd0 100644
--- a/src/pympivendor.h
+++ b/src/pympivendor.h
@@ -1,5 +1,59 @@
-/* Author:  Lisandro Dalcin   */
-/* Contact: dalcinl@gmail.com */
+/*
+### mpi4py-ve License ##
+#
+#  Copyright (c) 2022, NEC Corporation.  All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#   * Redistributions of source code must retain the above copyright notice, this
+#     list of conditions and the following disclaimer.
+#   * Redistributions in binary form must reproduce the above copyright notice, this
+#     list of conditions and the following disclaimer listed in this license in the
+#     documentation and/or other materials provided with the distribution.
+#
+# The copyright holders provide no reassurances that the source code provided does not
+# infringe any patent, copyright, or any other intellectual property rights of third
+# parties. The copyright holders disclaim any liability to any recipient for claims
+# brought against recipient by any third party for infringement of that parties
+# intellectual property rights.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANYTHEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+# NOTE: This code is derived from mpi4py written by Lisandro Dalcin.
+#
+### mpi4py License ##
+#
+#  Copyright (c) 2019, Lisandro Dalcin. All rights reserved.
+#
+#  Redistribution and use in source and binary forms, with or without modification,
+#  are permitted provided that the following conditions are met:
+#
+#  * Redistributions of source code must retain the above copyright notice, this
+#    list of conditions and the following disclaimer.
+#  * Redistributions in binary form must reproduce the above copyright notice,
+#    this list of conditions and the following disclaimer in the documentation
+#    and/or other materials provided with the distribution.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS" AND ANY
+# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+# OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
+# SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
+# INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+# TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+# BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+*/
 
 static int PyMPI_Get_vendor(const char **vendor_name,
                             int         *version_major,
@@ -111,6 +165,10 @@ static int PyMPI_Get_vendor(const char **vendor_name,
   micro = LAM_RELEASE_VERSION;
   #endif
 
+#elif defined(MPI4PYVE_NEC_MPI) && (MPI4PYVE_NEC_MPI == 1)
+
+  name = "NEC MPI";
+
 #endif
 
   if (vendor_name)   *vendor_name   = name;
diff --git a/test/nlcpy_ndarray_wrapper.py b/test/nlcpy_ndarray_wrapper.py
new file mode 100644
index 0000000..08071cd
--- /dev/null
+++ b/test/nlcpy_ndarray_wrapper.py
@@ -0,0 +1,16 @@
+import nlcpy
+
+class nlcpy_ndarray_wrapper(nlcpy.ndarray):
+    def __init__(self, shape, dtype=float, strides=None, order='C'):
+        super().__init__(shape, dtype, strides, order)
+        self.read_only_flag = False
+
+    def set_read_only_flag(self, read_only_flag):
+        self.read_only_flag = read_only_flag
+
+    @property
+    def __ve_array_interface__(self):
+        vai = super().__ve_array_interface__
+        vai['data'] = (vai['data'][0], self.read_only_flag)
+        return vai
+
diff --git a/test/test_cco_buf.py b/test/test_cco_buf.py
index f8420a3..468888d 100644
--- a/test/test_cco_buf.py
+++ b/test/test_cco_buf.py
@@ -110,6 +110,9 @@ def testReduce(self):
                 for root in range(size):
                     for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
                         sbuf = array(range(size), typecode)
+                        # @unittest.skip('necmpi')
+                        if hasattr(sbuf.as_mpi(), '__ve_array_interface__'):
+                            continue
                         rbuf = array(-1, typecode, size)
                         self.COMM.Reduce(sbuf.as_mpi(),
                                          rbuf.as_mpi(),
@@ -587,6 +590,9 @@ def testReduceLocal(self):
                 for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
                     size = 5
                     sbuf = array(range(1,size+1), typecode)
+                    # @unittest.skip('necmpi')
+                    if hasattr(sbuf.as_mpi(), '__ve_array_interface__'):
+                        continue 
                     rbuf = array(range(0,size+0), typecode)
                     try:
                         op.Reduce_local(sbuf.as_mpi(), rbuf.as_mpi())
@@ -605,6 +611,9 @@ def testReduceLocal(self):
         for array in arrayimpl.ArrayTypes:
             for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
                 sbuf = array(range(3), "i")
+                # @unittest.skip('necmpi')
+                if hasattr(sbuf.as_mpi(), '__ve_array_interface__'):
+                    continue 
                 rbuf = array(range(3), "i")
                 def f(): op.Reduce_local(sbuf.as_mpi_c(2),
                                          rbuf.as_mpi_c(3))
diff --git a/test/test_io.py b/test/test_io.py
index b255db0..a5c86ae 100644
--- a/test/test_io.py
+++ b/test/test_io.py
@@ -17,7 +17,12 @@ def setUp(self):
         fname = None
         if comm.Get_rank() == 0:
             if not os.path.exists(self.tmpname):
-                os.mkdir(self.tmpname)
+                try:
+                    os.mkdir(self.tmpname)
+                except OSError as e:
+                    if e.errno != 17: # not File exists
+                        raise
+                    pass
             fd, fname = tempfile.mkstemp(prefix=self.prefix,dir=self.tmpname)
             os.close(fd)
         fname = comm.bcast(fname, 0)
@@ -112,7 +117,7 @@ def testReadWrite(self):
                             self.assertEqual(value, 42)
                         self.assertEqual(rbuf[-1], -1)
                     comm.Barrier()
-    
+
     def testIReadIWrite(self):
         comm = self.COMM
         size = comm.Get_size()
@@ -228,10 +233,10 @@ def testIReadIWriteAtAll(self):
             for array in arrayimpl.ArrayTypes:
                 for typecode in arrayimpl.TypeMap:
                     etype = arrayimpl.TypeMap[typecode]
-                    fh.Set_size(0)
-                    fh.Set_view(0, etype)
                     count = 13
                     wbuf = array(42, typecode, count)
+                    fh.Set_size(0)
+                    fh.Set_view(rank*wbuf.itemsize*count, etype)
                     fh.Iwrite_at_all(count*rank, wbuf.as_raw()).Wait()
                     fh.Sync()
                     comm.Barrier()
@@ -306,10 +311,10 @@ def testIReadIWriteAll(self):
             for array in arrayimpl.ArrayTypes:
                 for typecode in arrayimpl.TypeMap:
                     etype = arrayimpl.TypeMap[typecode]
-                    fh.Set_size(0)
-                    fh.Set_view(0, etype)
                     count = 13
                     wbuf = array(42, typecode, count)
+                    fh.Set_size(0)
+                    fh.Set_view(rank*count*wbuf.itemsize, etype)
                     fh.Seek(count*rank, MPI.SEEK_SET)
                     fh.Iwrite_all(wbuf.as_raw()).Wait()
                     fh.Sync()
@@ -352,7 +357,7 @@ def testReadWriteAllBeginEnd(self):
                     self.assertEqual(value, 42)
                 self.assertEqual(rbuf[-1], -1)
                 comm.Barrier()
- 
+
     def testReadWriteOrdered(self):
         comm = self.COMM
         size = comm.Get_size()
diff --git a/test/test_p2p_buf.py b/test/test_p2p_buf.py
index 75f2ef0..364dd93 100644
--- a/test/test_p2p_buf.py
+++ b/test/test_p2p_buf.py
@@ -35,7 +35,7 @@ def testSendrecvReplace(self):
                     self.COMM.Sendrecv_replace(buf.as_mpi(), dest, 0, source, 0)
                     for value in buf:
                         self.assertEqual(value, source)
-
+    
     def testSendRecv(self):
         size = self.COMM.Get_size()
         rank = self.COMM.Get_rank()
@@ -46,36 +46,38 @@ def testSendRecv(self):
                     sbuf = array( s, typecode, s)
                     rbuf = array(-1, typecode, s)
                     mem  = array( 0, typecode, 2*(s+MPI.BSEND_OVERHEAD)).as_raw()
-                    if size == 1:
-                        MPI.Attach_buffer(mem)
-                        rbuf = sbuf
-                        MPI.Detach_buffer()
-                    elif rank == 0:
-                        MPI.Attach_buffer(mem)
-                        self.COMM.Ibsend(sbuf.as_mpi(), 1, 0).Wait()
-                        self.COMM.Bsend(sbuf.as_mpi(), 1, 0)
-                        MPI.Detach_buffer()
-                        self.COMM.Send(sbuf.as_mpi(), 1, 0)
-                        self.COMM.Ssend(sbuf.as_mpi(), 1, 0)
-                        self.COMM.Recv(rbuf.as_mpi(),  1, 0)
-                        self.COMM.Recv(rbuf.as_mpi(),  1, 0)
-                        self.COMM.Recv(rbuf.as_mpi(), 1, 0)
-                        self.COMM.Recv(rbuf.as_mpi(), 1, 0)
-                    elif rank == 1:
-                        self.COMM.Recv(rbuf.as_mpi(), 0, 0)
-                        self.COMM.Recv(rbuf.as_mpi(), 0, 0)
-                        self.COMM.Recv(rbuf.as_mpi(), 0, 0)
-                        self.COMM.Recv(rbuf.as_mpi(), 0, 0)
-                        MPI.Attach_buffer(mem)
-                        self.COMM.Ibsend(sbuf.as_mpi(), 0, 0).Wait()
-                        self.COMM.Bsend(sbuf.as_mpi(), 0, 0)
-                        MPI.Detach_buffer()
-                        self.COMM.Send(sbuf.as_mpi(), 0, 0)
-                        self.COMM.Ssend(sbuf.as_mpi(), 0, 0)
-                    else:
-                        rbuf = sbuf
-                    for value in rbuf:
-                        self.assertEqual(value, s)
+                    # @unittest.skip('necmpi')
+                    if not hasattr(mem, '__ve_array_interface__'):
+                        if size == 1:
+                            MPI.Attach_buffer(mem)
+                            rbuf = sbuf
+                            MPI.Detach_buffer()
+                        elif rank == 0:
+                            MPI.Attach_buffer(mem)
+                            self.COMM.Ibsend(sbuf.as_mpi(), 1, 0).Wait()
+                            self.COMM.Bsend(sbuf.as_mpi(), 1, 0)
+                            MPI.Detach_buffer()
+                            self.COMM.Send(sbuf.as_mpi(), 1, 0)
+                            self.COMM.Ssend(sbuf.as_mpi(), 1, 0)
+                            self.COMM.Recv(rbuf.as_mpi(),  1, 0)
+                            self.COMM.Recv(rbuf.as_mpi(),  1, 0)
+                            self.COMM.Recv(rbuf.as_mpi(), 1, 0)
+                            self.COMM.Recv(rbuf.as_mpi(), 1, 0)
+                        elif rank == 1:
+                            self.COMM.Recv(rbuf.as_mpi(), 0, 0)
+                            self.COMM.Recv(rbuf.as_mpi(), 0, 0)
+                            self.COMM.Recv(rbuf.as_mpi(), 0, 0)
+                            self.COMM.Recv(rbuf.as_mpi(), 0, 0)
+                            MPI.Attach_buffer(mem)
+                            self.COMM.Ibsend(sbuf.as_mpi(), 0, 0).Wait()
+                            self.COMM.Bsend(sbuf.as_mpi(), 0, 0)
+                            MPI.Detach_buffer()
+                            self.COMM.Send(sbuf.as_mpi(), 0, 0)
+                            self.COMM.Ssend(sbuf.as_mpi(), 0, 0)
+                        else:
+                            rbuf = sbuf
+                        for value in rbuf:
+                            self.assertEqual(value, s)
                     #
                     rank = self.COMM.Get_rank()
                     sbuf = array( s, typecode, s)
@@ -386,27 +388,29 @@ def testPersistent(self):
                             self.assertEqual(value, -1)
                         #
                         mem = array( 0, typecode, s+MPI.BSEND_OVERHEAD).as_raw()
-                        sbuf = array( s, typecode, s)
-                        rbuf = array(-1, typecode, s+xs)
-                        MPI.Attach_buffer(mem)
-                        sendreq = self.COMM.Bsend_init(sbuf.as_mpi(), dest, 0)
-                        recvreq = self.COMM.Recv_init(rbuf.as_mpi(), source, 0)
-                        sendreq.Start()
-                        recvreq.Start()
-                        sendreq.Wait()
-                        recvreq.Wait()
-                        MPI.Detach_buffer()
-                        self.assertNotEqual(sendreq, MPI.REQUEST_NULL)
-                        self.assertNotEqual(recvreq, MPI.REQUEST_NULL)
-                        sendreq.Free()
-                        recvreq.Free()
-                        self.assertEqual(sendreq, MPI.REQUEST_NULL)
-                        self.assertEqual(recvreq, MPI.REQUEST_NULL)
-                        for value in rbuf[:s]:
-                            self.assertEqual(value, s)
-                        for value in rbuf[s:]:
-                            self.assertEqual(value, -1)
-                        #
+                        # @unittest.skip('necmpi')
+                        if not hasattr(mem, '__ve_array_interface__'):
+                            sbuf = array( s, typecode, s)
+                            rbuf = array(-1, typecode, s+xs)
+                            MPI.Attach_buffer(mem)
+                            sendreq = self.COMM.Bsend_init(sbuf.as_mpi(), dest, 0)
+                            recvreq = self.COMM.Recv_init(rbuf.as_mpi(), source, 0)
+                            sendreq.Start()
+                            recvreq.Start()
+                            sendreq.Wait()
+                            recvreq.Wait()
+                            MPI.Detach_buffer()
+                            self.assertNotEqual(sendreq, MPI.REQUEST_NULL)
+                            self.assertNotEqual(recvreq, MPI.REQUEST_NULL)
+                            sendreq.Free()
+                            recvreq.Free()
+                            self.assertEqual(sendreq, MPI.REQUEST_NULL)
+                            self.assertEqual(recvreq, MPI.REQUEST_NULL)
+                            for value in rbuf[:s]:
+                                 self.assertEqual(value, s)
+                            for value in rbuf[s:]:
+                                self.assertEqual(value, -1)
+                        
                         rank = self.COMM.Get_rank()
                         sbuf = array( s, typecode, s)
                         rbuf = array(-1, typecode, s+xs)
diff --git a/test/test_p2p_buf_s.py b/test/test_p2p_buf_s.py
index d27316e..cc00891 100644
--- a/test/test_p2p_buf_s.py
+++ b/test/test_p2p_buf_s.py
@@ -31,6 +31,9 @@ def test_send_recv(self):
                     sbuf = array( s, typecode, s)
                     # rbuf = array(-1, typecode, s)
                     mem  = array( 0, typecode, 2*(s+MPI.BSEND_OVERHEAD)).as_raw()
+                    # @unittest.skip('necmpi')
+                    if hasattr(mem, '__ve_array_interface__'):
+                        continue
                     if size == 1:
                         MPI.Attach_buffer(mem)
                         rbuf = sbuf
diff --git a/test/test_p2p_obj_array.py b/test/test_p2p_obj_array.py
index ac709fb..bc52c99 100644
--- a/test/test_p2p_obj_array.py
+++ b/test/test_p2p_obj_array.py
@@ -375,6 +375,7 @@ def testISSendAndRecv(self):
             self.assertFalse(req)
             self.assertTrue(np.array_equal(rmess, smess))
 
+    @unittest.skip('necmpi')
     def testIRecvAndBSend(self):
         comm = self.COMM
         rank = comm.Get_rank()
@@ -385,18 +386,19 @@ def testIRecvAndBSend(self):
                 src = dst = rank
                 req1 = comm.irecv(None, src, 1)
                 req2 = comm.irecv(None, src, 2)
-                req3 = comm.irecv(None, src, 3)
-                comm.bsend(smess, dst, 3)
+                #req3 = comm.irecv(None, src, 3)
+                #comm.bsend(smess, dst, 3)
                 comm.bsend(smess, dst, 2)
                 comm.bsend(smess, dst, 1)
-                self.assertTrue(np.array_equal(smess, req3.wait()))
+                #self.assertTrue(np.array_equal(smess, req3.wait()))
                 self.assertTrue(np.array_equal(smess, req2.wait()))
                 self.assertTrue(np.array_equal(smess, req1.wait()))
-                comm.bsend(smess, MPI.PROC_NULL, 3)
+                #comm.bsend(smess, MPI.PROC_NULL, 3)
         finally:
             MPI.Detach_buffer()
             MPI.Free_mem(buf)
-
+    
+    @unittest.skip('necmpi')
     def testIRecvAndIBSend(self):
         comm = self.COMM
         rank = comm.Get_rank()
diff --git a/test/test_pack.py b/test/test_pack.py
index 64cf259..f9221dc 100644
--- a/test/test_pack.py
+++ b/test/test_pack.py
@@ -74,6 +74,9 @@ def testPackUnpackExternal(self):
                             iarray1 = array(127, typecode1, items).as_raw()
                         else:
                             iarray1 = array(255, typecode1, items).as_raw()
+                        # @unittest.skip('necmpi')
+                        if hasattr(iarray1, '__ve_array_interface__'):
+                            continue
                         iarray2 = array(range(items), typecode2).as_raw()
                         oarray1 = array(-1, typecode1, items).as_raw()
                         oarray2 = array(-1, typecode2, items).as_raw()
diff --git a/test/test_rma.py b/test/test_rma.py
index 02204de..b41ec08 100644
--- a/test/test_rma.py
+++ b/test/test_rma.py
@@ -95,6 +95,7 @@ def testAccumulate(self):
         group.Free()
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
+                if typecode in 'FDG': continue
                 for count in range(10):
                     for rank in range(size):
                         sbuf = array(range(count), typecode)
@@ -122,6 +123,7 @@ def testGetAccumulate(self):
         try:
             try:
                 self.WIN.Get_accumulate([obuf, 0, MPI.BYTE], [rbuf, 0, MPI.BYTE], rank)
+                self.WIN.Fence()
             finally:
                 MPI.Free_mem(obuf)
                 MPI.Free_mem(rbuf)
@@ -130,8 +132,10 @@ def testGetAccumulate(self):
         self.WIN.Fence()
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
+                if typecode in 'FDG': continue
+                if rank != 0: continue
                 for count in range(10):
-                    for rank in range(size):
+                    for lrank in range(size):
                         ones = array([1]*count, typecode)
                         sbuf = array(range(count), typecode)
                         rbuf = array(-1, typecode, count+1)
@@ -139,16 +143,16 @@ def testGetAccumulate(self):
                         for op in (MPI.SUM, MPI.PROD,
                                    MPI.MAX, MPI.MIN,
                                    MPI.REPLACE, MPI.NO_OP):
-                            self.WIN.Lock(rank)
-                            self.WIN.Put(ones.as_mpi(), rank)
-                            self.WIN.Flush(rank)
+                            self.WIN.Lock(lrank)
+                            self.WIN.Put(ones.as_mpi(), lrank)
+                            self.WIN.Flush(lrank)
                             self.WIN.Get_accumulate(sbuf.as_mpi(),
                                                     rbuf.as_mpi_c(count),
-                                                    rank, op=op)
-                            self.WIN.Flush(rank)
-                            self.WIN.Get(gbuf.as_mpi_c(count), rank)
-                            self.WIN.Flush(rank)
-                            self.WIN.Unlock(rank)
+                                                    lrank, op=op)
+                            self.WIN.Flush(lrank)
+                            self.WIN.Get(gbuf.as_mpi_c(count), lrank)
+                            self.WIN.Flush(lrank)
+                            self.WIN.Unlock(lrank)
                             #
                             for i in range(count):
                                 self.assertEqual(sbuf[i], i)
@@ -158,39 +162,51 @@ def testGetAccumulate(self):
                             self.assertEqual(gbuf[-1], -1)
 
     def testFetchAndOp(self):
+        typemap = MPI._typedict
         group = self.WIN.Get_group()
         size = group.Get_size()
         rank = group.Get_rank()
         group.Free()
         self.WIN.Fence()
-        #obuf = MPI.Alloc_mem(1); memzero(obuf)
-        #rbuf = MPI.Alloc_mem(1); memzero(rbuf)
-        #try:
-        #    try:
-        #        self.WIN.Fetch_and_op([obuf, 1, MPI.BYTE], [rbuf, 1, MPI.BYTE], rank)
-        #    finally:
-        #        MPI.Free_mem(obuf)
-        #        MPI.Free_mem(rbuf)
-        #except NotImplementedError:
-        #    self.skipTest('mpi-win-fetch_and_op')
-        #self.WIN.Fence()
+        blen = MPI.INT.Get_size()
+        obuf = MPI.Alloc_mem(blen); memzero(obuf)
+        rbuf = MPI.Alloc_mem(blen); memzero(rbuf)
+        try:
+            try:
+                self.WIN.Fetch_and_op(
+                    [obuf, 1, MPI.INT], 
+                    [rbuf, 1, MPI.INT],
+                    rank)
+                self.WIN.Fence()
+            finally:
+                MPI.Free_mem(obuf)
+                MPI.Free_mem(rbuf)
+        except NotImplementedError:
+            self.skipTest('mpi-win-fetch_and_op')
+        self.WIN.Fence()
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
+                if typecode in 'FDG': continue
+                if rank != 0 : continue
                 obuf = array(+1, typecode)
                 rbuf = array(-1, typecode, 2)
+                datatype = typemap[typecode]
                 for op in (MPI.SUM, MPI.PROD,
                            MPI.MAX, MPI.MIN,
                            MPI.REPLACE, MPI.NO_OP):
-                    for rank in range(size):
+                    for lrank in range(size):
                         for disp in range(3):
-                            self.WIN.Lock(rank)
+                            self.WIN.Lock(lrank)
                             self.WIN.Fetch_and_op(obuf.as_mpi(),
                                                   rbuf.as_mpi_c(1),
-                                                  rank, disp, op=op)
-                            self.WIN.Unlock(rank)
+                                                  lrank, 
+                                                  disp * datatype.size,
+                                                  op=op)
+                            self.WIN.Unlock(lrank)
                             self.assertEqual(rbuf[1], -1)
 
     def testCompareAndSwap(self):
+        typemap = MPI._typedict
         group = self.WIN.Get_group()
         size = group.Get_size()
         rank = group.Get_rank()
@@ -205,6 +221,7 @@ def testCompareAndSwap(self):
                                           [cbuf, 1, MPI.BYTE],
                                           [rbuf, 1, MPI.BYTE],
                                           rank, 0)
+                self.WIN.Fence()
             finally:
                 MPI.Free_mem(obuf)
                 MPI.Free_mem(cbuf)
@@ -215,17 +232,21 @@ def testCompareAndSwap(self):
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
                 if typecode in 'fdg': continue
+                if typecode in 'FDG': continue
+                if rank !=0 : continue
                 obuf = array(+1, typecode)
                 cbuf = array( 0, typecode)
                 rbuf = array(-1, typecode, 2)
-                for rank in range(size):
+                datatype = typemap[typecode]
+                for lrank in range(size):
                     for disp in range(3):
-                        self.WIN.Lock(rank)
+                        self.WIN.Lock(lrank)
                         self.WIN.Compare_and_swap(obuf.as_mpi(),
                                                   cbuf.as_mpi(),
                                                   rbuf.as_mpi_c(1),
-                                                  rank, disp)
-                        self.WIN.Unlock(rank)
+                                                  lrank, 
+                                                  disp * datatype.size)
+                        self.WIN.Unlock(lrank)
                         self.assertEqual(rbuf[1], -1)
 
     def testPutProcNull(self):
@@ -418,7 +439,6 @@ def testFlush(self):
 class TestRMASelf(BaseTestRMA, unittest.TestCase):
     COMM = MPI.COMM_SELF
 
-@unittest.skipIf(multihost.IS_MULTI_HOST, 'necmpi-multi-host')
 class TestRMAWorld(BaseTestRMA, unittest.TestCase):
     COMM = MPI.COMM_WORLD
 
diff --git a/test/test_rma_nb.py b/test/test_rma_nb.py
index 404ffd7..1f5100a 100644
--- a/test/test_rma_nb.py
+++ b/test/test_rma_nb.py
@@ -45,51 +45,57 @@ def tearDown(self):
     def testPutGet(self):
         group = self.WIN.Get_group()
         size = group.Get_size()
+        rank = group.Get_rank()
         group.Free()
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
+                if typecode in 'FDG': continue
+                if rank != 0: continue
                 for count in range(self.COUNT_MIN, 10):
-                    for rank in range(size):
-                        sbuf = array([rank]*count, typecode)
+                    for lrank in range(size):
+                        sbuf = array([lrank]*count, typecode)
                         rbuf = array(-1, typecode, count+1)
-                        self.WIN.Fence()
-                        self.WIN.Lock(rank)
-                        r = self.WIN.Rput(sbuf.as_mpi(), rank)
+                        #self.WIN.Fence()
+                        self.WIN.Lock(lrank)
+                        r = self.WIN.Rput(sbuf.as_mpi(), lrank)
                         r.Wait()
-                        self.WIN.Flush(rank)
-                        r = self.WIN.Rget(rbuf.as_mpi_c(count), rank)
+                        self.WIN.Flush(lrank)
+                        r = self.WIN.Rget(rbuf.as_mpi_c(count), lrank)
                         r.Wait()
-                        self.WIN.Unlock(rank)
+                        self.WIN.Unlock(lrank)
                         for i in range(count):
-                            self.assertEqual(sbuf[i], rank)
-                            self.assertEqual(rbuf[i], rank)
-                        self.assertEqual(rbuf[-1], -1)
+                            self.assertEqual(sbuf[i], lrank)
+                            self.assertEqual(rbuf[i], lrank)
+                        self.assertEqual(int(rbuf[-1]), -1)
 
     @unittest.skipMPI('openmpi(>=1.10.0,<1.11.0)')
     def testAccumulate(self):
         group = self.WIN.Get_group()
         size = group.Get_size()
+        rank = group.Get_rank()
         group.Free()
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
+                if typecode in 'FDG': continue
+                if rank != 0: continue
                 for count in range(self.COUNT_MIN, 10):
-                    for rank in range(size):
+                    for lrank in range(size):
                         ones = array([1]*count, typecode)
                         sbuf = array(range(count), typecode)
                         rbuf = array(-1, typecode, count+1)
                         for op in (MPI.SUM, MPI.PROD,
                                    MPI.MAX, MPI.MIN,
                                    MPI.REPLACE):
-                            self.WIN.Lock(rank)
-                            self.WIN.Put(ones.as_mpi(), rank)
-                            self.WIN.Flush(rank)
+                            self.WIN.Lock(lrank)
+                            self.WIN.Put(ones.as_mpi(), lrank)
+                            self.WIN.Flush(lrank)
                             r = self.WIN.Raccumulate(sbuf.as_mpi(),
-                                                     rank, op=op)
+                                                     lrank, op=op)
                             r.Wait()
-                            self.WIN.Flush(rank)
-                            r = self.WIN.Rget(rbuf.as_mpi_c(count), rank)
+                            self.WIN.Flush(lrank)
+                            r = self.WIN.Rget(rbuf.as_mpi_c(count), lrank)
                             r.Wait()
-                            self.WIN.Unlock(rank)
+                            self.WIN.Unlock(lrank)
                             #
                             for i in range(count):
                                 self.assertEqual(sbuf[i], i)
@@ -100,11 +106,14 @@ def testAccumulate(self):
     def testGetAccumulate(self):
         group = self.WIN.Get_group()
         size = group.Get_size()
+        rank = group.Get_rank()
         group.Free()
         for array in arrayimpl.ArrayTypes:
             for typecode in arrayimpl.TypeMap:
+                if typecode in 'FDG': continue
+                if rank != 0: continue
                 for count in range(self.COUNT_MIN, 10):
-                    for rank in range(size):
+                    for lrank in range(size):
                         ones = array([1]*count, typecode)
                         sbuf = array(range(count), typecode)
                         rbuf = array(-1, typecode, count+1)
@@ -112,17 +121,17 @@ def testGetAccumulate(self):
                         for op in (MPI.SUM, MPI.PROD,
                                    MPI.MAX, MPI.MIN,
                                    MPI.REPLACE, MPI.NO_OP):
-                            self.WIN.Lock(rank)
-                            self.WIN.Put(ones.as_mpi(), rank)
-                            self.WIN.Flush(rank)
+                            self.WIN.Lock(lrank)
+                            self.WIN.Put(ones.as_mpi(), lrank)
+                            self.WIN.Flush(lrank)
                             r = self.WIN.Rget_accumulate(sbuf.as_mpi(),
                                                          rbuf.as_mpi_c(count),
-                                                         rank, op=op)
+                                                         lrank, op=op)
                             r.Wait()
-                            self.WIN.Flush(rank)
-                            r = self.WIN.Rget(gbuf.as_mpi_c(count), rank)
+                            self.WIN.Flush(lrank)
+                            r = self.WIN.Rget(gbuf.as_mpi_c(count), lrank)
                             r.Wait()
-                            self.WIN.Unlock(rank)
+                            self.WIN.Unlock(lrank)
                             #
                             for i in range(count):
                                 self.assertEqual(sbuf[i], i)
@@ -175,7 +184,6 @@ class TestRMASelf(BaseTestRMA, unittest.TestCase):
 @unittest.skipMPI('MPI(<3.0)')
 @unittest.skipMPI('openmpi(<1.8.1)')
 @unittest.skipMPI('MPICH2(<1.5.0)')
-@unittest.skipIf(multihost.IS_MULTI_HOST, 'necmpi-multi-host')
 class TestRMAWorld(BaseTestRMA, unittest.TestCase):
     COMM = MPI.COMM_WORLD
 
diff --git a/test/test_rma_nb_nlcpy_notimpl.py b/test/test_rma_nb_nlcpy_notimpl.py
deleted file mode 100644
index e83a7c4..0000000
--- a/test/test_rma_nb_nlcpy_notimpl.py
+++ /dev/null
@@ -1,200 +0,0 @@
-from mpi4pyve import MPI
-import mpiunittest as unittest
-import nlcpy_only_arrayimpl as arrayimpl
-import sys
-
-pypy_lt_53 = (hasattr(sys, 'pypy_version_info') and
-              sys.pypy_version_info < (5, 3))
-
-def mkzeros(n):
-    if pypy_lt_53:
-        return b'\0' * n
-    return bytearray(n)
-
-def memzero(m):
-    try:
-        m[:] = 0
-    except IndexError: # cffi buffer
-        m[0:len(m)] = b'\0'*len(m)
-
-class BaseTestRMA(object):
-
-    COMM = MPI.COMM_NULL
-    INFO = MPI.INFO_NULL
-
-    COUNT_MIN = 0
-
-    def setUp(self):
-        nbytes = 100*MPI.DOUBLE.size
-        try:
-            self.mpi_memory = MPI.Alloc_mem(nbytes)
-            self.memory = self.mpi_memory
-            memzero(self.memory)
-        except MPI.Exception:
-            import array
-            self.mpi_memory = None
-            self.memory = array.array('B',[0]*nbytes)
-        self.WIN = MPI.Win.Create(self.memory, 1, self.INFO, self.COMM)
-
-    def tearDown(self):
-        self.WIN.Free()
-        if self.mpi_memory:
-            MPI.Free_mem(self.mpi_memory)
-
-    def testPutGet(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        group.Free()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                for count in range(self.COUNT_MIN, 10):
-                    for rank in range(size):
-                        sbuf = array([rank]*count, typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        self.WIN.Fence()
-                        self.WIN.Lock(rank)
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Rput(sbuf.as_mpi(), rank).Wait()
-                        )
-                        self.WIN.Flush(rank)
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Rget(rbuf.as_mpi_c(count), rank).Wait()
-                        )
-                        self.WIN.Unlock(rank)
-                        #for i in range(count):
-                        #    self.assertEqual(sbuf[i], rank)
-                        #    self.assertEqual(rbuf[i], rank)
-                        #self.assertEqual(rbuf[-1], -1)
-
-    @unittest.skipMPI('openmpi(>=1.10.0,<1.11.0)')
-    def testAccumulate(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        group.Free()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                for count in range(self.COUNT_MIN, 10):
-                    for rank in range(size):
-                        ones = array([1]*count, typecode)
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        for op in (MPI.SUM, MPI.PROD,
-                                   MPI.MAX, MPI.MIN,
-                                   MPI.REPLACE):
-                            self.WIN.Lock(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Put(ones.as_mpi(), rank)
-                            )
-                            self.WIN.Flush(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Raccumulate(sbuf.as_mpi(),
-                                                     rank, op=op).Wait()
-                            )
-                            self.WIN.Flush(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Rget(rbuf.as_mpi_c(count), rank).Wait()
-                            )
-                            self.WIN.Unlock(rank)
-                            #
-                            #for i in range(count):
-                            #    self.assertEqual(sbuf[i], i)
-                            #    self.assertEqual(rbuf[i], op(1, i))
-                            #self.assertEqual(rbuf[-1], -1)
-
-    @unittest.skipMPI('openmpi(>=1.10,<1.11)')
-    def testGetAccumulate(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        group.Free()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                for count in range(self.COUNT_MIN, 10):
-                    for rank in range(size):
-                        ones = array([1]*count, typecode)
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        gbuf = array(-1, typecode, count+1)
-                        for op in (MPI.SUM, MPI.PROD,
-                                   MPI.MAX, MPI.MIN,
-                                   MPI.REPLACE, MPI.NO_OP):
-                            self.WIN.Lock(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Put(ones.as_mpi(), rank)
-                            )
-                            self.WIN.Flush(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Rget_accumulate(sbuf.as_mpi(),
-                                                         rbuf.as_mpi_c(count),
-                                                         rank, op=op).Wait()
-                            )
-                            self.WIN.Flush(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Rget(gbuf.as_mpi_c(count), rank).Wait()
-                            )
-                            self.WIN.Unlock(rank)
-                            #
-                            #for i in range(count):
-                            #    self.assertEqual(sbuf[i], i)
-                            #    self.assertEqual(rbuf[i], 1)
-                            #    self.assertEqual(gbuf[i], op(1, i))
-                            #self.assertEqual(rbuf[-1], -1)
-                            #self.assertEqual(gbuf[-1], -1)
-
-    def testPutProcNull(self):
-        rank = self.COMM.Get_rank()
-        self.WIN.Lock(rank)
-        r = self.WIN.Rput(None, MPI.PROC_NULL, None)
-        r.Wait()
-        self.WIN.Unlock(rank)
-
-    def testGetProcNull(self):
-        rank = self.COMM.Get_rank()
-        self.WIN.Lock(rank)
-        r = self.WIN.Rget(None, MPI.PROC_NULL, None)
-        r.Wait()
-        self.WIN.Unlock(rank)
-
-    def testAccumulateProcNullReplace(self):
-        rank = self.COMM.Get_rank()
-        zeros = mkzeros(8)
-        self.WIN.Lock(rank)
-        r = self.WIN.Raccumulate([zeros, MPI.INT], MPI.PROC_NULL, None, MPI.REPLACE)
-        r.Wait()
-        r = self.WIN.Raccumulate([zeros, MPI.INT], MPI.PROC_NULL, None, MPI.REPLACE)
-        r.Wait()
-        self.WIN.Unlock(rank)
-
-    def testAccumulateProcNullSum(self):
-        rank = self.COMM.Get_rank()
-        zeros = mkzeros(8)
-        self.WIN.Lock(rank)
-        r = self.WIN.Raccumulate([zeros, MPI.INT], MPI.PROC_NULL, None, MPI.SUM)
-        r.Wait()
-        r = self.WIN.Raccumulate([None, MPI.INT], MPI.PROC_NULL, None, MPI.SUM)
-        r.Wait()
-        self.WIN.Unlock(rank)
-
-
-@unittest.skipMPI('MPI(<3.0)')
-@unittest.skipMPI('openmpi(<1.8.1)')
-@unittest.skipMPI('MPICH2(<1.5.0)')
-class TestRMASelf(BaseTestRMA, unittest.TestCase):
-    COMM = MPI.COMM_SELF
-
-#@unittest.skipMPI('MPI(<3.0)')
-#@unittest.skipMPI('openmpi(<1.8.1)')
-#@unittest.skipMPI('MPICH2(<1.5.0)')
-#class TestRMAWorld(BaseTestRMA, unittest.TestCase):
-#    COMM = MPI.COMM_WORLD
-
-
-SpectrumMPI = MPI.get_vendor()[0] == 'Spectrum MPI'
-try:
-    if SpectrumMPI: raise NotImplementedError
-    MPI.Win.Create(None, 1, MPI.INFO_NULL, MPI.COMM_SELF).Free()
-except (NotImplementedError, MPI.Exception):
-    unittest.disable(BaseTestRMA, 'mpi-rma-nb')
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/test/test_rma_nlcpy_notimpl.py b/test/test_rma_nlcpy_notimpl.py
deleted file mode 100644
index b28d0b7..0000000
--- a/test/test_rma_nlcpy_notimpl.py
+++ /dev/null
@@ -1,459 +0,0 @@
-from mpi4pyve import MPI
-import mpiunittest as unittest
-import nlcpy_only_arrayimpl as arrayimpl
-import sys
-
-pypy_lt_53 = (hasattr(sys, 'pypy_version_info') and
-              sys.pypy_version_info < (5, 3))
-
-def mkzeros(n):
-    if pypy_lt_53:
-        return b'\0' * n
-    return bytearray(n)
-
-def memzero(m):
-    try:
-        m[:] = 0
-    except IndexError: # cffi buffer
-        m[0:len(m)] = b'\0'*len(m)
-
-class BaseTestRMA(object):
-
-    COMM = MPI.COMM_NULL
-    INFO = MPI.INFO_NULL
-
-    def setUp(self):
-        nbytes = 100*MPI.DOUBLE.size
-        try:
-            self.mpi_memory = MPI.Alloc_mem(nbytes)
-            self.memory = self.mpi_memory
-            memzero(self.memory)
-        except MPI.Exception:
-            import array
-            self.mpi_memory = None
-            self.memory = array.array('B',[0]*nbytes)
-        self.WIN = MPI.Win.Create(self.memory, 1, self.INFO, self.COMM)
-
-    def tearDown(self):
-        self.WIN.Free()
-        if self.mpi_memory:
-            MPI.Free_mem(self.mpi_memory)
-
-    def testPutGet(self):
-        typemap = MPI._typedict
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        group.Free()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                for count in range(10):
-                    for rank in range(size):
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        #
-                        self.WIN.Fence()
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Put(sbuf.as_mpi(), rank)
-                        )
-                        self.WIN.Fence()
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Get(rbuf.as_mpi_c(count), rank)
-                        )
-                        self.WIN.Fence()
-                        #for i in range(count):
-                        #    self.assertEqual(sbuf[i], i)
-                        #    self.assertNotEqual(rbuf[i], -1)
-                        #self.assertEqual(rbuf[-1], -1)
-                        #
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        target  = sbuf.itemsize
-                        self.WIN.Fence()
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Put(sbuf.as_mpi(), rank, target)
-                        )
-                        self.WIN.Fence()
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Get(rbuf.as_mpi_c(count), rank, target)
-                        )
-                        self.WIN.Fence()
-                        #for i in range(count):
-                        #    self.assertEqual(sbuf[i], i)
-                        #    self.assertNotEqual(rbuf[i], -1)
-                        #self.assertEqual(rbuf[-1], -1)
-                        #
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        datatype = typemap[typecode]
-                        target  = (sbuf.itemsize, count, datatype)
-                        self.WIN.Fence()
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Put(sbuf.as_mpi(), rank, target)
-                        )
-                        self.WIN.Fence()
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Get(rbuf.as_mpi_c(count), rank, target)
-                        )
-                        self.WIN.Fence()
-                        #for i in range(count):
-                        #    self.assertEqual(sbuf[i], i)
-                        #    self.assertNotEqual(rbuf[i], -1)
-                        #self.assertEqual(rbuf[-1], -1)
-
-    def testAccumulate(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        group.Free()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                for count in range(10):
-                    for rank in range(size):
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        for op in (MPI.SUM, MPI.PROD, MPI.MAX, MPI.MIN):
-                            self.WIN.Fence()
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Accumulate(sbuf.as_mpi(), rank, op=op)
-                            )
-                            self.WIN.Fence()
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Get(rbuf.as_mpi_c(count), rank)
-                            )
-                            self.WIN.Fence()
-                            #for i in range(count):
-                            #    self.assertEqual(sbuf[i], i)
-                            #    self.assertNotEqual(rbuf[i], -1)
-                            #self.assertEqual(rbuf[-1], -1)
-
-    @unittest.skipMPI('openmpi(>=1.10,<1.11)')
-    def testGetAccumulate(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        rank = group.Get_rank()
-        group.Free()
-        self.WIN.Fence()
-        obuf = MPI.Alloc_mem(1); memzero(obuf)
-        rbuf = MPI.Alloc_mem(1); memzero(rbuf)
-        try:
-            try:
-                self.WIN.Get_accumulate([obuf, 0, MPI.BYTE], [rbuf, 0, MPI.BYTE], rank)
-            finally:
-                MPI.Free_mem(obuf)
-                MPI.Free_mem(rbuf)
-        except NotImplementedError:
-            self.skipTest('mpi-win-get_accumulate')
-        self.WIN.Fence()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                for count in range(10):
-                    for rank in range(size):
-                        ones = array([1]*count, typecode)
-                        sbuf = array(range(count), typecode)
-                        rbuf = array(-1, typecode, count+1)
-                        gbuf = array(-1, typecode, count+1)
-                        for op in (MPI.SUM, MPI.PROD,
-                                   MPI.MAX, MPI.MIN,
-                                   MPI.REPLACE, MPI.NO_OP):
-                            self.WIN.Lock(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Put(ones.as_mpi(), rank)
-                            )
-                            self.WIN.Flush(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Get_accumulate(sbuf.as_mpi(),
-                                                    rbuf.as_mpi_c(count),
-                                                    rank, op=op)
-                            )
-                            self.WIN.Flush(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Get(gbuf.as_mpi_c(count), rank)
-                            )
-                            self.WIN.Flush(rank)
-                            self.WIN.Unlock(rank)
-                            #
-                            #for i in range(count):
-                            #    self.assertEqual(sbuf[i], i)
-                            #    self.assertEqual(rbuf[i], 1)
-                            #    self.assertEqual(gbuf[i], op(1, i))
-                            #self.assertEqual(rbuf[-1], -1)
-                            #self.assertEqual(gbuf[-1], -1)
- 
-    def testFetchAndOp(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        rank = group.Get_rank()
-        group.Free()
-        #self.WIN.Fence()
-        #obuf = MPI.Alloc_mem(1); memzero(obuf)
-        #rbuf = MPI.Alloc_mem(1); memzero(rbuf)
-        #try:
-        #    try:
-        #        self.WIN.Fetch_and_op([obuf, 1, MPI.BYTE], [rbuf, 1, MPI.BYTE], rank)
-        #    finally:
-        #        MPI.Free_mem(obuf)
-        #        MPI.Free_mem(rbuf)
-        #except NotImplementedError:
-        #    self.skipTest('mpi-win-fetch_and_op')
-        #self.WIN.Fence()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                obuf = array(+1, typecode)
-                rbuf = array(-1, typecode, 2)
-                for op in (MPI.SUM, MPI.PROD,
-                           MPI.MAX, MPI.MIN,
-                           MPI.REPLACE, MPI.NO_OP):
-                    for rank in range(size):
-                        for disp in range(3):
-                            self.WIN.Lock(rank)
-                            self.assertRaises(NotImplementedError, lambda:
-                            self.WIN.Fetch_and_op(obuf.as_mpi(),
-                                                  rbuf.as_mpi_c(1),
-                                                  rank, disp, op=op)
-                            )
-                            self.WIN.Unlock(rank)
-                            #self.assertEqual(rbuf[1], -1)
-
-    def testCompareAndSwap(self):
-        group = self.WIN.Get_group()
-        size = group.Get_size()
-        rank = group.Get_rank()
-        group.Free()
-        self.WIN.Fence()
-        obuf = MPI.Alloc_mem(1); memzero(obuf)
-        cbuf = MPI.Alloc_mem(1); memzero(cbuf)
-        rbuf = MPI.Alloc_mem(1); memzero(rbuf)
-        try:
-            try:
-                self.WIN.Compare_and_swap([obuf, 1, MPI.BYTE],
-                                          [cbuf, 1, MPI.BYTE],
-                                          [rbuf, 1, MPI.BYTE],
-                                          rank, 0)
-            finally:
-                MPI.Free_mem(obuf)
-                MPI.Free_mem(cbuf)
-                MPI.Free_mem(rbuf)
-        except NotImplementedError:
-            self.skipTest('mpi-win-compare_and_swap')
-        self.WIN.Fence()
-        for array in arrayimpl.ArrayTypes:
-            for typecode in arrayimpl.TypeMap:
-                if typecode in 'fdg': continue
-                obuf = array(+1, typecode)
-                cbuf = array( 0, typecode)
-                rbuf = array(-1, typecode, 2)
-                for rank in range(size):
-                    for disp in range(3):
-                        self.WIN.Lock(rank)
-                        self.assertRaises(NotImplementedError, lambda:
-                        self.WIN.Compare_and_swap(obuf.as_mpi(),
-                                                  cbuf.as_mpi(),
-                                                  rbuf.as_mpi_c(1),
-                                                  rank, disp)
-                        )
-                        self.WIN.Unlock(rank)
-                        #self.assertEqual(rbuf[1], -1)
-
-    def testPutProcNull(self):
-        self.WIN.Fence()
-        self.WIN.Put(None, MPI.PROC_NULL, None)
-        self.WIN.Fence()
-
-    def testGetProcNull(self):
-        self.WIN.Fence()
-        self.WIN.Get(None, MPI.PROC_NULL, None)
-        self.WIN.Fence()
-
-    def testAccumulateProcNullReplace(self):
-        self.WIN.Fence()
-        zeros = mkzeros(8)
-        self.WIN.Fence()
-        self.WIN.Accumulate([zeros, MPI.INT], MPI.PROC_NULL, None, MPI.REPLACE)
-        self.WIN.Fence()
-        self.WIN.Accumulate([zeros, MPI.INT], MPI.PROC_NULL, None, MPI.REPLACE)
-        self.WIN.Fence()
-
-    def testAccumulateProcNullSum(self):
-        self.WIN.Fence()
-        zeros = mkzeros(8)
-        self.WIN.Fence()
-        self.WIN.Accumulate([zeros, MPI.INT], MPI.PROC_NULL, None, MPI.SUM)
-        self.WIN.Fence()
-        self.WIN.Accumulate([None, MPI.INT], MPI.PROC_NULL, None, MPI.SUM)
-        self.WIN.Fence()
-
-    def testGetAccumulateProcNull(self):
-        obuf = [mkzeros(8), 0, MPI.INT]
-        rbuf = [mkzeros(8), 0, MPI.INT]
-        self.WIN.Fence()
-        try:
-            self.WIN.Get_accumulate(obuf, rbuf, MPI.PROC_NULL)
-        except NotImplementedError:
-            self.skipTest('mpi-win-get_accumulate')
-        self.WIN.Fence()
-
-    ##def testFetchAndOpProcNull(self):
-    ##    obuf = cbuf = rbuf = None
-    ##    self.WIN.Fence()
-    ##    try:
-    ##        self.WIN.Fetch_and_op(obuf, rbuf, MPI.PROC_NULL, 0)
-    ##    except NotImplementedError:
-    ##        self.skipTest('mpi-win-fetch_and_op')
-    ##    self.WIN.Fence()
-
-    ##def testCompareAndSwapProcNull(self):
-    ##    obuf = cbuf = rbuf = None
-    ##    self.WIN.Fence()
-    ##    try:
-    ##        self.WIN.Compare_and_swap(obuf, cbuf, rbuf, MPI.PROC_NULL, 0)
-    ##    except NotImplementedError:
-    ##        self.skipTest('mpi-win-compare_and_swap')
-    ##    self.WIN.Fence()
-
-    def testFence(self):
-        win = self.WIN
-        LMODE = [0, MPI.MODE_NOSTORE, MPI.MODE_NOPUT,
-                 MPI.MODE_NOSTORE|MPI.MODE_NOPUT]
-        GMODE = [0, MPI.MODE_NOPRECEDE, MPI.MODE_NOSUCCEED]
-        win.Fence()
-        for lmode in LMODE:
-            for gmode in GMODE:
-                assertion =  lmode | gmode
-                win.Fence(assertion)
-        win.Fence()
-
-    @unittest.skipMPI('openmpi(==1.8.1)')
-    def testFenceAll(self):
-        win = self.WIN
-        assertion = 0
-        modes = [0,
-                 MPI.MODE_NOSTORE,
-                 MPI.MODE_NOPUT,
-                 MPI.MODE_NOPRECEDE,
-                 MPI.MODE_NOSUCCEED]
-        win.Fence()
-        for mode in modes:
-            win.Fence(mode)
-            assertion |= mode
-            win.Fence(assertion)
-        win.Fence()
-
-    @unittest.skipMPI('openmpi(==1.8.6)')
-    def testStartComplete(self):
-        self.WIN.Start(MPI.GROUP_EMPTY)
-        self.WIN.Complete()
-
-    @unittest.skipMPI('openmpi(==1.8.6)')
-    def testPostWait(self):
-        self.WIN.Post(MPI.GROUP_EMPTY)
-        self.WIN.Wait()
-
-    @unittest.skipMPI('openmpi(==1.8.7)')
-    @unittest.skipMPI('openmpi(==1.8.6)')
-    def testStartCompletePostWait(self):
-        win = self.WIN
-        wingroup = win.Get_group()
-        size = wingroup.Get_size()
-        rank = wingroup.Get_rank()
-        if size < 2: return wingroup.Free()
-        if rank == 0:
-            group = wingroup.Excl([0])
-            win.Start(group)
-            win.Complete()
-            win.Post(group)
-            win.Wait()
-            group.Free()
-        else:
-            group = wingroup.Incl([0])
-            win.Post(group)
-            win.Wait()
-            win.Start(group)
-            win.Complete()
-            group.Free()
-        wingroup.Free()
-
-    @unittest.skipMPI('openmpi(==1.8.7)')
-    @unittest.skipMPI('openmpi(==1.8.6)')
-    def testStartCompletePostTest(self):
-        comm = self.COMM
-        win = self.WIN
-        wingroup = win.Get_group()
-        size = wingroup.Get_size()
-        rank = wingroup.Get_rank()
-        if size < 2: return wingroup.Free()
-        if rank == 0:
-            group = wingroup.Excl([0])
-            win.Start(group)
-            comm.Barrier()
-            win.Complete()
-            comm.Barrier()
-            group.Free()
-        else:
-            group = wingroup.Incl([0])
-            win.Post(group)
-            flag = win.Test()
-            self.assertFalse(flag)
-            comm.Barrier()
-            comm.Barrier()
-            flag = win.Test()
-            self.assertTrue(flag)
-            group.Free()
-        wingroup.Free()
-
-    @unittest.skipMPI('MPI(<3.0)')
-    def testSync(self):
-        win = self.WIN
-        comm = self.COMM
-        rank = comm.Get_rank()
-        win.Lock(rank)
-        win.Sync()
-        win.Unlock(rank)
-        comm.Barrier()
-
-    @unittest.skipMPI('MPI(<3.0)')
-    def testFlush(self):
-        win = self.WIN
-        comm = self.COMM
-        size = comm.Get_size()
-        rank = comm.Get_rank()
-        #
-        for i in range(size):
-            win.Lock(i)
-            win.Flush(i)
-            win.Unlock(i)
-        comm.Barrier()
-        for i in range(size):
-            if i == rank:
-                win.Lock_all()
-                win.Flush_all()
-                win.Unlock_all()
-            comm.Barrier()
-        #
-        for i in range(size):
-            win.Lock(i)
-            win.Flush_local(i)
-            win.Unlock(i)
-        comm.Barrier()
-        for i in range(size):
-            if i == rank:
-                win.Lock_all()
-                win.Flush_local_all()
-                win.Unlock_all()
-            comm.Barrier()
-
-class TestRMASelf(BaseTestRMA, unittest.TestCase):
-    COMM = MPI.COMM_SELF
-
-#class TestRMAWorld(BaseTestRMA, unittest.TestCase):
-#    COMM = MPI.COMM_WORLD
-
-
-SpectrumMPI = MPI.get_vendor()[0] == 'Spectrum MPI'
-try:
-    if SpectrumMPI: raise NotImplementedError
-    MPI.Win.Create(None, 1, MPI.INFO_NULL, MPI.COMM_SELF).Free()
-except (NotImplementedError, MPI.Exception):
-    unittest.disable(BaseTestRMA, 'mpi-rma')
-
-
-if __name__ == '__main__':
-    unittest.main()
diff --git a/test/test_vai.py b/test/test_vai.py
new file mode 100644
index 0000000..63bd918
--- /dev/null
+++ b/test/test_vai.py
@@ -0,0 +1,103 @@
+from mpi4pyve import MPI
+import mpiunittest as unittest
+import nlcpy as vp
+import numpy as np
+import nlcpy_ndarray_wrapper
+
+class BaseTestVAIBuf(object):
+
+    COMM = MPI.COMM_NULL
+
+    def testSendrecv(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        dest = (rank + 1) % size
+        source = (rank - 1) % size
+        if size < 1: return
+        
+        if rank == 0:
+            sbuf = vp.arange(10)[2:]
+            rbuf = vp.array(vp.zeros(8), dtype='int')
+            self.COMM.Sendrecv(sbuf, 0, 0,
+                               rbuf, 0, 0)
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+
+    def testSendRecv(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        sbuf = vp.arange(10)[2:]
+        rbuf = vp.array(vp.zeros(8), dtype='int')
+        if size < 2: return
+        if rank == 0:
+            self.COMM.Send(sbuf, dest=1)
+        elif rank == 1:
+            self.COMM.Recv(rbuf, source=0)
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+    
+    def testPickledSendrecv(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        dest = (rank + 1) % size
+        source = (rank - 1) % size
+        if size < 1: return
+        
+        if rank == 0:
+            sbuf = vp.arange(10)[2:]
+            rbuf = vp.array(vp.zeros(8), dtype='int')
+            rbuf = self.COMM.sendrecv(sbuf,dest=0,source=0)
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+
+    def testPickledSendRecv(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        sbuf = vp.arange(10)[2:]
+        rbuf = vp.array(vp.zeros(8), dtype='int')
+        if size < 2: return
+        if rank == 0:
+            self.COMM.send(sbuf, dest=1)
+        elif rank == 1:
+            rbuf = self.COMM.recv()
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+
+    def testVAIReadOnly(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        sbuf =  nlcpy_ndarray_wrapper.nlcpy_ndarray_wrapper(shape=(8))
+        sbuf.fill(0)
+        rbuf =  nlcpy_ndarray_wrapper.nlcpy_ndarray_wrapper(shape=(8))
+        rbuf.fill(0)
+        if size < 2: return
+        if rank == 0:
+            self.COMM.Send(sbuf, dest=1)
+        elif rank == 1:
+            self.COMM.Recv(rbuf, source=0)
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+        sbuf.set_read_only_flag(True)
+        rbuf.set_read_only_flag(True)
+        if rank == 0:
+            #self.COMM.Send(sbuf, dest=1)
+            pass
+        elif rank == 1:
+            with self.assertRaises(BufferError):
+                self.COMM.Recv(rbuf, source=0)
+        else :
+            pass
+
+class TestVAIBufSelf(BaseTestVAIBuf, unittest.TestCase):
+    COMM = MPI.COMM_SELF
+
+class TestVAIBufWorld(BaseTestVAIBuf, unittest.TestCase):
+    COMM = MPI.COMM_WORLD
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/test/test_vai_bool_array.py b/test/test_vai_bool_array.py
new file mode 100644
index 0000000..7ca15ee
--- /dev/null
+++ b/test/test_vai_bool_array.py
@@ -0,0 +1,60 @@
+from mpi4pyve import MPI
+import mpiunittest as unittest
+import nlcpy as vp
+import numpy as np
+import nlcpy_ndarray_wrapper
+
+class BaseTestVAIBuf(object):
+
+    COMM = MPI.COMM_NULL
+
+    def testSendRecvbool(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        sbuf = vp.array([True, True, True], dtype='bool')
+        rbuf = vp.array([False, False, False], dtype='bool')
+        if size < 2: return
+        if rank == 0:
+            self.COMM.Send(sbuf, dest=1)
+        elif rank == 1:
+            self.COMM.Recv(rbuf, source=0)
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+
+    def testSendRecvMPIBOOL(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        sbuf = vp.array([True, True, True], dtype='bool')
+        rbuf = vp.array([False, False, False], dtype='bool')
+        if size < 2: return
+        if rank == 0:
+            self.COMM.Send([sbuf,MPI.BOOL], dest=1)
+        elif rank == 1:
+            self.COMM.Recv([rbuf,MPI.BOOL], source=0)
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+
+    def testPickledbool(self):
+        size = self.COMM.Get_size()
+        rank = self.COMM.Get_rank()
+        sbuf = vp.array([True, True, True], dtype='bool')
+        rbuf = vp.array([False, False, False], dtype='bool')
+        if size < 2: return
+        if rank == 0:
+            self.COMM.send(sbuf, dest=1)
+        elif rank == 1:
+            rbuf = self.COMM.recv()
+            self.assertEqual(np.allclose(sbuf, rbuf),True)
+        else :
+            pass
+    
+class TestVAIBufSelf(BaseTestVAIBuf, unittest.TestCase):
+    COMM = MPI.COMM_SELF
+
+class _TestVAIBufWorld(BaseTestVAIBuf, unittest.TestCase):
+    COMM = MPI.COMM_WORLD
+
+if __name__ == '__main__':
+    unittest.main()
diff --git a/test/test_vai_notimpl.py b/test/test_vai_notimpl.py
new file mode 100644
index 0000000..5adc67a
--- /dev/null
+++ b/test/test_vai_notimpl.py
@@ -0,0 +1,46 @@
+from mpi4pyve import MPI
+import mpiunittest as unittest
+import nlcpy
+import numpy
+
+class TestVAI_Notimpl(unittest.TestCase):
+
+    def testRestriction(self):
+        src = nlcpy.arange(10)
+        dst = nlcpy.arange(10)
+        np_src = numpy.arange(10)
+        np_dst = numpy.arange(10)
+
+        with self.assertRaises(NotImplementedError):
+            MPI.Attach_buffer(src)
+        
+        comm = MPI.COMM_SELF
+        with self.assertRaises(NotImplementedError):
+            comm.Bsend(src, 0)
+        with self.assertRaises(NotImplementedError):
+            comm.Ibsend(src, 0)
+        with self.assertRaises(NotImplementedError):
+            comm.Bsend_init(src, 0)
+        with self.assertRaises(NotImplementedError):
+            comm.bsend(src, 0)
+        with self.assertRaises(NotImplementedError):
+            comm.ibsend(src, 0)
+
+        op = MPI.SUM
+        with self.assertRaises(NotImplementedError):
+            op.Reduce_local(src, np_src)
+        with self.assertRaises(NotImplementedError):
+            op.Reduce_local(np_src, src)
+  
+        data_type = MPI.INT 
+        with self.assertRaises(NotImplementedError):
+            data_type.Pack_external('external32',src, np_src, 0)
+        with self.assertRaises(NotImplementedError):
+            data_type.Pack_external('external32',np_src, src, 0)
+        with self.assertRaises(NotImplementedError):
+            data_type.Unpack_external('external32',src, 0, np_src)
+        with self.assertRaises(NotImplementedError):
+            data_type.Unpack_external('external32',np_src, 0, src)
+
+if __name__ == '__main__':
+    unittest.main()