This is a python package that contains a parallel implementation for
a subset of numpy.einsum
functionality.
numpy.einsum
is a fantastic function for multiplying numpy
arrays.
However, numpy.dot
and numpy.tensordot
are typically faster, especially if numpy
is linked to a parallel implementation of BLAS:
then numpy.dot
and numpy.tensordot
will take advantage of the multiple
CPUs, whereas numpy.einsum
remains single-threaded.
The trouble is, some einsum
products are impossible to express as
dot
or tensordot
. For example,
numpy.einsum("ijm,kmi->ijk", A, B)
returns a tensor C with Ci,j,k = ∑m Ai,j,m Bk,m,i.
This cannot be expressed as a dot
or tensordot
because the shared
axis i is included in the output C.
This operation can be expressed in terms of numpy.matmul
, and in particular
this example is equivalent to numpy.matmul(A, np.transpose(B))
.
However, on my machine, numpy.matmul
does not appear to take advantage
of parallel BLAS with multiple cores.
This may eventually change in the future (last year Intel introduced
batched GEMM operations), but in the meantime, you can use einsum2
to parallelize such einsum
calls.
einsum2
is also compatible with the autograd package for automatic
differentiation.
Pre-requisites are a C compiler that supports OpenMP, Cython
, numpy
, and pip
.
Assuming you satisfy these requirements, you can install with
pip install .
at the top-level directory of einsum2
.
If you have problem installing or running, it’s probably because
of your C compiler.
On OSX in particular, the default compiler clang
does not support OpenMP and thus won’t work.
I’ve also occasionally run into issues using Anaconda Python on Linux,
apparently because of the old gcc-4 it was compiled with – however, I’m currently unable to reproduce this error.
In case you are having issues with the C compiler, it is recommended
to install and run einsum2
in a virtual environment with a compatible C compiler.
For example, if you use Anaconda Python, you can do the following:
- Create a new virtual environment named
env_name
withconda create -n env_name python=3.5 anaconda
(alternatively, you can usepython=2.7 anaconda
). - Switch to the new environment with
source activate env_name
. - Do
conda install gcc
to install the Anaconda distribution ofgcc
, which both supports OpenMP and is fully compatible with the Anaconda Python. Note this clobbers your systemgcc
, which is why we are using a virtual environment here. - Finally, install
einsum2
withCC=gcc pip install .
(theCC=gcc
is required on OSX, otherwise the installation will try to use the defaultclang
compiler that does not support OpenMP).
To test whether the parallelization is working, try
sh test/profile_einsum2.sh
, you should see a speedup for
the function calls using multiple threads (OMP_NUM_THREADS > 1
).
This computes einsum
products that can be expressed
in terms of matmul
, transpose
, and reshape
operations.
It can either have the form
einsum2.einsum2(a, a_sublist, b, b_sublist, out_sublist)
or
einsum2.einsum2(subscripts_str, a, b)
Here a
and b
are tensors.
In the first format, a_sublist
and b_sublist
label the indices of a
and b
,
while out_sublist
gives the indices of the output array.
In the second format, subscripts_str
is a string specifying the subscripts of a
, b
, and
the output, and looks like a_subs,b_subs->out_subs
.
The official numpy.einsum documentation provides more details about specifying the subscripts; here is an example. The following two calls are equivalent:
einsum2(A, [0,1,2,3], B, [4,1,0], [0,2,4])
and
einsum2("ijkm,nji->ikn", A, B)
and returns a tensor C with Cikn = ∑j,m Aijkm Bnji.
Unlike numpy.einsum
, we allow the subscripts in the first format to be a list of
arbitrary hashable keys, so
einsum2(A, ["zero","one","two","three], B, ["four","one","zero"], ["zero","two","four"])
would also be allowed.
einsum2
has several limitations compared to numpy.einsum
: it only operates
on two input arrays, does not allow diagonal operations
(repeated subscripts on the same array), and requires the output
subscripts to always be specified.
Unlike the standard einsum, einsum2 will perform computations
in parallel. The number of parallel threads is selected automatically,
but you can also control this with the environment variable
OMP_NUM_THREADS
.
To perform the parallel computation, einsum2 will either use numpy.dot (if possible), otherwise it will use a parallel for loop. The advantage of using numpy.dot is that it uses BLAS which is much faster than a for loop. However, you need to make sure numpy is compiled against a parallel BLAS implementation such as MKL or OpenBlas. You won’t need to worry about this for most packaged, precompiled versions of numpy (e.g. Anaconda Python).
This is a parallel implementation of numpy.matmul
.
More specifically, for 3-tensors a
and b
,
einsum2.batched_dot(a, b)
computes numpy.matmul(a,b)
in parallel.
batched_dot
is only currently implemented for a
and b
that are 3-tensors.
If the leading dimension has length 1, then batched_dot
will use numpy.dot
to take advantage of BLAS.
This is a convenience function for einsum
operations on a single array.
In particular,
einsum2.einsum1(in_arr, in_sublist, out_sublist)
returns an array out_arr
that is derived from in_arr
, but with subscripts given by
out_sublist
. In particular, all subscripts of in_sublist
not in out_sublist
are summed out, and then the axes of in_arr
are rearranged to match out_sublist
.
Like einsum2
, arbitrary keys are allowed to label the subscripts in einsum1
.
Also like einsum2
, repeated subscripts (i.e. diagonal operations) are not supported.