Skip to content

Commit

Permalink
CUDA support for ImageBufAlgo (experimental and very incomplete)
Browse files Browse the repository at this point in the history
First stab at this, it's experimental, the general organization may change
as we extend it.

* To get these features, you must build with `USE_CUDA=1`, in which case
  it will look for Cuda toolkit. For simplicity, we're setting a version
  floor of Cuda 7.0 and sm_30.

* To enable at runtime (duh, still only if you built with Cuda support
  enabled), you can either set `OIIO::attribute("cuda",1)` or use the
  magic environment variable `OPENIMAGEIO_CUDA=1`. When running oiiotool,
  the command line argument `--cuda` turns the attribut on (or cheat with
  the aforementioned env variable).

* When the attribute is set, ImageBuf of "local" (not ImageCache-backed)
  float (no other data types yet) buffers will allocate and free with
  cudaMallocManaged/cudaFree (other cases will use the usual malloc/free).
  We are thus heavily leveraging Unified Memory, never do any explicit
  copying of data back and forth.

* Certain ImageBufAlgo functions, then, have the options of calling
  Cuda implementations when all the stars align -- Cuda support enabled,
  Cuda turned on, the ImageBufs in question all have local storage that
  was allocated as visible to Cuda, the buffers are all float, and other
  restrctions to just the most common cases (all image inputs have
  identical ROIs, etc.).

* Implemented this for IBA::add() and sub() initially. Will extend to
  other operations in the future and as the need arises.

Results and discussion:

Perf: add and sub operations on 1920x1080 3 channel float images, on my
workstation (16 core Xeon Silver 4110, it's ISA is AVX-512 but I'm only
compiling for SSE4.2 support at the moment) runs in about 20ms single
threaded, ~3.8ms multithreaded. With Cuda enabled (NVIDIA Quadro P5000,
Pascal architecture), I am getting about 12ms (i.e., moderately faster
than single core, quite a bit slower than fully using all the CPU
cores).

Now, this is not an especially good case for GPU -- the compute-to-memory
ratio is very poor, just a single math op for every 12 bytes of transfer
on or off the GPU. When I contrive to do an example with about 10x more
math per pixel, the Cuda times are approximately equal to the CPU times
when I take advantage of all the CPU cores. Maybe it only helps if we
do a bunch of IBA operations in a row before needing the results. Maybe
it's only worth Cuda-accelerating the most expensive operations (resize,
area ops, etc.), but we'll never get gain from something simple like add?

If anybody can point out ways in which I'm being very wasteful, please do
let me know!

Even after we flesh out many more image operations to be
Cuda-accelerated, and even we see an improvement in all cases over CPU,
I don't expect people to see much practical improvement in a typical
oiiotool command line, since disk/network to read input images and write
results are almost certain to dominate runtime, compared to the
math. But if you have a program that's doing a whole bunch of repeated
image math via IBA calls themselves, that's where the bigger payoff is
going to be, I think.

Note that CUDA is extremely finicky about what compilers it can use,
with an especially narrow idea of which "host compiler" is required by
each version of the Cuda Toolkit/nvcc. I'm still working through those
issues, and am considering the merits of compiling the cuda itself with
clang (if available) rather than nvcc, just to ease up on these
requirements. We'll be making the rest of the build issues more robust
over time as well.
  • Loading branch information
lgritz committed Apr 27, 2018
1 parent 6daaefd commit 1997433
Show file tree
Hide file tree
Showing 15 changed files with 521 additions and 19 deletions.
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,10 @@ ifneq (${BUILD_OIIOUTIL_ONLY},)
MY_CMAKE_FLAGS += -DBUILD_OIIOUTIL_ONLY:BOOL=${BUILD_OIIOUTIL_ONLY}
endif

ifneq (${USE_CUDA},)
MY_CMAKE_FLAGS += -DUSE_CUDA:BOOL=${USE_CUDA}
endif

ifdef DEBUG
MY_CMAKE_FLAGS += -DCMAKE_BUILD_TYPE:STRING=Debug
endif
Expand Down Expand Up @@ -484,6 +488,7 @@ help:
@echo " USE_OPENCV=0 Skip anything that needs OpenCV"
@echo " USE_PTEX=0 Skip anything that needs PTex"
@echo " USE_FREETYPE=0 Skip anything that needs Freetype"
@echo " USE_CUDA=1 Build NVIDIA CUDA support (if found)"
@echo " OIIO build-time options:"
@echo " INSTALL_PREFIX=path Set installation prefix (default: ./${INSTALL_PREFIX_BRIEF})"
@echo " NAMESPACE=name Override namespace base name (default: OpenImageIO)"
Expand Down
5 changes: 5 additions & 0 deletions site/spi/Makefile-bits-arnold
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,11 @@ ifeq (${SP_OS}, rhel7)
-DCMAKE_CXX_COMPILER=${LLVM_DIRECTORY}/bin/clang++
endif

# CUDA customizations
MY_CMAKE_FLAGS += \
-DCUDA_TOOLKIT_ROOT_DIR=/net/soft_scratch/apps/arnold/tools/nvidia/cuda9.1 \
-DCUDA_HOST_COMPILER=/shots/spi/home/lib/arnold/rhel7/llvm_4.0_final/bin/clang++

MY_CMAKE_FLAGS += \
-DOPENEXR_CUSTOM_INCLUDE_DIR=/usr/include/OpenEXR2 \
-DOPENEXR_CUSTOM_LIB_DIR=/usr/lib64/OpenEXR2 \
Expand Down
2 changes: 2 additions & 0 deletions src/cmake/compiler.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ option (CLANG_TIDY "Enable clang-tidy" OFF)
set (CLANG_TIDY_CHECKS "-*" CACHE STRING "clang-tidy checks to perform")
set (CLANG_TIDY_ARGS "" CACHE STRING "clang-tidy args")
option (CLANG_TIDY_FIX "Have clang-tidy fix source" OFF)
set (USE_CUDA OFF CACHE BOOL "Use CUDA if found")
set (CUDA_TARGET_ARCH "sm_30" CACHE STRING "CUDA GPU architecture (e.g. sm_35)")

# Figure out which compiler we're using
if (CMAKE_COMPILER_IS_GNUCC)
Expand Down
24 changes: 24 additions & 0 deletions src/cmake/externalpackages.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
# When not in VERBOSE mode, try to make things as quiet as possible
if (NOT VERBOSE)
set (Boost_FIND_QUIETLY true)
set (CUDA_FIND_QUIETLY true)
set (DCMTK_FIND_QUIETLY true)
set (FFmpeg_FIND_QUIETLY true)
set (Field3D_FIND_QUIETLY true)
Expand Down Expand Up @@ -557,4 +558,27 @@ endmacro()


###########################################################################
if (USE_CUDA)
if (NOT CUDA_TOOLKIT_ROOT_DIR AND NOT $ENV{CUDA_TOOLKIT_ROOT_DIR} STREQUAL "")
set (CUDA_TOOLKIT_ROOT_DIR $ENV{CUDA_TOOLKIT_ROOT_DIR})
endif ()
if (NOT CUDA_FIND_QUIETLY)
message (STATUS "CUDA_TOOLKIT_ROOT_DIR = ${CUDA_TOOLKIT_ROOT_DIR}")
endif ()
set (CUDA_PROPAGATE_HOST_FLAGS ON)
set (CUDA_VERBOSE_BUILD ${VERBOSE})
find_package (CUDA 7.0 REQUIRED)
list (APPEND CUDA_NVCC_FLAGS ${CSTD_FLAGS} -expt-relaxed-constexpr)
set (CUDA_INCLUDE_DIR ${CUDA_TOOLKIT_ROOT_DIR}/include)
message (STATUS "CUDA version = ${CUDA_VERSION}")
if (NOT CUDA_FIND_QUIETLY)
message (STATUS "CUDA includes = ${CUDA_INCLUDE_DIR}")
message (STATUS "CUDA libraries = ${CUDA_LIBRARIES}")
message (STATUS "CUDA host compiler = ${CUDA_HOST_COMPILER}")
message (STATUS "CUDA nvcc flags = ${CUDA_NVCC_FLAGS}")
endif ()
endif ()

# end Cuda
###########################################################################

3 changes: 3 additions & 0 deletions src/include/OpenImageIO/imagebuf.h
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,9 @@ class OIIO_API ImageBuf {
/// Which type of storage is being used for the pixels?
IBStorage storage () const;

/// Is the pixel memory of this ImageBuf visible to Cuda?
bool cuda_storage () const;

/// Is this ImageBuf object initialized?
bool initialized () const;

Expand Down
2 changes: 2 additions & 0 deletions src/include/OpenImageIO/platform.h
Original file line number Diff line number Diff line change
Expand Up @@ -323,8 +323,10 @@
// OIIO_HOSTDEVICE is used to supply the function decorators needed when
// compiling for CUDA devices.
#ifdef __CUDACC__
# define OIIO_HOST __host__
# define OIIO_HOSTDEVICE __host__ __device__
#else
# define OIIO_HOST
# define OIIO_HOSTDEVICE
#endif

Expand Down
26 changes: 22 additions & 4 deletions src/libOpenImageIO/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ file (TO_NATIVE_PATH "${PLUGIN_SEARCH_PATH}" PLUGIN_SEARCH_PATH_NATIVE)
configure_file (imageio_pvt.h.in "${CMAKE_CURRENT_BINARY_DIR}/imageio_pvt.h" @ONLY)
include_directories("${CMAKE_CURRENT_BINARY_DIR}")

if (BUILDSTATIC)
set (OIIO_LIB_TYPE "STATIC")
else ()
set (OIIO_LIB_TYPE "SHARED")
endif ()

file (GLOB libOpenImageIO_hdrs ../include/OpenImageIO/*.h)

if (NOT USE_EXTERNAL_PUGIXML)
Expand Down Expand Up @@ -48,6 +54,7 @@ set (libOpenImageIO_srcs
imageoutput.cpp iptc.cpp xmp.cpp
color_ocio.cpp
maketexture.cpp
imageio_cuda.cpp
../libutil/argparse.cpp
../libutil/benchmark.cpp
../libutil/errorhandler.cpp
Expand All @@ -74,6 +81,13 @@ set (libOpenImageIO_srcs
${libOpenImageIO_hdrs}
)

#set (cuda_using_srcs imagebufalgo_addsub.cpp)

if (USE_CUDA)
file (GLOB gpu_source_files "*.cu")
message (STATUS "Extra cuda files: ${gpu_source_files}")
list (APPEND libOpenImageIO_srcs ${gpu_source_files})
endif ()

# If the 'EMBEDPLUGINS' option is set, we want to compile the source for
# all the plugins into libOpenImageIO.
Expand Down Expand Up @@ -101,10 +115,12 @@ endif ()
source_group ("libutil" REGULAR_EXPRESSION ".+/libutil/.+")
source_group ("libtexture" REGULAR_EXPRESSION ".+/libtexture/.+")

if (BUILDSTATIC)
add_library (OpenImageIO STATIC ${libOpenImageIO_srcs})
if (USE_CUDA)
add_definitions ("-DOIIO_USE_CUDA=1")
cuda_add_library (OpenImageIO ${OIIO_LIB_TYPE} ${libOpenImageIO_srcs}
OPTIONS -arch ${CUDA_TARGET_ARCH})
else ()
add_library (OpenImageIO SHARED ${libOpenImageIO_srcs})
add_library (OpenImageIO ${OIIO_LIB_TYPE} ${libOpenImageIO_srcs})
endif ()

# if (SANITIZE AND ${CMAKE_SYSTEM_NAME} STREQUAL "Linux")
Expand Down Expand Up @@ -207,7 +223,9 @@ if (EXTRA_DSO_LINK_ARGS)
set_target_properties (OpenImageIO PROPERTIES LINK_FLAGS ${EXTRA_DSO_LINK_ARGS})
endif()

oiio_install_targets (OpenImageIO)
oiio_install_targets (OpenImageIO
#${OIIO_CUDA_LIB}
)


# Testing
Expand Down
76 changes: 61 additions & 15 deletions src/libOpenImageIO/imagebuf.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
#include <OpenImageIO/fmath.h>
#include <OpenImageIO/thread.h>
#include <OpenImageIO/simd.h>
#include "imageio_pvt.h"


OIIO_NAMESPACE_BEGIN

Expand Down Expand Up @@ -143,6 +145,7 @@ class ImageBufImpl {
void append_error (const std::string& message) const;

ImageBuf::IBStorage storage () const { return m_storage; }
bool cuda_storage () const { return m_cuda_storage; }

TypeDesc pixeltype () const {
validate_spec ();
Expand Down Expand Up @@ -193,6 +196,7 @@ class ImageBufImpl {
m_current_subimage, m_current_miplevel);
}

// Make sure the pixels are ready to read with an iterator.
bool validate_pixels () const {
if (m_pixels_valid)
return true;
Expand Down Expand Up @@ -242,6 +246,21 @@ class ImageBufImpl {
return (z * m_spec.height + y) * m_spec.width + x;
}

void release_pixels () {
IB_local_mem_current -= m_allocated_size;
#if OIIO_USE_CUDA
if (m_cuda_storage) {
OIIO::debug ("IB Cuda free %p\n", (void*)m_pixels.get());
OIIO::pvt::cuda_free (m_pixels.release());
m_cuda_storage = false;
}
#endif
m_pixels.reset();
m_localpixels = nullptr;
m_pixels_valid = false;
m_allocated_size = 0;
}

private:
ImageBuf::IBStorage m_storage; ///< Pixel storage class
ustring m_name; ///< Filename of the image
Expand All @@ -258,6 +277,7 @@ class ImageBufImpl {
mutable spin_mutex m_valid_mutex;
mutable bool m_spec_valid; ///< Is the spec valid
mutable bool m_pixels_valid; ///< Image is valid
mutable bool m_cuda_storage = false; ///< Is the pixel memory visible to Cuda?
bool m_badfile; ///< File not found
float m_pixelaspect; ///< Pixel aspect ratio of the image
size_t m_pixel_bytes;
Expand Down Expand Up @@ -358,23 +378,23 @@ ImageBufImpl::ImageBufImpl (const ImageBufImpl &src)
{
m_spec_valid = src.m_spec_valid;
m_pixels_valid = src.m_pixels_valid;
m_allocated_size = src.m_localpixels ? src.spec().image_bytes() : 0;
m_allocated_size = 0;
IB_local_mem_current += m_allocated_size;
if (src.m_localpixels) {
// Source had the image fully in memory (no cache)
if (m_storage == ImageBuf::APPBUFFER) {
// Source just wrapped the client app's pixels, we do the same
m_allocated_size = src.m_localpixels ? src.spec().image_bytes() : 0;
m_localpixels = src.m_localpixels;
} else {
// We own our pixels -- copy from source
m_pixels.reset (new char [src.m_spec.image_bytes()]);
realloc ();
memcpy (m_pixels.get(), src.m_pixels.get(), m_spec.image_bytes());
m_localpixels = m_pixels.get();
}
} else {
// Source was cache-based or deep
// nothing else to do
m_localpixels = NULL;
m_localpixels = nullptr;
}
if (src.m_configspec)
m_configspec.reset (new ImageSpec(*src.m_configspec));
Expand All @@ -388,7 +408,7 @@ ImageBufImpl::~ImageBufImpl ()
// externally and passed to the ImageBuf ctr or reset() method, or
// else init_spec requested the system-wide shared cache, which
// does not need to be destroyed.
IB_local_mem_current -= m_allocated_size;
release_pixels ();
}


Expand Down Expand Up @@ -513,6 +533,14 @@ ImageBuf::storage () const



bool
ImageBuf::cuda_storage () const
{
return impl()->cuda_storage ();
}



void
ImageBufImpl::clear ()
{
Expand All @@ -524,10 +552,8 @@ ImageBufImpl::clear ()
m_current_miplevel = -1;
m_spec = ImageSpec ();
m_nativespec = ImageSpec ();
m_pixels.reset ();
m_localpixels = NULL;
release_pixels ();
m_spec_valid = false;
m_pixels_valid = false;
m_badfile = false;
m_pixelaspect = 1;
m_pixel_bytes = 0;
Expand Down Expand Up @@ -629,20 +655,41 @@ ImageBuf::reset (const ImageSpec &spec)
void
ImageBufImpl::realloc ()
{
IB_local_mem_current -= m_allocated_size;
release_pixels ();
m_allocated_size = m_spec.deep ? size_t(0) : m_spec.image_bytes ();
IB_local_mem_current += m_allocated_size;
m_pixels.reset (m_allocated_size ? new char [m_allocated_size] : NULL);
m_localpixels = m_pixels.get();
m_storage = m_allocated_size ? ImageBuf::LOCALBUFFER : ImageBuf::UNINITIALIZED;
m_cuda_storage = false;
if (m_allocated_size) {
#ifdef OIIO_USE_CUDA
if (OIIO::get_int_attribute("cuda") && m_spec.format == TypeFloat) {
char *cudaptr = (char *)OIIO::pvt::cuda_malloc (m_allocated_size);
if (cudaptr) {
OIIO::debug ("IB Cuda allocated %p\n", (void*)cudaptr);
m_pixels.reset (cudaptr);
m_cuda_storage = true;
}
else {
OIIO::debug ("Requested cudaMallocManaged of %s FAILED\n",
m_allocated_size);
}
}
#endif
if (! m_pixels) // no cuda, or cuda failed
m_pixels.reset (new char [m_allocated_size]);
m_pixels_valid = true;
m_storage = ImageBuf::LOCALBUFFER;
m_localpixels = m_pixels.get();
} else {
m_pixels_valid = false;
m_storage = ImageBuf::UNINITIALIZED;
m_localpixels = nullptr;
}
m_pixel_bytes = m_spec.pixel_bytes();
m_scanline_bytes = m_spec.scanline_bytes();
m_plane_bytes = clamped_mult64 (m_scanline_bytes, (imagesize_t)m_spec.height);
m_channel_bytes = m_spec.format.size();
m_blackpixel.resize (round_to_multiple (m_pixel_bytes, OIIO_SIMD_MAX_SIZE_BYTES), 0);
// NB make it big enough for SSE
if (m_allocated_size)
m_pixels_valid = true;
if (m_spec.deep) {
m_deepdata.init (m_spec);
m_storage = ImageBuf::LOCALBUFFER;
Expand Down Expand Up @@ -2372,5 +2419,4 @@ ImageBuf::retile (int x, int y, int z, ImageCache::Tile* &tile,
}



OIIO_NAMESPACE_END
18 changes: 18 additions & 0 deletions src/libOpenImageIO/imagebufalgo_addsub.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
#include <OpenImageIO/deepdata.h>
#include <OpenImageIO/dassert.h>
#include "imageio_pvt.h"
#include "imagebufalgo_cuda.h"



Expand Down Expand Up @@ -119,6 +120,15 @@ ImageBufAlgo::add (ImageBuf &dst, const ImageBuf &A, const ImageBuf &B,
return false;
ROI origroi = roi;
roi.chend = std::min (roi.chend, std::min (A.nchannels(), B.nchannels()));

#ifdef OIIO_USE_CUDA
if (dst.cuda_storage() && A.cuda_storage() && B.cuda_storage() &&
dst.roi() == roi && A.roi() == roi && B.roi() == roi) {
return pvt::add_impl_cuda (dst, A, B, dst.roi());
}
// make >/dev/null && OPENIMAGEIO_LOG_TIMES=2 oiiotool -cuda -frames 1-1 -pattern fill:topleft=0,0,0:topright=0.5,0,0:bottomleft=0,0.5,0:bottomright=0.5,0.5,0.5 1920x1080 3 -pattern fill:topleft=0.5,0,0:topright=0,0.5,0:bottomleft=0.5,0.5,0.5:bottomright=0,.5,.5 1920x1080 3 -add -o out.exr
#endif

bool ok;
OIIO_DISPATCH_COMMON_TYPES3 (ok, "add", add_impl, dst.spec().format,
A.spec().format, B.spec().format,
Expand Down Expand Up @@ -211,6 +221,14 @@ ImageBufAlgo::sub (ImageBuf &dst, const ImageBuf &A, const ImageBuf &B,
return false;
ROI origroi = roi;
roi.chend = std::min (roi.chend, std::min (A.nchannels(), B.nchannels()));

#ifdef OIIO_USE_CUDA
if (dst.cuda_storage() && A.cuda_storage() && B.cuda_storage() &&
dst.roi() == roi && A.roi() == roi && B.roi() == roi) {
return pvt::sub_impl_cuda (dst, A, B, dst.roi());
}
#endif

bool ok;
OIIO_DISPATCH_COMMON_TYPES3 (ok, "sub", sub_impl, dst.spec().format,
A.spec().format, B.spec().format,
Expand Down
Loading

0 comments on commit 1997433

Please sign in to comment.