README


Image Processing in C++ using CUDA

Ridiculously fast morphology and convolutions using an NVIDIA GPU!

Additional:  cudaImageHost<type>    and 
             cudaImageDevice<type> 
   Automate all the "standard" CUDA memory operations needed for any 
   numeric data type.  The can be used to learn how to allocate and 
   move basic memory between host and device, and can be used to 
   soften the learning curve when implementing CUDA programs (memory
   allocation/copy issues are frustrating to track down)

-----
Orig Author:  Alan Reiner
Date:         04 September, 2010
Email:        etotheipi@gmail.com


-----
This a BETA version of a complete morph/convolve image processing library
using CUDA.  CUDA is an NVIDIA programming interface for harnessing your
NVIDIA graphics card to do computations.  For problems that are highly 
parallelizable (like convolution or morphology) you can get speedups of 
many orders of magnitude.  

For instance, to convolve a 4096x4096 mask with a 15x15 point-spread fn,
the CPU would take anywhere from 5 to 20 seconds.  Using this library 
with an NVIDIA GTX 460, the operation takes less than 0.1 seconds!  You
can see nearly identical speed improvements with morphological operations.

To use this library, you will need the CUDA 3.2 toolkit, and you might
have to move a few libraries into your linker path.  And of course, you
need a CUDA-enabled graphics card (Fermi recommended, others will prob
work;  more on that below).

This library is designed to be used for performing a sequence of image
operations on a single image/mask, only copying the data in and out of 
the device once.   Memory copies are expensive, computation is basically
free...  

I envision this library could be useful for GIMP plugins, or ATR
systems.  Just remember, CUDA is NOT open-source, and the licensing
will not be favorable to OSS projects.


***IMPORTANT NOTE***

   I hate makefiles.  And w/ CUDA, they got more complicated, so I am using
   the canned common/common.mk makefile that comes with the CUDA SDK.  

      THIS MAKEFILE ERASES *.cpp FILES WHEN CALLING "make clean"

   I don't know how to fix this.  Perhaps someone else knows.  Just be sure
   to check-in your cpp code before cleaning (though, this project no longer
   contains any .cpp files)
   

-----
Supported Hardware:

   This code was designed for NVIDIA devices of Compute Capability 2.0+
   which is any NVIDIA GTX 4XX series card (Fermi).  The code *should*
   compile and run on 1.X devices, but the code is not optimized for them
   so there may be a huge performance hit.  Maybe not.  (SEE NOTE BELOW
   about compiling for 1.X GPUs)

   I believe NVIDIA 8800 GT is the earliest NVIDIA card that supports
   CUDA, and it would be Compute Capability 1.0. (correction: I believe
   all NVIDIA 8XXX cards support CUDA 1.0)

   CUDA was created by NVIDIA, for NVIDIA.  It will not work ATI cards.
   If you want to take up GPU-programming on ATI, the only two options
   I know are OpenGL and OpenCL.  However, the maturity of those
   programming interfaces are unclear (but at least such programs can 
   be run on both NVIDIA and ATI)


-----
Installing and running:

   This directory should contain everything you need to compile
   the image convolution code, besides the CUDA 3.1 toolkit
   which can be downloaded from the following website:

      http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html

   In addition to installing the toolkit, you might need to add 
   libcutil*.a and libshrutil*.a to your LD path.  In linux,
   it is sufficient to copy them to your /usr/local/cuda/lib[64] 
   directory.  (note these two libraries are not formally part of
   NVIDIA CUDA toolbox, but come with the SDK as a set of tools for 
   simplifying many CUDA tasks, like checking error codes)

   I personally work with space-separated files for images, because 
   they are trivial to read and write from C++.  Additionally, I 
   use MATLAB to read/write these files too, which makes it easy
   create input for the CUDA code, and verify output.

   There is no reason this code won't work on Windows, though I've
   never tried it.

   I strongly urge anyone trying to learn CUDA to download the 
   CUDA SDK.  It contains endless examples of every CUDA feature
   in existence.  Once you know what to do, you can get the
   syntax from there, which more often than not is very ugly.


-----
To work with "older" NVIDIA cards (8XXX GT, 9XXX GT, GTX 2XX):

   Open commom/common.mk and around line 150, uncomment the GENCODE_SM10
   line, which enables dual-compilation for CUDA Compute Capability 1.X
   devices.  

   The reason I commented this out is that 1.X devices don't
   support printf(), which is my main method for debugging.  Therefore,
   leaving this line in the Makefile prevents the code from compiling 
   when I am attempting to debug.


-----
Planned Updates

   - I have basic HOST and DEVICE memory automated with cudaImageHost<type>
     and cudaImageDevice<type>... but texture, constant, and page-locked
     memory still requires manual operation.  I will create similar classes
     that handle the allocation and copying of such data (especially textures
     because there's a lot of fluff to preparing that data)

   - Need to modify the kernel functions (or invocations) to handle masks
     that have arbitrary dimensions.  Currently, it is assumed that the
     input has dimensions that are multiples of 32.

   - In the future, I plan add preprocessors which branch the compiled code
     based on CUDA architecture.  It may be as simple as choosing new block
     sizes, or reworking some of the algorithms.  The point is, the code is
     currently written with Compute Capability 2.0+ in mind, and who knows
     what happens when run on 1.X device. (NOTE:  one major difference between
     1.X and 2.X is that 1.X is optimized for 24-bit integer operations, 2.x
     is optimized for 32-bit integer operations.  Using the wrong ones for
     the given architecture can produce factor of 8-32 slowdown, this most
     definitely needs to be addressed in a preprocessor somewhere)

   - Right now my focus has been on morphology, though regular convolution
     HAS been implemented.  I will implement a more-exhaustive array of 
     image processing functions, such as more filters and FFTs.

   - I'm pretty sure that all operations need separate input/output buffers
     I need to check that and then see if I can avoid it.


-------------------------------------------------------------------------------
----- 
User Instructions - to use this library in your project

   This library is designed for the user to put data on an "ImageWorkbench", 
   which stores the data in device memory, perform thousands of operations, 
   then copy the data back to the host.  Use cudaImageHost to store and move
   data around in HOST memory, and initiate a ImageWorkbench object.

   The user is welcome to create/use cudaImageDevice objects (to store and 
   manage DEVICE memory), but it should be unnecessary, since the workbench 
   handles that for you.  
   

   A typical usage paradigm might be like the following:

      // Allocate and populate HOST memory
      cudaImageHost imageIn, imageOut;
      imageIn.readFromFile("picture.txt", 1024, 1024);

      // Initialize the workbench -- this copies the img to DEVICE memory
      ImageWorkbench theBench(imageIn);

      theBench.Dilate();
      theBench.Erode();
      theBench.PruningSweep();
      theBench.Open();
      theBench.Dilate();
      theBench.Dilate();
      theBench.Median();
      theBench.Dilate();
   
      // Use CountChanged method to determine how many pixels changed in 
      // the previous operation.  Frequently, we may want to apply a sequence
      // of operations until the image is no longer changing
      int nChanged = -1;
      while(nChanged != 0)
      {
         theBench.ThinningSweep();
         nChanged = theBench.CountChanged();
      }

   If the user wishes to use supplied structuring elements, you create a 
   cudaImageHost object using {-1, 0, +1} for {OFF, DONTCARE, ON}, and then
   call:

      int seIndex = ImageWorkbench::addStructElt(seImageHost);

   This permanently stores the SE in a master (static list), and you only need
   to supply the returned index to use it:

      theBench.Dilate(seIndex); 
      theBench.Open(seIndex); 
      theBench.FindAndRemove(seIndex); 

   
   Every operation so far leaves the buffer management to the IWB object.
   However, every method has optional two arguments for specifying source & 
   destination buffers, if desired.   The valid buffers are also accessed 
   by index:  'A', 'B', or N >= 1.  If a number is supplied, the IWB object
   will use the Nth buffer in its extraBuffers_ vector.  If the buffer does
   not exist, it will create it.  Keep in mind that A and B are #defines and
   do not need quotes.  


      // No arguments uses 3x3 optimized morphology, (A --> B)
      theBench.Dilate();
      
      // Supply structuring element, still (A --> B)
      theBench.Erode(seIndex);

      // Default 3x3 median calc on buffer A, result goes to buf 1  (A --> 1)
      theBench.Median(A, 1);   

      // Copy in an external image to buffer 2  (Extern --> 2)
      theBench.copyHostToBuffer(newImageSameSize, 2);   

      // Union buffers 1 and 2, store the result in buffer B (1U2 --> B)
      theBench.Union(1, 2, B);

      // Subtract B from buffer 1, store in 2  (1-B -> 2)
      theBench.Subtract(B, 1, 2);

      // Now morphological-open the result with SE and copy back to A
      theBench.Open(seIndex, 2, A);
      // DONE!
      
      // The last operation could've been replaced with the following two lines
      theBench.Open(seIndex, 2, B);
      theBench.flipBuffers();   // switches buffer ptrs so B is now A and vv


   In order to maintain IWB consistency, any manual buffer management needs
   to end with the result in buffer A, or in buffer B followed by a call to
   flipBuffers().  This is because subsequent operations always assume the 
   last output was in buffer A.

   
-------------------------------------------------------------------------------
----- 
Developer Stuff - to expand this library

   The paradigm for developing this library differs substantially from simply 
   using the library.  There are two main reasons for this:

   1)  The developer needs his own set of buffers so that he has workspace but
       doesn't overwrite the buffers that the user may have created himself
   2)  The developer needs to ensure that every "standard" operations uses
       buffer A as input, buffer B as output AND THAT BOTH BUFFERS DISTINCTLY
       REPRESENT BEFORE-AND-AFTER VERSIONS of the operation.

   Therefore, the developer CANNOT just implement morph-close operation as:
      Dilate(seIndex);
      Erode(seIndex);
      
   because a subsequent call to CountChanged would only give the number of
   pixels that changed by the Erode operation.  This is more important than it
   sounds, because many algorithms are based on observing when a cycle of
   operations reaches steady state, and if the developer doesn't respect this 
   behavior, the user would have to redundantly implement the sequence himself
   to be able to do these kinds of checks.   This is especially true of thinning
   and pruning, where you are performing 8 erosions and subtractions, and need
   to stop when the entire sequence of operations no longer changes any pixels.

   Therefore, the developer should ALWAYS use ZFunctions where possible.  These
   functions always require source and destination, allow access to TEMP buffers
   and do NOT flip the buffers when called.  In other words, the developer uses
   a different set of tools, and does everything manually.
   
   Second thing to note is that the getBufPtrAny command has the following
   range of valid inputs:

      {A=-1000, B=-1001, 1, 2, 3, 4, ..., -1, -2, -3, -4 ...}
      |     PRIMARY     |      EXTRA     |    TEMPORARY     |

   The USER has access to PRIMARY & EXTRA (getBufferPtr won't accept neg vals)
   The DEVELOPER has access to all, but should only use PRIMARY & TEMP
   
   The difference between EXTRA and TEMPORARY is the the developer may create
   a sequence of commands, which themselves may also use temporary buffers. 
   This causes a recursion problem, where the developer has no way to protect
   his own buffers from being overwritten by a method that he is calling.

   Therefore, lock-flags were implemented so that the workbench can find the
   next available TEMP buffer.  This retrieval "locks" the buffer, and it 
   MUST be "released" at the end of the function.  Much like new&delete, there
   needs to be a "release" for every "get".  This example demonstrates how
   to write new methods safely for ImageWorkbench class:

   void A_Complicated_Morph_Operation(int seIndex)
   {
      // Retrieve three temporary buffers from the workbench
      int tmpBufIndex1 = getTempBuffer();
      int tmpBufIndex2 = getTempBuffer();
      int tmpBufIndex3 = getTempBuffer();

      // Always read in from A
      ZErode(A, tmpBufIndex1);
      ZMedian(seIndex, tmpBufIndex1, tmpBufIndex2);
      ZSubtract(tmpBufIndex2, A, tmpBufIndex3);
      ZUnion(A, tmpBufIndex3, tmpBufIndex2);
      ZDifference(tmpBufIndex2, tmpBufIndex1, tmpBufIndex3);

      // Last morph op should always put result in B, and then flip
      ZFindAndRemove( seIndex, tmpBufIndex3, B);
      flipBuffers();

      // Tell the workbench to "unlock" all the TEMP buffers we used
      releaseTempBuffer(tmpBufIndex1);
      releaseTempBuffer(tmpBufIndex2);
      releaseTempBuffer(tmpBufIndex3);
   }
      
      
   As per the descriptions above, this means that all new GPU kernels should
   take as input, pointers to input, output and all temporary memory locations
   and the workbench will supply those pointers when wrapping the GPU kernels.
   
   This also means, that when implementing new methods in IWB, there should
   actually be three new methods created:

      NAME (arg1, arg2, ...);    // calls below with src=A, dst=B, then flip
      NAME (arg1, arg2, ..., int srcBuf, int dstBuf);  // no flip
      ZNAME(arg1, arg2, ..., int srcBuf, int dstBuf);  // no flip, TEMP access


   Some might argue that the ZNAME functions aren't necessary, since the only 
   real difference they have compared to NAME is that they can access TEMP 
   buffers.  There may be a better way to do this, and in fact, it may not be
   necessary to shield the user from the TEMP buffers at all (just don't use them,
   please).  Since most of the ZFunctions are written by #define macros anyway,
   I'll leave them alone for now.