forked from etotheipi/CUDA-Image-Processing
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
355 lines (256 loc) · 14.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
Image Processing in C++ using CUDA
Ridiculously fast morphology and convolutions using an NVIDIA GPU!
Additional: cudaImageHost<type> and
cudaImageDevice<type>
Automate all the "standard" CUDA memory operations needed for any
numeric data type. The can be used to learn how to allocate and
move basic memory between host and device, and can be used to
soften the learning curve when implementing CUDA programs (memory
allocation/copy issues are frustrating to track down)
-----
Orig Author: Alan Reiner
Date: 04 September, 2010
Email: [email protected]
-----
This a BETA version of a complete morph/convolve image processing library
using CUDA. CUDA is an NVIDIA programming interface for harnessing your
NVIDIA graphics card to do computations. For problems that are highly
parallelizable (like convolution or morphology) you can get speedups of
many orders of magnitude.
For instance, to convolve a 4096x4096 mask with a 15x15 point-spread fn,
the CPU would take anywhere from 5 to 20 seconds. Using this library
with an NVIDIA GTX 460, the operation takes less than 0.1 seconds! You
can see nearly identical speed improvements with morphological operations.
To use this library, you will need the CUDA 3.2 toolkit, and you might
have to move a few libraries into your linker path. And of course, you
need a CUDA-enabled graphics card (Fermi recommended, others will prob
work; more on that below).
This library is designed to be used for performing a sequence of image
operations on a single image/mask, only copying the data in and out of
the device once. Memory copies are expensive, computation is basically
free...
I envision this library could be useful for GIMP plugins, or ATR
systems. Just remember, CUDA is NOT open-source, and the licensing
will not be favorable to OSS projects.
***IMPORTANT NOTE***
I hate makefiles. And w/ CUDA, they got more complicated, so I am using
the canned common/common.mk makefile that comes with the CUDA SDK.
THIS MAKEFILE ERASES *.cpp FILES WHEN CALLING "make clean"
I don't know how to fix this. Perhaps someone else knows. Just be sure
to check-in your cpp code before cleaning (though, this project no longer
contains any .cpp files)
-----
Supported Hardware:
This code was designed for NVIDIA devices of Compute Capability 2.0+
which is any NVIDIA GTX 4XX series card (Fermi). The code *should*
compile and run on 1.X devices, but the code is not optimized for them
so there may be a huge performance hit. Maybe not. (SEE NOTE BELOW
about compiling for 1.X GPUs)
I believe NVIDIA 8800 GT is the earliest NVIDIA card that supports
CUDA, and it would be Compute Capability 1.0. (correction: I believe
all NVIDIA 8XXX cards support CUDA 1.0)
CUDA was created by NVIDIA, for NVIDIA. It will not work ATI cards.
If you want to take up GPU-programming on ATI, the only two options
I know are OpenGL and OpenCL. However, the maturity of those
programming interfaces are unclear (but at least such programs can
be run on both NVIDIA and ATI)
-----
Installing and running:
This directory should contain everything you need to compile
the image convolution code, besides the CUDA 3.1 toolkit
which can be downloaded from the following website:
http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html
In addition to installing the toolkit, you might need to add
libcutil*.a and libshrutil*.a to your LD path. In linux,
it is sufficient to copy them to your /usr/local/cuda/lib[64]
directory. (note these two libraries are not formally part of
NVIDIA CUDA toolbox, but come with the SDK as a set of tools for
simplifying many CUDA tasks, like checking error codes)
I personally work with space-separated files for images, because
they are trivial to read and write from C++. Additionally, I
use MATLAB to read/write these files too, which makes it easy
create input for the CUDA code, and verify output.
There is no reason this code won't work on Windows, though I've
never tried it.
I strongly urge anyone trying to learn CUDA to download the
CUDA SDK. It contains endless examples of every CUDA feature
in existence. Once you know what to do, you can get the
syntax from there, which more often than not is very ugly.
-----
To work with "older" NVIDIA cards (8XXX GT, 9XXX GT, GTX 2XX):
Open commom/common.mk and around line 150, uncomment the GENCODE_SM10
line, which enables dual-compilation for CUDA Compute Capability 1.X
devices.
The reason I commented this out is that 1.X devices don't
support printf(), which is my main method for debugging. Therefore,
leaving this line in the Makefile prevents the code from compiling
when I am attempting to debug.
-----
Planned Updates
- I have basic HOST and DEVICE memory automated with cudaImageHost<type>
and cudaImageDevice<type>... but texture, constant, and page-locked
memory still requires manual operation. I will create similar classes
that handle the allocation and copying of such data (especially textures
because there's a lot of fluff to preparing that data)
- Need to modify the kernel functions (or invocations) to handle masks
that have arbitrary dimensions. Currently, it is assumed that the
input has dimensions that are multiples of 32.
- In the future, I plan add preprocessors which branch the compiled code
based on CUDA architecture. It may be as simple as choosing new block
sizes, or reworking some of the algorithms. The point is, the code is
currently written with Compute Capability 2.0+ in mind, and who knows
what happens when run on 1.X device. (NOTE: one major difference between
1.X and 2.X is that 1.X is optimized for 24-bit integer operations, 2.x
is optimized for 32-bit integer operations. Using the wrong ones for
the given architecture can produce factor of 8-32 slowdown, this most
definitely needs to be addressed in a preprocessor somewhere)
- Right now my focus has been on morphology, though regular convolution
HAS been implemented. I will implement a more-exhaustive array of
image processing functions, such as more filters and FFTs.
- I'm pretty sure that all operations need separate input/output buffers
I need to check that and then see if I can avoid it.
-------------------------------------------------------------------------------
-----
User Instructions - to use this library in your project
This library is designed for the user to put data on an "ImageWorkbench",
which stores the data in device memory, perform thousands of operations,
then copy the data back to the host. Use cudaImageHost to store and move
data around in HOST memory, and initiate a ImageWorkbench object.
The user is welcome to create/use cudaImageDevice objects (to store and
manage DEVICE memory), but it should be unnecessary, since the workbench
handles that for you.
A typical usage paradigm might be like the following:
// Allocate and populate HOST memory
cudaImageHost imageIn, imageOut;
imageIn.readFromFile("picture.txt", 1024, 1024);
// Initialize the workbench -- this copies the img to DEVICE memory
ImageWorkbench theBench(imageIn);
theBench.Dilate();
theBench.Erode();
theBench.PruningSweep();
theBench.Open();
theBench.Dilate();
theBench.Dilate();
theBench.Median();
theBench.Dilate();
// Use CountChanged method to determine how many pixels changed in
// the previous operation. Frequently, we may want to apply a sequence
// of operations until the image is no longer changing
int nChanged = -1;
while(nChanged != 0)
{
theBench.ThinningSweep();
nChanged = theBench.CountChanged();
}
If the user wishes to use supplied structuring elements, you create a
cudaImageHost object using {-1, 0, +1} for {OFF, DONTCARE, ON}, and then
call:
int seIndex = ImageWorkbench::addStructElt(seImageHost);
This permanently stores the SE in a master (static list), and you only need
to supply the returned index to use it:
theBench.Dilate(seIndex);
theBench.Open(seIndex);
theBench.FindAndRemove(seIndex);
Every operation so far leaves the buffer management to the IWB object.
However, every method has optional two arguments for specifying source &
destination buffers, if desired. The valid buffers are also accessed
by index: 'A', 'B', or N >= 1. If a number is supplied, the IWB object
will use the Nth buffer in its extraBuffers_ vector. If the buffer does
not exist, it will create it. Keep in mind that A and B are #defines and
do not need quotes.
// No arguments uses 3x3 optimized morphology, (A --> B)
theBench.Dilate();
// Supply structuring element, still (A --> B)
theBench.Erode(seIndex);
// Default 3x3 median calc on buffer A, result goes to buf 1 (A --> 1)
theBench.Median(A, 1);
// Copy in an external image to buffer 2 (Extern --> 2)
theBench.copyHostToBuffer(newImageSameSize, 2);
// Union buffers 1 and 2, store the result in buffer B (1U2 --> B)
theBench.Union(1, 2, B);
// Subtract B from buffer 1, store in 2 (1-B -> 2)
theBench.Subtract(B, 1, 2);
// Now morphological-open the result with SE and copy back to A
theBench.Open(seIndex, 2, A);
// DONE!
// The last operation could've been replaced with the following two lines
theBench.Open(seIndex, 2, B);
theBench.flipBuffers(); // switches buffer ptrs so B is now A and vv
In order to maintain IWB consistency, any manual buffer management needs
to end with the result in buffer A, or in buffer B followed by a call to
flipBuffers(). This is because subsequent operations always assume the
last output was in buffer A.
-------------------------------------------------------------------------------
-----
Developer Stuff - to expand this library
The paradigm for developing this library differs substantially from simply
using the library. There are two main reasons for this:
1) The developer needs his own set of buffers so that he has workspace but
doesn't overwrite the buffers that the user may have created himself
2) The developer needs to ensure that every "standard" operations uses
buffer A as input, buffer B as output AND THAT BOTH BUFFERS DISTINCTLY
REPRESENT BEFORE-AND-AFTER VERSIONS of the operation.
Therefore, the developer CANNOT just implement morph-close operation as:
Dilate(seIndex);
Erode(seIndex);
because a subsequent call to CountChanged would only give the number of
pixels that changed by the Erode operation. This is more important than it
sounds, because many algorithms are based on observing when a cycle of
operations reaches steady state, and if the developer doesn't respect this
behavior, the user would have to redundantly implement the sequence himself
to be able to do these kinds of checks. This is especially true of thinning
and pruning, where you are performing 8 erosions and subtractions, and need
to stop when the entire sequence of operations no longer changes any pixels.
Therefore, the developer should ALWAYS use ZFunctions where possible. These
functions always require source and destination, allow access to TEMP buffers
and do NOT flip the buffers when called. In other words, the developer uses
a different set of tools, and does everything manually.
Second thing to note is that the getBufPtrAny command has the following
range of valid inputs:
{A=-1000, B=-1001, 1, 2, 3, 4, ..., -1, -2, -3, -4 ...}
| PRIMARY | EXTRA | TEMPORARY |
The USER has access to PRIMARY & EXTRA (getBufferPtr won't accept neg vals)
The DEVELOPER has access to all, but should only use PRIMARY & TEMP
The difference between EXTRA and TEMPORARY is the the developer may create
a sequence of commands, which themselves may also use temporary buffers.
This causes a recursion problem, where the developer has no way to protect
his own buffers from being overwritten by a method that he is calling.
Therefore, lock-flags were implemented so that the workbench can find the
next available TEMP buffer. This retrieval "locks" the buffer, and it
MUST be "released" at the end of the function. Much like new&delete, there
needs to be a "release" for every "get". This example demonstrates how
to write new methods safely for ImageWorkbench class:
void A_Complicated_Morph_Operation(int seIndex)
{
// Retrieve three temporary buffers from the workbench
int tmpBufIndex1 = getTempBuffer();
int tmpBufIndex2 = getTempBuffer();
int tmpBufIndex3 = getTempBuffer();
// Always read in from A
ZErode(A, tmpBufIndex1);
ZMedian(seIndex, tmpBufIndex1, tmpBufIndex2);
ZSubtract(tmpBufIndex2, A, tmpBufIndex3);
ZUnion(A, tmpBufIndex3, tmpBufIndex2);
ZDifference(tmpBufIndex2, tmpBufIndex1, tmpBufIndex3);
// Last morph op should always put result in B, and then flip
ZFindAndRemove( seIndex, tmpBufIndex3, B);
flipBuffers();
// Tell the workbench to "unlock" all the TEMP buffers we used
releaseTempBuffer(tmpBufIndex1);
releaseTempBuffer(tmpBufIndex2);
releaseTempBuffer(tmpBufIndex3);
}
As per the descriptions above, this means that all new GPU kernels should
take as input, pointers to input, output and all temporary memory locations
and the workbench will supply those pointers when wrapping the GPU kernels.
This also means, that when implementing new methods in IWB, there should
actually be three new methods created:
NAME (arg1, arg2, ...); // calls below with src=A, dst=B, then flip
NAME (arg1, arg2, ..., int srcBuf, int dstBuf); // no flip
ZNAME(arg1, arg2, ..., int srcBuf, int dstBuf); // no flip, TEMP access
Some might argue that the ZNAME functions aren't necessary, since the only
real difference they have compared to NAME is that they can access TEMP
buffers. There may be a better way to do this, and in fact, it may not be
necessary to shield the user from the TEMP buffers at all (just don't use them,
please). Since most of the ZFunctions are written by #define macros anyway,
I'll leave them alone for now.