Fast Box Blur - reflection padding

Header only C++ implementation of a Fast Box Blur in linear time. It's designed to be a full portable, light and faster replacement of cv::blur , thus is accurate on image boundaries, emulating a reflection (mirrored) padding without increasing the memory for it.

The main code is based on FastGaussianBlur of @bfraboni and on a blog post by Ivan Kutskir: blog. Which refers to a presentation by Wojciech Jarosz: slides. The code uses STB_IMAGE and STB_IMAGE_WRITE by stb for image manipulation: stb github.

Algorithm

A 2D box blur is a separable convolution, hence it is most of the time performed using first an horizontal 1D box blur pass and then a vertical 1D box blur pass. Usually the process of N box blur passes should alternate between these horizontal and vertical passes.

However thanks to box blur properties the horizontal and vertical passes can be performed in any order without changing the result.

Hence for performance purposes I came up with the following algorithm:

apply N times horizontal box blur (horizontal passes)
flip the image buffer (transposition)
apply N times horizontal box blur (vertical passes)
flip the image buffer (transposition)

Steps 1. and 3. are performed with the horizontal_blur function, which is a fast 1D box blur pass with a sliding accumulator.

Steps 2. and 4. are performed with the flip_block function, which is a fast image buffer transposition, processed per block such that it better preserves cache coherency.

Note: This is the main difference with @bfraboni repository. The fast box blur algorithm is accurate on image boundaries, it emulates a reflection (mirrored) padding therefore radius has to be < of width in order to read correctly inside the image buffer, I set a maximum.

Example of maximum radius

kernel = 9, radius = 4, width(or height) = 5

e d c b | a b c d e | d c b a

The reflection padding is inside the image buffer!

WIP - Experimental

If defined DOUBLE_ACCUMULATOR an alternative function is available

 template <typename  T, int  C> 
 void  horizontal_blur_kernel_reflect_double(const  T  *in, T  *out, const  int  w, const  int  h, const  int  ksize)

@TJCoding original idea was to avoid the re-iteration of the function for each pass doing 2 ( or ideally N ) accumulations at once.

This has been achieved using:

A "rough and easy" circular buffer implementation, that stores the 1st pass output sums in a deque
A Lookup table used to get the correct index at bounds.

Both implementations have their limits, the first is unefficient, popping and pushing at every iteration, the latter might be good but the modulo operator with % lut.size() slow down the improvement, a possible solution might be to create a bigger lookup table and avoid the modulo at all but the memory usage will increase. So in conclusion it's slower than the original algorithm but I left it for documentation purposes, maybe we can optimize further or have new ideas, I'm open to it.

For further details please refer to:

Implementation

The implementation is defined in the fast_box_blur.h header that contains the fastest templated cache coherent version I could make.

The main exposed function is defined as:

  

template<typename  T>

void  fastboxblur(T  *  in, const  int  w, const  int  h, const  int  channels, const  int  ksize, const  int  passes = 1);

where the arguments are:

in is a reference to the source buffer ptr, inplace transformed
w is the image width,
h is the image height,
channels is the image number of channels,
ksize is the desired box car blur,
passes is the number of box car blur passes to perform.

A SIMD vectorized or a GPU version of this algorithm could be significantly faster (but may be painful for the developper for arbitrary channels number / data sizes).

Usage

Run the program with the following command:

./fastboxblur <input_filename> <output_filename> <ksize> <passes = 1>

input_image_filename should be any of [.jpg, .png, .bmp, .tga, .psd, .gif, .hdr, .pic, .pnm].
output_image_filename should be any of [.png, .jpg, .bmp] (unknown extensions will be saved as .png by default).
ksize is the desired box car blur.
passes is an optional argument that controls the number of box blur passes (should be positive). Default is 1.

Performance

The algorithm is designed to be a portable, easy, and faster replacement of cv::blur , using an i7-10750H, it's around 2x - 3x time faster.

Benchmark with 45 images 3 channels from 1500 x 1000 px to 11400 x 7600 px, box car size = (2 * width - 1)

Acknowledgments

Special thanks to @bfraboni for our insightful discussions and his main repository Fast Gaussian Blur.

Licence

You may use, distribute and modify this code under the terms of the MIT license. For further details please refer to : https://mit-license.org/

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.vscode		.vscode
data		data
py		py
.gitignore		.gitignore
FastBoxBlur.code-workspace		FastBoxBlur.code-workspace
Makefile		Makefile
README.md		README.md
fast_box_blur.h		fast_box_blur.h
fastboxblur.cpp		fastboxblur.cpp
stb_image.h		stb_image.h
stb_image_write.h		stb_image_write.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Box Blur - reflection padding

Algorithm

WIP - Experimental

Implementation

Usage

Performance

Acknowledgments

Licence

References

About

Releases

Packages

Contributors 2

Languages

michelerenzullo/FastBoxBlur

Folders and files

Latest commit

History

Repository files navigation

Fast Box Blur - reflection padding

Algorithm

WIP - Experimental

Implementation

Usage

Performance

Acknowledgments

Licence

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages