NEON is a set of single instruction, multiple data (SIMD) instructions for ARM, and it can help in performance optimization. In this recipe, we will learn how to add NEON support to your project, and how to vectorize the code using it.
We will use the Recipe12_ProcessingVideo
project as a starting point, trying to minimize the processing time. The source code is available in the Recipe14_OptimizingWithNEON
folder in the code bundle that accompanies this book. For this recipe, you can't use Simulator, as NEON instructions are not supported on it and they are ARM-specific, while Simulator is x86.
The following is how we will optimize our video processing application:
Let's implement the described steps:
RetroFilter::applyToVideo
method, as it is the most time consuming part of our application. We'll create a copy of this method with the name applyToVideo_optimized
, and insert time measurements in it, as we did in the Printing a postcard (Intermediate) recipe. We'll not show the code of the method here, as it differs with these measurements only.It is generally a good practice to use special profiling tools to find hotspots in an application. But in our case, we only have a few functions, and it is better to measure their individual time without using any tools. Image processing tasks are quite time consuming, so you can easily detect bottlenecks with simple logging, and focus on optimization.
TIMER_ConvertingToGray: 8.28ms TIMER_IntensityVariation: 16.23ms TIMER_AddingScratches: 4.46ms TIMER_FuzzyBorder: 14.65ms TIMER_ConvertingToBGR: 2.59ms 2013-05-25 19:04:12.879 Recipe14_OptimizingWithNEON[4503:5203] Processing time = 48.05ms; Running average FPS = 20.1;
Profiling will show that there are two major hotspots in our application: alphaBlendC1
function and the matrix multiplication with scalar (intensity variation). Because both functions process individual pixels independently, we can parallelize their execution. We then have several choices, such as multi-threading (via libdispatch
) of vectorization using the NEON SIMD instruction set. To process images with several threads, we can split them into several stripes (for example, into four horizontal stripes) and process them as submatrices. This approach is quite easy to implement, and it actually doesn't require memory copy.
Processing_NEON.cpp
file of the CvEffects
static library project. It is shown in the following code snippet:#include "Processing.hpp" #if defined(__ARM_NEON__) #include <arm_neon.h> #endif #define USE_NEON true #define USE_FIXED_POINT false using namespace cv; void alphaBlendC1_NEON(const Mat& src, Mat& dst, const Mat& alpha) { CV_Assert(src.type() == dst.type() == alpha.type() == CV_8UC1 && src.isContinuous() && dst.isContinuous() && alpha.isContinuous() && (src.cols % 8 == 0) && (src.cols == dst.cols) && (src.cols == alpha.cols)); #if !defined(__ARM_NEON__) || !USE_NEON alphaBlendC1(src, dst, alpha); #else uchar* pSrc = src.data; uchar* pDst = dst.data; uchar* pAlpha = alpha.data; for(int i=0; i < src.total(); i+=8, pSrc+=8, pDst+=8, pAlpha+=8) { // Load data from memory to NEON registers uint8x8_t vsrc = vld1_u8(pSrc); uint8x8_t vdst = vld1_u8(pDst); uint8x8_t valpha = vld1_u8(pAlpha); uint8x8_t v255 = vdup_n_u8(255); // Multiply source pixels uint16x8_t mult1 = vmull_u8(vsrc, valpha); // Multiply destination pixels uint8x8_t tmp = vsub_u8(v255, valpha); uint16x8_t mult2 = vmull_u8(tmp, vdst); //Add them uint16x8_t sum = vaddq_u16(mult1, mult2); // Take upper bytes (approximates division by 255) uint8x8_t out = vshrn_n_u16(sum, 8); // Store the result back to the memory vst1_u8(pDst, out); } #endif } void multiply_NEON(Mat& src, float multiplier) { CV_Assert(src.type() == CV_8UC1 && src.isContinuous() && (src.cols % 8 == 0)); #if !defined(__ARM_NEON__) || !USE_NEON src *= multiplier; #elif USE_FIXED_POINT uchar fpMult = uchar((multiplier * 128.f) + 0.5f); uchar* ptr = src.data; for(int i = 0; i < src.total(); i+=8, ptr+=8) { uint8x8_t vsrc = vld1_u8(ptr); uint8x8_t vmult = vdup_n_u8(fpMult); uint16x8_t product = vmull_u8(vsrc, vmult); uint8x8_t out = vqshrn_n_u16(product, 7); vst1_u8(ptr, out); } #else uchar* ptr = src.data; for(int i = 0; i < src.total(); i+=8, ptr+=8) { float32x4_t vmult1 = vdupq_n_f32(multiplier); float32x4_t vmult2 = vdupq_n_f32(multiplier); uint8x8_t in = vld1_u8(ptr); // Load // Convert to 16bit uint16x8_t in16bit = vmovl_u8(in); // Split vector uint16x4_t in16bit1 = vget_high_u16(in16bit); uint16x4_t in16bit2 = vget_low_u16(in16bit); // Convert to float uint32x4_t in32bit1 = vmovl_u16(in16bit1); uint32x4_t in32bit2 = vmovl_u16(in16bit2); float32x4_t inFlt1 = vcvtq_f32_u32(in32bit1); float32x4_t inFlt2 = vcvtq_f32_u32(in32bit2); // Multiplication float32x4_t outFlt1 = vmulq_f32(vmult1, inFlt1); float32x4_t outFlt2 = vmulq_f32(vmult2, inFlt2); // Convert from float uint32x4_t out32bit1 = vcvtq_u32_f32(outFlt1); uint32x4_t out32bit2 = vcvtq_u32_f32(outFlt2); uint16x4_t out16bit1 = vmovn_u32(out32bit1); uint16x4_t out16bit2 = vmovn_u32(out32bit2); // Combine back uint16x8_t out16bit = vcombine_u16(out16bit2, out16bit1); // Convert to 8bit uint8x8_t out8bit = vqmovn_u16(out16bit); // Store to the memory vst1_u8(ptr, out8bit); } #endif }
applyToVideo_optimized
method.Nowadays, SIMD instructions are available on many architectures, from desktop CPU to embedded DSP. ARM processors provide a rich set of instructions, called NEON; they are available on all iOS devices starting from iPhone 3GS.
To start writing NEON code, you have to add the following declaration to your file:
#if defined(__ARM_NEON__) #include <arm_neon.h> #endif
Now you can use all the types and functions declared there. Please note, that we're going to use the so-called intrinsics—functions in C that serve as a wrapper over NEON assembler instructions. In fact, you can write your code in pure assembler, but it will worsen the readability, although there is a small performance gain, it usually isn't worth it.
Let's consider how the alphaBlendC1_optimized
function works. This function should use the following formula to calculate the resulting pixel's value:
dst(x, y) = [alpha(x, y) * src(x, y) + (255.0 - alpha(x, y)) * dst(x, y)] / 255.0;
The NEON code does exactly that, except the very last division, which is approximated by bit-shifting 8 positions to the right (vshrn_n_u16
function). This means that we divide by 256, instead of 255, and the result of the vectorized function may differ from the original implementation. But we can tolerate that, as we're working on a visual effect, and the possible difference is negligibly small. But please note that such approximations may be unacceptable in a numerical pipeline.
You can also see that we process 8 pixels simultaneously. Our alphaBlendC1_optimized
function heavily relies on the exact format of input matrices (that is, is one channel, is continuous, and the number of columns is a multiple of 8), but it can be easily generalized for other situations.
The multiply
function performs simple multiplication with a floating-point coefficient. But we need to do a sequence of conversions to perform the multiplication. But still, because we process 8 pixels simultaneously, the speedup is impressive.
Performance optimization with NEON is a deep and wide subject. Most image processing functions could be optimized for 3x speedup, without affecting accuracy. You can even get more if you apply some approximations. In the following sections, we provide some pointers for further study.
ARM Information Center provides extensive documentation on NEON intrinsics, and can be found at http://bit.ly/3848_ARMNEON. You can see that the instruction set is quite rich, and allows you to optimize your code in different situations.
Our multiply
function is a naive translation of the C++ code to NEON intrinsics. But sometimes, it is possible to achieve much better speedup by using some approximation. The very popular method of approximating floating-point calculations is the so-called fixed-point arithmetic, where we store real numbers in variables of integer type (http://en.wikipedia.org/wiki/Fixed-point_arithmetic).
In our case, we can convert the value of multiplier
into the Q1.7 format, perform multiplication, and then scale the result back. More about the Qm.n format can be found at http://en.wikipedia.org/wiki/Q_(number_format). The only difference is that the actual Q1.7 format requires 9 bits, where the first bit is used for the sign. But because pixel values are positive, we can drop the sign bit and pack the Q1.7 format into 8 bits of a single byte.
In the following code, we demonstrate the use of the fixed-point arithmetic:
uchar src = 111; float multiplier = 0.76934; uchar dst = 0; dst = uchar(src * multiplier); printf("dst floating-point = %d ", dst); uchar fpMultiplier = uchar((multiplier * 128.f) + 0.5f); dst = (src * fpMultipiler) >> 7; // 128 = 2^7 printf("dst fixed-point = %d ", dst);
The following is the console output for that code. You can see that approximation is not exact, but again, we can tolerate it in our application. We can also try to use the Qm.n format with a larger value of n
, for example, Q1.15
:
dst floating-point = 85 dst fixed-point = 84
It can bee seen that fixed-point arithmetic uses integer operations instead of floating-point, and so is much more efficient. At the same time, it can be effectively vectorized with NEON, producing even higher speedups.
Please note that you shouldn't expect speedup in our example, as the NEON version is already good enough. But if the numerical pipeline is a little bit more complicated, fixed-point may give you an impressive speedup.