Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Optimizing performance with ARM NEON (Advanced)

NEON is a set of single instruction, multiple data (SIMD) instructions for ARM, and it can help in performance optimization. In this recipe, we will learn how to add NEON support to your project, and how to vectorize the code using it.

Getting ready

We will use the Recipe12_ProcessingVideo project as a starting point, trying to minimize the processing time. The source code is available in the Recipe14_OptimizingWithNEON folder in the code bundle that accompanies this book. For this recipe, you can't use Simulator, as NEON instructions are not supported on it and they are ARM-specific, while Simulator is x86.

How to do it...

The following is how we will optimize our video processing application:

Profile the application and find hotspots.
Enable NEON support in our source code.
Create an alternative implementation for the bottleneck functions using NEON.

Let's implement the described steps:

First of all, we need to profile the RetroFilter::applyToVideo method, as it is the most time consuming part of our application. We'll create a copy of this method with the name applyToVideo_optimized, and insert time measurements in it, as we did in the Printing a postcard (Intermediate) recipe. We'll not show the code of the method here, as it differs with these measurements only.
Note
It is generally a good practice to use special profiling tools to find hotspots in an application. But in our case, we only have a few functions, and it is better to measure their individual time without using any tools. Image processing tasks are quite time consuming, so you can easily detect bottlenecks with simple logging, and focus on optimization.
The following is a sample console log with processing steps:
```
TIMER_ConvertingToGray: 8.28ms
TIMER_IntensityVariation: 16.23ms
TIMER_AddingScratches: 4.46ms
TIMER_FuzzyBorder: 14.65ms
TIMER_ConvertingToBGR: 2.59ms
2013-05-25 19:04:12.879 Recipe14_OptimizingWithNEON[4503:5203] Processing time = 48.05ms; Running average FPS = 20.1;
```
Profiling will show that there are two major hotspots in our application: alphaBlendC1 function and the matrix multiplication with scalar (intensity variation). Because both functions process individual pixels independently, we can parallelize their execution. We then have several choices, such as multi-threading (via libdispatch) of vectorization using the NEON SIMD instruction set. To process images with several threads, we can split them into several stripes (for example, into four horizontal stripes) and process them as submatrices. This approach is quite easy to implement, and it actually doesn't require memory copy.

But let's focus on NEON; we will put the vectorized code to the Processing_NEON.cpp file of the CvEffects static library project. It is shown in the following code snippet:

#include "Processing.hpp"

#if defined(__ARM_NEON__)
  #include <arm_neon.h>
#endif

#define USE_NEON true
#define USE_FIXED_POINT false

using namespace cv;

void alphaBlendC1_NEON(const Mat& src, Mat& dst, const Mat& alpha)
{
    CV_Assert(src.type() == dst.type() == alpha.type() == CV_8UC1 &&
              src.isContinuous() && dst.isContinuous() &&
              alpha.isContinuous() &&
              (src.cols % 8 == 0) &&
              (src.cols == dst.cols) && (src.cols == alpha.cols));
    
#if !defined(__ARM_NEON__) || !USE_NEON
    alphaBlendC1(src, dst, alpha);
#else
    uchar* pSrc = src.data;
    uchar* pDst = dst.data;
    uchar* pAlpha = alpha.data;
    for(int i=0; i < src.total(); i+=8, pSrc+=8, pDst+=8, pAlpha+=8)
    {
        // Load data from memory to NEON registers
        uint8x8_t vsrc = vld1_u8(pSrc);
        uint8x8_t vdst = vld1_u8(pDst);
        uint8x8_t valpha = vld1_u8(pAlpha);
        uint8x8_t v255 = vdup_n_u8(255);
        
        // Multiply source pixels
        uint16x8_t mult1 = vmull_u8(vsrc, valpha);
        
        // Multiply destination pixels
        uint8x8_t tmp = vsub_u8(v255, valpha);
        uint16x8_t mult2 = vmull_u8(tmp, vdst);
        
        //Add them
        uint16x8_t sum = vaddq_u16(mult1, mult2);
        
        // Take upper bytes (approximates division by 255)
        uint8x8_t out = vshrn_n_u16(sum, 8);
        
        // Store the result back to the memory
        vst1_u8(pDst, out);
    }
#endif
}

void multiply_NEON(Mat& src, float multiplier)
{
    CV_Assert(src.type() == CV_8UC1 && src.isContinuous() &&
              (src.cols % 8 == 0));
    
#if !defined(__ARM_NEON__) || !USE_NEON
    src *= multiplier;
#elif USE_FIXED_POINT
    uchar fpMult = uchar((multiplier * 128.f) + 0.5f);
    uchar* ptr = src.data;
    for(int i = 0; i < src.total(); i+=8, ptr+=8)
    {
        uint8x8_t vsrc = vld1_u8(ptr);
        uint8x8_t vmult = vdup_n_u8(fpMult);
        uint16x8_t product = vmull_u8(vsrc, vmult);
        uint8x8_t out = vqshrn_n_u16(product, 7);
        vst1_u8(ptr, out);
    }

#else
    uchar* ptr = src.data;
    for(int i = 0; i < src.total(); i+=8, ptr+=8)
    {
        float32x4_t vmult1 = vdupq_n_f32(multiplier);
        float32x4_t vmult2 = vdupq_n_f32(multiplier);
        
        uint8x8_t in = vld1_u8(ptr); // Load
        
        // Convert to 16bit
        uint16x8_t in16bit = vmovl_u8(in);
        
        // Split vector
        uint16x4_t in16bit1 = vget_high_u16(in16bit);
        uint16x4_t in16bit2 = vget_low_u16(in16bit);
        
        // Convert to float
        uint32x4_t in32bit1 = vmovl_u16(in16bit1);
        uint32x4_t in32bit2 = vmovl_u16(in16bit2);
        float32x4_t inFlt1 = vcvtq_f32_u32(in32bit1);
        float32x4_t inFlt2 = vcvtq_f32_u32(in32bit2);
        
        // Multiplication
        float32x4_t outFlt1 = vmulq_f32(vmult1, inFlt1);
        float32x4_t outFlt2 = vmulq_f32(vmult2, inFlt2);
        
        // Convert from float
        uint32x4_t out32bit1 = vcvtq_u32_f32(outFlt1);
        uint32x4_t out32bit2 = vcvtq_u32_f32(outFlt2);
        uint16x4_t out16bit1 = vmovn_u32(out32bit1);
        uint16x4_t out16bit2 = vmovn_u32(out32bit2);
        
        // Combine back
        uint16x8_t out16bit = vcombine_u16(out16bit2, out16bit1);
        
        // Convert to 8bit
        uint8x8_t out8bit = vqmovn_u16(out16bit);
        
        // Store to the memory
        vst1_u8(ptr, out8bit);
    }
#endif
}

Now, we should call these functions from the applyToVideo_optimized method.
When ready, build and run the application. Depending on your device, you can see up to two times the total performance speedup. Speedup of optimized functions alone is much higher.

How it works...

Nowadays, SIMD instructions are available on many architectures, from desktop CPU to embedded DSP. ARM processors provide a rich set of instructions, called NEON; they are available on all iOS devices starting from iPhone 3GS.

To start writing NEON code, you have to add the following declaration to your file:

#if defined(__ARM_NEON__)
  #include <arm_neon.h>
#endif

Now you can use all the types and functions declared there. Please note, that we're going to use the so-called intrinsics—functions in C that serve as a wrapper over NEON assembler instructions. In fact, you can write your code in pure assembler, but it will worsen the readability, although there is a small performance gain, it usually isn't worth it.

Let's consider how the alphaBlendC1_optimized function works. This function should use the following formula to calculate the resulting pixel's value:

dst(x, y) = [alpha(x, y) * src(x, y) + (255.0 - alpha(x, y)) * dst(x, y)] / 255.0;

The NEON code does exactly that, except the very last division, which is approximated by bit-shifting 8 positions to the right (vshrn_n_u16 function). This means that we divide by 256, instead of 255, and the result of the vectorized function may differ from the original implementation. But we can tolerate that, as we're working on a visual effect, and the possible difference is negligibly small. But please note that such approximations may be unacceptable in a numerical pipeline.

You can also see that we process 8 pixels simultaneously. Our alphaBlendC1_optimized function heavily relies on the exact format of input matrices (that is, is one channel, is continuous, and the number of columns is a multiple of 8), but it can be easily generalized for other situations.

Note

If the image width is not divided by the width of the SIMD instruction, the common practice is to process the tail with ordinary C code. As images are normally large enough, this non-vectorized processing near the right-hand side border doesn't affect performance much.

The multiply function performs simple multiplication with a floating-point coefficient. But we need to do a sequence of conversions to perform the multiplication. But still, because we process 8 pixels simultaneously, the speedup is impressive.

There's more...

Performance optimization with NEON is a deep and wide subject. Most image processing functions could be optimized for 3x speedup, without affecting accuracy. You can even get more if you apply some approximations. In the following sections, we provide some pointers for further study.

NEON

ARM Information Center provides extensive documentation on NEON intrinsics, and can be found at http://bit.ly/3848_ARMNEON. You can see that the instruction set is quite rich, and allows you to optimize your code in different situations.

Fixed-point arithmetic

Our multiply function is a naive translation of the C++ code to NEON intrinsics. But sometimes, it is possible to achieve much better speedup by using some approximation. The very popular method of approximating floating-point calculations is the so-called fixed-point arithmetic, where we store real numbers in variables of integer type (http://en.wikipedia.org/wiki/Fixed-point_arithmetic).

In our case, we can convert the value of multiplier into the Q1.7 format, perform multiplication, and then scale the result back. More about the Qm.n format can be found at http://en.wikipedia.org/wiki/Q_(number_format). The only difference is that the actual Q1.7 format requires 9 bits, where the first bit is used for the sign. But because pixel values are positive, we can drop the sign bit and pack the Q1.7 format into 8 bits of a single byte.

In the following code, we demonstrate the use of the fixed-point arithmetic:

    uchar src = 111;
    float multiplier = 0.76934;
    uchar dst = 0;

    dst = uchar(src * multiplier);
    printf("dst floating-point = %d
", dst);

    uchar fpMultiplier = uchar((multiplier * 128.f) + 0.5f);
    dst = (src * fpMultipiler) >> 7; // 128 = 2^7
    printf("dst fixed-point = %d
", dst);

The following is the console output for that code. You can see that approximation is not exact, but again, we can tolerate it in our application. We can also try to use the Qm.n format with a larger value of n, for example, Q1.15:

    dst floating-point = 85
    dst fixed-point = 84

It can bee seen that fixed-point arithmetic uses integer operations instead of floating-point, and so is much more efficient. At the same time, it can be effectively vectorized with NEON, producing even higher speedups.

Please note that you shouldn't expect speedup in our example, as the NEON version is already good enough. But if the numerical pipeline is a little bit more complicated, fixed-point may give you an impressive speedup.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Optimizing performance with ARM NEON (Advanced)

Create new playlist

Sign In

Sign Up

Optimizing performance with ARM NEON (Advanced)

Getting ready

How to do it...

Note

How it works...

Note

There's more...

NEON

Fixed-point arithmetic

Table of Contents for
Optimizing performance with ARM NEON (Advanced)