Aparapi (https://github.com/aparapi/aparapi) is a Java library that supports concurrent operations. The API supports code running on GPUs or CPUs. GPU operations are executed using OpenCL, while CPU operations use Java threads. The user can specify which computing resource to use. However, if GPU support is not available, Aparapi will revert to Java threads.
The API will convert Java byte codes to OpenCL at runtime. This makes the API largely independent from the graphics card used. The API was initially developed by AMD but has been released as open source. This is reflected in the basic package name, com.amd.aparari
. Aparapi offers a higher level of abstraction than provided by OpenCL.
Aparapi code is located in a class derived from the Kernel
class. Its execute
method will start the operations. This will result in an internal call to a run
method, which needs to be overridden. It is within the run
method that concurrent code is placed. The run
method is executed multiple times on different processors.
Due to OpenCL limitations, we are unable to use inheritance or method overloading. In addition, it does not like println
in the run
method, since the code may be running on a GPI. Aparapi only supports one-dimensional arrays. Arrays using two or more dimensions need to be flattened to a one dimension array. The support for double values is dependent on the OpenCL version and GPU configuration.
When a Java thread pool is used, it allocates one thread per CPU core. The kernel containing the Java code is cloned, one copy per thread. This avoids the need to access data across a thread. Each thread has access to information, such as a global ID, to assist in the code execution. The kernel will wait for all of the threads to complete.
Aparapi downloads can be found at https://github.com/aparapi/aparapi/releases.
The basic framework for an Aparapi application is shown next. It consists of a Kernel
derived class where the run
method is overridden. In this example, the run
method will perform scalar multiplication. This operation involves multiplying each element of a vector by some value.
The ScalarMultiplicationKernel
extends the Kernel
class. It possesses two instance variables used to hold the matrices for input and output. The constructor will initialize the matrices. The run
method will perform the actual computations, and the displayResult
method will show the results of the multiplication:
public class ScalarMultiplicationKernel extends Kernel { float[] inputMatrix; float outputMatrix []; public ScalarMultiplicationKernel(float inputMatrix[]) { ... } @Override public void run() { ... } public void displayResult() { ... } }
The constructor is shown here:
public ScalarMultiplicationKernel(float inputMatrix[]) { this.inputMatrix = inputMatrix; outputMatrix = new float[this.inputMatrix.length]; }
In the run
method, we use a global ID to index into the matrix. This code is executed on each computation unit, for example, a GPU or thread. A unique global ID is provided to each computational unit, allowing the code to access a specific element of the matrix. In this example, each element of the input matrix is multiplied by 2
and then assigned to the corresponding element of the output matrix:
public void run() { int globalID = this.getGlobalId(); outputMatrix[globalID] = 2.0f * inputMatrix[globalID]; }
The displayResult
method simply displays the contents of the outputMatrix
array:
public void displayResult() { out.println("Result"); for (float element : outputMatrix) { out.printf("%.4f ", element); } out.println(); }
To use this kernel, we need to declare variables for the inputMatrix
and its size
. The size
will be used to control how many kernels to execute:
float inputMatrix[] = {3, 4, 5, 6, 7, 8, 9}; int size = inputMatrix.length;
The kernel is then created using the input matrix followed by the invocation of the execute
method. This method starts the process and will eventually invoke the Kernel
class' run
method based on the execute
method's argument. This argument is referred to as the pass ID. While not used in this example, we will use it in the next section. When the process is complete, the resulting output matrix is displayed and the dispose
method is called to stop the process:
ScalarMultiplicationKernel kernel = new ScalarMultiplicationKernel(inputMatrix); kernel.execute(size); kernel.displayResult(); kernel.dispose();
When this application is executed we will get the following output:
6.0000 8.0000 10.0000 12.0000 14.0000 16.0000 18.000
We can specify the execution mode using the Kernel class' setExecutionMode
method, as shown here:
kernel.setExecutionMode(Kernel.EXECUTION_MODE.GPU);
However, it is best to let Aparapi determine the execution mode. The following table summarizes the execution modes available:
Execution mode |
Meaning |
|
Does not specify mode |
|
Use CPU |
|
Use GPU |
|
Use Java threads |
|
Use single loop (for debugging purposes) |
Next, we will demonstrate how we can use Aparapi to perform dot product matrix multiplication.
We will use the matrices as used in the Implementing basic matrix operations section. We start with the declaration of the MatrixMultiplicationKernel
class, which contains the vector declarations, a constructor, the run
method, and a displayResults
method. The vectors for matrices A
and B
have been flattened to one-dimensional arrays by allocating the matrices in row-column order:
class MatrixMultiplicationKernel extends Kernel { float[] vectorA = { 0.1950f, 0.0311f, 0.3588f, 0.2203f, 0.1716f, 0.5931f, 0.2105f, 0.3242f}; float[] vectorB = { 0.0502f, 0.9823f, 0.9472f, 0.5732f, 0.2694f, 0.916f}; float[] vectorC; int n; int m; int p; @Override public void run() { ... } public MatrixMultiplicationKernel(int n, int m, int p) { ... } public void displayResults () { ... } }
The MatrixMultiplicationKernel
constructor assigns values for the matrices' dimensions and allocates memory for the result stored in vectorC,
as shown here:
public MatrixMultiplicationKernel(int n, int m, int p) { this.n = n; this.p = p; this.m = m; vectorC = new float[n * p]; }
The run method uses a global ID and a pass ID to perform the matrix multiplication. The pass ID is specified as the second argument of the Kernel
class' execute
method, as we will see shortly. This value allows us to advance the column index for vectorC
. The vector indexes map to the corresponding row and column positions of the original matrices:
public void run() { int i = getGlobalId(); int j = this.getPassId(); float value = 0; for (int k = 0; k < p; k++) { value += vectorA[k + i * m] * vectorB[k * p + j]; } vectorC[i * p + j] = value; }
The displayResults
method is shown as follows:
public void displayResults() { out.println("Result"); for (int i = 0; i < n; i++) { for (int j = 0; j < p; j++) { out.printf("%.4f ", vectorC[i * p + j]); } out.println(); } }
The kernel is started in the same way as in the previous section. The execute
method is passed the number of kernels that should be created and an integer indicating the number of passes to make. The number of passes is used to control the index into the vectorA
and vectorB
arrays:
MatrixMultiplicationKernel kernel = new MatrixMultiplicationKernel(n, m, p);kernel.execute(6, 3);kernel.displayResults(); kernel.dispose();
When this example is executed, you will get the following output:
Result 0.0276 0.1999 0.2132 0.1443 0.4118 0.5417 0.3486 0.3283 0.7058 0.1964 0.2941 0.4964
Next, we will see how Java 8 additions can contribute to solving math-intensive problems in a parallel manner.