Appendix 1

Presentation of the Benchmarks used in our Experiments

This appendix describes the benchmarks and the data dependence graphs (DDG) that we used in our experiments. The DDGs have been generated by the st200cc compiler from STmicroelectronics, using the option -03. Superblock formation and loop unrolling are enabled, and instruction selection has been performed for the ST231 VLIW processor.

The ST231 processor used for our experiments executes up to four operations per clock cycle with a maximum of one control operation (goto, jump, call, return), one memory operation (load, store, prefetch) and two multiply operations per clock cycle. All arithmetic instructions operate on integer values with operands belonging either to the general register (GR) file (64 × 32 bit) or to the branch register (BR) file (8 × 1 bit). Floating-point computations are emulated by software. In order to eliminate some conditional branches, the ST200 architecture also provides conditional selection. The processing time of any operation is a single clock cycle, while the latencies between operations range from 0 to 3 clock cycles.

Note that we make our DDG public for helping the research community to share their data and to reproduce our performance numbers.

A1.1. Qualitative benchmarks presentation

We consider a representative set of applications for both high performance and embedded benchmarks. We chose to optimize the set of the following collections of well-known applications programmed in C and C++.

1) FFMPEG is the reference application benchmark used by STMicroelectronics for their compilation research and development. It is a representative application for the usage of ST231 (video mpeg encoder/decoder). The application is a set of 119 C files, containing 112,997 lines of C code.
2) MEDIABENCH is a collection of 10 applications for multimedia written in C (encryption, image and video processing, compression, speech recognition, etc.). In its public version, MEDIABENCH is not portable to any platform because some parts are coded in assembly language of some selected workstation targets (excluding very lng instruction word (VLIW) targets). Our used MEDIABENCH collection has first been ported to ST231 VLIW platform. The whole MEDIABENCH application has 1,467 C files, containing 788,261 lines of C code.
3) SPEC2000 is a collection of applications for high-performance computing and desktop market (scientific computing, simulation, compiler, script interpreters, multimedia applications, desktop applications, etc.). It is a group of 12 big applications of representative integer programs and 4 big applications of floating-point programs. The whole collection contains 469 C files and 151 C++ files (656,867 lines of C and C++ code).
4) SPEC CPU2006 is the last collection of applications for scientific computing, intensive computation and desktop market. Compared to SPEC2000, SPEC2006 has larger code size and data sets (2386 C file, 528 C++ files and 3,365,040 C/C++ lines).

Both FFMPEG and MEDIABENCH collections have been successfully compiled, linked and executed on the embedded ST231 platform. For SPEC2000 and SPEC CPU2006, they have been successfully compiled and statically optimized but not executed because of one of the three following reasons:

1) Our target embedded system does not support some required dynamic function libraries by SPEC (the dynamic execution system of an embedded system is not as rich as a desktop workstation).
2) The large code size of SPEC benchmarks does not fit inside small embedded systems based on ST231.
3) The amount of requested dynamic memory (heap) cannot be satisfied at execution time on our embedded platform.

Consequently, our experiments report static performance numbers for all benchmark collections. The dynamic performance numbers (executions) are reported only for FFMPEG and MEDIABENCH applications.

The next section provides some useful quantitative metrics to analyze the complexity of our benchmarks.

A1.2. Quantitative benchmarks presentation

In order to gain a precise idea on problem sizes handled by our register optimization methods, we report six metrics using histograms (the x-axis represents the values and the y-axis represents the number of loops of the given values):

1) The numbers of nodes (loop statements) are depicted in Figure A1.1 for each benchmark collection. The whole median1 is equal to 24 nodes; the maximal value is 847. FFMPEG has the highest median of nodes numbers (29).
2) The numbers of nodes writing inside GRs are depicted in Figure A1.2. The whole median is equal to 15 nodes; the maximal value is 813 nodes. FFMPEG has the highest median (21 nodes).
3) The numbers of nodes writing inside BRs are depicted in Figure A1.3. The whole median is equal to 3 nodes; the maximal value is 35 nodes. Both FFMPEG and MEDIABENCH have a median of 1 node, meaning that half of their loops have a unique branch instruction (the regular loop branch). It can be noted that our model considers loops with multiple branch instructions inside their bodies.
4) The numbers of edges (data dependences) are depicted in Figure A1.4 for each benchmark collection. The whole median is equal to 73 edges; the maximal value is 21,980 edges. The highest median is FFMPEG one (99 edges).
5) the MinII values are depicted in Figure A1.5. We recall that MinII = max(MII, MIIres), where MIIres is the minimal II imposed by the resource constraints of the ST231 processor. The whole median of MinII values is equal to 12 clock cycles; the maximal value is 640 clock cycles. The highest median is that of FFMPEG (20 clock cycles);
6) The numbers of strongly connected components are depicted in Figure A1.6. The whole median is equal to nine strongly connected components, which means that, if needed, half of the loops can be split by loop fission into nine smaller loops; the maximal value is equal to 295. FFMPEG has the smallest median (seven strongly connected components).

Figure A1.1. Histograms on the number of nodes (loop statements): || V ||

image

Figure A1.2. Histograms on the number of statements writing inside general registers ||VR, GR||

image

Figure A1.3. Histograms on the number of statements writing inside branch registers ||VR, BR||

image

These quantitative measures show that the FFMPEG application brings a priori the most difficult and complex DDG instances for code optimization. This analysis is confirmed by our experiments below.

A1.3. Changing the architectural configuration of the processor

The previous section shows a quantitative presentation of our benchmarks when we consider the ST231 VLIW processor with its architectural configuration. In order to emulate more complex architectures, we configured the st200cc compiler to generate DDG for a processor architecture with three register types T = {FP, GR, BR} instead of two. Consequently, the distribution of the number of values per register type becomes the following.2

Figure A1.4. Histograms on the number of data dependences ||E||

image
image

Figure A1.5. Histograms on MinII values

image

We also considered various configurations for the number of architectural registers. We considered three possible configurations named small, medium and large architectures, respectively:

image

Figure A1.6. Histograms on the number of strongly connected components

image

1 We deliberately choose to report the median value instead of the mean value because the histograms show a skewed (biased) distribution [JAI 91].

2 MIN stands for MINimum, FST stands for FirST quantile (25% of the population), MED stands for MEDian (50% of the population), THD stands for THirD quantile (75% of the population) and MAX stands for MAXimum.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset