Conclusion

C.1. Some open problems in backend optimizing compilation

C.1.1. Problem of instruction selection

One of the most interesting problems in backend code optimization is instruction selection, which we did not address in this book because we think that the problem is still open when instruction schedules are not fixed.

Instruction selection allows us to transform a low-level intermediate code into the assembly code of the target machine. We think that the classical model based on pattern matching (rewriting rules and trees [AHO 07]) does not exactly describe the problem of instruction selection. Indeed, such syntactic rules allow us to transform m intermediate instructions into a single assembly instruction, using a static cost model [AHO 07]. We think that a more accurate model must be based on code semantics. That is, the general problem is transforming m low-level intermediate instructions into n assembly instructions computing the same result, and this could be modeled by algorithm recognition as studied in [ALI 05]. The reason is that, with the advance of reconfigurable computing and heterogeneous architectures, some sophisticated assembly instructions (with complex semantics) may be introduced in the instruction set: mathematical functions, vector instructions, domain-specific instructions, etc. Rewriting rules based on pattern matching are fragile techniques that may not detect opportunities for transforming m low-level three address instructions into sophisticated instructions. More advanced code analysis based on program semantics should be used.

Another open question for instruction selection is its phase ordering. Indeed, it is not clear where the best place to introduce the instruction selection step inside a backend compiler flow is. Usually, instruction selection is applied before register allocation and instruction scheduling. In this case, the cost model used to select an instruction among others is not exact, since the instruction-level parallelism (ILP) extracted afterward may hide or increase a cost: think about the case of multiply-add that may be favored by the instruction selection step, while it is not exactly the best candidate to enhance ILP if a single functional unit exists for executing it.

Contrary to instruction scheduling and register allocation, we think that instruction selection suffers from a lack of fundamental knowledge that helps us to design elegant and robust heuristics. For all these reasons, we think that instruction selection remains an interesting problem to study in backend compilation.

C.1.2. Code optimization for multi-core and many-core processors

Putting multiple processors inside the same microchip (including network on chips) does not fundamentally change the problems of parallel programming. The programming paradigms used for multiprocessors are exactly the same for multi-cores: shared memory (OpenMP), distributed memory (MPI), threads and process, multi-agents, bio-inspired parallelism, can all be used to program multi-core processors. Furthermore, parallel algorithms would not change simply because a multiprocessor machine is transformed into a multi-core processor. In addition, the performance of a parallel program always remains limited by its sequential part: contrary to the Moore’s law which hits its practical limit, Amdhal’s law remains valid forever. Consequently, the importance of the optimization of sequential codes is not reduced by the multi-core era; it remains a complementary research activity.

We think that the introduction of multi-core processors brings us new application domains for parallelism (not a new paradigm). Indeed, parallelism used to be an expert domain mainly tackled for scientific computing and simulation. Automatic parallelization was initially thought for regular codes (static control programs) such as Fortran programs executing on supercomputers. With multi-cores, parallelism becomes a cheap technology that brings high-performance computing at home, opening a new market for semiconductor industry. Consequently, the applications that must be optimized for multi-cores are general purpose ones, clearly distinct from regular scientific codes. Such applications are executed on desktops, programmed with languages such as java/C/C++, where the programmer makes an extensive usage of data pointers, data structures, while-loops, if-then-else construct, external libraries, indirect array accesses and function pointers. Automatic parallelization in this context becomes very limited in practice.

The future trend of the number of cores inside a microchip is not clear today. Maybe the number of cores will increase for many years, or it may hit a physical limit quickly, or maybe the technology will focus on heterogeneous architectures, with specialized cores surrounding a subset of general-purpose cores. As usual, the question is how to optimize the usage of all these cores through enough parallelism. Our personal view is to design the applications in parallel from the beginning if possible, though not all problems can be solved with parallel algorithms. For the huge amount of existing irregular sequential codes, they should be parallelized with semiautomatic tools. For this purpose, we may need some advanced data flow analysis methods that consider irregular program structures.

Finally, we think about the aspect of the performance instability on multi-cores. We think that the notion of speedups usually used for summarizing a code performance with a single number must be analyzed carefully with a rigorous statical method, as explained in this book. In addition, general-purpose codes have performances sensitive to input data, contrary to scientific regular codes, which are sensitive to the size of the input data. The chosen input data may favor one execution path among others; so the notion of a single speedup per program becomes questionable.

C.2. Summary

We present here our impressions after a long-term research effort in backend code optimization. As research methodology, we favored formal computer science when the objective to optimize was clear at the architectural level. For instance, the number of registers, ILP and code size are clearly defined objectives, which can be observed in the final code. We believe that optimizing for architecturally visible objectives should be formal because: (1) the architecture of a processor does not change quickly, so it allows more time for investing in fundamental studies; (2) formal research allows us to make connection with other computer science areas and benefit from their vision (algorithmic theory, complexity, discrete applied mathematics and combinatorial optimization); (3) formal results in code optimization allow us to verify the correctness of the generated code; and (4) investing in a compilation step is a hard and costly effort; it should be done under strong basis.

As architecturally visible objectives, we showed how to efficiently tackle the phase ordering problem between register optimization and instruction scheduling. We demonstrate that is better to first satisfy register constraints to guarantee the absence of spilling before instruction scheduling. Our processor model is general enough to be used for most of the existing superscalar, very long instruction word (VLIW) and explicitly parallel instruction computing (EPIC) processors. We provided theorems to understand, and designed efficient heuristics. Our methods have been implemented and tested as stand-alone tools inside a real compiler. We demonstrated that the quality of the codes generated due to our register optimization methods is better. We also released our software that is independent of an existing compiler, allowing its future integration inside code optimization or analysis tools.

Another architecturally visible objective is code size, more precisely the relationship between loop unrolling factor, the number of allocated registers and the ILP. Our fundamental knowledge on the relationship between these three metrics allows us to design an optimal (exponential but efficient) algorithm that minimizes the unrolling factor without degrading ILP while guaranteeing the absence of spill code. The application of this research result is devoted to the embedded VLIW area. We showed then that code size and code performance are not necessarily two antagonistic optimization objectives, and trade-off is not always necessary between code compaction and code performance.

Concerning the optimizing compilation in general, we studied the problem of phase ordering. We proved that iterative compilation is not fundamentally better than static compilation. If we consider the long compilation time used in iterative approaches, we believe that it can be used for more aggressive static approaches. First, because static compilation does not favor a program input. Second, in static compilation, we can use abstract performance models that may help to design efficient phase ordering strategies in the future. Third, static code optimization methods can be combined with code verification to certify that the final generated codes are correct.

When we optimize objectives for micro-architectural mechanisms, we favored practical research with stress on experimental observations. The reason is that micro-architectures are too complex to model, and may change quickly, so we do not have time to invest in fundamental research. Furthermore, we think that a compiler backend should not be patched with ad hoc optimization methods that focus on a specific micro-architectural problem. For such types of backend optimizations, we think that they are more appropriate in separate tools for semiautomatic code optimization. For instance, we demonstrated that the memory disambiguation mechanisms in superscalar processors do not make full memory address comparisons, and may sequentialize the execution of independent operations. To solve this problem, we designed an ad hoc load/store vectorization strategy. In another research effort devoted to VLIW processors, we showed how to combine data preloading and prefetching to optimize some irregular embedded codes. All these methods are efficient in practice because they optimize the interaction between the ILP and the micro-architecture.

As cache mechanisms have been considered constant micro-architectural enhancements for a long time, we think that there is room for abstracting the problem in order to study it from the scheduling theory point of view. Ideally, the scheduling problem must consider variable memory instruction latencies: a memory instruction has a latency that depends on the placement of the data inside the cache. Inversely, the placement of the data inside the cache depends on the schedule of the memory instructions. This cyclic relationship defines an interesting open problem for scheduling theory.

The current advance in reconfigurable computing, and the possible future emergence of heterogeneous computing, would introduce complex instruction sets with rich semantics. This would put special stress on instruction selection in backend compilation, a problem that we did not tackle in this current book. Indeed, the implemented instruction selection heuristics inside compilers are mainly based on syntactic pattern matching (rewriting rules) that cannot capture all the semantics of low-level instructions. We think that more sophisticated algorithm recognition methods must be used to build efficient automatic instruction selection phases.

Finally, readers must be aware that code optimization using sophisticated compilation methods is one of the multiple processes needed for improving program performances. In order to observe a real code performance improvement, the user must combine multiple optimizing compilation methods by considering detailed hardware characteristics. In addition, he/she may also need to deal with the operating system and machine workload (modify the configuration and optimize some operating system parameters). Optimizing a code by compilation does not necessarily yield observed performance improvement. And an observed code performance improvement does not necessarily result from an applied code optimization method. Sometimes, applying a code optimization method transforms the program and alters its interaction with the hardware with a new execution scenario which is completely different from the one assumed by the compiler.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset