Home Page Icon
Home Page
Table of Contents for
Lists of Figures, Tables and Algorithms
Close
Lists of Figures, Tables and Algorithms
by Sid Touati, Benoit de Dinechin
Advanced Backend Optimization
Cover
Contents
Title Page
Copyright
Introduction
I.1. Inside this book
I.2. Other contributors
I.3. Basics on instruction-level parallelism processor architectures
Part 1: Prolog: Optimizing Compilation
1 On the Decidability of Phase Ordering in Optimizing Compilation
1.1. Introduction to the Phase Ordering Problem
1.2. Background on Phase Ordering
1.3. Toward a Theoretical Model for the Phase Ordering Problem
1.4. Examples of Decidable Simplified Cases
1.5. Compiler Optimization Parameter Space Exploration
1.6. Conclusion on Phase Ordering in Optimizing Compilation
Part 2: Instruction Scheduling
2 Instruction Scheduling Problems and Overview
2.1. Vliw Instruction Scheduling Problems
2.2. Software Pipelining
2.3. Instruction Scheduling and Register Allocation
3 Applications of Machine Scheduling to Instruction Scheduling
3.1. Advances in Machine Scheduling
3.2. List Scheduling Algorithms
3.3. Time-Indexed Scheduling Problem Formulations
4 Instruction Scheduling Before Register Allocation
4.1. Instruction Scheduling for an Ilp Processor: Case of a Vliw Architecture
4.2. Large Neighborhood Search for the Resource-Constrained Modulo Scheduling Problem
4.3. Resource-Constrained Modulo Scheduling Problem
4.4. Time-Indexed Integer Programming Formulations
4.5. Large Neighborhood Search Heuristic
4.6. Summary and Conclusions
5 Instruction Scheduling After Register Allocation
5.1. Introduction
5.2. Local Instruction Scheduling
5.3. Global Instruction Scheduling
5.4. Experimental Results
5.5. Conclusions
6 Dealing in Practice With Memory Hierarchy Effects and Instruction Level Parallelism
6.1. The Problem of Hardware Memory Disambiguation at Runtime
6.2. Data Preloading and Prefetching
Part 3: Register Optimization
7 The Register Need of a Fixed Instruction Schedule
7.1. Data Dependence Graph and Processor Model for Register Optimization
7.2. The Acyclic Register Need
7.3. The Periodic Register Need
7.4. Computing the Periodic Register Need
7.5. Some Theoretical Results on the Periodic Register Need
7.6. Conclusion on the Register Requirement
8 The Register Saturation
8.1. Motivations on the Register Saturation Concept
8.2. Computing the Acyclic Register Saturation
8.3. Computing the Periodic Register Saturation
8.4. Conclusion on the Register Saturation
9 Spill Code Reduction
9.1. Introduction to Register Constraints in Software Pipelining
9.2. Related Work in Periodic Register Allocation
9.3. Sira: Schedule Independant Register Allocation
9.4. Siralina: An Efficient Polynomial Heuristic for Sira
9.5. Experimental Results With Sira
9.6. Conclusion on Spill Code Reduction
10 Exploiting the Register Access Delays Before Instruction Scheduling
10.1. Introduction
10.2. Problem Description of Ddg Circuits With Non-Positive Distances
10.3. Necessary and Sufficient Condition to Avoid Non-Positive Circuits
10.4. Application to the Sira Framework
10.5. Experimental Results on Eliminating Non-Positive Circuits
10.6. Conclusion on Non-Positive Circuit Elimination
11 Loop Unrolling Degree Minimization for Periodic Register Allocation
11.1. Introduction
11.2. Background
11.3. Problem Description of Unroll Factor Minimization for Unscheduled Loops
11.4. Algorithmic Solution for Unroll Factor Minimization: Single Register Type
11.5. Unroll Factor Minimization in the Presence of Multiple Register Types
11.6. Unroll Factor Reduction for Already Scheduled Loops
11.7. Experimental Results
11.8. Related Work
11.9. Conclusion on Loop Unroll Degree Minimization
Part 4: Epilog: Performance, Open Problems
12 Statistical Performance Analysis: The Speedup-Test Protocol
12.1. Code Performance Variation
12.2. Background and Notations
12.3. Analyzing the Statistical Significance of the Observed Speedups
12.4. The Speedup-Test Software
12.5. Evaluating the Proportion of Accelerated Benchmarks by a Confidence Interval
12.6. Experiments and Applications
12.7. Related Work
12.8. Discussion and Conclusion on the Speedup-Test Protocol
Conclusion
Appendix 1: Presentation of the Benchmarks Used in Our Experiments
A1.1. Qualitative benchmarks presentation
A1.2. Quantitative benchmarks presentation
A1.3. Changing the architectural configuration of the processor
Appendix 2: Register Saturation Computation on Stand-Alone Ddg
A2.1. The cyclic register saturation
A2.2. The periodic register saturation
Appendix 3: Efficiency Of Sira on the Benchmarks
A3.1. Efficiency of SIRALINA on stand-alone DDG
A3.2. Efficiency of SIRALINA plugged with an industrial compiler
Appendix 4: Efficiency of Non-Positive Circuit Elimination in the Sira Framework
A4.1. Experimental setup
A4.2. Comparison between the heuristics execution times
A4.3. Convergence of the proactive heuristic (iterative SIRALINA)
A4.4. Qualitative analysis of the heuristics
A4.5. Conclusion on non-positive circuit elimination strategy
Appendix 5: Loop Unroll Degree Minimization: Experimental Results
A5.1. Stand-alone experiments with single register types
A5.2. Experiments with multiple register types
Appendix 6: Experimental Efficiency of Software Data Preloading and Prefetching for Embedded Vliw
Appendix 7: Appendix of the Speedup-Test Protocol
A7.1. Why is the observed minimal execution time not necessarily a good statistical estimation of program performances?
A7.2. Hypothesis testing in statistical and probability theory
A7.3. What is a reasonable large sample? Observing the central limit theorem in practice
Bibliography
Lists of Figures, Tables and Algorithms
LIST OF FIGURES
LIST OF TABLES
LIST OF ALGORITHMS
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Bibliography
Next
Next Chapter
Index
Lists of Figures, Tables and Algorithms
LIST OF FIGURES
INTRODUCTION
I.1. Pipelined vs. simultaneous execution
I.2. Superscalar execution
I.3. Simple superscalar pipelined steps
I.4. VLIW processors
I.5. Block diagram of a TTA
CHAPTER 1
1.1. Classes of phase-ordering problems
1.2. Classes of best-parameters problems
CHAPTER 2
2.1. The ST220 cumulative resource availabilities and resource class requirements
2.2. Source code and the inner loop body code generator representation
2.3. The block scheduled loop body and the block schedule resource table
2.4. The software pipeline local schedule and the software pipeline resource table
CHAPTER 3
3.1. The time-indexed dependence inequalities of Christofides
et al.
[CHR 87]
CHAPTER 4
4.1. Original dependence graph
4.2. Augmented dependence graph
4.3. A reservation table, a regular reservation table and a reservation vector
4.4. Counted loop software pipeline construction with and without preconditioning
4.5. While-loop software pipeline construction without and with modulo expansion
4.6. The
st200cc R3.2
compiler performances on the HP benchmarks
4.7. Definition of the contributions
to the register pressure
4.8. Sample cyclic instruction scheduling problem
CHAPTER 5
5.1. Sample schedules for a two-resource scheduling problem (horizontal time)
5.2. Scoreboard Scheduling within the time window (window_size = 4)
5.3. Scoreboard Scheduling and moving the time window (window_size = 4)
5.4. Benchmark basic blocks and instruction scheduling results
5.5. Time breakdown for cycle scheduling and scoreboard scheduling
CHAPTER 6
6.1. Example of Alpha 21264 processor
6.2. Example of Power 4 processor
6.3. Cache behavior of Itanium 2 processor
6.4. Vectorization on Itanium 2
6.5. Stride patterns classification
CHAPTER 7
7.1. DAG example with acyclic register need
7.2. Periodic register need in software pipelining
7.3. Circular lifetime intervals
7.4. Relationship between the maximal clique and the width of a circular graph
7.5. Examples of DDG with unique possible killer per value
CHAPTER 8
8.1. DAG model
8.2. Valid killing function and bipartite decomposition
8.3. Example of computing the acyclic register saturation
CHAPTER 9
9.1. Example for SIRA and reuse graphs
CHAPTER 10
10.1. Linear program based on shortest paths equations (SPE)
CHAPTER
11.1. Minimal unroll factor computation depending on phase ordering
11.2. Example to highlight the short-comings of the MVE technique
11.3. SWP kernel unrolled with MVE
11.4. Example to explain the optimality of the meeting graph technique
11.5. Example for SIRA and reuse graphs
11.6. Graphical solution for the fixed loop unrolling problem
11.7. How to traverse the lattice
S
11.8. Modifying reuse graphs to minimize loop unrolling factor
11.9. Loop unrolling values in the search space
S
11.10. Example of loop unrolling reduction using meeting graph
11.11. The new search space
S
in the meeting graph
CHAPTER 12
12.1. Observed execution times of some SPEC OMP 2001 applications (compiled with
gcc
)
12.2. The Speedup-Test protocol for analyzing the average execution time
12.3. The Speedup-Test protocol for analyzing the median execution time
APPENDIX 1
A1.1. Histograms on the number of nodes (loop statements): ||
V
||
A1.2. Histograms on the number of statements writing inside general registers ||
V
R,GR
||
A1.3. Histograms on the number of statements writing inside branch registers ||
V
R,BR
||
A1.4. Histograms on the number of data dependences ||
E
||
A1.5. Histograms on MinII values
A1.6. Histograms on the number of strongly connected components
APPENDIX 2
A2.1. Accuracy of the GREEDY-K heuristic versus optimality
A2.2. Error ratios of the GREEDY-K heuristic versus optimality
A2.3. Execution times of the GREEDY-K heuristic
A2.4. Maximal periodic register need vs. initiation interval
A2.5. Periodic register saturation in unrolled loops
APPENDIX 3
A3.1. Percentage of DDG treated successfully by SIRALINA and the impact on the MII
A3.2. Average increase of the MII
A3.3. Boxplots of the execution times of SIRALINA (all DDG)
A3.4. Plugging SIRA into the ST231 compiler toolchain (LAO backend)
A3.5. The impact of SIRA on static code quality
A3.6. Loops where spill code disappears completely
A3.7. Speedups of the whole application using the standard input
A3.8. Performance characterization of some applications
A3.9. Performance characterization of the FFMPEG application
APPENDIX 4
A4.1. Execution times of UAL (in seconds)
A4.2. Execution times of CHECK (in seconds)
A4.3. Execution times of SPE (in seconds)
A4.3. Execution times of SPE (in seconds)
A4.4. Maximum observed number of iterations for SPE
A4.5. Comparison of the heuristics ability to reduce the register pressure (SPEC2000)
A4.6. Comparison of the heuristics ability to reduce the register pressure (MEDIABENCH)
A4.7. Comparison of the heuristics ability to reduce the register pressure (SPEC2006)
A4.8. Comparison of the heuristics ability to reduce the register pressure (FFMPEG)
APPENDIX 5
A5.1. Loop unrolling minimization experiments (random DDG, single register type)
A5.2. Average code compaction ratio (random DDG, single register type).
A5.3. Weighted harmonic mean for minimized loop unrolling degree
A5.4. Initial versus final loop unrolling in each configuration
A5.5. Observations on loop unrolling minimization
A5.6. Final loop unrolling factors after minimization
APPENDIX 6
A6.1. Execution time repartition for Spec benchmarks
A6.2. Efficiency of data prefetching and preloading. Note that prefetching is not applicable to all applications
A6.3. Initial and modified codes sizes
APPENDIX 7
A7.1. The sample minimum is a not necessarily a good estimation of the theoretical minimum
LIST OF TABLES
INTRODUCTION
I.1. Other contributors to the results presented in this book
CHAPTER 3
3.1. Polynomial-time solvable parallel machine scheduling problems
3.2.
NP
-hard parallel machine scheduling problems
3.3. Performance guarantees of the GLSA with arbitrary priority
3.4. Performance guarantees of the GLSA with a specific priority
3.5. Problems solved by the algorithm of Leung, Palem and Pnueli [LEU 01] steps 1 and 2
CHAPTER 6
6.1. Examples of measured performance degradation factors
6.2. Worst-case performance gain on Alpha 21264
6.3. Worst-case performance gain on Itanium 2
6.4. Examples of code and data regularity/irregularity
6.5. Examples of prefetch: simple case and using extra register case
CHAPTER 12
12.1. Number of non-statistically significant speedups in the tested benchmarks
APPENDIX 2
A2.1. Optimal versus approximate PRS
APPENDIX 5
A5.1. Machine with bounded number of registers
A5.2. Machine with bounded registers with option
continue
A5.3. Number of unrolled loops compared to the number of spilled loops resulted (by using meeting graphs)
A5.4. Arithmetic mean of initial loop unrolling, final loop unrolling and ratio
A5.5. Comparison between final loop unrolling factors and MAXLIVE
A5.6. Optimized loop unrolling factors of scheduled versus unscheduled loops
APPENDIX 7
A7.1. Monte Carlo simulation of a Gaussian distribution
A7.2. The two risk levels for hypothesis testing in statistical and probability theory
A7.3. SPEC OMP2001 on low-overhead environment
A7.4. SPECCPU2006 executed on low overhead environment
A7.5. SPEC OMP2001 on high-overhead environment
A7.6. SPECCPU2006 on high-overhead environment
LIST OF ALGORITHMS
CHAPTER 1
1.1. Computing a good compilation sequence in the compilation cost model
1.2. Optimize_Node(n)
CHAPTER 8
8.1. GREEDY-K heuristic
CHAPTER 10
10.1. The algorithm
IterativeSIRALINA
10.2. The function
UpdateReuseDistances
CHAPTER 11
11.1. Fixed loop unrolling problem
11.2. DIV_NEAR
11.3. DIVISORS
11.4. LCM-MIN algorithm
11.5. General fixed loop unrolling problem
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset