
Optimizing Compilers for Modern Architectures
A Dependence-based Approach
Description
Key Features
- Offers a guide to the simple, practical algorithms and approaches that are most effective in real-world, high-performance microprocessor and parallel systems.
- Demonstrates each transformation in worked examples.
- Examines how two case study compilers implement the theories and practices described in each chapter.
- Presents the most complete treatment of memory hierarchy issues of any compiler text.
- Illustrates ordering relationships with dependence graphs throughout the book.
- Applies the techniques to a variety of languages, including Fortran 77, C, hardware definition languages, Fortran 90, and High Performance Fortran.
- Provides extensive references to the most sophisticated algorithms known in research.
Readership
Table of Contents
- Preface
Chapter 1 - Compiler Challenges for High-Performance Architectures
1.1 Overview and Goals
1.2 Pipelining
1.2.1 Pipelined Instruction Units
1.2.2 Pipelined Execution Units
1.2.3 Parallel Functional Units
1.2.4 Compiling for Scalar Pipelines
1.3 Vector Instructions
1.3.1 Vector Hardware Overview
1.3.2 Compiling for Vector Pipelines
1.4 Superscalar and VLIW Processors
1.4.1 Multiple-Issue Instruction Units
1.4.2 Compiling for Multiple-Issue Processors
1.5 Processor Parallelism
1.5.1 Overview of Processor Parallelism
1.5.2 Compiling for Asynchronous Parallelism
1.6 Memory Hierarchy
1.6.1 Overview of Memory Systems
1.6.2 Compiling for Memory Hierarchy
1.7 A Case Study: Matrix Multiplication
1.8 Advanced Compiler Technology
1.8.1 Dependence
1.8.2 Transformations
1.9 Summary
1.10 Case Studies
1.11 Historical Comments and References
Exercises
Chapter 2 - Dependence: Theory and Practice
2.1 Introduction
2.2 Dependence and Its Properties
2.2.1 Load-Store Classification
2.2.2 Dependence in Loops
2.2.3 Dependence and Transformations
2.2.4 Distance and Direction Vectors
2.2.5 Loop-Carried and Loop-Independent Dependences
2.3 Simple Dependence Testing
2.4 Parallelization and Vectorization
2.4.1 Parallelization
2.4.2 Vectorization
2.4.3 An Advanced Vectorization Algorithm
2.5 Summary
2.6 Case Studies
2.7 Historical Comments and References
Exercises
Chapter 3 - Dependence Testing
3.1 Introduction
3.1.1 Background and Terminology
3.2 Dependence Testing Overview
3.2.1 Subscript Partitioning
3.2.2 Merging Direction Vectors
3.3 Single-Subscript Dependence Tests
3.3.1 ZIV Test
3.3.2 SIV Tests
3.3.3 Multiple Induction-Variable Tests
3.4 Testing in Coupled Groups
3.4.1 The Delta Test
3.4.2 More Powerful Multiple-Subscript Tests
3.5 An Empirical Study
3.6 Putting It All Together
3.7 Summary
3.8 Case Studies
3.9 Historical Comments and References
Exercises
Chapter 4 - Preliminary Transformations
4.1 Introduction
4.2 Information Requirements
4.3 Loop Normalization
4.4 Data Flow Analysis
4.4.1 Definition-Use Chains
4.4.2 Dead Code Elimination
4.4.3 Constant Propagation
4.4.4 Static Single-Assignment Form
4.5 Induction-Variable Exposure
4.5.1 Forward Expression Substitution
4.5.2 Induction-Variable Substitution
4.5.3 Driving the Substitution Process
4.6 Summary
4.7 Case Studies
4.8 Historical Comments and References
Exercises
Chapter 5 - Enhancing Fine-Grained Parallelism
5.1 Introduction
5.2 Loop Interchange
5.2.1 Safety of Loop Interchange
5.2.2 Profitability of Loop Interchange
5.2.3 Loop Interchange and Vectorization
5.3 Scalar Expansion
5.4 Scalar and Array Renaming
5.5 Node Splitting
5.6 Recognition of Reductions
5.7 Index-Set Splitting
5.7.1 Threshold Analysis
5.7.2 Loop Peeling
5.7.3 Section-Based Splitting
5.8 Run-Time Symbolic Resolution
5.9 Loop Skewing
5.10 Putting It All Together
5.11 Complications of Real Machines
5.12 Summary
5.13 Case Studies
5.13.1 PFC
5.13.2 Ardent Titan Compiler
5.13.3 Vectorization Performance
5.14 Historical Comments and References
Exercises
Chapter 6 - Creating Coarse-Grained Parallelism
6.1 Introduction
6.2 Single-Loop Methods
6.2.1 Privatization
6.2.2 Loop Distribution
6.2.3 Alignment
6.2.4 Code Replication
6.2.5 Loop Fusion
6.3 Perfect Loop Nests
6.3.1 Loop Interchange for Parallelization
6.3.2 Loop Selection
6.3.3 Loop Reversal
6.3.4 Loop Skewing for Parallelization
6.3.5 Unimodular Transformations
6.3.6 Profitability-Based Parallelization Methods
6.4 Imperfectly Nested Loops
6.4.1 Multilevel Loop Fusion
6.4.2 A Parallel Code Generation Algorithm
6.5 An Extended Example
6.6 Packaging of Parallelism
6.6.1 Strip Mining
6.6.2 Pipeline Parallelism
6.6.3 Scheduling Parallel Work
6.6.4 Guided Self-Scheduling
6.7 Summary
6.8 Case Studies
6.8.1 PFC and ParaScope
6.8.2 Ardent Titan Compiler
6.9 Historical Comments and References
Exercises
Chapter 7 - Handling Control Flow
7.1 Introduction
7.2 If-Conversion
7.2.1 Definition
7.2.2 Branch Classification
7.2.3 Forward Branches
7.2.4 Exit Branches
7.2.5 Backward Branches
7.2.6 Complete Forward Branch Removal
7.2.7 Simplification
7.2.8 Iterative Dependences
7.2.9 If-Reconstruction
7.3 Control Dependence
7.3.1 Constructing Control Dependence
7.3.2 Control Dependence in Loops
7.3.3 An Execution Model for Control Dependences
7.3.4 Application of Control Dependence to Parallelization
7.4 Summary
7.5 Case Studies
7.6 Historical Comments and References
Exercises
Chapter 8 - Improving Register Usage
8.1 Introduction
8.2 Scalar Register Allocation
8.2.1 Data Dependence for Register Reuse
8.2.2 Loop-Carried and Loop-Independent Reuse
8.2.3 A Register Allocation Example
8.3 Scalar Replacement
8.3.1 Pruning the Dependence Graph
8.3.2 Simple Replacement
8.3.3 Handling Loop-Carried Dependences
8.3.4 Dependences Spanning Multiple Iterations
8.3.5 Eliminating Scalar Copies
8.3.6 Moderating Register Pressure
8.3.7 Scalar Replacement Algorithm
8.3.8 Experimental Data
8.4 Unroll-and-Jam
8.4.1 Legality of Unroll-and-Jam
8.4.2 Unroll-and-Jam Algorithm
8.4.3 Effectiveness of Unroll-and-Jam
8.5 Loop Interchange for Register Reuse
8.5.1 Considerations for Loop Interchange
8.5.2 Loop Interchange Algorithm
8.6 Loop Fusion for Register Reuse
8.6.1 Profitable Loop Fusion for Reuse
8.6.2 Loop Alignment for Fusion
8.6.3 Fusion Mechanics
8.6.4 A Weighted Loop Fusion Algorithm
8.6.5 Multilevel Loop Fusion for Register Reuse
8.7 Putting It All Together
8.7.1 Ordering the Transformations
8.7.2 An Example: Matrix Multiplication
8.8 Complex Loop Nests
8.8.1 Loops with If Statements
8.8.2 Trapezoidal Loops
8.9 Summary
8.10 Case Studies
8.11 Historical Comments and References
Exercises
Chapter 9 - Managing Cache
9.1 Introduction
9.2 Loop Interchange for Spatial Locality
9.3 Blocking
9.3.1 Unaligned Data
9.3.2 Legality of Blocking
9.3.3 Profitability of Blocking
9.3.4 A Simple Blocking Algorithm
9.3.5 Blocking with Skewing
9.3.6 Fusion and Alignment
9.3.7 Blocking in Combination with Other Transformations
9.3.8 Effectiveness
9.4 Cache Management in Complex Loop Nests
9.4.1 Triangular Cache Blocking
9.4.2 Special-Purpose Transformations
9.5 Software Prefetching
9.5.1 A Software Prefetching Algorithm
9.5.2 Effectiveness of Software Prefetching
9.6 Summary
9.7 Case Studies
9.8 Historical Comments and References
Exercises
Chapter 10 - Scheduling
10.1 Introduction
10.2 Instruction Scheduling
10.2.1 Machine Model
10.2.2 Straight-Line Graph Scheduling
10.2.3 List Scheduling
10.2.4 Trace Scheduling
10.2.5 Scheduling in Loops
10.3 Vector Unit Scheduling
10.3.1 Chaining
10.3.2 Coprocessors
10.4 Summary
10.5 Case Studies
10.6 Historical Comments and References
Exercises
Chapter 11 - Interprocedural Analysis and Optimization
11.1 Introduction
11.2 Interprocedural Analysis
11.2.1 Interprocedural Problems
11.2.2 Interprocedural Problem Classification
11.2.3 Flow-Insensitive Side Effect Analysis
11.2.4 Flow-Insensitive Alias Analysis
11.2.5 Constant Propagation
11.2.6 Kill Analysis
11.2.7 Symbolic Analysis
11.2.8 Array Section Analysis
11.2.9 Call Graph Construction
11.3 Interprocedural Optimization
11.3.1 Inline Substitution
11.3.2 Procedure Cloning
11.3.3 Hybrid Optimizations
11.4 Managing Whole-Program Compilation
11.5 Summary
11.6 Case Studies
11.7 Historical Comments and References
Exercises
Chapter 12 - Dependence in C and Hardware Design
12.1 Introduction
12.2 Optimizing C
12.2.1 Pointers
12.2.2 Naming and Structures
12.2.3 Loops
12.2.4 Scoping and Statics
12.2.5 Dialect
12.2.6 Miscellaneous
12.3 Hardware Design
12.3.1 Hardware Description Languages
12.3.2 Optimizing Simulation
12.3.3 Synthesis Optimization
12.4 Summary
12.5 Case Studies
12.6 Historical Comments and References
Exercises
Chapter 13 - Compiling Array Assignments
13.1 Introduction
13.2 Simple Scalarization
13.3 Scalarization Transformations
13.3.1 Loop Reversal
13.3.2 Input Prefetching
13.3.3 Loop Splitting
13.4 Multidimensional Scalarization
13.4.1 Simple Scalarization in Multiple Dimensions
13.4.2 Outer Loop Prefetching
13.4.3 Loop Interchange for Scalarization
13.4.4 General Multidimensional Scalarization
13.4.5 A Scalarization Example
13.5 Considerations for Vector Machines
13.6 Postscalarization Interchange and Fusion
13.7 Summary
13.8 Case Studies
13.9 Historical Comments and References
Exercises
Chapter 14 - Compiling High Performance Fortran
14.1 Introduction
14.2 HPF Compiler Overview
14.3 Basic Loop Compilation
14.3.1 Distribution Propagation and Analysis
14.3.2 Iteration Partitioning
14.3.3 Communication Generation
14.4 Optimization
14.4.1 Communication Vectorization
14.4.2 Overlapping Communication and Computation
14.4.3 Alignment and Replication
14.4.4 Pipelining
14.4.5 Identification of Common Recurrences
14.4.6 Storage Management
14.4.7 Handling Multiple Dimensions
14.5 Interprocedural Optimization for HPF
14.6 Summary
14.7 Case Studies
14.8 Historical Comments and References
Exercises
Appendix - Fundamentals of Fortran 90
Introduction
Lexical Properties
Array Assignment
Library Functions
Further Reading
References
Index
Product details
- No. of pages: 790
- Language: English
- Copyright: © Morgan Kaufmann 2001
- Published: September 26, 2001
- Imprint: Morgan Kaufmann
- Hardcover ISBN: 9781558602861
About the Authors
Randy Allen
Randy Allen received his A.B. summa cum laude in chemistry from Harvard University and his M.A. and Ph.D. in mathematical sciences from Rice University. After serving a research fellowship at Rice, Dr. Allen entered the practical world of industrial compiler construction. His career has spanned research, advanced development, and management at Ardent Computers, Sun Microsystems, Chronologic Simulation, Synopsys, and CynApps. He has authored or coauthored 15 conference and journal papers on computer optimization, restructuring compilers, and hardware simulation, and has served on program committees for Supercomputing and the Conference on Programming Language and Design Implementation. Mr. Allen is CEO and President of Catalytic Compilers.
Affiliations and Expertise
Ken Kennedy
Professor Kennedy has published over 150 technical articles and supervised 34 Ph.D. dissertations on programming support software for high-performance computer systems. In recognition of his contributions to software for high-performance computation, he received the 1995 W. Wallace McDowell Award, the highest research award of the IEEE Computer Society. In 1999, he was named the third recipient of the ACM SIGPLAN Programming Languages Achievement Award.
Affiliations and Expertise
Ratings and Reviews
There are currently no reviews for "Optimizing Compilers for Modern Architectures"