Optimizing Compilers for Modern Architectures
A Dependence-based ApproachBy
- Randy Allen, CEO and President of Catalytic Compilers
- Ken Kennedy, Rice University
Modern computer architectures designed with high-performance microprocessors offer tremendous potential gains in performance over previous designs. Yet their very complexity makes it increasingly difficult to produce efficient code and to realize their full potential. This landmark text from two leaders in the field focuses on the pivotal role that compilers can play in addressing this critical issue.
The basis for all the methods presented in this book is data dependence, a fundamental compiler analysis tool for optimizing programs on high-performance microprocessors and parallel architectures. It enables compiler designers to write compilers that automatically transform simple, sequential programs into forms that can exploit special features of these modern architectures.
The text provides a broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling. The authors demonstrate the importance and wide applicability of dependence-based compiler optimizations and give the compiler writer the basics needed to understand and implement them. They also offer cookbook explanations for transforming applications by hand to computational scientists and engineers who are driven to obtain the best possible performance of their complex applications.The approaches presented are based on research conducted over the past two decades, emphasizing the strategies implemented in research prototypes at Rice University and in several associated commercial systems. Randy Allen and Ken Kennedy have provided an indispensable resource for researchers, practicing professionals, and graduate students engaged in designing and optimizing compilers for modern computer architectures.
Hardbound, 790 Pages
Published: September 2001
Imprint: Morgan Kaufmann
"Compilers are the Queen of Computing Science and Technology. They have long been the bridge from applications to systems, but now they determine which architectural features should be implemented in new hardware, as well as which new language features will be effective for software developers. The authors write from great experience as innovators and developers of the field. This book is a very comprehensive treatment of optimization for cache management, vectorization, parallelization, and more. The title refers to Modern Architectures and indeed the subject matter is applicable from desktop systems to the world's fastest supercomputers. The examples are drawn from Fortran, but the theory applies to many programming languages. I think the book will serve as an excellent textbook as well as a much used reference for software developers." David Kuck, Intel " This book makes an extremely valuable contribution to the field of compilation by presenting the fundamental basics in compiling technology for high performance computing systems. The authors provide careful and thorough descriptions of the analyses, including data and control dependences and interprocedural analysis, and the code transformations that can be applied as a result of the analyses. The book covers a comprehensive range of important topics needed to compile for high performance systems. The organization and structure of the book as well as the clear writing style make it an excellent text book, highly valuable reference book and a useful guide for implementing the techniques." Mary Lou Soffa, University of Pittsburgh "The much awaited book by Randy Allen, a leading practitioner and Ken Kennedy, a pioneer in compiler research provides a skillful encapsulation of the results of more than 30 years of research and development in restructuring compilers - a significant part of which was done by the authors. The combination of staged introduction of each topic with the aid of examples and the detailed algorithmic layout of each optimization make this text an outstanding reference for the expert as well as for new students of the topic. This book constitutes yet the most complete and rich text of compiler optimization fundamentals and algorithms, an invaluable resource for researchers, educators and compiler developer." Constantine Polychronopoulos, University of Illinois Urbana-Champaign "Kennedy and Allen take a unique approach in this book. They focus on how compilation techniques work together to make practical program analysis and optimization algorithms for achieving good performance on parallel machines, whereas previous texts focus on the specific techniques. Every compiler writer should have a copy of this insightful and lively book in their library!" Kathryn S McKinley, University of Texas at Austin "Dependence analysis is at the core of a huge class of program transformations and optimizations, including cache management, exploiting parallelism, and many many others. The authors have provided information that is essential to practicing professionals in the area of high-performance computer architecture. An indispensable reference." Rohit Chandra, NARUS Inc.
- PrefaceChapter 1 - Compiler Challenges for High-Performance Architectures1.1 Overview and Goals 1.2 Pipelining 1.2.1 Pipelined Instruction Units 1.2.2 Pipelined Execution Units 1.2.3 Parallel Functional Units 1.2.4 Compiling for Scalar Pipelines1.3 Vector Instructions 1.3.1 Vector Hardware Overview 1.3.2 Compiling for Vector Pipelines1.4 Superscalar and VLIW Processors 1.4.1 Multiple-Issue Instruction Units 1.4.2 Compiling for Multiple-Issue Processors1.5 Processor Parallelism 1.5.1 Overview of Processor Parallelism 1.5.2 Compiling for Asynchronous Parallelism1.6 Memory Hierarchy 1.6.1 Overview of Memory Systems 1.6.2 Compiling for Memory Hierarchy1.7 A Case Study: Matrix Multiplication 1.8 Advanced Compiler Technology 1.8.1 Dependence 1.8.2 Transformations1.9 Summary 1.10 Case Studies 1.11 Historical Comments and References ExercisesChapter 2 - Dependence: Theory and Practice2.1 Introduction 2.2 Dependence and Its Properties 2.2.1 Load-Store Classification 2.2.2 Dependence in Loops 2.2.3 Dependence and Transformations 2.2.4 Distance and Direction Vectors 2.2.5 Loop-Carried and Loop-Independent Dependences2.3 Simple Dependence Testing 2.4 Parallelization and Vectorization 2.4.1 Parallelization 2.4.2 Vectorization 2.4.3 An Advanced Vectorization Algorithm2.5 Summary 2.6 Case Studies 2.7 Historical Comments and References ExercisesChapter 3 - Dependence Testing 3.1 Introduction 3.1.1 Background and Terminology3.2 Dependence Testing Overview 3.2.1 Subscript Partitioning 3.2.2 Merging Direction Vectors3.3 Single-Subscript Dependence Tests 3.3.1 ZIV Test 3.3.2 SIV Tests 3.3.3 Multiple Induction-Variable Tests3.4 Testing in Coupled Groups 3.4.1 The Delta Test 3.4.2 More Powerful Multiple-Subscript Tests3.5 An Empirical Study 3.6 Putting It All Together 3.7 Summary 3.8 Case Studies 3.9 Historical Comments and References ExercisesChapter 4 - Preliminary Transformations 4.1 Introduction 4.2 Information Requirements 4.3 Loop Normalization 4.4 Data Flow Analysis 4.4.1 Definition-Use Chains 4.4.2 Dead Code Elimination 4.4.3 Constant Propagation 4.4.4 Static Single-Assignment Form4.5 Induction-Variable Exposure 4.5.1 Forward Expression Substitution 4.5.2 Induction-Variable Substitution 4.5.3 Driving the Substitution Process4.6 Summary 4.7 Case Studies 4.8 Historical Comments and References ExercisesChapter 5 - Enhancing Fine-Grained Parallelism 5.1 Introduction 5.2 Loop Interchange 5.2.1 Safety of Loop Interchange 5.2.2 Profitability of Loop Interchange 5.2.3 Loop Interchange and Vectorization5.3 Scalar Expansion 5.4 Scalar and Array Renaming 5.5 Node Splitting 5.6 Recognition of Reductions 5.7 Index-Set Splitting 5.7.1 Threshold Analysis 5.7.2 Loop Peeling 5.7.3 Section-Based Splitting5.8 Run-Time Symbolic Resolution 5.9 Loop Skewing 5.10 Putting It All Together 5.11 Complications of Real Machines 5.12 Summary 5.13 Case Studies 5.13.1 PFC 5.13.2 Ardent Titan Compiler 5.13.3 Vectorization Performance5.14 Historical Comments and References ExercisesChapter 6 - Creating Coarse-Grained Parallelism6.1 Introduction 6.2 Single-Loop Methods 6.2.1 Privatization 6.2.2 Loop Distribution 6.2.3 Alignment 6.2.4 Code Replication 6.2.5 Loop Fusion6.3 Perfect Loop Nests 6.3.1 Loop Interchange for Parallelization 6.3.2 Loop Selection6.3.3 Loop Reversal 6.3.4 Loop Skewing for Parallelization 6.3.5 Unimodular Transformations 6.3.6 Profitability-Based Parallelization Methods6.4 Imperfectly Nested Loops 6.4.1 Multilevel Loop Fusion 6.4.2 A Parallel Code Generation Algorithm6.5 An Extended Example 6.6 Packaging of Parallelism 6.6.1 Strip Mining 6.6.2 Pipeline Parallelism 6.6.3 Scheduling Parallel Work 6.6.4 Guided Self-Scheduling6.7 Summary 6.8 Case Studies 6.8.1 PFC and ParaScope 6.8.2 Ardent Titan Compiler6.9 Historical Comments and References ExercisesChapter 7 - Handling Control Flow 7.1 Introduction 7.2 If-Conversion 7.2.1 Definition 7.2.2 Branch Classification 7.2.3 Forward Branches 7.2.4 Exit Branches 7.2.5 Backward Branches 7.2.6 Complete Forward Branch Removal 7.2.7 Simplification 7.2.8 Iterative Dependences 7.2.9 If-Reconstruction7.3 Control Dependence 7.3.1 Constructing Control Dependence 7.3.2 Control Dependence in Loops 7.3.3 An Execution Model for Control Dependences 7.3.4 Application of Control Dependence to Parallelization7.4 Summary 7.5 Case Studies 7.6 Historical Comments and References ExercisesChapter 8 - Improving Register Usage 8.1 Introduction 8.2 Scalar Register Allocation 8.2.1 Data Dependence for Register Reuse 8.2.2 Loop-Carried and Loop-Independent Reuse 8.2.3 A Register Allocation Example8.3 Scalar Replacement 8.3.1 Pruning the Dependence Graph 8.3.2 Simple Replacement 8.3.3 Handling Loop-Carried Dependences 8.3.4 Dependences Spanning Multiple Iterations 8.3.5 Eliminating Scalar Copies 8.3.6 Moderating Register Pressure 8.3.7 Scalar Replacement Algorithm 8.3.8 Experimental Data8.4 Unroll-and-Jam 8.4.1 Legality of Unroll-and-Jam 8.4.2 Unroll-and-Jam Algorithm 8.4.3 Effectiveness of Unroll-and-Jam8.5 Loop Interchange for Register Reuse 8.5.1 Considerations for Loop Interchange 8.5.2 Loop Interchange Algorithm8.6 Loop Fusion for Register Reuse 8.6.1 Profitable Loop Fusion for Reuse 8.6.2 Loop Alignment for Fusion 8.6.3 Fusion Mechanics 8.6.4 A Weighted Loop Fusion Algorithm 8.6.5 Multilevel Loop Fusion for Register Reuse8.7 Putting It All Together 8.7.1 Ordering the Transformations 8.7.2 An Example: Matrix Multiplication8.8 Complex Loop Nests 8.8.1 Loops with If Statements 8.8.2 Trapezoidal Loops8.9 Summary 8.10 Case Studies 8.11 Historical Comments and References ExercisesChapter 9 - Managing Cache 9.1 Introduction 9.2 Loop Interchange for Spatial Locality 9.3 Blocking 9.3.1 Unaligned Data 9.3.2 Legality of Blocking 9.3.3 Profitability of Blocking 9.3.4 A Simple Blocking Algorithm 9.3.5 Blocking with Skewing 9.3.6 Fusion and Alignment 9.3.7 Blocking in Combination with Other Transformations 9.3.8 Effectiveness9.4 Cache Management in Complex Loop Nests 9.4.1 Triangular Cache Blocking 9.4.2 Special-Purpose Transformations9.5 Software Prefetching 9.5.1 A Software Prefetching Algorithm 9.5.2 Effectiveness of Software Prefetching9.6 Summary 9.7 Case Studies 9.8 Historical Comments and References ExercisesChapter 10 - Scheduling 10.1 Introduction 10.2 Instruction Scheduling 10.2.1 Machine Model 10.2.2 Straight-Line Graph Scheduling 10.2.3 List Scheduling 10.2.4 Trace Scheduling 10.2.5 Scheduling in Loops10.3 Vector Unit Scheduling 10.3.1 Chaining 10.3.2 Coprocessors10.4 Summary 10.5 Case Studies 10.6 Historical Comments and References ExercisesChapter 11 - Interprocedural Analysis and Optimization 11.1 Introduction 11.2 Interprocedural Analysis 11.2.1 Interprocedural Problems 11.2.2 Interprocedural Problem Classification 11.2.3 Flow-Insensitive Side Effect Analysis 11.2.4 Flow-Insensitive Alias Analysis 11.2.5 Constant Propagation 11.2.6 Kill Analysis 11.2.7 Symbolic Analysis 11.2.8 Array Section Analysis 11.2.9 Call Graph Construction11.3 Interprocedural Optimization 11.3.1 Inline Substitution 11.3.2 Procedure Cloning 11.3.3 Hybrid Optimizations11.4 Managing Whole-Program Compilation 11.5 Summary 11.6 Case Studies 11.7 Historical Comments and References ExercisesChapter 12 - Dependence in C and Hardware Design 12.1 Introduction 12.2 Optimizing C 12.2.1 Pointers 12.2.2 Naming and Structures 12.2.3 Loops 12.2.4 Scoping and Statics 12.2.5 Dialect 12.2.6 Miscellaneous12.3 Hardware Design 12.3.1 Hardware Description Languages 12.3.2 Optimizing Simulation 12.3.3 Synthesis Optimization12.4 Summary 12.5 Case Studies 12.6 Historical Comments and References ExercisesChapter 13 - Compiling Array Assignments 13.1 Introduction 13.2 Simple Scalarization 13.3 Scalarization Transformations 13.3.1 Loop Reversal 13.3.2 Input Prefetching 13.3.3 Loop Splitting13.4 Multidimensional Scalarization 13.4.1 Simple Scalarization in Multiple Dimensions 13.4.2 Outer Loop Prefetching 13.4.3 Loop Interchange for Scalarization 13.4.4 General Multidimensional Scalarization 13.4.5 A Scalarization Example13.5 Considerations for Vector Machines 13.6 Postscalarization Interchange and Fusion 13.7 Summary 13.8 Case Studies 13.9 Historical Comments and References ExercisesChapter 14 - Compiling High Performance Fortran 14.1 Introduction 14.2 HPF Compiler Overview 14.3 Basic Loop Compilation 14.3.1 Distribution Propagation and Analysis 14.3.2 Iteration Partitioning 14.3.3 Communication Generation14.4 Optimization 14.4.1 Communication Vectorization 14.4.2 Overlapping Communication and Computation 14.4.3 Alignment and Replication 14.4.4 Pipelining 14.4.5 Identification of Common Recurrences 14.4.6 Storage Management 14.4.7 Handling Multiple Dimensions14.5 Interprocedural Optimization for HPF 14.6 Summary 14.7 Case Studies 14.8 Historical Comments and References ExercisesAppendix - Fundamentals of Fortran 90 Introduction Lexical Properties Array Assignment Library Functions Further ReadingReferences Index