COVID-19 Update: We are currently shipping orders daily. However, due to transit disruptions in some geographies, deliveries may be delayed. To provide all customers with timely access to content, we are offering 50% off Science and Technology Print & eBook bundle options. Terms & conditions.
Optimizing Compilers for Modern Architectures - 1st Edition - ISBN: 9781493303540, 9780080513249

Optimizing Compilers for Modern Architectures

1st Edition

A Dependence-based Approach

Authors: Randy Allen Ken Kennedy
Paperback ISBN: 9781493303540
Hardcover ISBN: 9781558602861
Imprint: Morgan Kaufmann
Published Date: 26th September 2001
Page Count: 790
Sales tax will be calculated at check-out Price includes VAT/GST

Institutional Subscription

Secure Checkout

Personal information is secured with SSL technology.

Free Shipping

Free global shipping
No minimum order.


Modern computer architectures designed with high-performance microprocessors offer tremendous potential gains in performance over previous designs. Yet their very complexity makes it increasingly difficult to produce efficient code and to realize their full potential. This landmark text from two leaders in the field focuses on the pivotal role that compilers can play in addressing this critical issue.

The basis for all the methods presented in this book is data dependence, a fundamental compiler analysis tool for optimizing programs on high-performance microprocessors and parallel architectures. It enables compiler designers to write compilers that automatically transform simple, sequential programs into forms that can exploit special features of these modern architectures.

The text provides a broad introduction to data dependence, to the many transformation strategies it supports, and to its applications to important optimization problems such as parallelization, compiler memory hierarchy management, and instruction scheduling. The authors demonstrate the importance and wide applicability of dependence-based compiler optimizations and give the compiler writer the basics needed to understand and implement them. They also offer cookbook explanations for transforming applications by hand to computational scientists and engineers who are driven to obtain the best possible performance of their complex applications.

The approaches presented are based on research conducted over the past two decades, emphasizing the strategies implemented in research prototypes at Rice University and in several associated commercial systems. Randy Allen and Ken Kennedy have provided an indispensable resource for researchers, practicing professionals, and graduate students engaged in designing and optimizing compilers for modern computer architectures.

Key Features

  • Offers a guide to the simple, practical algorithms and approaches that are most effective in real-world, high-performance microprocessor and parallel systems.
  • Demonstrates each transformation in worked examples.
  • Examines how two case study compilers implement the theories and practices described in each chapter.
  • Presents the most complete treatment of memory hierarchy issues of any compiler text.
  • Illustrates ordering relationships with dependence graphs throughout the book.
  • Applies the techniques to a variety of languages, including Fortran 77, C, hardware definition languages, Fortran 90, and High Performance Fortran.
  • Provides extensive references to the most sophisticated algorithms known in research.


Programmers, designers and developers of conventional and supercomputing. Postgraduate level compiler and parallel processing courses

Table of Contents


Chapter 1 - Compiler Challenges for High-Performance Architectures

1.1 Overview and Goals

1.2 Pipelining

1.2.1 Pipelined Instruction Units

1.2.2 Pipelined Execution Units

1.2.3 Parallel Functional Units

1.2.4 Compiling for Scalar Pipelines

1.3 Vector Instructions

1.3.1 Vector Hardware Overview

1.3.2 Compiling for Vector Pipelines

1.4 Superscalar and VLIW Processors

1.4.1 Multiple-Issue Instruction Units

1.4.2 Compiling for Multiple-Issue Processors

1.5 Processor Parallelism

1.5.1 Overview of Processor Parallelism

1.5.2 Compiling for Asynchronous Parallelism

1.6 Memory Hierarchy

1.6.1 Overview of Memory Systems

1.6.2 Compiling for Memory Hierarchy

1.7 A Case Study: Matrix Multiplication

1.8 Advanced Compiler Technology

1.8.1 Dependence

1.8.2 Transformations

1.9 Summary

1.10 Case Studies

1.11 Historical Comments and References


Chapter 2 - Dependence: Theory and Practice

2.1 Introduction

2.2 Dependence and Its Properties

2.2.1 Load-Store Classification

2.2.2 Dependence in Loops

2.2.3 Dependence and Transformations

2.2.4 Distance and Direction Vectors

2.2.5 Loop-Carried and Loop-Independent Dependences

2.3 Simple Dependence Testing

2.4 Parallelization and Vectorization

2.4.1 Parallelization

2.4.2 Vectorization

2.4.3 An Advanced Vectorization Algorithm

2.5 Summary

2.6 Case Studies

2.7 Historical Comments and References


Chapter 3 - Dependence Testing

3.1 Introduction

3.1.1 Background and Terminology

3.2 Dependence Testing Overview

3.2.1 Subscript Partitioning

3.2.2 Merging Direction Vectors

3.3 Single-Subscript Dependence Tests

3.3.1 ZIV Test

3.3.2 SIV Tests

3.3.3 Multiple Induction-Variable Tests

3.4 Testing in Coupled Groups

3.4.1 The Delta Test

3.4.2 More Powerful Multiple-Subscript Tests

3.5 An Empirical Study

3.6 Putting It All Together

3.7 Summary

3.8 Case Studies

3.9 Historical Comments and References


Chapter 4 - Preliminary Transformations

4.1 Introduction

4.2 Information Requirements

4.3 Loop Normalization

4.4 Data Flow Analysis

4.4.1 Definition-Use Chains

4.4.2 Dead Code Elimination

4.4.3 Constant Propagation

4.4.4 Static Single-Assignment Form

4.5 Induction-Variable Exposure

4.5.1 Forward Expression Substitution

4.5.2 Induction-Variable Substitution

4.5.3 Driving the Substitution Process

4.6 Summary

4.7 Case Studies

4.8 Historical Comments and References


Chapter 5 - Enhancing Fine-Grained Parallelism

5.1 Introduction

5.2 Loop Interchange

5.2.1 Safety of Loop Interchange

5.2.2 Profitability of Loop Interchange

5.2.3 Loop Interchange and Vectorization

5.3 Scalar Expansion

5.4 Scalar and Array Renaming

5.5 Node Splitting

5.6 Recognition of Reductions

5.7 Index-Set Splitting

5.7.1 Threshold Analysis

5.7.2 Loop Peeling

5.7.3 Section-Based Splitting

5.8 Run-Time Symbolic Resolution

5.9 Loop Skewing

5.10 Putting It All Together

5.11 Complications of Real Machines

5.12 Summary

5.13 Case Studies

5.13.1 PFC

5.13.2 Ardent Titan Compiler

5.13.3 Vectorization Performance

5.14 Historical Comments and References


Chapter 6 - Creating Coarse-Grained Parallelism

6.1 Introduction

6.2 Single-Loop Methods

6.2.1 Privatization

6.2.2 Loop Distribution

6.2.3 Alignment

6.2.4 Code Replication

6.2.5 Loop Fusion

6.3 Perfect Loop Nests

6.3.1 Loop Interchange for Parallelization

6.3.2 Loop Selection

6.3.3 Loop Reversal

6.3.4 Loop Skewing for Parallelization

6.3.5 Unimodular Transformations

6.3.6 Profitability-Based Parallelization Methods

6.4 Imperfectly Nested Loops

6.4.1 Multilevel Loop Fusion

6.4.2 A Parallel Code Generation Algorithm

6.5 An Extended Example

6.6 Packaging of Parallelism

6.6.1 Strip Mining

6.6.2 Pipeline Parallelism

6.6.3 Scheduling Parallel Work

6.6.4 Guided Self-Scheduling

6.7 Summary

6.8 Case Studies

6.8.1 PFC and ParaScope

6.8.2 Ardent Titan Compiler

6.9 Historical Comments and References


Chapter 7 - Handling Control Flow

7.1 Introduction

7.2 If-Conversion

7.2.1 Definition

7.2.2 Branch Classification

7.2.3 Forward Branches

7.2.4 Exit Branches

7.2.5 Backward Branches

7.2.6 Complete Forward Branch Removal

7.2.7 Simplification

7.2.8 Iterative Dependences

7.2.9 If-Reconstruction

7.3 Control Dependence

7.3.1 Constructing Control Dependence

7.3.2 Control Dependence in Loops

7.3.3 An Execution Model for Control Dependences

7.3.4 Application of Control Dependence to Parallelization

7.4 Summary

7.5 Case Studies

7.6 Historical Comments and References


Chapter 8 - Improving Register Usage

8.1 Introduction

8.2 Scalar Register Allocation

8.2.1 Data Dependence for Register Reuse

8.2.2 Loop-Carried and Loop-Independent Reuse

8.2.3 A Register Allocation Example

8.3 Scalar Replacement

8.3.1 Pruning the Dependence Graph

8.3.2 Simple Replacement

8.3.3 Handling Loop-Carried Dependences

8.3.4 Dependences Spanning Multiple Iterations

8.3.5 Eliminating Scalar Copies

8.3.6 Moderating Register Pressure

8.3.7 Scalar Replacement Algorithm

8.3.8 Experimental Data

8.4 Unroll-and-Jam

8.4.1 Legality of Unroll-and-Jam

8.4.2 Unroll-and-Jam Algorithm

8.4.3 Effectiveness of Unroll-and-Jam

8.5 Loop Interchange for Register Reuse

8.5.1 Considerations for Loop Interchange

8.5.2 Loop Interchange Algorithm

8.6 Loop Fusion for Register Reuse

8.6.1 Profitable Loop Fusion for Reuse

8.6.2 Loop Alignment for Fusion

8.6.3 Fusion Mechanics

8.6.4 A Weighted Loop Fusion Algorithm

8.6.5 Multilevel Loop Fusion for Register Reuse

8.7 Putting It All Together

8.7.1 Ordering the Transformations

8.7.2 An Example: Matrix Multiplication

8.8 Complex Loop Nests

8.8.1 Loops with If Statements

8.8.2 Trapezoidal Loops

8.9 Summary

8.10 Case Studies

8.11 Historical Comments and References


Chapter 9 - Managing Cache

9.1 Introduction

9.2 Loop Interchange for Spatial Locality

9.3 Blocking

9.3.1 Unaligned Data

9.3.2 Legality of Blocking

9.3.3 Profitability of Blocking

9.3.4 A Simple Blocking Algorithm

9.3.5 Blocking with Skewing

9.3.6 Fusion and Alignment

9.3.7 Blocking in Combination with Other Transformations

9.3.8 Effectiveness

9.4 Cache Management in Complex Loop Nests

9.4.1 Triangular Cache Blocking

9.4.2 Special-Purpose Transformations

9.5 Software Prefetching

9.5.1 A Software Prefetching Algorithm

9.5.2 Effectiveness of Software Prefetching

9.6 Summary

9.7 Case Studies

9.8 Historical Comments and References


Chapter 10 - Scheduling

10.1 Introduction

10.2 Instruction Scheduling

10.2.1 Machine Model

10.2.2 Straight-Line Graph Scheduling

10.2.3 List Scheduling

10.2.4 Trace Scheduling

10.2.5 Scheduling in Loops

10.3 Vector Unit Scheduling

10.3.1 Chaining

10.3.2 Coprocessors

10.4 Summary

10.5 Case Studies

10.6 Historical Comments and References


Chapter 11 - Interprocedural Analysis and Optimization

11.1 Introduction

11.2 Interprocedural Analysis

11.2.1 Interprocedural Problems

11.2.2 Interprocedural Problem Classification

11.2.3 Flow-Insensitive Side Effect Analysis

11.2.4 Flow-Insensitive Alias Analysis

11.2.5 Constant Propagation

11.2.6 Kill Analysis

11.2.7 Symbolic Analysis

11.2.8 Array Section Analysis

11.2.9 Call Graph Construction

11.3 Interprocedural Optimization

11.3.1 Inline Substitution

11.3.2 Procedure Cloning

11.3.3 Hybrid Optimizations

11.4 Managing Whole-Program Compilation

11.5 Summary

11.6 Case Studies

11.7 Historical Comments and References


Chapter 12 - Dependence in C and Hardware Design

12.1 Introduction

12.2 Optimizing C

12.2.1 Pointers

12.2.2 Naming and Structures

12.2.3 Loops

12.2.4 Scoping and Statics

12.2.5 Dialect

12.2.6 Miscellaneous

12.3 Hardware Design

12.3.1 Hardware Description Languages

12.3.2 Optimizing Simulation

12.3.3 Synthesis Optimization

12.4 Summary

12.5 Case Studies

12.6 Historical Comments and References


Chapter 13 - Compiling Array Assignments

13.1 Introduction

13.2 Simple Scalarization

13.3 Scalarization Transformations

13.3.1 Loop Reversal

13.3.2 Input Prefetching

13.3.3 Loop Splitting

13.4 Multidimensional Scalarization

13.4.1 Simple Scalarization in Multiple Dimensions

13.4.2 Outer Loop Prefetching

13.4.3 Loop Interchange for Scalarization

13.4.4 General Multidimensional Scalarization

13.4.5 A Scalarization Example

13.5 Considerations for Vector Machines

13.6 Postscalarization Interchange and Fusion

13.7 Summary

13.8 Case Studies

13.9 Historical Comments and References


Chapter 14 - Compiling High Performance Fortran

14.1 Introduction

14.2 HPF Compiler Overview

14.3 Basic Loop Compilation

14.3.1 Distribution Propagation and Analysis

14.3.2 Iteration Partitioning

14.3.3 Communication Generation

14.4 Optimization

14.4.1 Communication Vectorization

14.4.2 Overlapping Communication and Computation

14.4.3 Alignment and Replication

14.4.4 Pipelining

14.4.5 Identification of Common Recurrences

14.4.6 Storage Management

14.4.7 Handling Multiple Dimensions

14.5 Interprocedural Optimization for HPF

14.6 Summary

14.7 Case Studies

14.8 Historical Comments and References


Appendix - Fundamentals of Fortran 90


Lexical Properties

Array Assignment

Library Functions

Further Reading




No. of pages:
© Morgan Kaufmann 2001
26th September 2001
Morgan Kaufmann
Paperback ISBN:
Hardcover ISBN:

About the Authors

Randy Allen

Randy Allen received his A.B. summa cum laude in chemistry from Harvard University and his M.A. and Ph.D. in mathematical sciences from Rice University. After serving a research fellowship at Rice, Dr. Allen entered the practical world of industrial compiler construction. His career has spanned research, advanced development, and management at Ardent Computers, Sun Microsystems, Chronologic Simulation, Synopsys, and CynApps. He has authored or coauthored 15 conference and journal papers on computer optimization, restructuring compilers, and hardware simulation, and has served on program committees for Supercomputing and the Conference on Programming Language and Design Implementation. Mr. Allen is CEO and President of Catalytic Compilers.

Affiliations and Expertise

CEO and President of Catalytic Compilers

Ken Kennedy

p> Ken Kennedy is the Ann and John Doerr Professor of Computational Engineering and Director of the Center for High Performance Software Research (HiPerSoft) at Rice University. He is a fellow of the Institute of Electrical and Electronics Engineers, the Association for Computing Machinery, and the American Association for the Advancement of Science and has been a member of the National Academy of Engineering since 1990. From 1997 to 1999, he served as cochair of the President's Information Technology Advisory Committee (PITAC). For his leadership in producing the PITAC report on funding of information technology research, he received the Computing Research Association Distinguished Service Award (1999) and the RCI Seymour Cray HPCC Industry Recognition Award (1999).

Professor Kennedy has published over 150 technical articles and supervised 34 Ph.D. dissertations on programming support software for high-performance computer systems. In recognition of his contributions to software for high-performance computation, he received the 1995 W. Wallace McDowell Award, the highest research award of the IEEE Computer Society. In 1999, he was named the third recipient of the ACM SIGPLAN Programming Languages Achievement Award.

Affiliations and Expertise

Rice University


"Compilers are the Queen of Computing Science and Technology. They have long been the bridge from applications to systems, but now they determine which architectural features should be implemented in new hardware, as well as which new language features will be effective for software developers.
The authors write from great experience as innovators and developers of the field. This book is a very comprehensive treatment of optimization for cache management, vectorization, parallelization, and more. The title refers to Modern Architectures and indeed the subject matter is applicable from desktop systems to the world's fastest supercomputers. The examples are drawn from Fortran, but the theory applies to many programming languages. I think the book will serve as an excellent textbook as well as a much used reference for software developers."
—David Kuck, Intel

" This book makes an extremely valuable contribution to the field of compilation by presenting the fundamental basics in compiling technology for high performance computing systems. The authors provide careful and thorough descriptions of the analyses, including data and control dependences and interprocedural analysis, and the code transformations that can be applied as a result of the analyses. The book covers a comprehensive range of important topics needed to compile for high performance systems. The organization and structure of the book as well as the clear writing style make it an excellent text book, highly valuable reference book and a useful guide for implementing the techniques."
—Mary Lou Soffa, University of Pittsburgh

"The much awaited book by Randy Allen, a leading practitioner and Ken Kennedy, a pioneer in compiler research provides a skillful encapsulation of the results of more than 30 years of research and development in restructuring compilers - a significant part of which was done by the authors. The combination of staged introduction of each topic with the aid of examples and the detailed algorithmic layout of each optimization make this text an outstanding reference for the expert as well as for new students of the topic. This book constitutes yet the most complete and rich text of compiler optimization fundamentals and algorithms, an invaluable resource for researchers, educators and compiler developer."
—Constantine Polychronopoulos, University of Illinois Urbana-Champaign

"Kennedy and Allen take a unique approach in this book. They focus on how compilation techniques work together to make practical program analysis and optimization algorithms for achieving good performance on parallel machines, whereas previous texts focus on the specific techniques. Every compiler writer should have a copy of this insightful and lively book in their library!"

—Kathryn S McKinley, University of Texas at Austin

"Dependence analysis is at the core of a huge class of program transformations and optimizations, including cache management, exploiting parallelism, and many many others. The authors have provided information that is essential to practicing professionals in the area of high-performance computer architecture. An indispensable reference."
—Rohit Chandra, NARUS Inc.

Ratings and Reviews