High Performance Parallelism Pearls Volume Two - 1st Edition - ISBN: 9780128038192, 9780128038901

High Performance Parallelism Pearls Volume Two

1st Edition

Multicore and Many-core Programming Approaches

Authors: Jim Jeffers James Reinders
eBook ISBN: 9780128038901
Paperback ISBN: 9780128038192
Imprint: Morgan Kaufmann
Published Date: 23rd July 2015
Page Count: 592
Tax/VAT will be calculated at check-out
File Compatibility per Device

PDF, EPUB, VSB (Vital Source):
PC, Apple Mac, iPhone, iPad, Android mobile devices.

Amazon Kindle eReader.

Institutional Access


High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism. Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming – illustrating the most effective ways to combine Xeon Phi coprocessors with Xeon and other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as biomed, genetics, finance, manufacturing, imaging, and more. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating "success stories" demonstrating not just the features of Xeon-powered systems, but also how to leverage parallelism across these heterogeneous systems.

Key Features

  • Promotes write-once, run-anywhere coding, showing how to code for high performance on multicore processors and Xeon Phi
  • Examples from multiple vertical domains illustrating real-world use of Xeon Phi coprocessors
  • Source code available for download to facilitate further exploration


computer engineers in high-performance computing and system developers in vertical domains hoping to leverage HPC

Table of Contents

  • Acknowledgments
  • Foreword
    • Making a bet on many-core
    • 2013 Stampede—intel many-core system – a first
    • HPC journey and revelation
    • Stampede users discover: It’s parallel programming
    • This book is timely and important
  • Preface
    • Inspired by 61 cores: A new era in programming
  • Chapter 1: Introduction
    • Abstract
    • Applications and techniques
    • SIMD and vectorization
    • OpenMP and nested parallelism
    • Latency optimizations
    • Python
    • Streams
    • Ray tracing
    • Tuning prefetching
    • MPI shared memory
    • Using every last core
    • OpenCL vs. OpenMP
    • Power analysis for nodes and clusters
    • The future of many-core
    • Downloads
  • Chapter 2: Numerical Weather Prediction Optimization
    • Abstract
    • Numerical weather prediction: Background and motivation
    • WSM6 in the NIM
    • Shared-memory parallelism and controlling horizontal vector length
    • Array alignment
    • Loop restructuring
    • Compile-time constants for loop and array bounds
    • Performance improvements
    • Summary
  • Chapter 3: WRF Goddard Microphysics Scheme Optimization
    • Abstract
    • Acknowledgments
    • The motivation and background
    • WRF Goddard microphysics scheme
    • Summary
  • Chapter 4: Pairwise DNA Sequence Alignment Optimization
    • Abstract
    • Pairwise sequence alignment
    • Parallelization on a single coprocessor
    • Parallelization across multiple coprocessors using MPI
    • Performance results
    • Summary
  • Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery
    • Abstract
    • Parallelism enables proteome-scale structural bioinformatics
    • Overview of eFindSite
    • Benchmarking dataset
    • Code profiling
    • Porting eFindSite for coprocessor offload
    • Parallel version for a multicore processor
    • Task-level scheduling for processor and coprocessor
    • Case study
    • Summary
  • Chapter 6: Amber PME Molecular Dynamics Optimization
    • Abstract
    • Theory of MD
    • Acceleration of neighbor list building using the coprocessor
    • Acceleration of direct space sum using the coprocessor
    • Additional optimizations in coprocessor code
    • Modification of load balance algorithm
    • Compiler optimization flags
    • Results
    • Conclusions
  • Chapter 7: Low-Latency Solutions for Financial Services Applications
    • Abstract
    • Introduction
    • The opportunity
    • Packet processing architecture
    • The symmetric communication interface
    • Optimizing packet processing on the coprocessor
    • Results
    • Conclusions
  • Chapter 8: Parallel Numerical Methods in Finance
    • Abstract
    • Overview
    • Introduction
    • Pricing equation for American option
    • Initial C/C++ implementation
    • Scalar optimization: Your best first step
    • SIMD parallelism—Vectorization
    • Thread parallelization
    • Scale from multicore to many-core
    • Summary
    • For more information
  • Chapter 9: Wilson Dslash Kernel From Lattice QCD Optimization
    • Abstract
    • The Wilson-Dslash kernel
    • First implementation and performance
    • Optimized code: QPhiX and QphiX-Codegen
    • Code generation with QphiX-Codegen
    • Performance results for QPhiX
    • The end of the road?
  • Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism in Practice
    • Abstract
    • Analyzing the CMB with Modal
    • Optimization and modernization
    • Introducing nested parallelism
    • Results
    • Summary
  • Chapter 11: Visual Search Optimization
    • Abstract
    • Image-matching application
    • Image acquisition and processing
    • Keypoint matching
    • Applications
    • A study of parallelism in the visual search application
    • Database (db) level parallelism
    • Flann library parallelism
    • Experimental evaluation
    • Setup
    • Database threads scaling
    • Flann threads scaling
    • KD-tree scaling with dbthreads
    • Summary
  • Chapter 12: Radio Frequency Ray Tracing
    • Abstract
    • Acknowledgments
    • Background
    • StingRay system architecture
    • Optimization examples
    • Summary
  • Chapter 13: Exploring Use of the Reserved Core
    • Abstract
    • Acknowledgments
    • The Uintah computational framework
    • Cross-compiling the UCF
    • Toward demystifying the reserved core
    • Experimental discussion
    • Summary
  • Chapter 14: High Performance Python Offloading
    • Abstract
    • Acknowledgments
    • Background
    • The pyMIC offload module
    • Example: singular value decomposition
    • GPAW
    • PyFR
    • Performance
    • Summary
  • Chapter 15: Fast Matrix Computations on Heterogeneous Streams
    • Abstract
    • The challenge of heterogeneous computing
    • Matrix multiply
    • The hStreams library and framework
    • Cholesky factorization
    • LU factorization
    • Continuing work on hStreams
    • Acknowledgments
    • Recap
    • Summary
    • Tiled hStreams matrix multiplier example source
  • Chapter 16: MPI-3 Shared Memory Programming Introduction
    • Abstract
    • Motivation
    • MPI’s interprocess shared memory extension
    • When to use MPI interprocess shared memory
    • 1-D ring: from MPI messaging to shared memory
    • Modifying MPPTEST halo exchange to include MPI SHM
    • Evaluation environment and results
    • Summary
  • Chapter 17: Coarse-Grained OpenMP for Scalable Hybrid Parallelism
    • Abstract
    • Coarse-grained versus fine-grained parallelism
    • Flesh on the bones: A FORTRAN “stencil-test” example
    • Performance results with the stencil code
    • Parallelism in numerical weather prediction models
    • Summary
  • Chapter 18: Exploiting Multilevel Parallelism in Quantum Simulations
    • Abstract
    • Science: better approximate solutions
    • About the reference application
    • Parallelism in ES applications
    • Multicore and many-core architectures for quantum simulations
    • Setting up experiments
    • User code experiments
    • Summary: try multilevel parallelism in your applications
  • Chapter 19: OpenCL: There and Back Again
    • Abstract
    • Acknowledgments
    • The GPU-HEOM application
    • The Hexciton kernel
    • Optimizing the OpenCL Hexciton kernel
    • Performance portability in OpenCL
    • Porting the OpenCL kernel to OpenMP 4.0
    • Summary
  • Chapter 20: OpenMP Versus OpenCL: Difference in Performance?
    • Abstract
    • Five benchmarks
    • Experimental setup and time measurements
    • HotSpot benchmark optimization
    • Optimization steps for the other four benchmarks
    • Summary
  • Chapter 21: Prefetch Tuning Optimizations
    • Abstract
    • Acknowledgments
    • The importance of prefetching for performance
    • Prefetching on Intel Xeon Phi coprocessors
    • Throughput applications
    • Tuning prefetching
    • Results—Prefetch tuning examples on a coprocessor
    • Results—Tuning hardware prefetching on a processor
    • Summary
  • Chapter 22: SIMD Functions Via OpenMP
    • Abstract
    • SIMD vectorization overview
    • Directive guided vectorization
    • Targeting specific architectures
    • Vector functions in C++
    • Vector functions in Fortran
    • Performance results
    • Summary
  • Chapter 23: Vectorization Advice
    • Abstract
    • The importance of vectorization
    • About DL_MESO LBE
    • Intel vectorization advisor and the underlying technology
    • Analyzing the Lattice Boltzmann code
    • Summary
  • Chapter 24: Portable Explicit Vectorization Intrinsics
    • Abstract
    • Acknowledgments
    • Related work
    • Why vectorization?
    • Portable vectorization with OpenVec
    • Real-world example
    • Performance results
    • Developing toward the future
    • Summary
  • Chapter 25: Power Analysis for Applications and Data Centers
    • Abstract
    • Introduction to measuring and saving power
    • Application: Power measurement and analysis
    • Data center: Interpretation via waterfall power data charts
    • Summary
  • Author Index
  • Subject Index


No. of pages:
© Morgan Kaufmann 2016
Morgan Kaufmann
eBook ISBN:
Paperback ISBN:

About the Author

Jim Jeffers

Jim Jeffers was the primary strategic planner and one of the first full-time employees on the program that became Intel ® MIC. He served as lead SW Engineering Manager on the program and formed and launched the SW development team. As the program evolved, he became the workloads (applications) and SW performance team manager. He has some of the deepest insight into the market, architecture and programming usages of the MIC product line. He has been a developer and development manager for embedded and high performance systems for close to 30 years.

Affiliations and Expertise

Principal Engineer, Engineering Manager, Technical Computing, Intel Corporation, New Hope, PA, USA

James Reinders

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the world’s first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.

Affiliations and Expertise

Director and Programming Model Architect, Intel Corporation