Intel Xeon Phi Processor High Performance Programming

Knights Landing Edition

2nd Edition - May 31, 2016
Authors: James Jeffers, James Reinders, Avinash Sodani
Language: English
Paperback ISBN:
9 7 8 - 0 - 1 2 - 8 0 9 1 9 4 - 4
eBook ISBN:
9 7 8 - 0 - 1 2 - 8 0 9 1 9 5 - 1

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Kn… Read more

Purchase options

LIMITED OFFER

Save 50% on book bundles

Immediately download your ebook while waiting for your print delivery. No promo code is needed.

Institutional subscription on ScienceDirect

Request a sales quote

Intel Xeon Phi Processor High Performance Programming is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights Landingspecific details, programming advice, and real-world examples. The authors distill their years of Xeon Phi programming experience coupled with insights from many expert customers — Intel Field Engineers, Application Engineers, and Technical Consulting Engineers — to create this authoritative book on theessentials of programming for Intel Xeon Phi products.

Intel® Xeon Phi™ Processor High-Performance Programming is useful even before you ever program a system with an Intel Xeon Phi processor. To help ensure that your applications run at maximum efficiency, the authors emphasize key techniques for programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi processors, or other high-performance microprocessors. Applying these techniques will generally increase your program performance on any system and prepareyou better for Intel Xeon Phi processors.

Foreword

Extending the Sports Car Analogy to Higher Performance
What Exactly Is The Unfair Advantage?
Peak Performance Versus Drivable/Usable Performance
How Does The Unfair Advantage Relate to This Book?
Closing Comments

Preface

Sports Car Tutorial: Introduction for Many-Core Is Online
Parallelism Pearls: Inspired by Many Cores
Organization
Structured Parallel Programming
What’s New?
lotsofcores.com

Section I: Knights Landing

Introduction

Chapter 1: Introduction

Abstract
Introduction to Many-Core Programming
Trend: More Parallelism
Why Intel® Xeon Phi™ Processors Are Needed
Processors Versus Coprocessor
Measuring Readiness for Highly Parallel Execution
What About GPUs?
Enjoy the Lack of Porting Needed but Still Tune!
Transformation for Performance
Hyper-Threading Versus Multithreading
Programming Models
Why We Could Skip To Section II Now
For More Information

Chapter 2: Knights Landing overview

Abstract
Overview
Instruction Set
Architecture Overview
Motivation: Our Vision and Purpose
Summary
For More Information

Chapter 3: Programming MCDRAM and Cluster modes

Abstract
Programming for Cluster Modes
Programming for Memory Modes
Query Memory Mode and MCDRAM Available
SNC Performance Implications of Allocation and Threading
How to Not Hard Code the NUMA Node Numbers
Approaches to Determining What to Put in MCDRAM
Why Rebooting Is Required to Change Modes
BIOS
Summary
For More Information

Chapter 4: Knights Landing architecture

Abstract
Tile Architecture
Cluster Modes
Memory Interleaving
Memory Modes
Interactions of Cluster and Memory Modes
Summary
For More Information

Chapter 5: Intel Omni-Path Fabric

Abstract
Overview
Performance and Scalability
Transport Layer APIs
Quality of Service
Virtual Fabrics
Unicast Address Resolution
Multicast Address Resolution
Summary
For More Information

Chapter 6: μarch optimization advice

Abstract
Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
Memory Subsystem
μarch Nuances (tile)
Direct Mapped MCDRAM Cache
Advice: Use AVX-512
Summary
For More Information

Section II: Parallel Programming

Introduction

Chapter 7: Programming overview for Knights Landing

Abstract
To Refactor, or Not to Refactor, That Is the Question
Evolutionary Optimization of Applications
Revolutionary Optimization of Applications
Know When to Hold’em and When to Fold’em
For More Information

Chapter 8: Tasks and threads

Abstract
OpenMP
Fortran 2008
Intel TBB
hStreams
Summary
For More Information

Chapter 9: Vectorization

Abstract
Why Vectorize?
How to Vectorize
Three Approaches to Achieving Vectorization
Six-Step Vectorization Methodology
Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
Compiler Tips
Compiler Options
Compiler Directives
Use Array Sections to Encourage Vectorization
Look at What the Compiler Created: Assembly Code Inspection
Numerical Result Variations With Vectorization
Summary
For More Information

Chapter 10: Vectorization advisor

Abstract
Getting Started With Intel Advisor for Knights Landing
Enabling and Improving AVX-512 Code With the Survey Report
Memory Access Pattern Report
AVX-512 Gather/Scatter Profiler
Mask Utilization and FLOPs Profiler
Advisor Roofline Report
Explore AVX-512 Code Characteristics Without AVX-512 Hardware
Example — Analysis of a Computational Chemistry Code
Summary
For More Information

Chapter 11: Vectorization with SDLT

Abstract
What Is SDLT?
Getting Started
SDLT Basics
Example Normalizing 3d Points With SIMD
What Is Wrong With AOS Memory Layout and SIMD?
SIMD Prefers Unit-Stride Memory Accesses
Alpha-Blended Overlay Reference
Alpha-Blended Overlay With SDLT
Additional Features
Summary
For More Information

Chapter 12: Vectorization with AVX-512 intrinsics

Abstract
What Are Intrinsics?
AVX-512 Overview
Migrating From Knights Corner
AVX-512 Detection
Learning AVX-512 Instructions
Learning AVX-512 Intrinsics
Step-by-Step Example Using AVX-512 Intrinsics
Results Using Our Intrinsics Code
For More Information

Chapter 13: Performance libraries

Abstract
Intel Performance Library Overview
Intel Math Kernel Library Overview
Intel Data Analytics Library Overview
Together: MKL and DAAL
Intel Integrated Performance Primitives Library Overview
Intel Performance Libraries and Intel Compilers
Native (Direct) Library Usage
Offloading to Knights Landing While Using a Library
Precision Choices and Variations
Performance Tip for Faster Dynamic Libraries
For More Information

Chapter 14: Profiling and timing

Abstract
Introduction to Knight Landing Tuning
Event-Monitoring Registers
Efficiency Metrics
Potential Performance Issues
Intel VTune Amplifier XE Product
Performance Application Programming Interface
MPI Analysis: ITAC
HPCToolkit
Tuning and Analysis Utilities
Timing
Summary
For More Information

Chapter 15: MPI

Abstract
Internode Parallelism
MPI on Knights Landing
MPI Overview
How to Run MPI Applications
Analyzing MPI Application Runs
Tuning of MPI Applications
Heterogeneous Clusters
Recent Trends in MPI Coding
Putting it All Together
Summary
For More Information

Chapter 16: PGAS programming models

Abstract
To Share or Not to Share
Why use PGAS on Knights Landing?
Programming with PGAS
Performance Evaluation
Beyond PGAS
Summary
For More Information

Chapter 17: Software-defined visualization

Abstract
Motivation for Software-Defined Visualization
Software-Defined Visualization Architecture
OpenSWR: OpenGL Raster-Graphics Software Rendering
Embree: High-performance Ray Tracing Kernel Library
OSPRay: Scalable Ray Tracing Framework
Summary
Image Attributions
For More Information

Chapter 18: Offload to Knights Landing

Abstract
Offload Programming Model—Using With Knights Landing
Processors Versus Coprocessor
Offload Model Considerations
OpenMP Target Directives
Concurrent Host and Target Execution
Offload Over Fabric
Summary
For More Information

Chapter 19: Power analysis

Abstract
Power Demand Gates Exascale
Power 101
Hardware-Based Power Analysis Techniques
Software-Based Knights Landing Power Analyzer
ManyCore Platform Software Package Power Tools
Running Average Power Limit
Performance Profiling on Knights Landing
Intel Remote Management Module
Summary
For More Information

Section III: Pearls

Introduction

Chapter 20: Optimizing classical molecular dynamics in LAMMPS

Abstract
Acknowledgment
Molecular Dynamics
LAMMPS
Knights Landing Processors
LAMMPS Optimizations
Data Alignment
Data Types and Layout
Vectorization
Neighbor List
Long-Range Electrostatics
MPI and OpenMP Parallelization
Performance Results
System, Build, and Run Configurations
Workloads
Organic Photovoltaic Molecules
Hydrocarbon Mixtures
Rhodopsin Protein in Solvated Lipid Bilayer
Coarse Grain Liquid Crystal Simulation
Coarse-Grain Water Simulation
Summary
For More Information

Chapter 21: High performance seismic simulations

Abstract
High-Order Seismic Simulations
Numerical Background
Application Characteristics
Intel Architecture as Compute Engine
Highly-efficient Small Matrix Kernels
Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection
Dense Matrix Kernel Generation: AVX2
Dense Matrix Kernel Generation: AVX-512
Kernel Performance Benchmarking
Incorporating Knights Landing’s Different Memory Subsystems
Performance Evaluation
Mount Merapi
1992 Landers
Summary and Take-Aways
For More Information

Chapter 22: Weather research and forecasting (WRF)

Abstract
WRF Overview
WRF Execution Profile: Relatively Flat
History of WRF on Intel Many-Core (Intel Xeon Phi Product Line)
Our Early Experiences With WRF on Knights Landing
Compiling WRF for Intel Xeon and Intel Xeon Phi Systems
WRF CONUS12km Benchmark Performance
MCDRAM Bandwidth
Vectorization: Boost of AVX-512 Over AVX2
Core Scaling
Summary
For More Information

Chapter 23: N-Body simulation

Abstract
Parallel Programming for Noncomputer Scientists
Step-by-Step Improvements
N-Body simulation
optimization
Initial Implementation (Optimization Step 0)
Thread parallelism (optimization step 1)
Scalar Performance Tuning (Optimization Step 2)
Vectorization with SOA (optimization step 3)
Memory traffic (optimization step 4)
Impact of MCDRAM on Performance
Summary
For More Information

Chapter 24: Machine learning

Abstract
Convolutional Neural Networks
OverFeat-FAST Results
For More Information

Chapter 25: Trinity workloads

Abstract
Out of the Box Performance
Optimizing MiniGhost OpenMP Performance
Summary
For More Information

Chapter 26: Quantum chromodynamics

Abstract
LQCD
The QPhiX Library and Code Generator
Wilson-Dslash Operator
Configuring the QPhiX Code Generator
The Experimental Setup
Results
Conclusion
For More Information

Purchase options

Save 50% on book bundles

Institutional subscription on ScienceDirect

James Jeffers

James Reinders

Avinash Sodani