Parallelism Pearls for Multicore and Many-core Programming

Posted by reinders on Tuesday March 24, 2015 at 05:00:00

I had the privilege of giving a talk today in Maryland that covered many topic ranging from Parallelism, Intel Xeon Phi, Intel Parallel Studio XE (tools), and our books. I have posted the slides for the students and anyone else who is interested.

Posted by reinders on Tuesday January 27, 2015 at 11:27:22
Jim Dempsey provided this video related to his chapter: High Performance Parallelism Pearls, Chapter 5, Plesiochronous Phasing Barriers, by Jim Dempsey.

This is a video of the Plesiochronous Phasing Barriers in action. The video is not annotated nor does it have a voice over... a short explanation is provided below the video.

The left half of the screen represents the optimized tiled version and the right half represents the plesiochronous version. Each half is divided into two parts:

Top) A view of the Y/Z plane with the X dimension into the screen. Each pixel in the top portion of each side changes color upon completion of computation of column along X. Color changes are an indication of rate of computation, position of change indicates where and when in the Y/Z plane the computation occurred

Bottom) Each thread displays an individual line progressing in time from left to right, and wrapping around (raster-like) with two different colors: green for thread computing, red for in barrier wait. (red “ticks” may appear dark rather than red).

In the left half (traditional tiled), you can note that the Y/Z columns of X are at most in any one of two colors (time phases). The bottom of the left half illustrates the traditional tiled method runs well until the point where the threads start completion of their designated tile(s) and reach the barrier. It looks like a cascade of cars reaching a traffic jamb, which doesn’t clear until all threads reach the barrier.

The right half (plesiochronous), you can note that the Y/Z columns of X are at most in any one of three colors (time phases). The bottom half illustrates the barrier wait time for each thread, are for the most part not synchronized. You may notice that four threads appear to be synchronized, and they are. These are the treads of the same core, and the plesiochronous barrier scheme uses core barriers. These threads are not adjacent because of KMP_AFFINITY=scatter. You may also note that each thread computes their X columns along in the Y direction, essentially the threads tile is not rectangular. You also notice time domain edge is ragged indicating the time skew between threads. Occasionally you will also notice threads getting delayed, presumably by worst case memory latencies due to evictions.

The programs were instrumented to collect (RDTSC) time stamp counter information for each thread as it entered and left a computational region. The timer interval between computational regions is the barrier wait time.

You may click on the video to bring it up full size (double the width and height from that shown here).

Posted by reinders on Tuesday November 18, 2014 at 04:34:26

We have all the figures (diagrams, photos, etc.) from the book available to download.

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: High Performance Parallelism Pearls by Jim Jeffers and James Reinders, copyright 2015, published by Morgan Kaufmann, ISBN 978-0-128-02118-7.

You can get them all by this one simple download: All Figures, TIFF format (97Mb).

Most of the figures are also available in EPS (but NOT all figures) by this one simple download: Many Figures, EPS format, NOTE: not all figures in this ZIP (130Mb).

Posted by reinders on Saturday November 15, 2014 at 03:14:06

Jim and I got to see the first copies of the new book today - together. They are here in time for SC'14. We have a book signing in the Intel booth on Thursday (Nov 20, 2014) at noon (drop by with your copy and we can sign it! - hopefully some of our coauthors will be there too.) Many thanks to the amazing team at Morgan Kaufmann Publishing, and to the wonderful contributors who worked so hard to share their work.

Link to the PUBLISHER'S WEB SITE: for High Performance Parallelism Pearls

Posted by reinders on Friday November 14, 2014 at 08:01:25

This download has the code (90Mb in size), complete with Makefiles and build instructions, from the code used in our book "High Performance Parallelism Pearls." Call this "version 1." We'll update if the authors have revisions or if you report (email us!) any issues. Regardless, this is the version which has the code the authors used for the book with instructions which worked for us! I want to THANK all the contributors for being generous in both writing their chapters and working to share their code. We hope you find it useful. Please drop us a note with any feedback or suggestions! DOWNLOAD CODE - 90Mb ZIP FILE LINK

Posted by reinders on Wednesday November 12, 2014 at 10:32:05

We have created a Powerpoint summary of the Parallelism Pearl book. If you expand on these - please share with me! I will be happy to grow and expand (and correct) this powerpoint deck. I have uploaded an completely open and unlocked PPTX as well as a PDF version. Call this "v1." I'll post updates as appropriate. I will be posting the CODE and FIGURES to this web site soon as well.

Powerpoint: PearlsBook.pptx

PDF: PearlsBook.pdf

Posted by reinders on Wednesday October 22, 2014 at 08:15:27

October 22: update... all editing is done... it heads to be printed now. 548 pages by my count.

We have a publication date: November 17! (and ISBN number: 978-0128021187)
High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches

Where to order:

There are some early reviews/write-ups based on a draft of the book:

Teaching The World About Intel Xeon Phi

The Unabridged Chapter 1 Introduction To High Performance Parallelism Pearls

From ‘Correct’ to ‘Correct & Efficient’: a Hydro2D case study with Godunov’s scheme

(check for even more being posted for other chapters... Xeon Phi articles)

Posted by reinders on Thursday October 16, 2014 at 11:45:17

Colfax Research has just posted the 280-slide deck from their “Parallel Programming and Optimization with Intel Xeon Phi Coprocessors” developer training program.

Posted by reinders on Wednesday August 20, 2014 at 09:51:38

Work on this book is underway, all chapters have been selected. Feel free to contact us about future projects.


You are invited to contribute to a future volume of High Performance Parallelism Pearls Volume 2 – Multicore and Many-core Programming Approaches (working title) a contribution-based book that will focus on practical techniques for Intel Xeon processor and Intel Xeon Phi coprocessor parallel computing.

Submissions received now will be considered for a volume to be published late summer 2015 (the first volume appeared in November 2014, based on proposals that were submitted in May 2014). The deadline for Volume Two book submissions for most submissions will range from February 20 to March 14, 2015. We will have a very limited ability to take a few submission later than that, so if you need an additional 2-3 weeks please request it and we will discuss (in such a case, we'll have some additional requirements to help ensure our book project can stay on schedule).

Please submit your proposal and we'll work with you to refine it as needed. Each Chapter must contain detailed technical information, include real results on processors and coprocessors, explain optimizations that help processors and coprocessors, and include examples with source code (we put the code on the web). We have editors and artists to help authors, so submissions do not need to be perfect (much easier than a conference paper!). However, real world code and real results are required!

If you would like to contribute, please fill out the form below completely and click SUBMIT. You will receive a copy of your submission in e-mail (to the e-mail address you specify).

REQUIRED: You will have results on multicore (Intel Xeon processors or others) and many-core (Intel Xeon Phi).

REQUIRED: You will show the actual code, and discuss code changes, in the Chapter. Snippets of code will be shown in the Chapter, the complete code available for download.

REQUIRED: You will provide code that is discussed to download, with instructions/Makefile to build and run.

Each proposal will be considered to be a chapter in the book with each chapter author(s) acknowledged. Jim Jeffers and James Reinders will serve as the chief editors for the book, and will assist in preparing for publication, ensuring reasonable flow and consistency. Chapter submissions will be done in Microsoft Word unless other provisions are required (LaTeX can be accepted as well). Code is required for each chapter, and plays an important part in fully communicating your programming techniques and helping others leverage your expertise. For this reason, preference will be given to submissions with code that can be downloaded, built and run in a straightforward manner. The focus is on best and practical methods to utilize parallelism for performance on both processors and coprocessors. While the math is important, please keep in mind the focus is on the computing gem so others may learn.

We will use to manage all writing submissions, this form is the first step. You may email us at with questions. Please submit sooner rather than later... no obligation, it just starts a conversation with us.

Posted by reinders on Monday June 23, 2014 at 01:45:31

The world's fastest computer, for the third time in a row on biannual Top500 list, uses Intel Xeon Phi coprocessors to make it possible.
Intel Xeon Phi coprocessors are used in the #1, #7, #15, #39, #50, #51, #65, #92, #101, #102, #103, #134, #157, #186, #235, #251 and #451 systems.
No wonder we are working on another book about programming for highly parallel systems!

Posted by reinders on Thursday May 1, 2014 at 12:28:37

Click on the image to download my presentation

Posted by reinders on Saturday April 5, 2014 at 08:30:11


Posted by reinders on Wednesday March 26, 2014 at 11:00:01

Posted by reinders on Thursday July 18, 2013 at 04:23:41
All the figures, tables, charts and drawings are available for download.
Please use them freely with attribution. You should find them to all be high quality artwork, suitable for presentations and other uses.
Suggestion attribution: (c) 2013 Jim Jeffers and James Reinders, used with permission.
Feel free to mention the book too: "Intel Xeon Phi Coprocessor High Performance Programming."
If you like our book - please let others know! If you have suggestions or feedback, please let us know!

GZipped TAR file: XeonPhiBookFiguresEtc.tar.gz

ZIP file:

Posted by jimjeffers on Tuesday April 16, 2013 at 09:39:14

Checkout the download page for the code samples from Chapters 2, 3, and 4...

Posted by reinders on Tuesday April 2, 2013 at 10:42:55

Our book has been reviewed at Dr. Dobbs - online at

Posted by reinders on Saturday February 23, 2013 at 05:32:58

I was excited to get a copy (sent to each author express from the printer) this week. It is available for purchase from many stores including

Posted by reinders on Tuesday January 8, 2013 at 12:26:28

As of today - the book is in final production steps... we have proofreading to do still, but everything is in the production department at Morgan Kaufmann - on track to see books in February 2013.

As a teaser - here is the outline for the book:

Chapter 1 - Introduction
Chapter 2 - High Performance Closed TrackTest Drive!
Chapter 3 - A Friendly Country Road Race
Chapter 4 - Driving Around Town:Optimizing A Real-WorldCode Example
Chapter 5 - Lots of Data (Vectors)
Chapter 6 - Lots of Tasks (not Threads)
Chapter 7 - Offload
Chapter 8 - Coprocessor Architecture
Chapter 9 - Coprocessor System Software
Chapter 10 - Linux on the Coprocessor
Chapter 11 - Math Library
Chapter 12 - MPI
Chapter 13 - Profiling and Timing
Chapter 14 - Summary

We expect that to come out just over 400 pages.

Posted by reinders on Tuesday January 8, 2013 at 12:18:59

This book belongs on the bookshelf of every HPC professional. Not only does it successfully and accessibly teach us how to use and obtain high performance on the Intel MIC architecture, it is about much more than that. It takes us back to the universal fundamentals of high-performance computing including how to think and reason about the performance of algorithms mapped to modern architectures, and it puts into your hands powerful tools that will be useful for years to come.

—Robert J. Harrison
Institute for Advanced Computational Science,
Stony Brook University

(this will be in the Preface to the book)

Posted by reinders on Sunday December 16, 2012 at 10:09:15
Our book Intel Xeon Phi Corpocessor High Performance Programming (ISBN 978-0-124-10414-3) will be available from the publisher Morgan Kaufmann in February 2013, and many book sellers (including
Pushing computing to new heights is among one of the most exciting human endeavors both for the thrill of doing it, and the thrill of what it makes possible.
The Intel® Many Integrated Core (MIC) architecture and the first Intel® Xeon Phi™ coprocessor have brought us one of those rare, and very important, new chapters in this quest to push computing to new limits.
Jim and James spent two years helping educate customers about the prototype and pre-production hardware before Intel introduced the first Intel® Xeon Phi™ coprocessor. They have distilled their own experiences coupled with insights from many expert customers, Intel Field Engineers, Application Engineers and Technical Consulting Engineers, to create this authoritative first book on programming for this new architecture and these new products.
This book is useful even before you ever touch a system with an Intel® Xeon Phi™ coprocessor. The key techniques emphasized in this book are essential to programming any modern parallel computing system whether based on Intel Xeon processors, Intel Xeon Phi coprocessors, or other high performance microprocessors. Applying these techniques will generally increase your program performance on any system, and better prepare you for Intel Xeon Phi coprocessors and the Intel MIC architecture.