Reinders' Blogs

Subscribe to Reinders' Blogs feed
Updated: 28 min 4 sec ago

Intel Parallel Computing Centers

October 22, 2013 - 4:59am

I'm excited by our announcement today of Intel® Parallel Computing Centers.  The first five centers will be located at CINECA, Purdue University, Texas Advanced Computing Center at the University of Texas (TACC), The University of Tennessee, and Zuse Institut Berlin (ZIB). There are still opportunities to propose becoming a center at software.intel.com/en-us/articles/intel-parallel-computing-centers. The centers represent investments that I think of as "digging into code" to help make real applications more prepared to use parallel computing.

Parallel computing challenges are about enabling the future of computing not just tuning for one hardware direction or another.  That's the challenge that these centers are taking on.

I frequently hear from other programmers concerns that start with words like these:

  • "My algorithm cannot be done in parallel."
  • "My program cannot scale beyond 20 cores."
  • "I can't make my code vectorize."

The punch line usually is: Can you tell me why?

Every time it happens, I want to dig in myself to work on the program... so much to do, and so little time!

It's a fun challenge: How can we structure a problem (algorithm) so that it solves a problem while using the power of parallel computing? Sometimes today's algorithms can move to parallelism well with evolutionary changes. In other cases, previously designed algorithms can prove ill-suited for exploiting parallel computing. Regardless, the opportunities for revolutionary approaches are there, but beg us for inspiration.

Intel Parallel Computing Centers will help find both.  I relish the debates and discoveries these centers will help create.  We will all benefit. In every sense of the word, this is an effort to help "modernize" applications.

At Intel, we've invested heavily in a vision for parallel computing which is being called neo-heterogeneous computing. Our mission is to deliver the benefits "heterogeneous" without the downsides. With the Intel Many Integrated Core (MIC) Architecture, we offer our Intel Xeon Phi coprocessors to support the same familiar programming languages, models and tools for highly parallel computing as we are already familiar with for parallel computing in general. This leads to "neo-heterogeneous" clusters made of "heterogeneous" clusters with "homogeneous" programming. This approach is extraordainrily valuable.

The need for neo-heterogeneous computing is enormous. It combines the promise of heterogeneous to make deliver better compute density, compute performance and lower power consumption, while including the benefits of neo-heterogeneous computing to maintain programming flexibility, performance and efficiency for developers.

As I mentioned previously, first five centers will be located at CINECA, Purdue University, Texas Advanced Computing Center at the University of Texas (TACC), The University of Tennessee, and Zuse Institut Berlin (ZIB). Raj's posted more information about each center in his blog.  The one I'm that I've been working closely with recently is the very promising Memory Access Centric Performance Optimization (MACPO) research for TACC's PerfExpert project (read various papers). I'm excited to see what results we can get in the upcoming year from this fine work.

We are encouraging proposals for more centers from others who relish this applying parallel programming skills to existing codes and moving them into a parallel world. More information on the program can be found by visiting software.intel.com/en-us/articles/intel-parallel-computing-centers. The results will help break new ground, which will provide valuable lessons for us all, while yielding practical benefits in important open source applications.

Processor Tracing

September 18, 2013 - 11:44am

Intel® Processor Trace (Intel® PT) is an exciting new feature coming in future processors that can be enormously helpful in debugging because it will expose an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters. We released specifications recently, and now a library is available to enable tool development as well as a talk this week on the work to make these capabilities available in Linux. Tool and operating system developers have specifications and our library to enable development.

We have released a library along wth sample tools to enable use of Intel® Processor Trace (Intel® PT) available as the "Processor Trace Decoder Library" available as a free download.  I can tell you a little about this project and I will also explain Intel PT to motivate the decoding capabilities of the Processor Trace Decoder Library.

The project itself will be able to support any operating system which itself is enabled for using Intel PT.  Intel PT is presented as a performace event, therefore support in an operating system is easy to detect by seeing if that event is available to configure/use. Changes for Linux have been worked on; the status of some Linux work was presented this week (presentation available at linuxfoundation-PT for LinuxCon.pdf). In time, we expect other operating systems including Windows and OS X will include support for Intel PT too and our Processor Trace Decoder Library is ready for that. The decoder library currently has been verified to build on Linux, Windows and OS X so it is ready!

The project for Processor Trace Decoder Library contains a library for decoding Intel PT together with sample implementations of simple tools built on top of the library that show how to use the library in your own tool.  The following are included in the download:

  • libipt: A packet encoder/decoder library plus a document describing the usage of the decoder library.
  • Optional Contents and Samples:
    • ptdump: Example implementation of a packet dumper.
    • ptxed: Example implementation of a trace disassembler.
    • pttc: A trace test generator.
    • script: A collection of scripts.

Processor Trace

Intel recently released details about Intel Processor Trace in the latest Intel® Architecture Instruction Set Extensions Programming Reference as Chapter 11. Intel Processor Trace is a low-overhead execution tracing feature that will be supported by some processors in the future.  It works by capturing information about software execution on each hardware thread using dedicated hardware facilities so that after execution completes software can do processing of the captured trace data and reconstruct the exact program flow. Intel PT is not free with respect to execution overhead, but the overhead is low enough that it should work well in production builds for most applications.

The captured information is collected in data packets. The first implementation of Intel PT offers control flow tracing, which includes in these packets timing and program flow information (e.g. branch targets, branch taken/not taken indications) and program-induced mode related information (e.g., Intel TSX state transitions, CR3 changes). These packets may be buffered internally before being sent to the memory subsystem.

Why is this useful?

Intel PT provides the context around all kinds of events. Performance profilers can use PT to discover the root causes of 'response-time' issues -  performance issues which affect the quality of execution, if not the overall runtime.  For example, using PT, video application developers can explore, in very fine detail, the execution of problematic individual frames, something not generally possible with more traditional sampling-based collection.

Furthermore, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific backedges and loop tripcounts, is easy to extract and report.

Debuggers can use it to reconstruct the code flow that led to the current location.  Whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over.  They may even allow navigating in the recorded execution history via reverse stepping commands.

Another important use case is debugging stack corruptions.  When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results.  Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.

Operating systems could include Intel PT into core files.  This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an operating system crash occurs the trace can be saved as part of a operating system crash dump mechanism and then used later to reconstruct the failure.

Intel PT can also help to narrow down data races in multi-threaded operating system and user program code. It can log the execution of all threads with a rough time indication.  While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.


Trace Buffer Management

The trace data can be collected into operating system provided circular buffers. To simplify memory management and to make it easier for the operating system to find a suitably large piece of memory, the buffer need not be contiguous.

The logical buffer consists of a collection of memory pages and a control structure that describes the page layout.  The operating system may configure Intel PT to generate an interrupt when any of the sections is near full.

This enables a variety of different use cases:

  •   a single circular buffer
  •   a single buffer with copy-out
  •   a single buffer with copy-out section by section

While Intel PT generates too much data to store the execution trace over a long period of time to disk, shorter snippets can be saved.

Intel is enabling Linux to provide support for Intel PT through the perf_event interface (a presentation reagrding this work is available at linuxfoundation-PT for LinuxCon.pdf).

Execution flow reconstruction

Intel PT uses a compact format to store the execution trace.  It omits everything that can be deduced directly from the code or from previous trace.

You can compare this with a brief list of instructions for navigating a maze.  As long as the way is obvious, you simply follow the twists and turns of the maze.  When you come to a junction you need to know whether to turn left or right.  In order to navigate the maze, all you really need is a short list of left or right directions. Similar to that, Intel PT uses a single bit to indicate whether a conditional branch has been taken or not taken.  Unconditional jumps and linear code are not represented in the trace, at all.

The PT trace consists of a sequence of packets (which come in different types).  To represent a selection of conditional branches, for example, Intel PT uses the TNT packet that comes in two different sizes: 8 bytes and 64bytes. For reconstructing execution flow, there are a few more things to consider such as indirect branches, function returns, or interrupts. To model these, Intel PT adds more packets like TIP for indirect branches and function returns, and FUP for asynchronous event locations.  An interrupt will then be represented as a FUP followed by a TIP, giving the source and destination of the asynchronous branch, respectively. Intel PT also gives information about transactional synchronization. Whenever a transaction is started, committed, or aborted, Intel PT will generate two packets: a MODE.TSX packet giving the new transactional state, and a FUP packet giving the code location at which the new state is effective.  For a transaction abort, an additional TIP packet will be generated giving the location of the corresponding abort handler.

Please refer to the specification (Chapter 11 of the Intel® Architecture Instruction Set Extensions Programming Reference) for a full list of supported packets.

In order to reconstruct the execution flow, a decoder therefore needs to decode the instructions in the traced executable or library as well as the PT trace packets.  To handle dynamic libraries, the decoder also needs to consider sideband information provided by the operating system.

Intel provides an Open Source reference implementation for decoding PT packets and for reconstructing the execution flow. The Processor Trace Decoder Library (a collection of tools and libraries to enable use of Intel® Processor Trace) is available as a free download. Intel is currently working to help enabling GDB, the GNU* debugger. Additional intregration with other tools are being considered as well.

Summary

Intel provides a low-overhead tracing feature that allows recording the execution flow and reconstructing it at a later time.  This feature has applications for functional as well as for performance debugging.

Posted notes from "Multithreading and VFX" SIGGRAPH class

July 25, 2013 - 7:11pm

We taught a class on "Multithreading and VFX" on July 24 at SIGGRAPH 2013.

All course notes are now online at http://www.multithreadingandvfx.org/course_notes/ - useful even if you were not there!

Wonderful group of presenters to work with (in order of presentation in our class):

  • James Reinders, Intel
  • George ElKoura, Pixar Animation Studios
  • Martin Watt, Dreamworks Animation
  • Erwin Coumans, AMD
  • Ron Henderson, Dreamworks Animation

  • Jeff Lait, Side Effects Software

The web site http://www.multithreadingandvfx.org is worth a look!

Intel® AVX-512 Instructions

July 23, 2013 - 3:57pm

The latest Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. These instructions represent a significant leap to 512-bit SIMD support. Programs can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE.

Intel AVX-512 instructions are important because they offer higher performance for the most demanding computational tasks. Intel AVX-512 instructions offer the highest degree of compiler support by including an unprecedented level of richness in the design of the instructions. Intel AVX-512 features include 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, compact representation of large displacement value, and the ability to have optional capabilities beyond the foundational capabilities. It is interesting to note that the 32 ZMM registers represent 2K of register space!

Intel AVX-512 offers a level of compatibility with AVX that is stronger than prior transitions to new widths for SIMD operations. Unlike SSE and AVX that cannot be mixed without performance penalties, the mixing of AVX and Intel AVX-512 instructions is supported without penalty. AVX registers YMM0–YMM15 map into the Intel AVX-512 registers ZMM0–ZMM15, very much like SSE registers map into AVX registers. Therefore, in processors with Intel AVX-512 support, AVX and AVX2 instructions operate on the lower 128 or 256 bits of the first 16 ZMM registers.

The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.

Intel AVX-512 in Intel products

Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing. Intel AVX-512 brings the capabilities of 512-bit vector operations, first seen in the first Xeon Phi Coprocessors (previously code named Knights Corner), into the official Intel instruction set in a way that can be utilized in processors as well. Intel AVX-512 offers some improvements and refinement over the 512-bit SIMD found on Knights Corner that I've seen bring smiles to compiler writers and application developers alike. This is done in a way that offers source code compatibility for almost all applications with a simple recompile or relinking to libraries with Knights Landing support.

Intel AVX-512 Instruction encodings

Intel AVX instructions use the VEX prefix while Intel AVX-512 instructions use the EVEX prefix which is one byte longer. The EVEX prefix enables the additional functionality of Intel AVX-512. In general, if the extra capabilities of the EVEX prefix are not needed then the AVX2 instructions can be used, coded using the VEX prefix saving a byte in certain cases. Such optimizations can be done in compiler code generators or assemblers automatically.

Emulation for Testing, Prior to Product

In order to help with testing of support, before Knights Landing is available, the Intel® Software Development Emulator (Intel® SDE) has been extended for Intel AVX-512 and is available at http://www.intel.com/software/sde.

Innovation Beyond Intel AVX-512

Intel AVX-512 foundation instructions will be included in all implementations of Intel AVX-512. Products may also include capabilities that extend Intel AVX-512 and have distinct CPUID bits for detection. Knights Landing will support three sets of capabilities to augment the foundation instructions. This is documented in the programmer’s guide; they are known as Intel AVX-512 Conflict Detection Instructions (CDI), Intel AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations and new prefetch capabilities, respectively. [2014-Jul-17: I've added a blog in July 2014 on additional AVX-512 instructions.]

Intel AVX-512 support

Release of detailed information on Intel AVX-512 helps enable support in tools and operating systems by the time products appear. Intel is working with both open source projects and tool vendors to help incorporate support. The Intel compilers, libraries, and analysis tools have, or will be updated, to provide first-class Intel AVX-512 support.

Intel AVX-512 documentation

The Intel AVX-512 instructions are documented in the Intel® Architecture Instruction Set Extensions Programming Reference (see the "Getting Started" tab at http://software.intel.com/en-us/intel-isa-extensions).

[2014-Jul-17: I've added a blog in July 2014 on additional AVX-512 instructions.]

For most up-to-date information on Intel AVX-512, please visit intel.com/avx512.

For more AVX development information check out the AVX homepage on IDZ.

AVX-512 instructions

July 23, 2013 - 9:03am

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

The latest Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. These instructions represent a significant leap to 512-bit SIMD support. Programs can pack eight double precision or sixteen single precision floating-point numbers, or eight 64-bit integers, or sixteen 32-bit integers within the 512-bit vectors. This enables processing of twice the number of data elements that AVX/AVX2 can process with a single instruction and four times that of SSE.

Intel AVX-512 instructions are important because they offer higher performance for the most demanding computational tasks. Intel AVX-512 instructions offer the highest degree of compiler support by including an unprecedented level of richness in the design of the instructions. Intel AVX-512 features include 32 vector registers each 512 bits wide, eight dedicated mask registers, 512-bit operations on packed floating point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, new operations, additional gather/scatter support, high speed math instructions, compact representation of large displacement value, and the ability to have optional capabilities beyond the foundational capabilities. It is interesting to note that the 32 ZMM registers represent 2K of register space!

Intel AVX-512 offers a level of compatibility with AVX that is stronger than prior transitions to new widths for SIMD operations. Unlike SSE and AVX that cannot be mixed without performance penalties, the mixing of AVX and Intel AVX-512 instructions is supported without penalty. AVX registers YMM0–YMM15 map into the Intel AVX-512 registers ZMM0–ZMM15, very much like SSE registers map into AVX registers. Therefore, in processors with Intel AVX-512 support, AVX and AVX2 instructions operate on the lower 128 or 256 bits of the first 16 ZMM registers.

The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.

Intel AVX-512 in Intel products

Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing. Intel AVX-512 brings the capabilities of 512-bit vector operations, first seen in the first Xeon Phi Coprocessors (previously code named Knights Corner), into the official Intel instruction set in a way that can be utilized in processors as well. Intel AVX-512 offers some improvements and refinement over the 512-bit SIMD found on Knights Corner that I've seen bring smiles to compiler writers and application developers alike. This is done in a way that offers source code compatibility for almost all applications with a simple recompile or relinking to libraries with Knights Landing support.

Intel AVX-512 Instruction encodings

Intel AVX instructions use the VEX prefix while Intel AVX-512 instructions use the EVEX prefix which is one byte longer. The EVEX prefix enables the additional functionality of Intel AVX-512. In general, if the extra capabilities of the EVEX prefix are not needed then the AVX2 instructions can be used, coded using the VEX prefix saving a byte in certain cases. Such optimizations can be done in compiler code generators or assemblers automatically

Emulation for Testing, Prior to Product

In order to help with testing of support, before Knights Landing is available, the Intel® Software Development Emulator (Intel® SDE) has been extended for Intel AVX-512 and is available at http://software.intel.com/en-us/articles/intel-software-development-emulator.

Innovation Beyond Intel AVX-512

Intel AVX-512 foundation instructions will be included in all implementations of Intel AVX-512. Products may also include capabilities that extend Intel AVX-512 and have distinct CPUID bits for detection. Knights Landing will support three sets of capabilities to augment the foundation instructions. This is documented in the programmer’s guide; they are known as Intel AVX-512 Conflict Detection Instructions (CDI), Intel AVX-512 Exponential and Reciprocal Instructions (ERI) and Intel AVX-512 Prefetch Instructions (PFI). These capabilities provide efficient conflict detection to allow more loops to be vectorized, exponential and reciprocal operations and new prefetch capabilities, respectively.

Intel AVX-512 support

Release of detailed information on Intel AVX-512 helps enable support in tools and operating systems by the time products appear. We are working with both open source projects and tool vendors to help incorporate support. The Intel compilers, libraries, and analysis tools have, or will be updated, to provide first class Intel AVX-512 support.

Intel AVX-512 documentation

The Intel AVX-512 instructions are documented in the Intel® Architecture Instruction Set Extensions Programming Reference (see the "Getting Started" tab at http://software.intel.com/en-us/intel-isa-extensions). Intel AVX-512 is covered in Chapters 2-7; Chapters 5 and 6 detail the Intel AVX-512 foundation instructions while Chapter 7 details the capabilities that extend Intel AVX-512.

Figures/Tables for presentations from Xeon Phi Book

July 18, 2013 - 5:24pm

The figures, tables, drawings, etc. used in our book can be downloaded from the book's website. We appreciate attribution, but there are no restrictions on use in educational material (presentations)!

Suggestion attribution: (c) 2013 Jim Jeffers and James Reinders, used with permission.

 

 

 

 

Code Examples from Xeon Phi Book

July 18, 2013 - 5:24pm

The code used in examples (Chapters 2-4) in our book can be downloaded from the book's website. We appreciate attribution, but there are no restrictions on use of the code - please use and enjoy! You can use the step by step instructions in the book or if you prefer we've included a Makefile for each of the chapter examples to make life a little easier.

 

Question about ODI (on-die-interconnect) performance

July 8, 2013 - 2:49pm

Hi all,

I have read the book, Intel Xeon Phi High Performance Programming and in chapter 8 (Architecture) and there was something that bothered me. To quote;

Core-to-core transfers are not "always" significantly better than memory latency times. Optimization for better core-to-core transfers has been considered, but because of ring hashing methods and the resulting distribution of addresses around the ring, no software optimization has been found that improves on the excellent built-in hardware optimization. No doubt people will keep looking! The architects for the coprocessor, at Intel, maintain searching for such optimizations through alternate memory mappings will not matter in large part because the performance of the on-die interconnect is so high.

In what situations, core to core transfers work poorly? and to what extend we can rely on ODI? If the communication is relatively fine grained or let's say 4 different cores tried to communicate over ODI at the same time. Will this have a major impact on ODI performace because it is ring based? How many simultaneous communications can ODI handle?

CUDA support , any news ?

July 8, 2013 - 2:49pm

Is there any news or official statement regarding support to compile CUDA code with Intel C++ Composer in the future ?

Thanks,

Armando

Free Intel C++ Compilers for Students, and related parallel programming tools.

July 8, 2013 - 2:49pm

I came across this offer - and thought it worth passing along...

Students at degree-granting institutions are eligible for free Intel C++ tools (and discounts on Fortran tools.)

Linux, Windows and Mac OS versions available.

These are serious tools to achieving high performance results with C++ programming through optimization, analysis and support for the latest standards.  These are interesting in advanced course work, or any time parallel programming is being done.  They aren't as interesting for beginning programming unless parallel programming is involved (it should be, it just isn't always).

This includes tools for processors as well as tools for the Intel Xeon Phi Coprocessor.

The impressive package of free tools includes:

  • Intel® C++ Composer XE, which includes:
    • Intel® C++ Compiler (highly optimizing compiler)
    • Intel® Math Kernel Library (high performance math library)
    • Intel® Threading Building Blocks (C++ tasking model, most popular C++ paralllelism method)
    • Intel® Integrated Performance Primitives (multimedia primitives library)
  • Intel® Advisor XE (modeling proposed methods to parallelize code)
  • Intel® VTune™ Amplifier XE (non-intrusive performance analysis)
  • Intel® Inspector XE (advanced threading and memory debugging)

Click on the website to get the free student licenses and then click on the students tab.

More detailed information (1 year length of license, who is eligible, etc.) are in the FAQ.

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

July 8, 2013 - 2:49pm


I have written a paper to explain programming for the Intel Xeon Phi coprocessor. The part that may surprise you is this: it's a paper focused on just doing parallel programming. Understanding how to restructure to expose more parallelism is critically important to enable the best performance on any device (processors, GPUs or coprocessors). Advice for successful parallel programming can be summarized as “Program with lots of threads that use vectors with your preferred programming languages and parallelism models.” This restructuring itself will generally yield benefits on most general-purpose computing systems, a bonus due to the emphasis on common programming languages, models, and tools that span these processors and coprocessors. I refer to this bonus as the dual-transforming-tuning advantage - an advantage you would lose by switching to a CUDA or OpenCL based solutions.

Intel Xeon Phi coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit available processor vector capabilities or memory bandwidth. For such applications, the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth, while maintaining the programmability and support associated with Intel Xeon processors.

In my paper, I work to explain more fully the implications of such high levels of parallelism and the work needed to develop parallelism, while benefiting your application on processors as well.

I hope you find it useful.

[ NOTE: this is a BLOG version of a previous posting so that it shows up at http://tinyurl.com/inteljames ]

Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

July 8, 2013 - 2:49pm
Programming for Multicore and Many-core Products including Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (including language extensions for offloading to Intel® Xeon Phi™ coprocessors) Abstract

The programming models in use today, used for multicore processors every day, are available for many-core coprocessors as well.  Therefore, explaining how to program both Intel Xeon processors and Intel Xeon Phi coprocessor is best done by explaining the options for parallel programming. This paper provides the foundation for understanding how multicore processors and many-core coprocessors are best programmed using a unified programming approach that is abstracted, intuitive and effective. This approach is simple and natural because it fits applications of today easily and yields strong results. When combined with the common base of Intel® architecture  instructions utilized by Intel® many-core processors and Intel® multi-core coprocessors, the result is performance for highly parallel computing with substantially less difficulty than with other less intuitive approaches.

Programs that utilize multicore processors and many-core coprocessors have a wide variety of options to meet varying needs. These options fully utilize existing widely adopted solutions, such as C, C++, Fortran, OpenMP*, MPI and Intel® Threading Building Blocks (Intel® TBB), and are rapidly driving the development of additional emerging standards such as OpenCL* as well as new open entrants such as Intel® Cilk™ Plus.

Introduction

Single core processors are a shrinking minority of all the processors in the world. Multicore processors, offering parallel computing, have displaced single core processors permanently. The future of computing is parallel computing, and the future of programming is parallel programming.

The methods to utilize multicore processors have evolved in recent years, offering more and better choices for programmers than ever. Nothing exemplifies this more than the rapid rise in popularity of Intel TBB or the industry interest and support behind OpenCL.

At the same time that multicore processors and programming methods are becoming common, Intel is introducing many-core processors that will participate in this evolution without sacrificing the benefits of Intel architecture. Additional capabilities that are new with many-core processors are addressed in a natural and intuitive manner Intel® many-core processors allow use of the same tools, programming languages, programming models, the same execution models, memory models and behaviors as in Intel’s multicore processors.

This paper explains the programming methods available for multicore processors and many-core processors with a focus on widely adopted solutions and emerging standards.

Parallel Programming Today

Since the goal of using Intel architecture in both multicore processors and many-core coprocessors is intuitive and common programming methods, it is important to first review where parallel programming for multicore stands today and understand where it is headed. Because of their common Intel architecture foundations, this will also precisely define the basis for parallel programming for many-core processors.

Libraries

Libraries provide an important abstract parallel programming method that needs to be considered before jumping into programming. Library implementations for algorithms including BLAS, video or audio encoders and decoders, Fast Fourier Transforms (FFT), solvers and sorters, are important to consider. Libraries such as the Intel® Math Kernel Library (Intel® MKL) already offer advanced implementations of many algorithms that are highly tuned to utilize Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions (Intel® AVX), multicore processors and many-core coprocessors. A program can start to get these benefits by adding a single call to a routine in Intel MKL that includes support for industry standard interfaces in both Fortran and C to the Linear Algebra PACKage (LAPACK). Standards combined with Intel’s pursuit of high performance, make libraries an easy choice to utilize as the first preference in parallel programming.

When libraries do not solve specific programming needs, developers turn to programming languages that have been in use for many years.

None of the most popular programming languages were designed for parallel programming. This has brought about many proposals for new programming languages as well as extensions for the pre-existing languages. In the end, these experiences have led to the emergence of a number of widely deployed solutions for parallel programming using C, C++ and Fortran.

The most widely used abstractions for parallel programming are OpenMP (primarily C and Fortran), Intel Threading Building Blocks (primarily C++) and MPI (C, C++ and Fortran). These support a diverse range of processors and operating systems, making them truly versatile and reliable choices for programming.

Additionally, the native threading methods of the operating system are directly available for programmers. These interfaces, including POSIX threads (pthreads) and Windows* threads, offer a low level interface for full control but without the benefits of high level programming abstractions. These interfaces are essentially assembly language programming for parallel computing. Programmers have all but completely moved to higher levels of abstraction and abandoned assembly language programming. Similarly, avoiding direct use of threading models has been a strong trend that has accelerated with the introduction of multicore processors. This shift to program in “tasks” and not “threads” is a fundamental and critical change in programming habits that is well supported by the abstract programming models.

Most deployed parallel programming today is either done with one of the three most popular abstractions for parallelism (OpenMP, MPI or Intel TBB), or done using the raw threading interfaces of the operating system.

These standards continue to evolve and new methods are proposed. Today, the principle technical drivers of these evolutions are highly data parallel hardware and advancing compiler technology. Both of these driving forces are motivated by a strong desire to program at higher levels of abstraction so as to increase programmer productivity leading to faster time-to-money and reduced development and maintenance costs.

Most Composable Parallel Programming Models

Learn more at http://Intel.com/go/parallel

For reasons explained in this paper, the most composable parallel programming methods are (Intel TBB and Intel® Cilk™ Plus. They consistent advantages for effective abstract programming that yield performance and preserve programming investments. They provide recombinant components that can be selected and assembled for effective parallel programs. Even though they can be described and studied individually, they are best thought of as a collection of capabilities that are easily utilized both individually and together. This is incredibly important since it offers composability for mixing modules including libraries. Both have self-composability, which is not the case for threading and for OpenMP. Uses of OpenMP and OpenCL have limitations on composability, but are still preferable to the use of raw threads. One of the benefits of Intel architecture multicore processors and many-core coprocessors is the strong support for all these methods that is available, offering the solution that best fits your current and future programming needs.

OpenMP

Learn more at http://openmp.org/wp/

In 1996, the OpenMP standard was proposed as a way for compilers to assist in the utilization of parallel hardware. Now, after more than a decade, every major compiler for C, C++ and Fortran supports OpenMP. OpenMP is especially well suited for the needs of Fortran programs as well as scientific programs written in C. Intel is a member of the OpenMP work group and a leading vendor of implementations of OpenMP and supporting tools. OpenMP is applicable to both multicore and many-core programming.

OpenMP dot product example, in Fortran:

!$omp do do j = 1, n adotb = adotb + a(j) * b(j) end do !$omp end do

 

OpenMP summation (reduction) example, in C:

#pragma omp parallel for reduction(+: s)   for (int i = 0; i < n; i++) s += x[i];

 

In the future, the OpenMP specification will expand to standardize the emerging controls for attached computing often called “offloading” or “accelerating.” Today, Intel offers non-standard extensions to OpenMP called “Language Extensions for Offload” (LEO). The OpenMP committee is reviewing LEO as well as a set of non-OpenMP offload directives for GPUs known as OpenACC, with an eye towards convergence to serve both Intel Xeon Phi coprocessors and GPUs.

Intel TBB

Learn more at http://threadingbuildingblocks.org/

Intel introduced Intel® TBB in 2006 and the open source project for Intel TBB was started in 2007. By 2009, it had grown in popularity to exceed that of OpenMP in terms of number of developers using it (per research from Evans Data Corp: http://www.evansdata.com/research/market_alerts.php, and support in subsequent research reports as well). Intel TBB is especially well suited for the needs of C++ programmers, and since OpenMP is designed to address the needs of C and Fortran developers there is virtually no competition between Intel TBB and OpenMP.  It is worth noting that for C++ programmers, using OpenMP and Intel TBB in the same program is possible as well.

Parallel function invocation example, in C++, using Intel TBB:

parallel_for (0, n,   [=](int i) { Foo(a[i]); });

 

The emergence of Intel TBB, which does not directly require nor leverage compiler technology, emphasized the value of programming to tasks and led the way for wide acceptance of using task-stealing systems. Compiler technology continues to evolve to help address parallel programming and led to the creation of the Intel Cilk™ Plus project. Increased use of compiler technology is better able to unlock the full potential of parallelism. Intel remains a leading participant and contributor in the Intel TBB open source project as well as a leading supplier of Intel TBB support and supporting tools. Intel TBB is applicable to multicore and many-core programming.

MPI

Learn more at http://Intel.com/go/mpi

For programmers utilizing a cluster, in which processors are connected by the ability to pass messages but not always the ability to share memory, the Message Passing Interface (MPI) is the most common programming method. In a cluster, communication continues to use MPI, as they do today, regardless of whether a node has many-core processors or not.

Today’s MPI based programs move easily to Intel Xeon Phi coprocessor based systems because the Intel coprocessors support ranks that can talk to other coprocessor ranks and multicore (e.g., Intel Xeon® processors) ranks. An Intel Xeon Phi coprocessor, like a multicore processor, may create as many ranks as the programmer desires. Such ranks communicate with other ranks regardless of whether they are on multicore or many-core processors.

Because Intel Xeon Phi coprocessors are general-purpose, MPI jobs run on the coprocessors. This is very powerful because no algorithmic recoding or refactoring is required to get working results from an existing MPI program.  The general capabilities of the coprocessors combined with the power of MPI support on the Intel Xeon Phi coprocessors produce immediate results in a manner that is intuitive for MPI programmers.

The widely used Intel® MPI library offers both high performance and support for virtually all interconnects. The Intel MPI library supports both multicore and many-core in systems creating ranks on multicore processors and many-core coprocessors in a fashion that is familiar and consistent with MPI programming today.

MPI, on Intel Xeon Phil coprocessors, composes with other thread models (e.g., OpenMP, Intel TBB, Intel® Cilk™ Plus) as has become common on multicore processors based systems.

Intel is a leading vendor of MPI implementations and tools. MPI is applicable to multicore and many-core programming.

Parallel Programming Emerging Standards

For data parallel hardware, the emergence of support for certain extensions to C, C++ offers important options for developers and address programmer productivity.

Intel® Cilk™ Plus

Learn more at http://cilk.com

Intel introduced Intel Cilk Plus in late 2010. Built on research from M.I.T. and product experiences by industry leader Cilk Arts, Intel implemented support for task stealing in compilers for Linux* and Windows. Intel has published full specifications for Intel Cilk Plus to help enable other implementations as well as optional usage of Intel runtime or construction of interchangeable runtimes via API compliance. Intel is actively working with other compilers to offer support in the future for more compilers. Intel is proud to be the leading supporter in industry of Intel Cilk Plus with products and tools.

Intel Cilk Plus provides three new keywords, special support for reduction operations, and data parallel extensions. The keyword cilk_spawn can be applied to a function call, as in x = cilk_spawn fib(n-1), to indicate that the function fib can execute concurrently with the subsequent code. The keyword cilk_sync indicates that execution has to wait until all spawns from the function have returned. The use of the function as a unit of spawn makes the code readable, relies on the baseline C/C++ language to define scoping rules of variables, and allows Intel Cilk Plus programs to be composable.

Parallel spawn in a recursive fibonacci computation, in C, using Intel Cilk Plus:

int fib (int n) {   if (n < 2) return 1; else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x + y; } }

 

Cilk offers exceptionally intuitive and effective compiler support for C and C++ programmers. Cilk is very easy to learn and poised to be widely adopted. A regular “for” loop, without inter-loop dependencies, can be transformed into a parallel loop by simply changing the keyword “for” into “cilk_for.” This indicates to the compiler that there is no ordering among the iterations of the loop.

Parallel function invocation, in C, using Intel Cilk Plus:

cilk_for (int i=0; i<n; ++i){   Foo(a[i]); }

 

Cilk programmers still utilize Intel TBB for certain algorithms or features where new compiler keywords or optimizations are not needed, such as the thread aware memory allocator or a sort routine. Intel Cilk Plus is applicable to multicore and many-core programming.

C/C++ data parallel extensions

Learn more at http://cilk.com

Debate about how to extend C (and C++) to directly offer data parallel extensions is on-going. Implementations, experiences and adoption are important steps toward standardization. Intel has implemented extensions for fundamental data parallelism as part of Intel Cilk Plus for Linux, Windows and Mac* OS X systems. Intel is actively working with other compilers to offer support in the future. An intuitive syntactic extension, similar to the array operations of Fortran 90, is provided as a key element of Intel Cilk Plus and allows simple operations on arrays. The C/C++ languages do not provide a way to express operations on arrays. A programmer has to write a loop and express the operation in terms of elements of the arrays, creating unnecessary explicit serial ordering. A better opportunity exists to write a[:] = b[:] + c[:]; to indicate the per element additions but without specifying unnecessary serial ordering. These simplified semantics free up a compiler to always generate vector code instead of generating non-optimal scalar code.

An additional method to avoid unintended serialization, allows a programmer to write a scalar function in standard C/C++ and declare it as an “elemental function.” This will trigger the compiler to generate a short vector version of that function, which instead of operating on a single set of arguments to the function, will operate on short vectors of arguments by utilizing the vector registers and vector instructions. In common cases, where the control flow within the function does not depend on data values, the execution of the short vector function can yield a vector of results in roughly the same time it would take the regular function to produce a single result.

Elemental functions, in C, using Intel Cilk Plus:

__declspec (vector) void saxpy(float a, float x, float &y) { y += a * x; }

 

Intel is supporting these syntactic extensions for C and C++ with products and tools as well as discussions with other compiler vendors for wider support. C, C++ and data parallel extensions are applicable to multicore and many-core programming.

OpenCL*

Learn more at http://Intel.com/go/opencl

OpenCL was first proposed by Apple* and then moved to an industry standards body of which Intel is a participant and supporter. OpenCL offers a “close to the hardware” interface, offering some important abstraction and substantial control coupled with wide industry interest and commitment. OpenCL may require the most refactoring of any of the solutions covered in this whitepaper. Specifically, refactoring based on advanced knowledge of the underlying hardware. Results from refactoring work may be significant for multicore and many-core performance, and the resulting performance may or may not be possible without such refactoring. A goal of OpenCL is to make an investment in refactoring productive when it is undertaken. Solutions other than OpenCL may offer alternatives to avoid the need for refactoring (which is best done when based on advanced knowledge of the underlying hardware).

Simple per element multiplication using OpenCL:

kernel void dotprod( global const float *a, global const float *b, global float *c) { int myid = get_global_id(0); c[myid] = a[myid] * b[myid]; }

 

Intel is a leading participant in the OpenCL standard efforts, and a vendor of solutions and related tools with early implementations available today. OpenCL is applicable to multicore, many-core and GPU programming although the code within an OpenCL program is usually separate or duplicated for each target. Intel currently ships OpenCL support for both Intel multi-core processors (using Intel SSE and Intel AVX instructions) and Intel® HD Graphics (integrated graphics available as part of many Third Generation Intel® Core™ processors).

Composability Using Multiple Models

Composability is an important concept. With multiple programming options to fit differing needs, it is essential that these methods not be mutually exclusive. The abstract programming methods discussed above can be mixed in a single application. By offering newer programming models that support composable programming, programmers are freed from subtle and unproductive limitations on the mixing and matching of programming methods in a single application.

The most composable methods are Intel TBB and Intel Cilk Plus (including the C/C++ data parallel extensions). Use of OpenMP and OpenCL have limitations on composability, but are still preferable to the use of raw threads.

Intel TBB and Intel Cilk Plus provide recombinant components that can be selected and assembled for effective parallel programs. This is incredibly important since it offers composability for mixing modules including libraries. Both have self-composability, which is not the case for threading and for OpenMP or OpenCL.

Harnessing Many-core

Combining the power of both multicore and many-core, and utilizing them together, offers enormous possibilities.

Intel Xeon Phi coprocessors are designed to offer power efficient processing for highly parallel work while remaining highly programmable. Platforms containing both multicore processors and Intel Xeon Phi coprocessors can be referred to as heterogeneous platforms. Such a heterogeneous platform offers the best of both worlds, multicore processors that are very flexible to handle general-purpose serial and parallel workloads as well as more specialized many-core processing capabilities for highly parallel workloads. A heterogeneous platform can be programmed as such and utilize a programming model to manage copying of data and transfer of control.

Applications are still built with a single source base. The versatility of  Intel architecture multicore processors and many-core coprocessors allows for programming that is both intuitive and effective.

Explicit vs. Implicit use of Many-core

Many-core processors may be used implicitly through the use of libraries, like Intel MKL, by provisioning code to detect and utilize many-core processors when present. Explicit controls for Intel libraries are available to the developer, but the simple approach of relying on a library to decide if and when to use the attached multicore processors can be quite effective.

Additional programming opportunities are possible by explicit directions from the programmer in the source code. Writing an application to explicitly utilize many-core is done by writing a heterogeneous program. This program would consist of writing a parallel application and splitting the work between the multicore processors and many-core coprocessors.

Even with explicit control, Intel has designed the extensions to be flexible enough to work if no many-core processors are present and to also be ready for a converged future.  These two benefits are incredibly important. First, a single source program can provide direction to offload to an Intel Xeon Phi coprocessor. However at runtime, if the coprocessor is not present on the system being utilized, the use of  Intel architecture on both the multicore processors and many-core coprocessors means that the code available for offloading to a coprocessor can be executed seamlessly on either type of processor.

Offloading

The reality of today’s hardware is that a heterogeneous platform contains multiple distinct memory spaces, one (or more in a cluster) for the multicore processors and one for each many-core processor. The connection between multicore processors and many-core coprocessors can be a bottleneck that needs some consideration.

There are two approaches to utilizing such a heterogeneous platform. One approach treats the memory spaces as wholly distinct, and uses offload directives to move control and data to and from the multicore processors. Another approach simplifies data concerns by utilizing a software illusion of shared memory called MYO to allow sharing between multicore processors and many-core coprocessors that reside in a single system or a single node on a cluster. MYO is an acronym for “Mine Yours Ours” and refers to the software abstraction that shares memory within a system for determining current access and ownership privileges.

The first approach exposes completely that the multicore processors and many-core processors do not share memory. Compiler support of directives for this execution model are able to free the programmer from specifying the low level details of the system, while exposing the fundamental property that the target is heterogeneous and leaving the programmer to devote their time to solving harder problems.

Simple offload, in Fortran:

!dir$ offload target(MIC1) !$omp parallel do do i=1,10 A(i) = B(i) * C(i) enddo !$omp end parallel

 

The compiler provides a pragma for offload (#pragma offload) that a programmer can use to indicate that the subsequent language construct may execute on the Intel Xeon Phil Coprocessor. The pragma also offers clauses that allow the programmer to specify data items that would need to be copied between processor and coprocessor memories before the offloaded code executes. The clauses also allow the developer to specify data that should be copied back to multicore processor memory afterwards. The offload pragma is available for C, C++ and Fortran.

Simple offload, in C, with data transfer:

float *a, *b; float *c;   #pragma offload target(MIC1) in(a, b : length(s)) out(c : length(s) alloc_if(0)) for (i=0; i<s; i++) { c[i] = a[i] + b[i]; }

 

An alternate approach is a run time user mode library called MYO. MYO allows synchronization of data between the multicore processors and an Intel Xeon Phi coprocessor, and with compiler support enabling allocation of data at the same virtual addresses. The implication is that data pointers can be shared between the multicore and many-core memory spaces. Copying of pointer based data structures such as trees, linked lists, etc. is supported fully without the need for data marshaling. To use the MYO capability, the programmer will mark data items that are meant to be visible from both sides with the _Cilk_shared keyword, and use offloading to invoke work on the Intel Xeon Phi coprocessor. The statement x = _Offload func(y); means that the function func() is executed on the Intel Xeon Phi coprocessor, and the return value is assigned to the variable x. The function may read and modify shared data as part of its execution, and the modified values will be visible for code executing on all processors.

The offload approach is very explicit and fits some programs quite well. The MYO approach is more implicit in nature and has several advantages. MYO allows copying of classes without marshaling and copying of C++ classes, which is not supported using offload pragmas. Importantly, MYO does not copy all shared variables upon synchronization. Instead, it only copies the values that have changed between two synchronization points.

These offload programming methods, while designed to allow control to direct work to many-core processors, are applicable to multicore and many-core programming so as to allow code to be highly portable and long lasting even as systems evolve. Source code will not need to differ for systems with and without many-core processors.

Additional Offload Capabilities

Both the keyword and pragma mechanisms perform the copying of the data triggered by the invocation of work on the Intel Xeon Phi coprocessor. Future directive options allow initiation of data copying ahead of invoking computation in order to be able to schedule other work while data is being copied.

Since systems may be configured with multicore processors, and more than one Intel Xeon Phi coprocessor per node, additional language support will allow the programmer to choose between forcing offloading or allowing a run time determination to be made. This option offers the potential for more dynamic and optimal decisions depending on the environment.

Standards

By utilizing Intel architecture instructions on multicore and many-core, programming tools and models are best able to serve both. With insights and use of the right models, a single source base can be constructed that is well equipped to utilize multicore processor systems, heterogeneous systems and future converged systems in an intuitive and effective manner.  This can be accomplished with a single source base in familiar and current programming languages.

Tried and true solutions, including C, C++, Fortran, OpenMP, MPI and Intel TBB apply to these  Intel architecture multicore and many-core systems.

Emerging efforts including Intel Cilk Plus, offload extensions and OpenCL are strongly supported by Intel, and are poised for broader adoption and support in the future.

The path to standardization starts with strong products and published specifications, progresses to users (customers) and support by additional vendors. Viable standards will follow. With OpenCL, Intel Cilk Plus, and offload extensions, the product support and specifications exist and customer usage is well under way. It is reasonable to expect that wider support and standards refined based on user experiences will follow.

Summary

By utilizing Intel architecture and industry standard programming tools, offer parallel programming methods that can be applied across both.  These methods can employ a single source code base using familiar tools, programming languages and previous source code investments. Current and emerging solutions allow applications to grow into a single code section that best utilizes multicore processors and many-core coprocessors together. The methods available to utilize multicore and many-core parallelism offer performance while preserving investments and offering intuitive programming methods.

Standards play an important role in programming methods. Intel has invested heavily to support and implement standard programming models and methods. In addition, Intel has been a leader in the evolution of standards to solve new challenges.

When programming for Intel Xeon Phi coprocessors, applications can get the power of the Intel Xeon Phi coprocessors in a maintainable and performant application that is highly portable and scales to future architectures while fully supporting multicore systems with the same code.

Clusters of multicore processors and many-core coprocessors, organized in nodes, will be able to take advantage of this very rich set of tools and programming models available for Intel architecture in an intuitive, maintainable and effective manner.

The ability to utilize existing developer tools, standards, performance and offer flexibility puts Intel multi-core and Intel many-core solutions in a class of their own.

About the Author

James Reinders, Director, Software Evangelist, Intel Corporation

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, the world's first TeraFLOP/sec supercomputer (ASCI Red), and the world’s first TeraFLOP/sec single-chip computing device known at the time as Knights Corner and now as the first Intel® Xeon Phi™ Coprocessor, as well as compilers and architecture work for multiple Intel® processors and parallel systems. James has been a leader in the emergence of Intel as a major provider of software development products, and serves as their chief software evangelist. James is the author of “Intel Threading Building Blocks” from O'Reilly Media. It has been translated to Japanese, Chinese and Korean. James is coauthor of “Structured Parallel Programming,” ©2012, from Morgan Kaufmann Publishing. James has published numerous articles, contributed to several books. James received his B.S.E. in Electrical and Computing Engineering and M.S.E. in Computer Engineering from the University of Michigan.

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.Intel.com/design/literature.htm

Intel, the Intel logo, Cilk, VTune, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.

 Copyright© 2012 Intel Corporation. All rights reserved.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

 

An Overview of Programming for Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors

July 8, 2013 - 2:49pm


I have written a paper to explain programming for the Intel Xeon Phi coprocessor. The part that may surprise you is this: it's a paper focused on just doing parallel programming. Understanding how to restructure to expose more parallelism is critically important to enable the best performance on any device (processors, GPUs or coprocessors). Advice for successful parallel programming can be summarized as “Program with lots of threads that use vectors with your preferred programming languages and parallelism models.” This restructuring itself will generally yield benefits on most general-purpose computing systems, a bonus due to the emphasis on common programming languages, models, and tools that span these processors and coprocessors. I refer to this bonus as the dual-transforming-tuning advantage - an advantage you would lose by switching to a CUDA or OpenCL based solutions.

Intel Xeon Phi coprocessors are designed to extend the reach of applications that have demonstrated the ability to fully utilize the scaling capabilities of Intel Xeon processor-based systems and fully exploit available processor vector capabilities or memory bandwidth. For such applications, the Intel Xeon Phi coprocessors offer additional power-efficient scaling, vector support, and local memory bandwidth, while maintaining the programmability and support associated with Intel Xeon processors.

In my paper, I work to explain more fully the implications of such high levels of parallelism and the work needed to develop parallelism, while benefiting your application on processors as well.

I hope you find it useful.


In addition to this paper, there is a succinct document that explains the same concepts, but in a  "flowchart format", and links to additional resources.    http://software.intel.com/en-us/articles/is-the-intel-xeon-phi-coprocessor-right-for-me

A Parallel Programming training opportunity (Xeon processors and Xeon Phi coprocessors)

July 8, 2013 - 2:49pm

SC12 is underway, and the opening gala is tonight.  Drop by our booth at the opening (7pm) and check out our amazing space and see what new things we have to share!

You may also visit Colfax's booth to learn about some work they are doing on Parallel Programming.  I understand they'll have classes next year available, which will cover programming CPUs and Intel Xeon Phi coprocessors.  It's all about PARALLELISM.  I've seen what material they are working on - and it is definitely worth considering!

As I relate in my book, parallel programming is critical to extracting the performance from all machines for the supercomputing community. As part of our support for parallelism in programming and our broad community of partners helping us out, Colfax International is  spearheading "http://www.colfax-intl.com/DL_Documents/Colfax-XeonPhi-Training-PR.pdf" for the Intel Xeon and Intel Xeon Phi family of products they'll be at the show in booth 2409. Colfax also has more information online at http://www.colfax-intl.com/DL_Documents/Colfax-XeonPhi-Training-PR.pdf

Salt Lake City... SC12... Join us in Intel booth (#2601) for interesting talks! Ask me about new books...

July 8, 2013 - 2:49pm

I hope you can join us at SC’12 in Intel booth #2601.

I'm looking forward to seeing many of you again at SC’12 this year. We’re kicking off the show at the Grand Opening Gala at 7:00p.m. where we’ll have a number of special guests joining us for a show floor presentation on our efforts in parallelization then and through out the week. I'll be carrying around a pre-print version of an upcoming book (it is more than 1/2 done) for programming the Intel Xeon Phi coprocessor (hint: its all about parallel programming). For now, I am telling myself that I will work on finishing a few chapters while I'm at the show too.  Well, I can think that way until all the excitement keeps me too busy!

Parallel programming is critical to extracting the performance from all machines for the supercomputing community. Naturally, we'll focus on parallelism and we have got a broad community of partners helping us out. But it’s not just about technology. It’s about enabling us all to effectively utilize the technology. On that front, we’ve got compelling panels and speakers talking about efforts to get more people into high-performance computing. 

Here a sample of speakers presenting in our Architecture for Discovery theater during the show:

  • Scott Alexander, BlackSky Computing
  • Bjorn Andersson, Hitachi
  • Jay Boisseau, TACC
  • Barry Bolding, Cray
  • Glenn Brook, National Institute of Computational Sciences
  • Bill Bryce, Univa
  • Ümit V. Çatalyürek, The Ohio State University
  • Jack Dongarra, University of Tennessee
  • Mathieu Dubois, Bull
  • Kai Dupke, SUSE
  • Jeff Falkanger, IBM
  • Michael Feldman hosting a Workforce Development/STEM Panel
  • Shai Fultheim, ScaleMP
  • Wolfgang Gentzsch and Burak Yenier, Uber Cloud project
  • Dr. Goh, SGI
  • Chad Harrington, Adaptive Computing
  • Yutaka Ishikawa, University of Tokyo
  • Vadim Karpusenko, Colfax
  • Rishi Khan, ETI
  • David Lecomber, Allinea
  • Eric Lenquiniou, Altair
  • Tom Leyden, Amplidata
  • Thomas Lippert, DEEP
  • Principal Lazaro Lopez, Wheeling High School 
  • John Malcolm, AccelerEyes
  • Glen Otero, Dell
  • Nikolay Piskun, RogueWave
  • Bert Shen and Akira Sano, Supermicro
  • Addison Snell hosting a Missing Middle Panel

And many more. Just stop by booth #2601 to check on times. I'll be introducing at least a few of these talks personally.

OpenMP 4.0 may offer important solutions for targeting and vectorization

July 8, 2013 - 2:49pm

The upcoming OpenMP 4.0 will be discussed at SC12, and there will be a number of additions I'm particularly excited to see coming from OpenMP.  They are: "SIMD extensions" and "targeting extensions."  One helps make the intention of a developer to have code vectorized efficiently be realized, and the other allows for the first time an industry standard to designate code and data be targeted to an attached device. The specification for "targeting extensions" is available now from OpenMP to encourage comment before full standardization, it is titled OpenMP Technical Report 1 on Directives for Attached Accelerators, and will be discussed along with other future OpenMP features at their SC12 BoF.

Both are worthy problems to see solved by the OpenMP standards body by bringing together many vendors and users.  OpenMP has helped bring together representatives from across the industry with many points of view, and to ensure standards that give developers a chance to write code that can span multiple architectures, while giving hardware vendors a chance to have their offerings well supported.

The "SIMD extensions" are more powerful and better specified than the vague but commonly supported "IVDEP" pragma. Many compilers support "IVDEP" but the definition of what that means, and what guarantees it gives vary a lot. This is a perfect place for a standards body to provide a more consistent and guaranteed approach.  Most commonly, "IVDEP" tells a compiler that it can ignore intra-loop assumed dependencies which may be just the trick needed for the compiler to vectorize a loop. This carries two flaws: it does not tell the compiler that it must vectorize, and it does not help vectorize code when assumed loop carried dependencies are not the barrier. The new "SIMD extensions " in OpenMP 4.0 will address these flaws and offer an industry standard approach to telling the compiler that it must vectorize a loop, and to allow some finer grained control on how it is vectorized. “IVDEP” pragmas did not have rich sets of clauses to give control like the “SIMD extensions” will have. We can all learn more with the release of the "SIMD extensions " proposal at SC12.  I say "proposal" because OpenMP will be releasing for public comment the pieces that will probably become OpenMP 4.0. Discussions will occur at the OpenMP BoF at SC12 (information at the end of my blog on that).

The "targeting directives" also fill a strong need for addressing the problem of "offloading" code and data to an attached device. These have been called "accelerator directives" by PGI and others, "offload directives" by Intel, and now "targeting directives" by OpenMP. OpenMP took a general and inclusive approach in order to let code span a great variety of devices. This resembles the commitment we've seen with groups such as Khronos OpenCL efforts to be inclusive. OpenMP has consistently over the years been inclusive with two objectives: (1) be able to have code written that shows off any given hardware to its potential, (2) have code be portable as much as possible to not require rewriting for each piece of hardware.

The "targeting directives" intended for OpenMP 4.0, have the challenge of spanning the likes of NVidia GPUs (SIMT oriented), and Intel Xeon Phi coprocessors (SMP-on-a-chip), and Intel HD Graphics (vector oriented GPUs), other GPUs (like AMD), and other potential attached processing devices and future ones as well. It is not easy to bring multiple companies together to find common ground.  In the interim, NVidia has been happy with OpenACC designed for their GPUs and supported during the prior year, and Intel has had its offload directives for Intel Xeon Phi coprocessors for about two years now. Each demonstrated capabilities, which can inspire and inform an inclusive standard. Now OpenMP will share a specification that OpenMP believes does that, and ask for input and comment from users and implementers alike. No doubt, there is work remaining to be done - but the result is well worth the work. Neither OpenACC nor Intel's offload directives come close to the inclusion a standard requires to be useful.  OpenMP has done well to bring us closer to that goal. OpenMP’s open process has published their current draft for comment as http://www.openmp.org/mp-documents/TR1_167.pdf 

OpenMP has a BoF at SC12 on Tuesday, Nov 13th, 5:30 - 7:00pm in Room 355-A (click this link to check SC12 official site in case it changes).  I will be in another BoF sharing thoughts on directives for accelerators (same time - different room).  I'll be talking about "targeting directives" there as part of comments on what we should aspire too in order for developers to be able to truly code with confidence, performance and portability.  I look forward to the conversations and interactions we'll have at SC12.

OpenMP is fifteen years old, and apparently that means cake at 12:30pm and beer at 3:30pm on Tuesday and Wednesday at their booth at Supercomputing according to their website.  I hope I can drop by to sample at least some beer.

Intel has been part of OpenMP, along with other founders, for all fifteen years. I was one of a handful of people involved at Intel from the earliest days, even before formation. I’ve been proud to be part of supporting OpenMP over the years with Intel’s leadership compilers. OpenMP 4.0 should be no different. Users have told us how much they want better vectorization, and a standard for targeting devices well. We plan to move quickly to support OpenMP 4.0, as we did for OpenMP 3.0, OpenMP 2.0 and OpenMP 1.0.  It has been an effective and valuable standard for compiler writers and compiler users alike.

Rogue Wave tools support Intel Xeon Phi coprocessors

July 8, 2013 - 2:49pm

Rogue Wave Software recently announced expansion of their support of Intel Xeon Phi coprocessors which will now includes their SourcePro® C++, IMSL® Numerical Libraries, TotalView® debugger, and the ThreadSpotter™ cache memory optimizer products.  You can check out their press release for details.

Scott Lasica, VP Products and Alliances at Rogue Wave Software, helped me understand this value by sharing with me "When we started porting TotalView to run on an Xeon Phi coprocessor, we progressed in a week what took us over a year with an accelerator." The common x86 programming model, Linux-based environment and suite of Intel Tools allowed them to support Intel Xeon Phi coprocessor in a few weeks instead of many months without the Intel Xeon Phi coprocessor advantages.

I'm pleased to have Rogue Wave offering solutions for our customers for Intel Xeon Phi coprocessor development. Many of our customers have expressed to me personally how happy they are to have Rogue Wave tools supporting Intel Xeon Phi coprocessors.

As with other tools, a key benefit is that these are NOT new tools or wildly different add-ons to their tools. Instead, they are familiar tools with familar features. They are similar by simply extending their x86 support to include the "many" core aspects of Intel Xeon Phi coprocessors including "the over 50 cores" and the "512 bit wide SIMD."  This is very similar to supporting a 50+ core processor shared memory machine.  That's why application developers and tool developers are finding it straight forward to work with Intel Xeon Phi coprocessors.  I call this a dual-tuning/optimization benefit of Intel Xeon Phi coprocessors - the ability of programming languages, model and tools to benefit both processors and Intel Xeon Phi coprocessors. That preserves past investments in coding and training, as well as investment today in both. That's an advantage that makes the tough job of programming just a little easier.

TACC symposium and programming two SMP-on-a-chip devices

April 26, 2012 - 8:28pm
Real results for many-core processors illustrate the power of a familiar configuration (SMP) even when reduced to a single chip. SMP on-a-chip can use the same applications, same tools, offer the same flexibility and pose familiar challenges that are solved by familiar techniques and skills.



I recently attended a symposium, co-sponsored by TACC and Intel, at the Texas Advanced Computing Center (TACC) in Austin where the programming of two many-core devices were discussed. One was a research chip designed to push some limits and allow interesting research on a device that lacks many things a product would require. The research chip is known as Intel’s Single-Chip Cloud Computer (SCC). The other many-core device was, a prototype of our new Intel Many Integrated Core (MIC) Architecture, the Knights Ferry co-processor. The deadline for papers precluded inclusion of results from pre-production Knights Corner co-processors which will be the first Intel MIC co-processor products. There was a lot of whispering in the hallways about the excitement of starting work with Knights Corner co-processors.

The papers, and the half day tutorial, at the “TACC-Intel Highly Parallel Computing Symposium” all had strong elements relating to familiar parallel programming challenges: scaling and vectorization. This is because both devices are built on Intel Pentium processor cores hooked together with their design for a connection fabric on the same piece of silicon.

Simply put, they are both SMP on-a-chip (symmetric multi-processors) devices, with somewhat different design goals.

At Intel, we have been convinced that putting a familiar generally programmable SMP on-a-chip is a good idea. It has a familiarity in programmability which proves to have many benefits. SCC was built for research into many facets of highly parallel devices. Knights Corner is designed for production usage and is optimized for power and highly parallel workloads. Knights Corner is well suited for HPC applications that already run on SMP systems. Presenter after presenter who talked about using the prototype Knight Ferry mentioned how applications “just worked."

I like to say, “Programming is hard, and so is parallel programming.” It follows that making an SMP or an SMP on-a-chip get maximum performance may not quite be rocket science, but it is no walk in the park. So, there was plenty of room for the papers to discuss the challenges of tuning for any SMP system.

What was really striking was how optimizations for Knights Ferry co-processors were applicable to SMP systems in general. Several authors commented on how their work to get better scaling or better vectorization for Knights Ferry also improved the performance of the same code compiled to run on an Intel Xeon processor based SMP system.  This performance-reuse is very significant, and one presenter exclaimed “Time spent optimizing for MIC is time well spent because it optimizes your code for non-MIC processors at the same time.”

All the papers and presentations (including my keynote) are available on-line now at http://www.tacc.utexas.edu/ti-hpcs12/program

Here are some notes from a few of the talks:

Dr. Robert Harkness, gave an engaging talk entitled “Experiences with ENZO on the Intel Many Integrated Core Architecture.” I enjoyed his comment that “we always programming for the future” because they “never have enough compute power.” He looked at multiple programming models, but had the best results using the “dusty” MPI based program that he had running on an SMP before Knights Ferry. He did his work on MPICH 1.2.7p1 because Intel did not supply an MPI with the Knights Ferry systems. He said it was obsolete but very easy to build and use. He reported that one person (not a dedicated programmer) was able to build everything (a quarter million lines of code) in a single week without any application source code modifications at all. The week, it seems, was spent hunting down libraries and recompiling them including MPICH. His results scaled very well.



His conclusions (from slide 30 of his presentation) were: “Intel MIC is the best way forward for large-scale codes which cannot use the existing GPGPU model (even with directives).”



A talk by Theron Voran, with the National Center of Atmospheric Research, looked at using Knights Ferry for Climate Science. He started by saying "We have large bodies of code laying around. We don't want to rewrite in new languages for restrictive architectures." He had several good introduction slides including a comparison of accelerators vs. multicore and many-core devices.



Here the challenges of vectorization offered opportunities for future work. Compiler hints, loop restructuring and relate activities should enhance performance on Xeon-based and MIC-based SMP systems, as well as work on improving scalability on more and more cores. Even with these challenges, the authors noted “Relative ease in porting codes” (recompiling) and the belief that computational capabilities of MIC will be worthwhile.

Ryan Hulguin, with the University of Tennessee, looked at CFD solvers on Knights Ferry. He looked at two methods, one based on Euler equations (for inviscid fluid flows) and another based on the BGK model Boltzmann equation (for rarefied gas flows). Performance results showed OpenMP to be effective on Knights Ferry, and that the SMP programming challenges of vectorization and having good concurrency held true on Knights Ferry as well.



A talk on Dense Linear Algebra Factorization, from David Hudak at the Ohio Supercomputing Center, talked about Heterogeneous Programming Challenges. David is a Wolverine working in a Buckeye world. My heart goes out to him. I really enjoyed his separation of short-term issues that distract us from the real long-term challenges that will stay with us.



The talk compared a QR factorization implemented in OpenMP with a Cilk Plus implementation. Both performed well. The authors emphasized that guidance to Vectorize and use lots of tasks, proved to work.



I’ve written more than I set out to write, so I’ll stop here. The SCC related papers were very interesting as well, ranging from Tim Mattson’s overview of the program to papers showing research results from investigations using SCC. The other MIC related papers are all worthy as well, including an excellent paper on early experiences with MVAPICH2 doing Intra-MIC MPI communication. Amazing things you can do on an SMP on-a-chip… it runs a real Linux after all!

It is very common for demos to start with an ‘ssh’ (shell) to one of the Knights Ferry processors… and then running the application natively from the command line. SMP on-a-chip, indeed.  Too bad I can’t convince Intel to name it that.  Even if I did, it would probably be chipSMP™ model 8650plus XS. Nevermind, Knights Corner is fine by me.

The papers and talks can be at http://www.tacc.utexas.edu/ti-hpcs12/program

Intel Software Network at SIGGRAPH 2009

March 13, 2012 - 6:57pm
Intel Software Network at SIGGRAPH 2009
  • siggraph
  • conferences
  • animation
  • visual computing
  • graphics
  • quick chat about MIC architecture with Mike Dewar, NAG

    November 17, 2011 - 3:37pm

    I ran into Mike Dewar at SC11 today as the exhibition draws to a close.  Mike is the CTO of NAG Ltd. - a company we've had the good fortune to work with for years.

    NAG is one of a handful of companies that have been providing feedback on our Knights Ferry (prototype MIC architecture).

    Mike told me: "We found porting existing routines from the NAG Library to the Intel Many Integrated Core Architecture (MIC) to be a relatively quick and painless process. The team was impressed at the way Intel has extended their existing software tools to support the MIC environment, allowing them to work in a familiar and productive environment."

    I quizzed Mike on what it took to get it running on Knights Ferry, and he did share the one type of tuning they have to explore. Since they use OpenMP which has generally meant that the number of threads is more like 10-20 instead of the 120 threads they use on Knights Ferry.  I'll have to write more about that later - scaling and vectorization are keys as multicore and many-core grow. No mystery there. The good news is that their use of OpenMP made this a straightforward challenge they understood. It was not a mystery to them. They also make good use of MKL in their library as well, and of course we support that for MIC architecture.

    It is great to know that NAG users will have the opportunity to continue using NAG software with Knights Corner.

     

    Pages