Notes From a Parallel Universe

Sept. 5, 2017

4 min read

Shared User Pr1eedae6a76a44d90bf6650de34f4a9ec

A frequent request I hear is to clarify processor nomenclature on our roadmap. To be fair, it is confusing, with internal code names, marketing names, platform names, and more. Here’s an attempt to lay it out from our perspective as a rugged COTS board vendor.

Note that some outlooking details are covered by our NDAs with the chip vendors, and those are not included here. Everything here is in the public domain, with all that implies (you know you can’t believe everything on the web, right?). This only represents part of the story: the changes in processor microarchitecture and parallelism become ever more important when trying to get the maximum performance out of a compute platform, so we’ll have a look at what is going on there too.

The current state of the art in the Intel world is represented by Kaby Lake 7^th Gen and Kaby Lake R(efresh) 8^th Gen that was announced in August 2017. Note that the 8^th Gen parts announced so far are system-on-chip devices intended for the laptop market – compelling low power (15W TDP), but no ECC memory, a pre-requisite for most mission-critical applications. Expect to see other 8^th Gen devices being announced this year – possibly some from the Coffee Lake portfolio, blurring the lines between architecture updates and device “generation”, making things yet more complicated.

Intel has formally announced the availability of AVX-512, the next significant speed bump for vectorizable code. It has been available for some time on the Xeon Phi family, is now available on the Xeon Processor Scalable Family and is widely expected to make its way into embedded processors in the not-too-distant future. This will provide a doubling of the peak theoretical FLOPS over CPUs with AVX2.

One thing that becomes apparent when looking at the trends is that, to extract maximum performance from newer devices, applications must be written to exploit an increasing number of cores and threads. Another is the increased dependence on using the AVX/AVX2/AVX-512 vector engines to mitigate the slowing down of the increases in clock rates. That’s good news for those of us who understand how to vectorize and thread code - but that doesn’t include everyone.

Solutions

How, then, to extract that performance without relying on very specific skill-sets? Fortunately, there are some solutions. Firstly, you can call upon math libraries that are written to use the available vector instructions and to launch multiple threads across cores, such as Abaco’s AXISLib or Intel’s MKL. These work great if you are willing to modify source code to manually replace looped algorithms with calls to the library.

If that is still not acceptable, all is not lost. Vectorizing compilers have been around for decades, initially in the realm of supercomputing, but increasingly in embedded and now mainstream computing. I first used them some 20 years ago. The latest ones will take your standard C/C++ code and look for opportunities to automatically replace compute loops with vector code - and in some cases with some form of threading exploitation such as OpenMP.

Of course, life isn’t always simple, so often loops will not vectorize/thread due to detected loop dependencies or other factors that could produce incorrect results, or ugly memory access patterns. A compiler worth its salt will produce a report that identifies such loops and their snags, with suggestions of how the code could be restructured to allow automatic optimization. It is also worth bearing in mind that there is only one AVX2 execution unit per physical core. This means that for highly optimized code, enabling Hyper-Threading, which doubles the number of virtual cores, can actually hurt performance.

Equally, if code is not fully optimized for AVX2, then Hyper-Threading may help as it can fill the time spent waiting for pipelines to no longer be stalled with useful work. Tools like Intel’s Vectorization Advisor can prove extremely advantageous in identifying and characterizing performance drains and opportunities for optimization.

Next time, we’ll look at what is going on with some of the other processors of interest for specific niches.

About the Author

Peter Thompson

Sr Bus Dev Mgr

Peter Thompson is senior business development manager for High Performance Embedded Computing. He first started working on High Performance Embedded Computing systems when a 1 MFLOP machine was enough to give him a hernia while carrying it from the parking lot to a customer’s lab. He is now very happy to have 27,000 times more compute power in his phone, which weighs considerably less.

Case Study: Aegis Combat System Fire-Control Hardware Cabinet

Army asks Dynetics to build air defense system with missile interceptors and a future laser weapon

Sponsored

What is a Private Cellular Network?

Sponsored

Notes From a Parallel Universe

About the Author

Peter Thompson

Sr Bus Dev Mgr

Related

Case Study: Aegis Combat System Fire-Control Hardware Cabinet

Army asks Dynetics to build air defense system with missile interceptors and a future laser weapon

What is a Private Cellular Network?

CIMPOR uses private 5G to improve safety, efficiency, and sustainability of cement plants

Voice Your Opinion!

To join the conversation, and become an exclusive member of Military Aerospace, create an account today!

Trending

FAA seeks industry input on cyber security testing and risk assessment for National Airspace System

Major avionics upgrades to Navy EA-6B Growler electronic warfare (EW) jet to involve machine learning

Vertical integration enabled by supply chain resiliency is part of a national security imperative

Sponsored Picks

SKF and the Industrial Internet of Things

Ensuring Reliability in Space-Grade RF Coaxial Interconnects

Why Non-Magnetic RF/Microwave Components Matter - Benefits, Applications & Use Cases