Yesterday, I gave an invited talk at the Multicore@Siemens 2016 conference in Nürnberg about the performance analysis of parallel applications. While in our high-performance computing center at JSC, we have to deal with large-scale scientific applications running on our world-class very scalable HPC systems like JURECA or JUQUEEN, software developers in general deal with much smaller systems.
However, everyone has to deal with parallel (multicore) systems now: smartphones, tablets, or laptops nowadays typically have two or four compute cores and a graphics accelerator and the same is true for embedded computers in consumer devices like washing machines or process automation control systems. Multicore computers are everywhere and so every software developer has to learn and understand parallel programming in these days and quickly finds out: (a) it is complicated to get right and (b) it is even more complicated to make it efficient, that means that the software really makes use of all the computer power available by all the cores on the chip.
In my talk, I presented some of the results of the RAPID ((Runtime Analysis of Parallel applications for Industrial software Development) project, which is a collaboration between the Corporate Technology Multicore Expert Center of Siemens AG and Jülich Supercomputing Centre.
The goal of this project was to adapt the measurement and analysis tools Score-P and Scalasca, which we develop at Jülich in my team for many years now, to the needs of industrial applications. As industrial applications are parallelized differently than scientific application codes, it meant that we had to integrate support for threading models like POSIX threads, Windows threads, Qt threads, and ACE threads into Score-P. In addition, support for leveraging task parallelism using MTAPI, the Multicore Association Tasking API, was also developed. Besides supporting new programming paradigms, additional work had to be done with regards to portability. Although Score-P is already quite portable as it is running on all relevant supercomputer architectures, systems like Windows and operating systems for embedded systems had not been targeted so far, but are of course very important in an industrial context. On the analysis side, new methods targeting thread-based communication patterns, e.g., a lock contention analysis, were implemented in Scalasca. Meanwhile, our software was successfully used in the work of the Multicore Expert Center to understand and optimize important Siemens industry codes.
At the end of my talk, they gave me a Siemens Multicore Expert Center coffee mug — not sure whether they read my blog article about my coffee mug collection, but anyhow, the mug will get a prominent spot in my bookshelf 😉