Extrae is the instrumentation package that captures information during the program execution and generates Paraver (and Dimemas) traces. It can use different mechanisms to insert the probes that vary from static interception of the runtime calls linking with the Extrae library to dynamic instrumentation using Dyninst. The most frequent scenario is to use LD_PRELOAD to intercept production binaries at loading time. The information collected by Extrae includes entry and exit to the programming model runtime, hardware counters (PAPI), call stack reference, user functions, periodic samples and user events.
Paraver is a very flexible data browser. In Paraver metrics are not hardwired on the tool but programmed. Using a filter and a semantic module, the analyst can create time-lines, profiles and histograms from trace-files to selectively display a huge number of performance metrics. The different views can be easily combined to find correlations among the causes of performance drawbacks. To capture the expert's knowledge, any set of views can be saved as a Paraver configuration file, to be reused in subsequent analyses. Paraver also features performance analytics tools, such as clustering and folding, that increase the richness on the analysis by giving insight of the overall execution behavior as well as fine-grain measurements for computation regions. The tool has demonstrated to be very useful for performance analysis studies, giving much more details about the applications behaviour than most performance tools.
Illustration: Paraver Folding Analysis
MAQAO (Modular Assembly Quality Analyzer and Optimizer) is a performance analysis and optimisation tool suite. Its main goal is to provide application developers with synthetic reports in order to help them optimizing their code. The tool mixes both dynamic and static analyses based on its ability to reconstruct high level structures such as functions and loops from an application binary. Since MAQAO operates at binary level, it is agnostic with regard to the language used in the source code.
Another key feature of MAQAO is its extensibility. Users can easily write their own modules thanks to an API using the Lua scripting language, allowing fast prototyping of new MAQAO modules.
MAQAO has also been designed to concurrently support multiple architectures. At the moment, the Intel64, Xeon Phi and ARM architectures are implemented.
The main modules of MAQAO are first, LProf, a sampling-based lightweight profiler offering results at both function and loop levels and capable of categorizing its results depending on their source; second, CQA (Code Quality Analyzer), a static analyser assessing the quality of the code generated by the compiler and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this. Other modules, currently in beta version, allow performing value profiling, decremental analysis, and memory profiling.
Partner: Université de Versailles Saint Quentin (UVSQ)
Codelet Extractor and REplayer (CERE) is an open source framework for code isolation. CERE finds and extracts the hotspots of an application such as loops or OpenMP parallel regions as isolated fragments of code, called codelets. Codelets can be modified, compiled, run, and measured independently from the original application. Code isolation reduces simulation cost, benchmarking cost and allows piecewise autotuning of an application. Unlike previous approaches, CERE isolates codes at the compiler Intermediate Representation level. Therefore CERE is language agnostic and supports many input languages such as C, C++, Fortran, and D. CERE automatically detects codelets invocations that have the same performance behavior. Then, it selects a reduced set of representative codelets and invocations, much faster to replay, which still captures accurately the original application. In addition, CERE supports recompiling and retargeting the extracted codelets, changing the number of threads, their mapping and scheduling. Therefore, CERE can be used for cross-architecture performance prediction, runtime or compiler optimization auto-tuning.
Partner: Université de Versailles Saint Quentin (UVSQ)
Score-P is a highly scalable measurement infrastructure and easy-to-use tool suite for profiling, event trace recording, and online analysis of HPC applications. Score-P offers the user a maximum of convenience by supporting a number of analysis tools. Currently, it works with Periscope, Scalasca, Vampir, and Tau and is open for other tools. Score-P comes together with the new Open Trace Format Version 2, the CUBE4 profiling format and the Opari2 instrumenter.
Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. It has been specifically designed for use on large-scale systems including IBM Blue Gene and Cray XT, but is also well-suited for small- and medium-scale HPC platforms. Scalasca integrates runtime summaries with in-depth studies of concurrent behavior via event tracing. A distinctive feature is the ability to identify wait states that occur, for example, as a result of unevenly distributed workloads.
Cube, which is used as performance report explorer for Scalasca, is a generic tool for displaying a multidimensional performance space consisting of the dimensions (i) performance metric, (ii) call path, and (iii) system resource. Each dimension can be represented as a tree, where non-leaf nodes of the tree can be collapsed or expanded to achieve the desired level of granularity. In addition, Cube can display up to three-dimensional Cartesian process topologies.
Illustration: Cube Result Display of Scalasca Parallel Trace Analysis