Discover the Mont-Blanc performance analysis tools: MAQAO and CERE

Tuesday, May 2, 2017

As part of the Mont-Blanc effort to provide a software ecosystem for ARM-based HPC, Mont-Blanc partners are developing and porting developer tools, such as performance analysis tools. In March 2017, Mont-Blanc set up a training session in Grenoble on the Open Source performance tools developed at BSC and UVSQ. During this one-day training, many Mont-Blanc partners, as well as external attendees from the local research labs, were given the opportunity to actually try the Paraver, Dimemas, EXTRAE, MAQAO and CERE tools, under the guidance of BSC and UVSQ experts. This motivated us to share information on these tools with a broader community.

Today, Hugo Bolloré and Pablo de Oliveira Castro, from University of Versailles Saint Quentin (UVSQ, France), present MAQAO and CERE. Stay tuned for our next instalment on Paraver, Dimemas and EXTRAE by BSC!


MAQAO (Modular Assembly Quality Analyser and Optimizer) has been developed at University of Versailles Saint Quentin (UVSQ) since 2004. At the time, it was becoming obvious that simple metrics were not enough anymore to obtain the code quality required to get the highest performance out of recent generation processors, and that automatic performance tuning tools were needed. MAQAO has been used and supported through a number of research projects and collaborations, such as Mont-Blanc.

MAQAO is a performance analysis and optimization tool suite. Its main goal is to provide application developers with synthetic reports in order to help them optimizing their code. The tool mixes both dynamic and static analyses based on its ability to reconstruct high level structures such as functions and loops from an application binary. Since MAQAO operates at binary level, it is agnostic with regard to the language used in the source code.


Figure 1: Overview of the MAQAO framework with an example of a CQA usage case

Another key feature of MAQAO is its extensibility. Users can easily write their own modules thanks to an APIusing the Lua scripting language, allowing fast prototyping of new MAQAO modules.

MAQAO has also been designed to concurrently support multiple architectures. At the moment, the Intel64, Xeon Phi and ARM architectures are implemented.

The two main modules of MAQAO are first LProf, a sampling-based lightweight profiler offering results at both function and loop levels and capable of categorizing its results depending on their source (program, libraries, OpenMP, MPI, etc.); second, CQA (Code Quality Analyzer), a static analyzer assessing the quality of the code generated by the compiler and producing a set of reports describing potential issues, estimations of the gain if fixed, and hints on how to achieve this. Other modules, currently in beta version, allow performing value profiling, decremental analysis, and memory profiling.

More information on


CERE (Codelet Extractor and REplayer) is a tool to facilitate piecewise code analysis and optimization. Finding the best design parameters is a costly and time-consuming iterative process. CERE renders this costly analysis affordable by breaking an application into a set of standalone codelets. With codelets, one can focusthe analysis, simulation and optimizationon single regions of code instead of on whole applications. CERE codelets map to the hot loops or the parallel regions in the original application.

Codelets can be modified, compiled, run, and measured independently from the original application. Figure 2 shows a high level view of the CERE workflow.To extract codelets, CERE builds upon the LLVM compiler to outline and isolate regions of code at the Intermediate Representation (IR). By working at the IR level, architecture specific details can be abstracted, making CERE codelets portable across architectures sharing the same memory model.

Accurately replaying codelets requires reloading the machine state from the original application, including state that affects replay performance such as the cache state or the NUMA memory pages’ ownership.  CERE capture library handles the capture and reloading of the state and is cache and NUMA aware.

To accelerate simulation and optimization, CERE automatically detects codelets invocations that have the same performance behaviour. Then it produces a reduced set of representative codelets and invocation by using clustering techniques on their performance signatures. The reduced set of codelets is much faster to replay and optimize, but is still representative of the original application.

Figure 2: CERE Workflow

CERE has been successfully used for cross-architecture prediction, scalability prediction, compiler parameters optimization, number of threads and scheduling optimization. It is released under the LGPL open-source licence.  For more information about the tool, detailed publications and results, please visit the project site

MAQAO and CERE in Mont-Blanc

Having a set of native ARM performance analysis tools is essential in the application and kernel profiling and benchmarking efforts in Mont-Blanc.

Two MAQAO modules are particularly useful in Mont-Blanc: the Code Quality Analyzer (CQA) and the lightweight profiler (Lprof). CQA decompiles the application and performs detailed performance analysis on its assembly code. Through simulation and performance models of the micro-architecture CQA is able to estimate the latency of a computation kernel, for example by detecting dependencies between instructions or long latency operations. LProf uses sampling to provide profiling of applications at function and loop levels. In the case of HPC applications it has the benefit of being runtime-agnostic while remaining runtime-aware which allows a categorization of the time spent depending on its source. Moreover, LProf is able to combine its timing feature with access to hardware counters,providing as many classifications as there are hardware counter events available on the target processor.

Through code similarity analysis, the CERE (Codelet Extraction and REplay) tool reduces the number of kernels that need to be studied by selecting a small representative subset that captures the performance features of the whole set. By automatically analyzing and clustering performance signatures of different kernels CERE pinpoints regions of interest for simulating new architecture designs. For example, CERE codelets have been used to accelerate significantly GEM5 hardware simulations.