MUSA (MUlti-scale Simulation Approach)
During the Mont-Blanc 3 project we developed MUSA, an end-to-end MUlti-scale Simulation Approach that integrates the architectural simulator TaskSim with the communication network simulator Dimemas. The result is a versatile tool able to model the communication network, microarchitectural details, and system software interactions.
MUSA employs two components: (i) a tracing infrastructure that captures communication,
computation and runtime system events; and (ii) a simulation infrastructure that leverages
these traces for simulation at multiple levels.
For more information, read Mont-Blanc 3 deliverable D5.6 Report on Correlated and Fine Tuned Multi-Scale Simulation Infrastructure.
Under Mont-Blanc 2020, MUSA was enhanced:
- Support to trace Arm SVE binaries: This required a significant redesign of the tracing infrastructure to accommodate emulation into the toolflow, as well as multiple tracing plugins. Arm SVE binaries must be emulated using the Arm Instruction Emulator and two instrumentation plugins need to work in parallel to generate the final MUSA traces that also have information about the different SVE instructions, including the memory references and gather/scatter information.
- Support to simulate traces with Arm SVE instructions: Similarly, the architectural simulator had to be updated to be able to decode SVE instructions and process the new memory addressing modes present in SVE.
The Mont-Blanc 2020 contribution to MUSA will be integrated in EPI’s public release of MUSA.
Our vision: 'multi-scale' simulation where the full simulation work flow consists of various tools to address different abstraction levels of the same simulated system
SVE-enabled gem5 simulator
The gem5 simulator is a cycle-level, full-system simulator available under a permissive BSD license from http://gem5.org. The particular release produced as part of Mont-Blanc 2020 (available from https://gem5.googlesource.com/arm/gem5/+/mb2020/d4.1) adds support for the Arm Scalable Vector Extension (SVE) ISA extension. SVE provides advanced vector instructions that decouple the width of the hardware vector registers and computation units from the ISA level. The same application binary will work with hardware vector widths ranging from 128 bits to 2048 bits. In this version, we add the majority of the instructions present in SVE, especially those that are being generated by vectorising compiler. The current development happens on a branch that tracks the main gem5 repository; we will continue pushing the SVE changes to gem5 into that mainline code base. We are making our changes available to the general public under the same permissive open-source BSD license as gem5 itself. The work for the gem5-SVE code base started under the Mont-Blanc 3 project (funded under the EU Horizon 2020 programme) and has continued under the successor project Mont-Blanc 2020 (also funded by the EU Horizon 2020 programme). In Mont-Blanc 2020, the resulting model is used in WP3 for helping port applications (T3.4), and deriving the right sizes for SVE implementations (T3.2); in WP4 to generate traces of applications (T4.4), and as a baseline for a power-modelling enabled simulator (D4.2). The activity factors obtained from gem5 simulation are used in T5.8 for reliability modelling; and the traces generated (in T7.2) are used for NoC performance testing in T5.1.
SVE-enabled gem5 with abstract power equations
In addition to performance simulation of SVE applications, Mont-Blanc 2020 extends the gem5 simulator with support for on-line power modelling. Power is modelled through the notion of power states and associated power equations. Generally, power consumption of integrated circuits can be split into static and dynamic power. Static (or leakage) power is the power draw caused by the finite resistance of the chip overall, i.e., by current that flows through the substrate. As such, it is dependant on area (higher area means more static power), temperature (hotter chips cause more static power), substrate characteristics (bulk silicon processes consume more static power than silicon-on-insulator processes), and of course voltage. Dynamic power, on the other hand, is caused by transistors switching, and charging / discharging, and the non-zero resistance of wires. Switching activity (toggles per unit time) and transistor / wire characteristics are the key determining factors for dynamic power consumption. Dynamic power is usually approximated as a sum of activities / events multiplied by specific activation factors. In deliverable 4.2, Mont-Blanc 2020 adds support for gem5 to obtain SVE-related activities from the core and feed them into power equations. Together with design (and process) specific constants (so called activity factors), the model allows on-line estimation of dynamic power, and (through the use of a thermal model) also static power. The thermal model of gem5 can then be used to model temperature based on the dynamic power and provide temperature input for further static power modelling. Both static power modelling, and usage of the thermal model are, however, out of scope for Mont-Blanc 2020.
We make the updated gem5 available at https://gem5.googlesource.com/arm/gem5/+/mb2020/d4.2.
Performance and scalability prediction is key to designing future High-Performance Computing (HPC) systems. System designers aim to find the proper balance between computation/network performance and power. An adequate multi-scale simulation methodology is needed for a fast and accurate design space exploration. In this regard, the Mont-Blanc project has focused on developing a complete simulation methodology at different abstraction levels that allows architectural parameter exploration and scalability analysis. Nowadays, a popular approach for architectural performance/scalability prediction is to use a trace-oriented simulation. It relies on performing a reference simulation and collecting traces of the most relevant phenomena observed during execution. The traces are then re-used as an abstraction for some of the simulation elements (e.g., cores behaviors, memories accesses). In this way, they enable refocusing the simulation effort on other performance-critical system sub-components such as caches, communication architecture and memory sub-system. The ElasticSimMATE tool, developed in the Mont-Blanc project, operates on those foundations. It allows the capture of traces on several cores and their subsequent replay on architectures with different configurations and an arbitrary core count, ranging up to hundreds of cores. ElasticSimMATE is based on two existing tools: Elastic Traces  and SimMATE , both developed within the gem5  full-system simulator. These tools have proved that the use of trace-driven simulators reduces simulation times while keeping accurate results in regard to the gem5 framework. However, they can be applied to only certain configurations or models. For instance, Elastic Traces can only be applied to mono-core systems. This means that multicore architectures and synchronization events are not handled. On the other hand, SimMATE focuses on analyzing multi-core systems and synchronization mechanisms, while only supporting in-order CPU models. ElasticSimMATE is a joint effort to provide the advantages of both Elastic Traces and SimMATE. In this way it enables to conduct explorations belonging to two categories as follows:
- Fast system parameter exploration: because the trace-driven simulation is fast, the influence of various parameters such as cache sizes, coherency policy, and memory speed can be rapidly assessed by replaying the same traces on different system configurations.
- System scalability: to analyze how performance scales when increasing the number of cores. This approach requires recording and carefully handling the synchronization semantics in the trace-replay phase to account for the execution semantics on such an architecture.
A fast and accurate gem5 trace-driven simulator for multicore systems
Figure 1: Overview of the ElasticSimMATE methodology Figure 1 conceptually depicts the ElasticSimMATE workflow, from the OpenMP application source files to the replay on different target architecture configurations. The red-colored “pragma omp” statements listed in the source are read by the preprocessor in the usual case and result in the insertion of calls to the OpenMP run-time. In ElasticSimMATE, these calls further require calling a tracing function that will make it possible to record the start and end of a parallel region in the trace. The resulting binaries are then executed in an FS simulation (Trace Collection phase) to generate the execution traces. Three traces are created: instruction and data dependencies trace files (as per the Elastic Traces approach) and an additional trace file that embeds synchronization information. These three trace files are used in the trace replay phase devoted to the architecture exploration. Experiments have been carried out on sample applications extracted from Rodinia and Parsec benchmark suites. Preliminary results show that ESM results are highly correlated with gem5 full-system simulation results. This, with a simulation speed-up of 3x. Furthermore, ESM allows fast scalability analysis. Experiments have been carried out on applications running on different core number ranging from one to 128. The Mont-Blanc 3 project takes advantage of the capabilities of ElasticSimMATE by performing analysis at architectural level with faster simulation times. It is part of the multi-scale simulation framework, and it will interact with tools developed by the consortium partners in order to provide a holistic approach for a fast design space exploration of HPC systems. Further information can be found at : A. Nocua, F. Bruguier, G. Sassatelli and A. Gamatie, “ElasticSimMATE: A fast and accurate gem5 trace-driven simulator for multicore systems,” 2017 12th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), Madrid, 2017, pp. 1-8. doi: 10.1109/ReCoSoC.2017.8016146. URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8016146&isnumber=8016139
 R. Jagtap, S. Diestelhorst, A. Hansson, M. Jung and N. When, “Exploring system performance using elastic traces: Fast, accurate and portable,” 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), Agios Konstantinos, 2016, pp. 96-105. doi: 10.1109/SAMOS.2016.7818336 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7818336&isnumber=7818316  A. Butko et al., “A trace-driven approach for fast and accurate simulation of manycore architectures,” The 20th Asia and South Pacific Design Automation Conference, Chiba, 2015, pp. 707-712. doi: 10.1109/ASPDAC.2015.7059093 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7059093&isnumber=7058915  http://gem5.org/Main_Page Partners: CNRS and Arm
Dimemas is a performance analysis tool for message-passing programs. The Dimemas simulator reconstructs the time behaviour of a parallel application on a machine modelled by the key factors influencing the performance. With a simple model, a network of SMP nodes (see below), Dimemas allows to simulate complete parametric studies in a very short time frame. Dimemas generates as part of its output a Paraver trace file, enabling the user to conveniently examine the simulator run. Partner: BSC Web: www.bsc.es/dimemas
BOAST is a modular meta-programming framework. It implements a DSL that allows description and parametrization of computing kernels. Application developer can port their computing kernel to BOAST and implement several optimization techniques. The kernels with the chosen optimizations can then be generated in the target language of choice: C, Fortran, OpenCL, CUDA or C with vector instructions. This approach also allows application developers to study application specific parameters. The generated kernels can then be built and executed inside of BOAST to evaluate their performance. With this framework one can easily find the best performing version of a kernel on a given architecture. Performance results could also be used to interact with automatic performance analysis tools (ASK, Collective Mind, …) in order to reduce the search space. Binary kernels that are generated can also be given to tools like MAQAO for static or dynamic analysis. Partner: CNRS