Deliverables
Discover here the public reports that summarize the activity of the different project phases.
This report details the main scientific and technological resultsof the project as well as the impact, general findings and lessons learnt.
DownloadThis report materializes deliverable D2.4, which is the Summer School itself. The Mont-Blanc Summer School, initially planned for the Summer of 2020 had to be turned into an online event, and later than initially planned, due to the delays in the project. To maximize the reach of this event, we chose to create a tutorial that was both: • a full-day online but live tutorial within an existing event (HiPEAC 2021 in January 2021 was ideal in terms of timing and audience); • a MOOC (Massive Online Open Course) created out of the materials prepared and recorded for the live tutorial. The project opened an account (https://mont-blanc-project.moodle.school/) with Moodle Cloud, an Australian MOOC platform popular in schools and academia.
DownloadThis report summarizes the dissemination activities carried out during the entire Mont-Blanc 2020 project, covering the timeframe from December 2017 to March 2021. The Mont-Blanc 2020 dissemination activities were carried out in continuity with the communication/dissemination activities of the previous Mont-Blanc projects, especially Mont-Blanc 3, which overlapped with Mont-Blanc 2020 until the end of 2018. The decision to continue the Mont-Blanc brand, taken at the time of the project submission, allowed the project to benefit from a reach it would never have achieved if we had had to create a new brand.
DownloadThis report defines the set of (Mini) Applications selected for project testing. In close cooperation with CEA and JUELICH, BSC identified a set of real life applications that exhibit the characteristic demands of Big Data-HPC applications. These applications will run on a simulator towards the end of the project, and as they may be too large to be processed in a reasonable time, some representative mini applications (miniapps) were chosen to replace the full applications. The optimal selection of these applications is essential in order that the project can deliver a competitive solution.
DownloadIn order to evaluate the main IP blocks being developed in Mont-Blanc 2020, a comprehensive evaluation environment is necessary. This environment is being deployed in WP7 as part of the final demonstrator, which is an emulation platform that will enable integration and evaluation of the RTL designs developed by different project partners. In order to evaluate the performance of the system, a set of applications was selected in D3.1. These applications will help test our requirements on the demonstrator. We detail here the porting efforts undertaken for the selected applications. These efforts include: (i) to generate appropriate binaries for the Arm Instruction Set Architecture (ISA) by using available tools like the Arm Compiler for HPC; (ii) to leverage the Scalable Vector Extension (SVE) to achieve competitive performance; and (iii) to ensure correct execution of ported applications by using available emulation and simulation tools such as ArmIE and gem5. Most of the effort has been devoted to port the applications to exploit SVE capabilities. This has been achieved in different ways depending on the application, from lower to higher level of effort: (i) auto-vectorisation, (ii) using Arm Performance Libraries, (iii) using Arm C Language Extensions (intrinsics), or (iv) hand-tuned assembly code. Applications based on simple loops that have regular or contiguous memory access patterns can rely on compiler auto-vectorisation, while low level kernels that require high-performance are likely to benefit from hand-tuned assembly implementations. A significant effort has been made to achieve good SVE performance, however, performance fine-tuning has not been a main objective due to the lack of a target architecture to optimize for. Finally, we also provide the details for the gem5-based simulation environment that allows us to validate the correct execution of the ported applications. This environment has also been used to evaluate application performance under different vector lengths and memory hierarchies.
DownloadThe gem5 simulator is a cycle-level, full-system simulator available under a permissive BSD license from http://gem5.org. This particular release (available from https://gem5.googlesource.com/arm/gem5/+/mb2020/d4.1) adds support for the Arm Scalable Vector Extension (SVE) ISA extension. SVE provides advanced vector instructions that decouple the width of the hardware vector registers and computation units from the ISA level. The same application binary will work with hardware vector widths ranging from 128 bits to 2048 bits. In this deliverable, we add the majority of the instructions present in SVE; especially those that are being generated by vectorising compiler. The current development happens on a branch that tracks the main gem5 repository; we will continue pushing the SVE changes to gem5 into that mainline code base (estimated alpha release is July 2018, and stable support in September 2018). We are making our changes available to the general public under the same permissive open-source BSD license as gem5 itself. The work for the gem5-SVE code base started under the Mont-Blanc 3 project (funded under the EU Horizon 2020 programme) and has continued under the successor project Mont-Blanc 2020 (also funded by the EU Horizon 2020 programme). In Mont-Blanc 2020, the resulting model will be used in WP3 for helping port applications (T3.4), and deriving the right sizes for SVE implementations (T3.2); in WP4 to generate traces of applications (T4.4), and as a baseline for a power-modelling enabled simulator (D4.2). The activity factors obtained from gem5 simulation will be used in T5.8 for reliability modelling; and the traces generated (in T7.2) will be used for NoC performance testing in T5.1.
DownloadIn addition to performance simulation of SVE applications, we extend the gem5 simulator with support for on-line power modelling. Power is modelled through the notion of power states and associated power equations. Generally, power consumption of integrated circuits can be split into static and dynamic power. Static (or leakage) power is the power draw caused by the finite resistance of the chip overall, i.e., by current that flows through the substrate. As such, it is dependant on area (higher area means more static power), temperature (hotter chips cause more static power), substrate characteristics (bulk silicon processes consume more static power than silicon-oninsulator processes), and of course voltage. Dynamic power, on the other hand, is caused by transistors switching, and charging / discharging, and the non-zero resistance of wires. Switching activity (toggles per unit time) and transistor / wire characteristics are the key determining factors for dynamic power consumption. Dynamic power is usually approximated as a sum of activities / events multiplied by specific activation factors. In this deliverable, we add support for gem5 to obtain SVE-related activities from the core and feed them into power equations. Together with design (and process) specific constants (so called activity factors), the model allows on-line estimation of dynamic power, and (through the use of a thermal model) also static power. The thermal model of gem5 can then be used to model temperature based on the dynamic power and provide temperature input for further static power modelling. Both, static power modelling, and usage of the thermal model are, however, out of scope for this report. In the literature, these design-specific constants have been obtained for existing designs [WDH+17, WBD+18], and thus allow fine-grained online power modelling. As there is no publicly available SVE-enabled silicon, yet, we use a set of fictitious constants to illustrate the model. Over time, we expect project partners to get a better understanding of hardware characteristics and improve the numbers for more detailed analysis. This deliverable and report do not make available actual activity factors / design-specific constants of Arm IP. These are tightly bundled to Arm product IP and thus cannot be made available here. Instead, the activity factors in this report and the deliverable code are for illustration, only! We make the updated gem5 available at https://gem5.googlesource.com/arm/gem5/+/mb2020/d4.2.
DownloadThis deliverable gives an overview of the main scientific and technological results of the project: - An industrial prototype (the Dibona platform) This medium-sized test platform (composed of 16 blades, ie. 48 bi-socket 64-cores compute nodes) made it possible to assess our performance extrapolation with the observation of real applications. It provided a key support to showcase our holistic work and enhance the impact of our project. - A significant contribution to the Arm HPC software ecosystem We identified gaps across the entire software stack, and consequently made improvements at all levels, from compilers and scientific libraries, to runtimes and operating systems. - Application evaluation on Dibona We structured the evaluation with a bottom-up approach, executing programs with an increasing level of complexity, starting from the simplest micro-benchmarks, going through the most relevant high-performance computing benchmarks LINPACK and HPCG, and finally performing tests with production scientific applications. This showed results in line with state-of-the-art HPC systems - Performance modeling environment In the Mont Blanc 3 project, we have pursued the idea of ’multi-scale’ simulation where the full simulation workflow consists of various tools to address different abstraction levels of the same simulated system. We have also explored options to reduce simulation time by finding and simulating only a set of representative sections of a particular application. We have demonstrated that these methods can indeed reduce simulation time significantly achieving 10-100x speed up in many cases. - Exploration of design space We have performed an exhaustive analysis considering several architectural components relevant for the design of next generation HPC architectures. We investigated hybrid MPI+OpenMP state of the art benchmarks with native input sets. We extracted several conclusions from this campaign, which can be used to drive the design of next generation HPC systems.
DownloadThis deliverable presents some results on the integration of Non-Volatile Memory technologies at main memory level in multicore architectures. This follows our initial work (see deliverable D3.2) on a design exploration framework for analyzing these technologies in different levels of the memory hierarchy, e.g., caches and main memory. It improves the design evaluation framework by targeting ARMv8 heterogeneous multicore architectures, while the main memory technology models have been modied to reflect as much as possible the typical compute node architectures under consideration in the MontBlanc3 project. We define a gem5 model of the Exynos 7420 ARMv8 chip, based on which we quantitatively evaluate the possible impacts of various main memory technologies (DDR4, Phase-Change Memory, Resistive RAM) on both performance and energy-to-solution. This is achieved by considering typical benchmarks, e.g., Parsec and a few HPC mini-applications studied in WP6, which aims at studying a set of kernels, mini-apps and applications for co-design and system software assessment. Further design improvement opportunities are addressed, by focusing on memory architecture parameters (e.g., frequency, multi-channel design) and on the implications of programming models, via their runtimes as investigated in WP7, which deals with runtimes for parallel programming.
DownloadSimulation is widely used in system design for evaluating different design options. Depending on the abstraction level considered for simulating a given system conguration, there is a tradeoff between the obtained precision and speed. Generally, simulating a detailed system model provides accurate evaluation results at the price of potentially high simulation time. On the other hand, less detailed or more abstract system representations usually provide less accurate evaluation results, but in a fast and cost-less manner. In practice, such representations are defined such that they only capture system features that are most relevant to the problem addressed by a designer. This report summarizes the final results regarding the Mont-Blanc 3 multi-scale simulation infrastructure. The main objective of the Simulation Work package was to enhance scalability of existing simulation tools both in terms of simulation speed and scope to capture all the details necessary while being capable of simulating entire HPC systems. We pursue the idea of 'multi-scale' simulation where the full simulation work flow consists of various tools to address different abstraction levels of the same simulated system. The results obtained with detailed fine grain simulators (addressing lower abstration levels) are funneled into corser grain tools (addressing higher abstratction levels). We have explored a rich set of existing simulation tools to select a few of them as core components of our multi-scale simulation infrastructure. We explored options to push the simulation limits both in terms of details/accuracy/speed and size (number of compute nodes) of the simulated HPC systems. We demonstrated the feasability of the multi-scale approach. Our simulation tools are used by design oriented work packages within the project and also by various external collaborators.
DownloadThis deliverable describes the analyses performed on the different applications being considered as part of the co-design process in Mont-Blanc 3. We present different types of results for 18 applications. For most of them we performed evaluations on Mont-Blanc platforms and analyses on both ARM and Intel based platforms. The different analyses try to identify fundamental issues that limit performance on the currently available platforms and through predictive studies we identify issues that will be relevant at larger scales. For each application we summarize, at the end of the corresponding section, the fundamental issues and possible co-design alternatives to consider.
DownloadThis document presents a methodology to select interesting regions within applications that exhibit variety with respect to processor resource demands and are representative of a set of benchmark applications. Such a selection of a benchmark subset allows for piecewise optimization of an application by replaying the selected regions on a simulator in contrast to executing all the applications thus reducing simulation time. Firstly we describe briefly the chosen applications from PARSEC and lulesh, and then present the detailed approach to select the regions. Next, we apply the proposed approach and show how the selected regions can be used for benchmarking novel accelerators, and performance tuning of big and LITTLE processors in heterogeneous architectures.
DownloadThis deliverable describes the results generated in porting and tuning the applications considered as part of the co-design process in Mont-Blanc 3 for ARM. We present our optimization experiences for 9 applications. For most of them we performed evaluations on Mont-Blanc 3 platforms and analyses on both ARM and Intel-based platforms. Building on Deliverable 6.1 which tried to identify root causes of scaling issues, we continued with analysis and implemented optimizations to overcome these performance problems. Overall load imbalance was identified as a very important issue. Most applications needed to address this either via improving algorithms and domain decompositions, distributing their ressources to fewer MPI ranks and more threads, or via dynamic load balancing (DLB). For the HPCG benchmark, a performance analysis and an initial algorithmic optimization are presented. The work presented is not ARM-specic, but it has been tested on an ARM-based cluster by a team of students. Following the directive in [1], Lulesh has been ported to OmpSs and tested on the Mont-Blanc 3 mini-clusters. In the ARM ecosystem, the ARM Performance Libraries have been evaluated on a widely used scientic suite QuantumESPRESSO. The results indicate speedups when using ARMPL for linear algebra workloads, and highlight opportunities to improve the FFT functions. In addition, the recently released ARM compiler was compared to GCC. Performance and usability were comparable, further investigation which compiler is preferable for which type of workload is suggested. For the applications in cardiac modelling and mesh deformation we generally find optimizations stemming from analysis on Intel systems advantageous for ARM systems and vice versa, e.g. work to scale to the high core density ThunderX system proved valuable for performance many core x86 systems. For some of them, power measurements are presented: these numbers will be used as baseline when comparing perfomance and power gures in the final Mont-Blanc 3 demonstrator under deployment in WP3.
DownloadThis report collects two major contributions to the project: 1. In Section 1 we present the porting to OmpSs/OpenMP4.0 of several production applications and mini-app of the project. We focused our report on explaining new functionalities of the OmpSs programming model that we consider disruptive. We show their actual benefits when executing on large HPC machines. 2. In Section 2 we report the results of the successful porting and test of CERE on ARM architectures. This allows to extract regions of interest of large applications, called codelets, running on ARM platforms to be "replayed" on architectural simulators, reducing their simulation time, or to be used for the development of mini-applications.
DownloadThis report collects the contributions of the Mont-Blanc 3 partners in the evaluation of the Dibona test platform. We structured the evaluation with a bottom-up approach, executing programs with an increasing level of complexity. Where appropriate, we complete each sections with scalability and energy measurements. - In Section 1 we briefly introduce the Dibona test platform that has been extensively described in D3.3 Detailed specification of the medium-sized test platform. - In Section 2 we evaluate the simplest micro-benchmarks, exposing the basic architectural features such as the floating point throughput of the CPU, the structure of the memory subsystem and the bandwidth and the latency of the network. - In Section 3 we report the results of the most relevant high-performance computing benchmarks LINPACK and HPCG together with the mini-apps Lulesh and commonly used solvers. For HPCG we include the description and the evaluation of a shared memory version of the benchmark developed within the Mont-Blanc 3 consortium. - In Section 4 we present the tests performed on Dibona with production scientific applications combined with the runtime optimizations introduced in D6.5 Initial report on automatic region of interest extraction and porting to OpenMP4.0-OmpSs. Beyond the pure technical content, this document is the result of a deep and continous collaboration effort between WP3, where the Dibona test platform has been developed, and WP6, where the applications have been ported to the Dibona test platform. This process made possible to implement a solid deployment method, making Dibona a production-ready machine.
DownloadThis document summarises part the work done on extensions to the message passing model within Work Package 7 of the Mont-Blanc 3 project. The largest part of the document is taken up by a detailed experimental evaluation of the performance of MPI libraries on the project's test platform Dibona. Particular focus is placed on the memory copy savings potential due to RDMA and direct access capabilities of the platforms hardware. The second part of the documents is dedicated to a description of tasking support for MPI communication, though most of this effort has been reported elsewhere more comprehensively. In this document, we add in particular an approach to dedicate an OmpSs thread to do MPI communication and the problems that were found while doing the evaluation on the test platform Dibona. An experimental evaluation was done as part of Work Package 6 and reported in D6.9.
DownloadIn this deliverable, we revisit the heterogeneous device specications (HDS) presented in deliverables D7.1 and D7.10, in order to check inconsistencies or gaps in the HDS implementation. We continue by proposing a new scheduling policy, adding it to those presented in deliverables D7.10 and D7.12, based on the Roofline model and HDS. We cover the details of the application profile generation and the scheduling policy applied at application start-up based on which we decide which device to run on. Results show that the Roofline Model scheduler achieves a high degree of accuracy, and could be even higher when applied to specic regions of code instead of application-wide. We include a brief description of the Activity Monitoring Unit, an optional upcoming architectural feature that can assist with power management decisions.
DownloadThis document presents the work done on the OmpSs programming model to support the compute accelerator (the Arm Scalable Vector Extension (SVE)), and to provide an OmpSs task scheduler oriented to the big.LITTLE architecture. It is based on a previous work done in the Intel architecture, and now ported to Arm. Experiments on an actual ODROID-XU3 hardware, with 4 big and 4 LITTLE cores show no penalty of the new task scheduler implementation. Task-based scheduling improves static threading performance by 23% on average. Experiments on a Dibona node show the performance obtained in Cholesky, STREAM, Matrix Multiplication, BlackScholes, and FluidAnimate. On Matrix Multiplication, the increase in performance when using the DDAST policy is 8%.
DownloadThis final report summarises the current state-of-the-art for the following topics: - The current status and the future development of a new Arm-optimized Fortran compiler. - The current status of various categorizations of mathematical libraries for the Armv8 architecture software ecosystem. Overall it is shown how the range of packages available for users are looking very healthy for the deployment of real High Performance Computing (HPC) applications. The report emphasizes the importance of the Fortran programming language for HPC. It also explains that open source compilers ease adaptation of Fortran code base to a new hardware and various operating systems. The goal of the `PGI Flang' project (and its successor, the 'F18' project) is to provide an LLVM compliant open source Fortran Compiler. Thanks to a Contributor License Agreement (CLA) between PGI and Arm, we were able to actively participate in the development of the Flang compiler and its successor. Since it was made public, the Flang project has gained a lot of attention and was tested for conformance with Fortran standards and performance of generated code. This document summarises our effort in ensuring good standards conformance and compatibility with the AArch64 architecture, including suitability for SVE vectorization. The second half of the report focuses on the latest developments in the support for the different numerical libraries available on Arm systems. This is split into four sections. In the first we focus on the optimizations that have been upstreamed this year for higher performing versions of various transcendental functions and the improvements these make for real HPC applications. Second we discuss the ability to vectorize loops that call vector versions of mathematical functions. This is followed by an update on the Arm Performance Libraries development that has happened, with a particular focus on FFT performance. Finally, the ongoing work with community codes, especially as provided through OpenHPC, is outlined.
DownloadThis document describes TinyMPI, a prototype MPI implementation which follows the non-traditional approach of virtualizing MPI, and presents the results of a research effort which employed TinyMPI as a research vehicle. Traditional MPI implementations run at most one MPI rank per CPU core. TinyMPI runs more than one MPI rank per CPU core, i.e., it oversubscribes the CPU, with the goal of achieving automatic computation-communication overlap: when one MPI rank blocks, TinyMPI switches to another and continues using the CPU. We have used TinyMPI as a tool in the research effort of answering the question, "How many ranks exactly to start on each CPU core?", the results of which - in the form of a model - are presented in this document along with a description of TinyMPI's internals. TinyMPI supports the Arm architecture and is deployed on the Dibona cluster.
DownloadThis report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2013 – September 2014 period. The dissemination activities are similar on both projects (Mont-Blanc 2011 – 2013 and 2013 – 2016). Specifically, in the following pages a complete list of conferences as well as the presentations made at various events and workshops and related to the project will be listed. Furthermore, additional coverage of the project by the press and social media is also presented in this document, as well as other dissemination activities such as collaborations with other projects.
DownloadThis report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2013 – September 2014 period. The dissemination activities are similar on both projects (Mont-Blanc 2011 – 2013 and 2013 – 2016). Specifically, in the following pages a complete list of conferences as well as the presentations made at various events and workshops and related to the project will be listed. Furthermore, additional coverage of the project by the press and social media is also presented in this document, as well as other dissemination activities such as collaborations with other projects.
DownloadThis report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2014 – September 2015 period. This period has been characterized by the promotion of the prototype deployment announcement and also with the participation of the Mont-Blanc climbers’ team to the ISC Student Cluster Competition.
DownloadIn this document D3.2 Applications porting and tuning reports the activities related to T3.1 and T3.2 and T3.4 of the first 21 months of the Mont-Blanc 2 project are given in detail. During the same period also a limited part of the activities related to T3.3 (Application benchmarking) have started in order to preliminarily assess the code versions ported to the platforms made available to the consortium partners (see next section for platform disambiguation). The support activities related to T3.4 have produced significant help in the porting of Mont-Blanc applications thus paving the route for its repeated use in T3.1 and T3.2 of Mont-Blanc2.
DownloadThis report describes work done in three areas relevant to the performance of the Mont-Blanc prototype system.
DownloadIn this deliverable we present the extensions to OmpSs regarding the support of clusters. Within OmpSs we have implemented a caching system to deal with the data that must be sent to remote nodes to be processed there. A remote node can be another node in the cluster or an accelerator attached to it. This implementation has been done in the Intel architecture, and evaluated in a cluster with NVidia GPUs. Deliverable D4.1 presents the porting to the ARM architecture.
DownloadThis deliverable report describes some of the work that has been done on optimizing the Linux operating system for running HPC applications on ARM. The first section describes problems found in the interconnect hardware/software stack and potential solutions. The second section describes some profiling infrastructure that will be used when looking at how to implement energy-aware scheduling policies at the kernel level.
DownloadThe main objective of the Mont-Blanc project is to develop a European Exascale approach based on commodity power-efficient embedded technologies. After having successfully delivered the Mont-Blanc prototype in the first phase of the project, we now complement the efforts undertaken in the first three years by addressing challenges that our system needs to cope with in terms of massive parallelism, system resiliency and employment of future heterogeneous architectures. The latter is discussed in this deliverable, where we present our latest results on the assessment and applicability of heterogeneous architectures, with a particular emphasis on ARM big.LITTLE technology. Specifically, we focus our attention on the evaluation and improvement of task scheduling mechanisms for big.LITTLE platforms and we propose three load balancing algorithms targeting performance improvement of data-parallel applications in heterogeneous systems.
DownloadFor the OmpSs extensions part, in this deliverable we present three developments we have done in OmpSs. First, we have incorporated a resource specification in the programming model to allow programmers to tune the use of cores and devices in the execution of OmpSs tasks. As a result, the programmer can better guide the runtime to use more or less resources of a specific type and get better performance. In the second place, we have extended OmpSs to provide the capability to profile the execution of the OpenCL kernels to determine the most suitable kernel configuration. The Mercurium compiler allows to specify the ranges of values that should be analyzed, and the Nanos++ runtime does the exploration. Finally, in the third place, we have further evaluated the performance of the OmpSs@cluster programming model, with 4 new benchmarks in the Mont-Blanc prototype.
DownloadThe objective of this document is to provide a precise specification of this interface. In a second step, the interface will be implemented by the BSC OmpSs compiler and runtime group, and necessary monitoring components using this interface will be created for the WP5 performance tools (Extrae, Score-P) and debugging tools (Temanejo, DDT) developers.
DownloadThis document describes the work done to integrate basic support for the Open Compute Language (OpenCL) into the unied measurement infrastructure Score-P. After an introduction to OpenCL and Score-P the current status of the software prototype and preliminary results are presented in detail. The current prototype monitors important OpenCL API functions by intercepting them at link time and collecting the necessary data via library function wrapping. Data is captured on OpenCL functions regarding devices, kernels, memory objects and command queues. The prototype was tested with Intel, AMD and NVIDIA OpenCL implementations.
DownloadThis deliverable presents the main features of MAQAO on ARM. After a study on the impact of vectorization and vectorization/energy tradeoffs on ARM32 architectures, we present the static analyses used on ARM and briefly the currently working instrumentation feature. Then we apply MAQAO on a benchmark in order to describe the hints given by the tool and apply MAQAO on SMMP, an MontBlanc application, in order to optimize it. Finally, we provide the on-going work concerning data layout transformations
DownloadThis document describes BOAST, a metaprogramming framework to produce portable and efficient computing kernels for HPC application. BOAST offers an embedded domain specific language to describe the kernels and their possible optimization. BOAST also supplies a complete run-time to compile, run, benchmark, and check the validity of the generated kernels. BOAST is being used in two flagship HPC applications BigDFT and SPECFEM3D, to improve performance portability of those codes.
DownloadThis document describes the work done to implement the monitoring and control API specified in the previous D5.1 Mont-Blanc 2 deliverable. First, we present the description of the software components that define the API from the programming model perspective (OmpSs, Mercurium, and Nanos++) and the monitoring/debugging tools (Extrae/Paraver, Ayudame/Temanejo, and Score-P/Scalasca). Then, we go into details of the implementations developed in this period, to provide the functionality associated with the API.
DownloadIn this deliverable, we describe our modifications and enhancements to the Score-P instrumentation and measurement infrastructure as well as the Scalasca Tracing Tools package implemented within the Mont-Blanc project towards an integrated analysis of hybrid applications using multiple parallel programming models in combination. In particular, we focus on the support for the OmpSs and OpenCL programming models as well as the challenges introduced by the asynchronous nature of create/wait-type threading and task-based programming. Various examples highlight that Score-P and Scalasca now effectively support the performance analysis of hybrid codes using a single, coherent workflow and a unified result presentation.
DownloadThis document describes the work done to integrate the results and the representation mechanism from the Folding process developed at BSC into the Cube4 visualization tool developed at JSC. The Cube4 tool has been extended to augment its display and analysis capabilities via a plugin mechanism so that third party tools provide not only performance data to Cube but also new ways to represent the performance information within the Cube GUI. BSC has taken advantage of this extension to provide new visualization metaphors of its Folding mechanism to be able to represent in Cube4 the application progression in terms of performance and source-code between delimited code regions. This document also presents an example of the usage of this integration by describing the analysis of the BigDFT application.
DownloadThis deliverable is a preliminary report on state of the art software-based resiliency techniques for high performance computing (HPC). The document overviews the past resiliency challenges and the proposed solutions to address them. It reviews what the future resiliency challenges would be in exascale computing and tries to project research directions to tackle these problems.
DownloadMont Blanc’s WP6 was set up to address the problem of the expected greater error rates due to increased component counts, smaller silicon geometries and other factors, that are expected in future Exascale systems. D6.6 summarises the results so far from research into new fault tolerant iterative sparse solvers based on the Conjugate Gradient (CG) method. These types of solver are very commonly used in scientific applications, and so any advance in improving the built-in fault tolerance of sparse iterative CG solvers should have a material impact for Exascale systems and for several of the Mont Blanc applications. For example, Berlin Quantum Chromodynamics (BQCD), spends ~80% of its execution time in a CG solver, while EUTERPE also spends a significant portion of its run-time in a Jacobi Preconditioned Conjugate Gradient solver.
DownloadThe objective for the first six months of the Mont-Blanc Project can be summarized as to have a fully functional framework in all work packages. This objective involves setting up the necessary technical infrastructure and adequate methodology in each of the work packages.
DownloadThis document defines the dissemination objectives for the Mont-Blanc project, as well as the different targets for all its activities, the dissemination tools, the interaction with similar projects, its activities to be done during the Mont-Blanc project, and the policy used to disseminate the results. The aim of this document is to define the strategy for disseminating the project results taking into account the big social impact that this project will have on society. This plan intends to raise awareness and interest in the developed technologies and solutions among the target groups such as the users, the scientific community, the IT industry and the general public. The strong presence of leading research HPC institutions ensures the wider dissemination potential through scientific channels, and the industrial partners will focus more on the exploitation and technology transfer activities. Most of the results will be published via academic and industrial channels by submitting scientific papers, and by holding workshops, courses and tutorials related to the new technologies.
DownloadThe objective of the Initial Press Release Deliverable is to 1) define a general strategy for creating and publishing press releases as well as to 2) report on the outcome of the initial and follow-up press releases for the Mont-Blanc Project. This press release must be sent out by all partners to all press contacts locally as well as translated to local languages, if needed. As stated in the Dissemination Strategy Document (D 2.1), there will be a planning for future press releases during the project. The press release strategy defined should be consistent with the dissemination strategy and its objectives and will be maintained throughout the Mont-Blanc project.
DownloadThis document describes the structure, content and updates process of the Mont-Blanc public web site (www.montblanc-project.eu). Web presence is a central element in the dissemination activities of Mont-Blanc, as indicated in the Dissemination Strategy Document (D 2.1). The website became publically available in October 2011. The Barcelona Supercomputing Center, as coordinator of the project, hosts and maintains the website. This document describes how the website was created and how it will be maintained. It also describes the structure of the website and the functions that are available to the user.
DownloadThis report summarizes the dissemination activities carried out by the Mont-Blanc project in the October 2011– October 2012 period. Specifically, in the following pages a complete list of conferences as well as the presentations made at various events and workshops and related to the project will be listed. Furthermore, any additional coverage of the project by the press and online media is also presented in this document. During this first year of Mont-Blanc, the consortium published a total of one technical report, and attended to 49 conferences, workshops or seminars. Moreover, the consortium organized two successful trainings, where other EU funded projects were invited attendees. The high media impact of the project has raised high expectation among the HPC community. For this reason, the overall dissemination output of Mont-Blanc is an indication of the European excellence and recognition of the project partners.
DownloadThe following document reports on the selection of the performance-critical kernels to be ported to the OmpSs [5] programming model during the course of the project. This work, started within WP3-T3.1 and now continuing in WP3-T3.2 and WP3-T3.3, pursues on the one hand an increased performance and portability of the kernels themselves due to the shift of paradigm from a serial or thread-oriented model to a task-based model supported by an ecient run-time scheduler. On the other hand it should devise a set of best-practices to provide WP4 colleagues with helpful guidelines when porting full applications.
DownloadThis report refers to the activities planned in WP3 under Task 3.2 and 3.3. After completion of D3.1 we identified two subsets of application kernels: small-size and medium-size kernels. After the status update of T3.2 given in D3.2, in the following we describe the WP3 final porting activities by reporting the detailed status of advancement with respect to D3.2. As in D3.2, after the porting on ARM, we focused over two major issues affecting the results on the kernels’ development: (i) the porting to OmpSs; (ii) the porting to OpenCL. As already reported in D3.2, some preliminary benchmarking has been carried out on the available Mont-Blanc prototypes but a full optimization of the most promising kernels (related to T3.3) will be made when the final system will be released. Activities expected in T3.2 can be considered as concluded with most of the kernels preliminarily integrated into the full application from WP4 even if some porting activities will continue in P3. In particular, (i) the small size kernels development activities have been concluded and will continue with the three kernels integrated into the full application they refer to; (ii) all the medium-size kernels were integrated into the corresponding full application and passed OmpSs compilation; (iii) medium-size kernels porting over OmpSs is almost completed while OpenCL porting is still in progress.
DownloadThis report refers to the activities planned in WP3 under Task 3.2. After completion of WP3 activities in P2 of the Mont-Blanc workplan, we setup a repository containing the source and supporting files for all the kernels object of this workpackage. The repository can be accessed at the URL http://wiki.montblanc-project.eu/index.php5/WP3_Optimized_application_kernels In this document we report the details about the structure of the repository with some brief description of the content therein.
DownloadThis deliverable shows the evaluation of the Mont-Blanc node using two sets of benchmarks: Standard and Mont-Blanc benchmarks.
DownloadThe Mont Blanc project aims to assess the potential of low power embedded components based clusters to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co design and optimise up to 11 real exascale-class scientific applications to the different generation of platforms available in order to assess the global programmability and the performance of such systems. The first section will introduce the different applications and their different characteristics, the second section will describe the platforms used by WP4 during the first year, the third section will report the progress of the porting and the profiling of each of the 11 applications during the first year and the last section will give perspectives on WP4 activities.
DownloadThe Mont-Blanc project aims to assess the potential of low power embedded components based clusters to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co-design and optimise up to 11 real exascale-class scientific applications to the different generation of platforms available in order to assess the global programmability and the performance of such systems. After the first report D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” [1] this report aims to give an overview and the results about the final porting of all the 11 applications on the different system made available by the project or by partners.
DownloadThe Mont-Blanc project aims to assess the potential of HPC clusters based on low-power embedded components to address future Exascale HPC needs. The role of work package 4 (WP4, “Exascale applications”) is to port, co-design and optimise up to 11 real exascale-class scientific applications to the different generations of Mont-Blanc hardware platforms available in order to assess the global programmability and the performance of such systems. After the first report D4.1 “Preliminary report of progress about the porting of the full-scale scientific applications” [1] and the latest report D4.2 “Final report about the porting of the full-scale scientific applications” [2], this report aims to present the work of the last year activity of WP4 based on a selection of a subset of scientific applications suited for the Mont-Blanc architecture, and a specific work of optimisation and taskification using OmpSs/OpenCL. The first results related to the optimisation performed on the selected set of applications are detailed in deliverable D4.2.
DownloadDue to the close relationship and rich cross references between the deliverables: D4.4 “Report on the profiling, the optimisation and the benchmarking of a subset of application suited for performance and energy”; D4.5 “Report on the efficiency and performance evaluation of the application ported and best practices”; D4.6 “Final list list of ported and optimized applications”. The decision has been taken to avoid redundancy and for a better reading and logical sequence to merge D4.4, D4.5 and D4.6 in a single physical document.
DownloadIn this Mont-Blanc deliverable we present the current status of porting to the ARM architecture of the OmpSs (Mercurium compiler and Nanos++ runtime system), the Extrae instrumentation library and the Scalasca instrumentation facilities. In addition, we present an initial evaluation of the overhead observed in the OmpSs programming model when using Extrae instrumentation in the Intel architecture.
DownloadNowadays, topmost high performance computing (HPC) clusters use scalable distributed parallel le systems that are able to stripe data over multiple servers to achieve high performance also in I/O. From our experience in the Storage Systems Research Group and given the requirements of the project, we chose a parallel le system that is very common, open-source and POSIX compliant: Lustre; as the rst candidate to provide high performance I/O on our ARM cluster. Given that Lustre is open source we are able to access its code and adapt it to our Linux kernel (provided by SECO) for the ARM architecture. In the meantime, we focused on the client part since the server part is not expected to be executed in the ARM cluster. Thus, we started spending our eorts on adapting the code of the Lustre client modules to our specic kernel version. As expected, we got some important compilation errors due to kernel incompatibilities, since last maintenance release of Lustre is compatible with kernel versions up to 2.6.32 whereas our current version is 2.6.36 (based on an Ubuntu Maverick distribution). However, we lately got a rst patched version of the Lustre client that can do mostly all of the most common and important POSIX operations. The problem is that due to circumstances we still do not control, when executing some specic deletion operations causes the client to hang. From this deliverable on we will more eorts to try to understand what is really happening, whether it is an issue related with the architecture or the changes we performed that still need to be further reviewed.
DownloadThe Mont-Blanc project will produce the rst large-scale supercomputer based on ARM cores. The ARM architecture has been succesfully used in the past in embedded and mobile platforms. However, the requirements and constrains of those platforms greatly dier from the needs of a High Performance Computing (HPC) system. One of these major dierences is the system software used in each environment. Embedded and mobile computing programmers typically use Operating Systems and li- braries customized for their target application (e.g., Android). Moreover, such platforms typi- cally target applications that run in a single MPSoC chip. This is in contrast to a typicall HPC environment, where general purpose operating systems (e.g., Linux) and scientic libraries (e.g., BLAS) are used to run applications in hundreds or thousands of compute nodes in parallel. This document describes initial work done to create a functional HPC system based on ARM cores, from the operating system, to the scientic libraries, and parallel execution. Such work does not only involve the port of system software to the ARM architecbure, but also tuning these software components to fully exploit the characteristics of ARM cores. Similarly, the cluster management system also needs to be adapted to the characteristics of ARM-based nodes and to the goal of achieving very high energy eciency.
DownloadWe aim to create an optimized software stack tailored to an ARM-based HPC system. As a result, we are looking at exploiting OS features that can improve performance. We investigate the eects of using hugepages through Transparent HugePages on a number of benchmarks and sample HPC applications, whilst running on the MontBlanc chosen SoC: Exynos 5 Dual. We are presenting results for both pandaboard, and the Arndale.
DownloadAs the main goal of the Mont-Blanc project is to produce large-scale HPC clusters based on ARM processor architecture, one of its major challenges is to perform porting and tuning of already-existing system software for ARM-based HPC clusters. Deliverable 5.3 [MBD12a] summarizes our initial work in this regard (until month 12 of the project). In this deliverable, we report on the follow-up work in the second year of the project (month 12 - month 24). In particular, we summarize our efforts related to the parallel programming model and compiler, the development tools, and the scientific and runtime libraries. Furthermore, we report on the installation and customization of the operating system, the cluster monitoring and resource management, the performance monitoring and analysis tools, and the parallel distributed filesystem.
DownloadIn this deliverable, we present the current status of the low-level software components required for gathering information about the performance of HPC applications running on ARM-based systems. This work will enable performance monitoring tools to be ported to the Mont-Blanc prototype.
DownloadIn this deliverable, we present the current status of the prototype versions of the performance analysis tools, considered in the Mont-Blanc project. This includes the community instrumentation and measurement system Score-P, the performance analysis toolset Scalasca with its result browser CUBE, developed by Juelich Supercomputing Centre, and the Barcelona performance tool-suite, containing the instrumentation library Extrae, the analysis tool Paraver and the simulation tool Dimemas. For all of these tools, we describe the current status of the porting to the Mont-Blanc platform as well as the implemented extensions for supporting the OmpSs programming model.
DownloadIn this deliverable we present the power consumption measurement process and data acquisition of the Mont-Blanc prototype.
DownloadIn this deliverable, we present the current status of the prototype versions of the performance analysis tools considered in the Mont-Blanc project. This includes the community instrumentation and measurement system Score-P, the performance analysis toolset Scalasca with its result browser CUBE, developed by Jülich Supercomputing Centre, and the Barcelona performance tool-suite, containing the instrumentation library Extrae, the analysis tool Paraver and the simulation tool Dimemas. For all of these tools, we describe the current status of the porting to the Mont-Blanc platform, in particular the testing on the WP7 prototype, as well as the implemented extensions for supporting the OmpSs programming model.
DownloadThis document describes the status of the system software stack within the Mont-Blanc project. The work of populating a complete software stack for HPC and scientific computing has been performed since the beginning of the Mont-Blanc project (see deliverables D5.3 and D5.5). In this deliverable we report the work related to the third year and the extension of the project. As during this period the project deployed the Mont-Blanc prototype, based on 1080 SoCs each with dual core CPUs + embedded mobile GPU, the effort has been focused in porting and tuning the Mont-Blanc system software, shown in Figure 1, to our final platform.
DownloadEnergy-efficient high performance computing extends beyond the use of energy-efficient low power processing hardware. With increasing variations in the power consumption depending on the workload of a high performance computing system, modern supercomputers need tighter integration with their surrounding data center infrastructure than ever before, causing new challenges for the design and operation of data centers and systems. Main aspects covered in this document are the power supply chain and the cooling system of the data center and the supercomputer.
DownloadThis deliverable provides the technical description of the final prototype system delivered to BSC for the use within the Mont-Blanc project. This system consists of 1080 nodes that are deployed in two separate partitions, a small one for test and development and a large one for running applications. The latter is a separate entity and has its own interconnect and storage subsystems.
Download