Nvidia has made no secret about wanting to be a player in the supercomputer racket both on the GPU and CPU sides of a hybrid system. The company launched “Project Denver” nearly two years ago to create a Nvidia-branded chip, which will see Denver ARM processors timed to market with the future “Maxwell” GPUs two years from now. But maybe you can’t wait that long to get started on using a ceepie-geepie hybrid, and maybe you don’t want to build your own machine from expensive CPU and GPU cards. That’s where E4 Computer Engineering comes in.
Ten-year-old Italian cluster-maker E4 Computer – a big supplier of standard x86 clusters to the CERN lab in Switzerland among others – has partnered with SECO, the Italian firm that makes embedded x86 and ARM boards for various uses, to bring out baby ARM clusters with GPU options.
The Carma Microcluster and full-on Carma Cluster machines are based on Qseven embedded ARM processor boards that SECO creates for embedded customers, and you will recognize the Qseven ARM board if you follow the HPC market or build embedded systems. The ARM boards that are in the E4 Computer servers are similar to the ones that were originally used in the experimental “Mont Blanc” machine, which paired quad-core Tegra3 ARM-based chips with Nvidia modile GPUs, at the Barcelona Supercomputing Center in Spain.
The Carma Microcluster rack and tower machines, so named because they run the CUDA parallel application development environment on ARM processors, were being shown off at the SC12 supercomputing conference in Salt Lake City last week, as were their Carma Cluster microserver variants.
The Carma cluster can be a 5U racker or a tower box
As E4 Computer correctly puts it, the current generations of quad-core Tegra3 processors from Nvidia and their 32-bit peers from other licensees of the ARM designs are somewhat challenged in the floating point department. But pairing an ARM processor with a GPU – essentially a modern-style, outboard math coprocessor like Intel used to offer in a special socket for x86 CPUs before they were brought on-chip with the 80486SX and Pentium chips – does the trick just nicely, as the Mont Blanc experimental machine demonstrates.
The second prototype system from BCS built was supposed to pair the Qseven card from SECO with a single Tegra3 processor, which has four Cortex-A9 cores running at 1.5GHz plus a fifth baby core for management. (This is the so-called “big.Little” architecture that ARM is espousing to support different-sized workloads with a single chip.) That Tegra3 card has 4GB of memory and a Gigabit Ethernet port. An Nvidia GeForce 520MX GPU for laptops was implemented on a side board and linked to the Tegra3 board it. That GPU is of the “Fermi” generation and has 48 cores. A Mont Blanc-2 1U server has eight of these Tegra3-GeForce 520MX combos in the box, and then 32 blades and 10 Gigabit Ethernet switches in a rack to deliver 38 teraflops of floating point oomph in a 5 kilowatt power envelope, for 7.5 gigaflops per watt. That’s almost three times better performance per watt than big CPU or CPU-GPU machines delivered on the latest Top500 supercomputer rankings.
The Mont Blanc project has subsequently chosen an integrated CPU-GPU device for its supercomputer. And that is the Exynos 5 processor from Samsung Electronics, which implements two Cortex-A15 cores on a die using 32 nanometer processes and running at 1.7GHz; the chip also has an ARM Mali-T604 GPU.
The Tegra3-Quadro 1000M ceepie-geepie hybrid board, made by SECO
The Carma machines from E4 Computer use the newer Quadro 1000M mobile graphics cards (PDF) from Nvidia, code-named “Huron River,” which have 96 CUDA cores and which burn 45 watts. The Quadro 1000M delivers 270 gigaflops of single-precision (32-bit) floating point performance, which doesn’t sound like much when you see what the Nvidia Tesla K10 GPU coprocessor can deliver in terms of single-precision oomph, but it is about 80 per cent more SP floating point performance than the GeForce 520MX had – and that is a nice jump.
The Tesla K10 is the single-precision monster at Nvidia, delivering 4.58 teraflops, or 20.3 gigaflops per watt peak on a 225 watt card, compared to 6 gigaflops per watt for the Quadro 1000M. But the Carma Microcluster is a development machine, not a performance beast, so this is about putting a baby cluster in a box that software can be created and tested on, and for a much lower cost than a Cadillac Xeon-Tesla setup might have. The Carma machines also let companies get ahead of the curve on ARM-based iron.
The E4 Carma Microcluster development cluster based on SECO ARM boards
The Carma Microcluster has one SECO Qseven board per blade in a chassis that is 5U in size; that chassis can hold up to eight blades. The chassis can be mounted in a rack or tipped on its side and used as a tower server, perhaps tucking it in a closet or beside your desk. The Microcluster machine, ironically enough, has an internal x86 processor that is used for managing the blades in the chassis as well as being used for cross-compiling on both ARM and x86 processors from the same machine.
The Microcluster box has an aggregate of 2,160 gigaflops across those eight blades. With the x86 management node included, the whole machine draws 600 watts of power at system level, which works out to 3.6 gigaflops per watt at the system (rather than at the GPU) level.
By the way, SECO does not play favorites. It has embedded boards with Freescale ARM processors and is ready to launch one with the latest OMAP ARM variant from Texas Instruments. And the company has been selling Qseven boards based on the Fusion G Series APUs, which have on-die Radeon HD6200 series GPUs, as well as on Intel Atom E600 series processors.
Production Carma boxinis
If you’re thinking of putting the Tegra-Quadro combo into production, E4 Computer has another machine that it thinks is more appropriate, and one that can be deployed as either an ARM-only setup for integer work as well as ceepie-geepie jobs. The Carma Cluster is a microserver design that puts a dozen blades and two power supplies into a 3U rack enclosure. Each Carma2 blade has two Tegra3-Quadro 1000M boards on it, for a total of 24 Tegra3 processors and 24 Quadro GPUs per enclosure, yielding 6,480 gigaflops in that 3U enclosure. This machine is estimated to draw around 1,500 watts according to preliminary data from E4 Computer.
The Carma Cluster microserver chassis can do just ARMs, or ARMs plus GPUs
If you want to go CPU-only, to run web servers or do Hadoop Big Data munching, then there are blades known as Darma – possibly short for dual ARM servers – that put 48 Tegra3 chips (192 usable ARM cores) into a single chassis with four SECO cards on each blade. This Darma setup is estimated to draw 400 watts. You can mix and match the Carma2 and Darma microservers inside of the Carma Cluster chassis.
Clearly, getting a CPU and GPU on a single die, as Nvidia is planning to do with Project Denver, would be a much better option in terms of thermals and performance. But that is many years away, and machines such as those made by E4 Computer let you get started on the programming now, so that you will be ready then.
Pricing information was not available at press time for the Carma and Darma machines. ®