This page is obsolete. Communication should be done in the background (ToDo)!

MPI-SpeedUp for SpinPack (matrix at memory, updated Mar09)

SpinPack MPI-SpeedUp measured (Mar09) MPI-Stress Benchmark (Mar09)

Now we could compare new SC5832 machine (light green) to an Infiniband cluster with two QuadOpteron Nodes (brown curve), which has enough bandwitdh to show good scaling. Both scale well, but the SC5832 is better on a peak GFLOP or energy consumption base. The MPI-Stress benchmark shows that the infiniband cluster has much higher latencies for collective communication. The SMP machines have latencies of 1.9us for a 4-socket DualOpteron and 3.1us for a 8-socket QuadOpteron system using OpenMPI-1.2.6. The Altix4700 at the LRZ has 510 usable IA64 Prozessors per Numalink-Partition. The MPI speed depends very strongly from the MPI package size (vertical lines, 128kB for middle line).

update 2018: the main part is SpMV network-IO-bounded with a (ideal) FLOP:Byte-Ratio of 2F:4B to 8F:8B (0.5 ... 1 FLOP/Byte).
The SiCortex has 6 MIPS each 2 FLOP/clk * 700MHz = 8.4GFLOP/s - nodes but 3GB/s/node network (but 0.35GB/s/node at 100% load measured) with a FLOP:Byte Ratio of about 2 (real: 5GF/0.35GB= 14 FLOP/Byte). Power is 3.26W/core (19.5W/node, 2.3W/GF).
The Altix4700 4 FLOP/clk * 1.6GHz = 6.4 GF/s/core and 32*6.4 GB/s (NUMAlink4) for 512 cores = 0.4 GB/s/core results to 16 FLOP/Byte which looks excellent. But this is Intra-512core-partition only. Between partitions we have 0.1 GB/s/blade-pair(2cores) = 50 MB/s/c only. results to 128 FLOP/Byte. This means the Altix4700 can not profit from the much better CPU because CPU waits for Inter-Partition-Network data 99.2% of time (at SpMV) or 0.8% of RPeak can be used for SpMV where as 7% of SiCortex-RPeak can (ideally) be used. With 17.6W/GF the SpMV-speed/W are about 0.8%*1GF/17.6W=0.45MF/W of Altix4700 vs 7%*1GF/2.3W=30MF/W of SiCortex(MIPS). That means the SiCortex is about 67 times more energy efficient in SpMV assuming same power consumption for HPL and SpMV computation, or 8 times better compared to single Altix4700-Partition with 512 cores. By the way HLRBII Altix4700 had 4GB/core and was from 04/2007 1.1MW/9728cores=113W/core, mass=10.6kg/core, maxJobTime=48h.