Speed test using spinpack

Here you can find information, how to measure speed of your machine running spinpack.

First you have to download spinpack-2.15.tgz (or better version). Uncompress and untar the file, configure the Makefile, compile the sources and run the executable. Here is an example:

       gunzip -c spinpack-2.15.tgz | tar -xf -
       cd spinpack

       # --- small speed test --- (1CPU, MEM=113MB, DISK=840MB, nud=30,10)
       ./configure --nozlib
       make speed_test 
       sh -c "( cd ./exe; time ./spin ) 2>&1 | tee speed_test_small" 

       # --- big speed test --- (16CPUs, MEM=735MB, DISK=6GB, nud=28,12)
       ./configure --mpt --nozlib
       make speed_test; grep -v small exe/daten.i1 >exe/daten.i
       sh -c "( cd ./exe; time ./spin ) 2>&1 | tee speed_test_big"
   

Send me the output files together with the characteristic data of your computer for comparisions. Please also add the output of grep FLAGS= Makefile and cpu.log if you have.

Computation time

The next table gives an overview about computation time for a N=40 site system (used for speed test) started at year 2003. First column marks the up- and down-spins (nud) given in daten.i. Other columns list the time needed for writing the matrix (SH) and for the first 40 iterations (i=40) showed by the output. The star (*) marks the default configuration when using make speed_test (see above). The double star is an example for the big speed_test (see above).

new tests:
            t=[[hh:]mm:]ss
   nud      ns       SH     i100    12sij CPUs machine          time=[hh:]mm:ss(+-ss) dflt: v2.26 -O2
  ------+---------------------------------+--+--- E=-8.22686823 SMag= 0.14939304 ----------------------
   32,8     39     1:36       37    16:43   1  Altix330-64GB 8x2-IA-64.m2.r1-1.5GHz v2.26 g++4.1 -O2
   32,8     24       50       22     8:37   2 h_get-inlined.i100=22s
   32,8     15       26       17     4:41   4 h_get-inlined.i100=17s
   32,8     14       16     2:40     2:58   6
   32,8     15       13     3:52     2:13   8  tmp=/dev/shm or DSK cpu 0,1,6,7,10,13,14,15
   32,8     14       14     1:32     2:23   8 # pmshub, linkstat, shubstats -cachetraffic,memdir,linkstats
   32,8     14       12     4:12     2:13   8 # numactl -i all 2m, -l 3m
   32,8     14        9     5:18        -  12  tmp=/dev/shm
   32,8     14       18     5:12     1:34  12
   32,8     14       12     4:32     2:01  16
   30,10  6:39    19:15    12:54        -   1 inlined_h_get
   30,10  4:11     9:54    10:17        -   2 inlined_h_get
   30,10  2:33     4:55     6:20        -   4 inlined_h_get
   30,10  2:29     2:41     4:25        -   8 inlined_h_get
   28,12 42:13  2:11:21  1:48:02 23:49:57   1 inlined_h_get (157m -> 108m)
   28,12 26:15  1:10:32  1:29:55 12:16:00   2 h_get_inlined (292m -> 90m)
   28,12 16:33    36:05  1:19:55  6:12:54   4 h_get_inlined (346m -> 80m)
   28,12 16:07    19:08    38:07  3:10:28   8 h_get_inlined (412m -> 38m)

   32,8     37     2:43     1:51    27:56   1 GS160-24GB 4x4-alpha-731MHz v2.26 cxx-6.5
   32,8     28     1:23     1:20    15:05   2 h_get_inlined (3:09 -> 1:20)
   32,8     16       43     1:09     7:17   4 h_get_inlined (5:10 -> 1:09)
   32,8     10       29       45     3:44   8 h_get-inlined (6:29 -> 0:45)
   32,8      9       14       42     1:54  16 (100i: 5:21 -> 42(unter Last))
   30,10  6:24    32:29    38:02      -     1    
   30,10  4:38    16:47    25:30      -     2
   30,10  2:45     8:50    20:55      -     4 h_get inlined
   30,10  1:40     4:37    14:02    47:53   8 h_get inlined (87m -> 14m=8*(6-11m))
   30,10  1:48     3:12    11:10    24:38  16 h_get inlined (73m -> 11m) used
   28,12 11:01    35:19  1:52:40      -     8 h_get inlined (476m->123m, other jobs running)
   28,12 10:59    17:10      -        -    16 h_get inlined (424m->?)
   cxx -g1 -pthread -pg; ./spin; gprof -b -F inc1 ./spin  # gmon.out
   34,6 16CPUs (real=12s user=147s, 67s h_get, 59s hamilton_geth_block)
   34,6  8CPUs (real=13s user= 94s, 43s h_get, 34s hamilton_geth_block)
   cxx -g1 -pthread -p; ./spin; prof ./spin  # mon.out
   ... + hiprof -pthread -run
   cxx -g3 -pthread; pixie -pthread -run ./spin; prof -pixie -threads -all spin *.Counts*
   35,5  1:07     1:35     1:44    - 8  #        
   hiprof -threads -run ./spin; gprof -b -scaled -all spin.hiprof *.?.hiout (-asm|-lines -f h_get )
   32,8  3:47      1:24       48   - 8  # h_get-inlined 410s(8*50s)

   30,10 4:46     26:09    29:37   - 1  # Last! GS1280-128GB-striped 32-alpha-1150MHz v2.26 cxx-6.5
   30,10 2:08      6:21    15:45   - 4  # Last! GS1280-128GB-striped 32-alpha-1150MHz v2.26 cxx-6.5

old tests:
   nud     SH-time  i=40-time CPUs machine          time=[hh:]mm:ss(+-ss) dflt: v2.15 -O2
  -------+---------+---------+--+------------------- E=-8.22686823 ----------------------
   32,8      3:32     9:20    1  Via-C3-1GHz-64k-gcc-3.3 v2.19 -O2 -msse -march=i586 lt=245s          (-T 255MB/s)
   32,8      1:25     3:10    1  Celeron-1GHz-gcc_2.95.3 v2.18 -O2                   lt=59s (rl: cache=256kB disk=26MB/s dskcache=168MB/s)
   32,8      0:53     1:56    1  Centrino-1.4GHz-gcc-3.3 v2.19 -O2 -msse -march=i586 lt=40s   3s/5It  (-T 986MB/s)
   32,8      2:02     4:10    1  Centrino-600MHz-gcc-3.3 v2.19 -O2 -msse -march=i586 lt=94s   4s/5It  (-T 858MB/s) speed-step
   32,8      2:11     4:18    1  Centrino-600MHz-gcc-3.3 v2.19 -O2 -msse -march=i686 lt=95s   4s/5It  (-T 858MB/s) speed-step
   32,8      0:50     2:04    1  Pentium4-2.5GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse B_NL2=4
   32,8      1:22     2:27    1  Pentium4-2.6GHz-gcc-3.3 v2.26p1 -O2  B_NL2=0 L2=512kB fam=15 model=2 stepping=9 (no inline h_get) no matter if h_get/h_put inlined, B_NL2=2
   32,8      1:03     2:21    1  AthlonXP-1.7GHz-gcc-3.2 v2.17 -O2 -march=athlon-xp -m3dnow               (           hda=55MB/s)
   32,8      0:55     2:21    1  AthlonXP-1.7GHz-gcc-3.3 v2.21 -O4 -march=athlon-xp -m3dnow lt=39s 6s/5It (-T 408MB/s hda=48MB/s)  i65=2m32s+18s 66MB/48MB*s*65=89s
   32,8      1:14     4:32    1  Xeon-2GHz-v2.18-gcc-3.2 -O2 -march=i686 -msse 4x4 lt=1:00
   32,8      0:43     1:18    1  Xeon-2660MHz-2M-8GB-v2.25-gcc-4.1.1 -O2 amd?64bit  4x4 model6 stepping4  lt=22s n2=24s 3s/10It 2*DualCore*2HT=8vCPUs bellamy
   32,8      0:43     1:19    1  Xeon-3GHz-12GB-v2.24-gcc-4.1 -O2   64bit      4x4 lt=0:22 model4 stepping10
   32,8      0:49     1:31    1  Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 -O2 32bit      4x4 lt=0:26 model4 stepping3
   32,8      1:21     3:13    1  GS160-Alpha-731MHz-cxx v2.17 -fast -g3 -pg Compaq C++ V6.3-008
   32,8      1:42     3:23    1  GS160-Alpha-731MHz-g++4.1.1 v2.24 -O2                         lt=47s n2=52s 12s/10It
   32,8      1:43     3:23    1  GS160-Alpha-731MHz-g++4.1.1 v2.24 -O2 -mcpu=ev67 -mtune=ev67  lt=47s n2=53s 12s/10It
   32,8      1:39     3:11    1  GS160-Alpha-731MHz-gcc4.1.1 v2.24 -O3 -funroll-loops -fomit-frame-pointer -ffast-math -mcpu=ev67 -mtune=ev67 lt=41s n2=46s 11s/10It
   32,8      0:59     2:16    1  ES45-Alpha-1250MHz-cxx-6.3 -fast v2.18 lt=0:59 2x2
   32,8      3:16     7:04    1  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA  B_NL2=2  
   32,8      3:11     6:39    1  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA  B_NL2=0  
   32,8      3:13     5:56    1  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lpthread
   32,8      1:59     4:54    2  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lpthread
   32,8      1:23     4:36    4  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lpthread
   ----------------------------  n1=5.3e6 113MB+840MB E=-13.57780124 ---------
   30,10      23m      76m    1  Pentium-1.7GHz-gcc v2.15 -lgz
   30,10      21m      50m    1  Pentium-1.7GHz-gcc v2.15
   30,10    12:58    44:01    1  AthlonXP-1.7GHz-gcc-3.3 v2.21 -O4 -march=athlon-xp -m3dnow      (lt=6m44s hda=48MB/s, cat  40x800MB=15m, 48%idle) 3m/5It i65:r=60m,u=34m,s=4m (also pthread)
   30,10    15:15    38:29    1  AthlonXP-1.7GHz-gcc-3.2 v2.17 -O2 -march=athlon-xp -m3dnow      (lt=7m29s hda=55MB/s, cat  40x800MB=15m, 40%idle)
   30,10    15:34    45:28    1  AthlonXP-1.7GHz-gcc-3.2 v2.18 -O2 -march=athlon-xp -m3dnow -lgz (lt=7m29s hda=55MB/s, zcat 40x450MB=13m,  1%idle)
   30,10    11:51    26:31    1  Pentium4-2.5GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse B_NL2=4
   30,10    15:59    40:09    1  Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse -g -pg
   30,10    14:26    34:08    1  Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse
   30,10     8:16    25:09    2  Xeon-2GHz-gcc-3.2 v2.17 -O2 -march=i686 -msse (slow nfs-disk)
   30,10    14:40    32:26    1  Xeon-2GHz-gcc-3.2 v2.18 -O2 -march=i686 -msse 4x4 lt=10m34
   30,10     8:44    16:59    1  Xeon-3GHz-12GB-v2.24-gcc-4.1.1 -O2 64bit      4x4 lt=3m31 model4 stepping10 n2=3m55 65s/10It
   30,10     9:57    19:08    1  Xeon-3GHz- 2GB-v2.25-gcc-4.0.2 -O2 32bit      4x4 lt=4m34 model4 stepping3  n2=4m58
   30,10     8:52    17:53    1  Xeon-3GHz- 2GB-v2.24-gcc-4.1.1 -O2 32bit      4x4 lt=4m25 model4 stepping3  n2=4m50
   30,10     8:27    16:48    1  Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=3m43 model6 stepping4  n2=4m10 62s/10It 2*DualCore*2HT=8vCPUs bellamy
   30,10     3:19    11:36    4  Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=2m10 model6 stepping4  n2=2m37 76s/10It 2*DualCore*2HT=8vCPUs bellamy
   30,10     1:57     7:49    8  Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit 4x4 lt=1m07 model6 stepping4  n2=1m35 55s/10It 2*DualCore*2HT=8vCPUs bellamy
   30,10     6:56    15:15    1  4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=4m18 n2=04:37 (116s/10It) Knoppix-3.8-32bit
   30,10     4:04    11:12    2  4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=2m40 n2=02:58 (63s/10It) cpu5+7
   30,10     2:20     9:05    4  4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m39 n2=01:57 (72s/10It) cpu3-6 (2*HT dabei?)
   30,10     2:47     6:33    4  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=1m23 n2=01:40 (52s/10It) cpu3-6 (2*HT dabei?)
   30,10     1:15     4:42    8  4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m37 n2=01:55 (24s/10It)
   30,10     1:03     4:10  2*8  4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m37 n2=01:55 (17s/10It) (4CPUs * 2Cores)
   30,10     1:00     4:33  4*8  4xDualOpteron885-2600MHz-gcc-3.3.5 32bit lt=1m38 n2=01:56 (17s/10It) (4CPUs * 2Cores) ulimit -n 4096
   30,10     2:43     8:57    2  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=1m51 n2=02:10 (46s/10It) kanotix2005-04-64bit   16G-RAM tmpfs=...MB/s
   30,10     2:02     5:34    4  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=1m05 n2=01:23 (32s/10It) kanotix2005-04-64bit   16G-RAM tmpfs=...MB/s
   30,10     1:08     3:16    8  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=0m42 n2=01:00 (17s/10It) kanotix2005-04-64bit   16G-RAM tmpfs
   30,10     5:16    13:19 a2 1  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=2m24 n2=02:41 (i56,1m12s/10It) zizj=15m07 all=18m58s  SLES10-64bit   32G-RAM (8virtCPUs) ltrace: rand=450s fread=223s
   30,10     0:54     3:25 a2 16 4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=0m41 n2=00:58 (i49,2s/It) sisj=13m57 slow!            SLES10-64bit   32G-RAM (8virtCPUs)
   30,10     8:14    29:52    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.17 -fast -lz (sun4u) lt=4:14  4 threads
   30,10    19:31  1:03:28    1  SunFire-880-SparcIII-750MHz-CC-5.3  v2.17 -fast     (sun4u) lt=9:50 16 threads   2048s/40*168e6=0.30us
   30,10    27:28  1:14:28    1  SunFire-880-SparcIII-750MHz-g++2.95 v2.17 -mv8 -O2  (sun4u) lt=22:32 4 threads
   30,10    30:24    59:33    1  SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -mv9 -O2  32-v8+  lt=12:01 4m00/10It 8 threads
   30,10    28:20    55:14    1  SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -mv9 -O2  64-v9   lt=9:59  3m54/10It 8 threads
   30,10    29:47    59:49    1  SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -profile-use 64v9 lt=9:52  4m44/10It 4 threads
   30,10    28:31    56:35    1  SunFire-880-SparcIII-750MHz-g++-4.1 v2.25 -mcpu=ultrasparc3 -mvis -mtune= 64v9 lt=9:19 n2=10m18 4m27/10It 4 threads
   30,10    28:10    55:32    1  SunFire-880-SparcIII-750MHz-gcc-4.1 v2.25 -O3 -mcpu=ultrasparc3 -mtune=   64v9 lt=8:56 n2=9m50  4m22/10It 4 threads
   30,10     9:12    25:51    4  SunFire-880-SparcIII-750MHz-gcc-4.1 v2.25 -O3 -mcpu=ultrasparc3 -mtune=   64v9 lt=3:13 n2=4m06  3m08/10It 4 threads
   30,10     7:52    21:40    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.19 -fast     (sun4u) lt=6:11  4 threads  (55s/5It) vbuf=16M
   30,10     7:24    26:45    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.17 -fast     (sun4u) lt=4:11  4 threads  4*910s/40*168e6=0.54us
   30,10     7:12    26:28    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.17 -fast -O4 (sun4u) lt=4:05  4 threads  4*911s/40*168e6=0.54us
   30,10     3:44    16:58    8  SunFire-880-SparcIII-750MHz-CC-5.3  v2.17 -fast     (sun4u) lt=4:23 16 threads  8*532s/40*168e6=0.63us
   30,10    13:42    25:56    1  SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3 4x4    (64bit) v2.25+ lt=4m26 n2=4m53 110s/10It (8virtCPUs,4DualCore) 32GB
   30,10     9:04    20:26    1  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=4m50 n2=5m20 81s/10It (8virtCPUs)
   30,10     9:01    19:30    1  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra3 -xarch=v9 (64bit) v2.25+ lt=4m37 n2=5m06 80s/10It (8virtCPUs)
   30,10     5:26    13:49    2  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=3m16 n2=3m46 69s/10It (8virtCPUs)
   30,10     3:07     8:16    4  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=1m49 n2=2m21 42s/10It (8virtCPUs)
   30,10     1:46     7:43    8  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=2m55 n2=3m59 29s/10It (8virtCPUs)
   30,10     1:23     5:05  2*8  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=1m34 n2=2m04          (8virtCPUs,4DualCore)
   30,10     2:00     5:39  2*8  SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3        (64bit) v2.25+ lt=1m29 n2=1m55 26s/10It (8virtCPUs,4DualCore) 32GB
   30,10     2:04     5:40  2*8  SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O2 -mcpu=ultrasparc3        (64bit) v2.25+ lt=1m25 n2=1m53 26s/10It (8virtCPUs,4DualCore) 32GB
   30,10    14:25    26:09    1  ES45-Alpha-1250MHz-gcc-3.2.3 -O2 v2.18 lt=6:09 2x2
   30,10    12:13    22:15    1  ES45-Alpha-1250MHz-cxx-6.3 -fast v2.18 lt=4:14 2x2 (ev56)
   30,10    22:55    53:52    1  GS160-Alpha-731MHz-gcc4.1.1 v2.24 -O3 -funroll-loops -fomit-frame-pointer -ffast-math -mcpu=ev67 -mtune=ev67 lt=6m56s n2=7m46 5m41s/10It 17jobs/16cpus
   30,10    13:20    32:30    4  GS160-Alpha-731MHz-gcc4.1.1 v2.24 -O3 -funroll-loops -fomit-frame-pointer -ffast-math -mcpu=ev67 -mtune=ev67 lt=5m08s n2=5m59 3m21s/10It 18jobs/16cpus
 * 30,10      24m      64m    1  GS160-Alpha-731MHz-cxx-6.3 v2.15
   30,10    19:00    48:14    1  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -g3 -pg (42% geth_block, 27% b_smallest, 16% ifsmallest3)
   30,10    21:12    50:37    1  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast
   30,10    19:36    59:44    1  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads
   30,10    12:15    36:16    2  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads
   30,10     8:24    24:17    3  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast 16 threads
   30,10     7:40    26:36    3  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -pthread  4 threads
   30,10     7:48    53:00    10 GS160-Alpha-731MHz-cxx-6.3 v2.15
   30,10     3:50    18:23    16 GS160-Alpha-731MHz-cxx-6.3 v2.15 ( 64 threads)
   30,10     3:33    15:19    16 GS160-Alpha-731MHz-cxx-6.3 v2.15 (128 threads) simulates async read
   30,10    21:20    43:55    1  GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=06:50 4m/10It
   30,10    19:44    46:16    2  GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=05:46 (1m59s..2m54s)/5It (work load, home)
   30,10    12:18    34:11 a2 2  GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=05:38 (20s..22s)/a2It (work load, home, a2=53It/34m)
   30,10     5:35    12:41    16 GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=02:51  1m/10It  (640%CPU)
   30,10    12:55    23:15    1  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=04:33 1m26s/10It = 10*840MB/1m26s=98MB/s                  10*hnz/86s/1=20e6eps/cpu 50ns (max.80ns)
   30,10     2:19     5:59    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=02:11 22s/10It (14%user+5%sys+81%idle (0%dsk) von 32CPUs) 10*hnz/22s/8=10e6eps/cpu
   30,10     1:38     4:11    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=01:47 12s/10It
   30,10     1:46     3:48    32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=01:25  9s/10It
   30,10  1:01:10  4:25:28    1  O2100-IP27-250MHz-CC-7.30 v2.15 -O3 -lz
   30,10    50:06  3:12:22    1  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lz
   30,10    30:14  2:00:42    2  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA -lz
   30,10    41:50  1:35:44    1  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA
   30,10    54:00  1:45:15    1  O2100-IP27-250MHz-CC-7.30 v2.17v3 ssrun -64 -O2 -IPA lt=00:20:33 geth_bl=2200s latency?=2030s/60*168e6=0.20us (XY_NEW+sortH)
   30,10    47:06  1:36:56    1  O2100-IP27-250MHz-CC-7.30 v2.17v3 ssrun -64 -O2 -IPA lt=00:20:40 geth_bl=2090s latency?=1928s/60*168e6=0.19us (XY_NEW)
   30,10    26:52  1:14:28    2  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA (HBLen=1024 about same)
   30,10    16:50  1:13:51    8  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA
   30,10    19:23  1:16:17    4  O2100-IP27-250MHz-CC-7.30 v2.18 -64 -O2         4x4 hnz+15% lt=00:11:33
   30,10    19:08  0:59:29    4  O2100-IP27-250MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 hnz+15% lt=00:13:00
   30,10    44:22  2:11:25    2  MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA 2x2 lt=23m (8m00s)/5It CFLOAT
   30,10    44:14  2:12:54 a2 2  MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA 2x2 lt=25m (2m05s)/a2It CFLOAT i45=2h23m
   30,10    22:08  2:03:54    4  MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA                   4x4 lt=47m (6m49s)/5It 
   30,10    22:04  2:12:33 a2 4  MIPS--IP25-194MHz-CC-7.21 v2.19 -64 -Ofast -IPA                   4x4 lt=47m (2m09s)/a2It  i51=2h36m
   30,10    23:13  1:45:24    4  MIPS--IP25-194MHz-gcc-323 v2.19 -O2 -mips4 -mabi=64 -mcpu=orion   4x4 lt=21m (7m35s)/5It  read=20k 
   30,10    20:22  1:32:46    4  MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA                   4x4 lt=14m (7m15s)/5It 
   ----------------------------  n1=35e6  -18.11159089 735MB+6GB
   28,12 10:40:39 20:14:07    1  MIPS--IP25-194MHz-CC-7.21 v2.18 -64 -Ofast -IPA 1x1 lt=2h51m 50m/5It (dd_301720*20k=354s dd*5=30m /tmp1 cat=6GB/352s=17MB/s)
   28,12  6:04:55 16:03:42    2  MIPS--IP25-194MHz-CC-7.21 v2.18 -64 -Ofast -IPA 2x2 lt=3h00m 52m/5It (ToDo: check time-diffs It0..It20?)
   28,12  5:40:49 14:05:49    2  MIPS--IP25-194MHz-CC-7.30 v2.18 -64 -Ofast -IPA 2x2 lt=1h49m 49m/5It      FLOAT npri=40 MaxSym=170, write=20480 read=? (2cat=6GB/451s 5*6GB=38m)
   28,12  3:14:01 10:09:10    4  MIPS--IP25-194MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 lt=1h26m 41m/5It      FLOAT npri=40 (was resetted?) (4cat=6.6GB/469s)
   28,12  3:25:09 11:04:01    4  MIPS--IP25-194MHz-CC-7.30 v2.18 -64 -Ofast -IPA 4x4 lt=1h32m 45m46s/5It  CFLOAT npri=40
   28,12  3:43:02 13:31:30    4  MIPS--IP25-194MHz-gcc-323 v2.19 -O2 -mips4 -mabi=64 -mcpu=orion   4x4 lt=2h11m (57m)/5It  read=20k 
   28,12  3:42:38 13:07:44 a2 4  MIPS--IP25-194MHz-gcc-323 v2.19 -O2 -mips4 -mabi=64 -mcpu=orion   4x4 lt=2h22m (16m)/a2It  read=20k  i55=17h
   28,12  3:14:22 12:28:31    4  MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA                   4x4 lt=1h25m (59m)/5It 
   28,12  3:15:36 12:00:46 a2 4  MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA                   4x4 lt=1h25m (16m)/a2It            i54=15h51m
   28,12  3:42:27 10:32:52    2  O2100-IP27-250MHz-CC-7.30 v2.17 -64 -Ofast -IPA
   28,12     171m       7h    1  Pentium-1.7GHz-gcc   v2.15
   28,12       5h      10h    1  GS160-Alpha-731MHz-cxx-6.3 v2.15
** 28,12    57:39  5:29:57    16 GS160-Alpha-731MHz-cxx-6.3 v2.15 (16  threads)
   28,12    59:22  2:51:54    16 GS160-Alpha-731MHz-cxx-6.3 v2.15 (128 threads) .
   28,12  3:03:00 10:04:03    1  GS160-Alpha-731MHz-cxx-6.3 v2.17pre -fast
   28,12  1:13:27  5:45:12    3  GS160-Alpha-731MHz-cxx-6.3 v2.17 -fast -pthread 16
   28,12  1:49:31  4:29:09    4  GS160-Alpha-731MHz-cxx-6.3 v2.18 -fast lt=25m home 10It/32..77m 7.5GB/635s=12MB/s(392s,254s,81s) tmp3=160s,40s,33s tmp3_parallel=166s,138s
   28,12    52:57  2:17:00    8  GS160-Alpha-731MHz-cxx-6.5 v2.19 -fast lt=24m  13m30s/10It 
   28,12  2:00:56  4:08:31    1  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=53:23  17m17s/10It = 10*6GB/17m17s=58MB/s (3GB_local+3GB_far)          12e6eps/cpu
   28,12  1:12:02  2:18:26    2  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=20:39  11m08s/10It = 10*6GB/11m08s=90MB/s                               9e6eps/cpu
   28,12    40:36  1:21:40    4  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=16:06   6m13s/10It
   28,12    23:20    50:20    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=13:08   3m26s/10It
   28,12    21:35    53:10    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=13:13   (2m04s..4m41s)/10It HBlen=409600 10*6GB/2m=492MB/s hnz*10/2m/8=10e6eps/cpu
   28,12    14:01    32:17    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=10:46   1m51s/10It
   28,12    13:09    27:50    32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=08:37   1m29s/10It
3  28,12    15:41    30:57    32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast lt=08:42   1m24s/10It 70%user+7%sys+23%idle(0%io) 1user                    4e6eps/cpu
   28,12    51:51  1:56:55    2  ES45-Alpha-1250MHz-cxx-6.5   v2.23 -fast lt=16m    11m34s/10It unter Last
   28,12  3:19:39  7:02:48    1  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=1h05m  1 threads (19m40s/5It)
   28,12  1:48:28  4:29:24    2  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=47:17  2 threads (14m08s/5It)
   28,12    58:41  2:42:08    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=36:36  4 threads (8m/5It, 4cat=6GB/0.5s)
   28,12  1:00:27  2:44:18    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=35:46  4 threads (8m19s/5It) 2nd try v2.19
   28,12  1:00:45  2:38:41 a2 4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=39:25  4 threads (1m57s/1a2) 2nd try v2.19a2 i51=3h incl. EV
   28,12    59:16  2:37:09    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=38:48  4 threads (7m17s/5It) FLOAT
   28,12    35:59  1:47:38    8  SunFire-880-SparcIII-750MHz-CC-5.3  v2.18 -fast     (sun4u) lt=29:39  8 threads (5m/5It)
   28,12  1:31:19  3:18:24    1  SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3        (64bit) v2.25+ lt=27m18 n2=29m52 9m17s/10It (8virtCPUs,4DualCore) 32GB
   28,12  1:03:21  2:30:33    1  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=30m02 n2=32m57 13m32s/10It 8virtCPUs(4DualCore) 32GB
   28,12    38:31  1:47:45    2  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=20m02 n2=22m58 11m32s/10It 8virtCPUs(4DualCore)
   28,12    22:09  1:02:11    4  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m52 n2=13m58 6m30s/10It 8virtCPUs(4DualCore)
2  28,12    12:17    42:57    8  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m05 n2=13m25 4m15s/10It 8virtCPUs(4DualCore) 8threads
   28,12    10:14    39:00  2*8  SunFire-V490-Sparc4+-1500MHz-CC-5.3 -fast -xtarget=ultra -xarch=v9 (64bit) v2.25+ lt=10m06 n2=13m02 3m55s/10It 8virtCPUs(4DualCore) 16threads
   28,12    14:31    43:09  2*8  SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O3 -mcpu=ultrasparc3        (64bit) v2.25+ lt=9m26  n2=12m00 4m08s/10It (8virtCPUs,4DualCore) 32GB
   28,12    14:44    43:48  2*8  SunFire-V490-Sparc4+-1500MHz-gcc4.1.1 -O2 -mcpu=ultrasparc3        (64bit) v2.25+ lt=9m13  n2=12m00 4m15s/10It (8virtCPUs,4DualCore) 32GB
   28,12    54:19  1:53:43    1  2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2    lt=19m  (9m36s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
   28,12    32:42  1:20:55    2  2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2    lt=13m  (8m19s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
   28,12    19:40  1:07:17    4  2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2    lt=7m   (9m38s)/10It hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)  memory distributed badly?
   28,12  1:11:37  2:14:45    1  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=25m52 n2=27:42 (8m36s/10It) Novel10-32bit  32G-RAM dsk=6MB/s
   28,12    44:31  1:31:31    2  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=16m25 n2=18:18 (6m59s/10It) Novel10-32bit  32G-RAM dsk=6MB/s
   28,12    23:30    51:54    4  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=10m06 n2=12:00 (4m09s/10It) Novel10-32bit  32G-RAM dsk=6MB/s
   # Novel10-32bit: mount -t tmpfs -o size=30g /tmp1 /tmp1 # w=591MB/s 
   28,12    20:17  1:25:03    4  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=9m05 n2=10:47 (13m33s/10It) knoppix-5.0-32bit  4of32G-RAM dsk=60MB/s
   28,12    11:30  1:31:56    8  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=8m53 n2=10:35 (16m54s/10It) knoppix-5.0-32bit  4of32G-RAM dsk=60MB/s
   28,12     9:39  1:44:06  2*8  4xDualOpteron885-2600MHz-gcc-4.0.4 32bit lt=8m56 n2=10:38 (20m29s/10It) knoppix-5.0-32bit  4of32G-RAM dsk=60MB/s
   28,12    47:41  1:48:12    1  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=30m  n2=32:15 (7m03s/10It) kanotix2005-04-64bit   16G-RAM tmpfs
   28,12    30:19  1:20:48    2  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=22m  n2=24:03 (6m36s/10It) kanotix2005-04-64bit   16G-RAM tmpfs
   28,12    16:29    45:49    4  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=6m55 n2=13:27 (3m30s/10It) kanotix2005-04-64bit   16G-RAM tmpfs
1  28,12     9:01    28:13    8  4xDualOpteron885-2600MHz-gcc-4.0.3 64bit lt=8m29 n2=10:18 (2m13s/10It) kanotix2005-04-64bit   16G-RAM tmpfs
   28,12    37:13  1:32:28 a2 1  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=15m03 n2=16:58 (9m19s/10It) SLES10-64bit 32G-RAM
   28,12    22:27  1:07:18 a2 2  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=10m37 n2=12:21 (8m45s/10It) SLES10-64bit 32G-RAM
   28,12    12:38    47:26 a2 4  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=6m01  n2=7:47  (6m37s/10It) SLES10-64bit 32G-RAM
   28,12     7:19    36:50 a2 8  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit lt=4m29  n2=6:24  (5m33s/10It) SLES10-64bit 32G-RAM
   28,12  1:04:36  2:04:00    1  Xeon-3GHz-12GB-v2.24-gcc-4.1    64bit 4x4 lt=22m  (8m36s)/10It model4 stepping10 n2=24m44 xen
   28,12    37:42  1:28:24    2  Xeon-3GHz-12GB-v2.24-gcc-4.1    64bit 2x2 lt=16m  (7m28s)/10It model4 stepping10 n2=18m18 xen
   28,12    29:46  1:17:56    4  Xeon-3GHz-12GB-v2.24-gcc-4.1    64bit 4x4 lt=11m  (8m51s)/10It model4 stepping10 n2=13m15 xen
   28,12  1:13:28  2:42:16    1  Xeon-3GHz- 2GB-v2.25-gcc-4.0.2  32bit 4x4 lt=29m  (13m46s)/10It model4 stepping3 n2=31m34
   28,12  1:06:07  2:30:48    1  Xeon-3GHz- 2GB-v2.24-gcc-4.1.1  32bit 4x4 lt=28m  (13m33s)/10It model4 stepping3 n2=30m16 
   28,12    13:46  1:24:21 a2 8  Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit  model6 stepping4  lt=7m56 n2=10m42 13m02/10It 2*DualCore*2HT=8vCPUs bellamy
   28,12    13:42    49:10    8  Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit  model6 stepping4  lt=8m43 n2=11m30  7m13/10It 2*DualCore*2HT=8vCPUs bellamy san=w142MB/s,r187MB/s
   28,12    12:40    48:23  2*8  Xeon-2660MHz-2M/8GB-v2.25-gcc-4.1.1 amd?64bit  model6 stepping4  lt=8m08 n2=10m54  6m01/10It 2*DualCore*2HT=8vCPUs bellamy san=w142MB/s,r187MB/s
   28,12    40:13  1:25:12    16 AltixIA64-1500MHz-gcc-3.3    v2.23 -O2    lt=18m  (6m06s)/10It hxy_size 163840/32768=5  14+2CPUs auf 1RAM, numalink=Bottleneck
   28,12    34:35  1:53:10    8  AltixIA64-1500MHz-gcc-3.3    v2.23 -O2    lt=18m  (14m30s)/10It hxy_size 163840/32768=5
   28,12    44:21  2:33:00    4  AltixIA64-1500MHz-gcc-3.3    v2.23 -O2    lt=18m  (21m50s)/10It hxy_size 163840/32768=5
   28,12  1:01:59  3:30:52    2  AltixIA64-1500MHz-gcc-3.3    v2.23 -O2    lt=28m  (29m29s)/10It hxy_size 163840/32768=5
   28,12  1:34:21  3:21:23    1  AltixIA64-1500MHz-gcc-3.3    v2.23 -O2    lt=46m  (14m37s)/10It hxy_size 163840/32768=5
   ----------------------------
   27,13  7:41:55 29:08:50    4  MIPS--IP25-194MHz-CC-7.30 v2.19 -64 -Ofast -IPA                   4x4 lt=3h10m (2h16m)/5It 
   27,13  2:21:06  6:24:27    4  SunFire-880-SparcIII-750MHz-CC-5.3  v2.19 -fast -xtarget=ultra -xarch=v9 -g -xipo -xO5 lt=73:15  (21m14s/5It)
   27,13  1:48:39 10:35:00    8  GS160-Alpha-731MHz-cxx-6.5   v2.19 -fast  lt=45m  56m/5It 
   27,13    54:36 14:23:16    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=27m    (13m..1h32m)/5It
   27,13    57:59  2:18:59    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=28m    (6m38s)/5It   HBLen=409600
   27,13    32:15  4:26:40    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=22:35  (4m43s..1h38m)/10It
   27,13    46:03  1:43:21    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=25:44  (3m50s..3m57s)/5It mfs-disk + vbuf=16MB + sah's
   27,13    29:18  1:01:25    32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=18:03  (1m43s..1h38m)/5It  (2stripe-Platte=150MB/s) 60%user+5%sys+35%idle(0%disk) 1user
   ---------------------------- n1=145068828=145e6 E0=-21.77715233 ZMag= 0.02928151
   26,14     107h     212h    1  O2100-IP27-250MHz-CC-7.30 v1.4
   26,14  3:30:24 12:09:12    4  ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17    lt=62m21
   26,14    45:51  1:45:48    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=27m07  (4m00s...4m12s)/5It mfs-disk vbuf=16M + spike (optimization after linking)
   26,14    47:18  1:50:45    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=27m08  (3m48s...5m16s)/5It mfs-disk          + spike (optimization after linking)
   26,14  1:31:49  3:31:01    16 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=48m34  (7m57s..10m01s)/5It mfs-disk vbuf=16M
   26,14  1:08:45 16:19:00    32 GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=33m17  (30m..5h)/10It HBLen=409600
   26,14  4:23:00 13:58:37    1  2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2    lt=76m    (123m)/10It    16blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
   26,14  2:49:53 10:00:26    2  2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2    lt=50m    ( 91m)/10It     2blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat) io.r=50MB/s
   26,14  1:32:16  6:53:23    4  2CoreOpteron-2194MHz-gcc-3.4 v2.23 -O2    lt=26m39  ( 72m)/10It     4blocks hxy_size=163840/32768=5 mem=16GB 2*DualCore (loki.nat)
   26,14    32:06  1:42:33    8  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit  lt=18m32  n2=25:18  (10m24s/10It) SLES10-64bit 32G-RAM
   26,14    32:23  2:15:03 a2 8  4xDualOpteron885-2600MHz-gcc-4.1.0 64bit  lt=18m32  n2=25:52  (21m26s/10It) SLES10-64bit 32G-RAM
   26,14  4:56:57 11:36:23    1  Xeon-3GHz-12GB-v2.24-gcc-4.1.1  64bit 4x4 lt=1h28m  ( 75m)/10It model4 stepping10 
   26,14  3:03:22  8:58:02    2  Xeon-3GHz-12GB-v2.24-gcc-4.1.1  64bit 2x2 lt=1h06m  ( 70m)/10It model4 stepping10
   ---------------------------- mem=7e9 hnz=17e9 E0= -24.52538640
   25,15  6:21:08 22:13:42    4  ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17    lt=1h32m
   25,15  3:03:41 15:54:51    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=1h24m  (1h18m..1h26m)/5It HBLen=409600
   24,16  4:58:56 25:21:48    8  GS1280-Alpha-1150MHz-cxx-6.5 v2.19 -fast  lt=2h08m  (2h16m)/5It    HBLen=409600
   24,16 10:17:31   :  :      4  ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.17    lt=02:41:03 99%stored (+4h i40ca54h)
   24,16  8:31:10 31:49:42    8  AltixIA64-1500MHz-gcc-3.3    v2.23 -O2    lt=3h14m  (4h55m)/10It hxy_size 163840/32768=5
   23,17 17:19:51 51:02:31    4  ES45-Alpha-1GHz-CC-6.5 -fast -lz v2.18    lt=04:11:14 latency=29h30m/40*63GB=42ns cat=2229s(28MB/s) zcat=5906s(11MB/s)

Next figure shows the computing time for different older program versions and computers (I update it as soon as I can). The computing time depends nearly linearly from the matrix size n1 (time is proportional to n1^1.07, n1 is named n in the figure).

4kB png image of computing time

memory and disk usage

Memory usage depends from the matrix dimension n1. For the N=40 sample two double vectors and one 5-byte vector is stored in the memory, so we need n1*21 Bytes, where n1 is approximatly (N!/(nu!*nd!))/(4N). Disk usage is mainly the number of nonzero matrix elements hnz times 5 (disk size for tmp_l1.dat is 5*n1 and is not included here). The number of nonzero matrix elements hnz depends from n1 by hnz=11.5(10)*n1^1.064(4), which was found empirically. Here are some examples:

  nu,nd     n1   memory     hnz    disk  (zip)  (n1*21=memory, hnz*5=disk)
  -----+---------------+----------------------
  34,6    24e3    432kB   526e3   2.6MB 1.3MB 
  32,8   482e3     11MB    13e6    66MB  34MB
  30,10  5.3e6    113MB   168e6   840MB 444MB   small speed test
  28,12   35e6    735MB   1.2e9     6GB 3.6GB   big speed test
  27,13   75e6    1.4GB   2.8e9    14GB         # n1=75214468
  26,14  145e6    2.6GB   5.5e9    28GB 
  25,15  251e6    5.3GB   9.9e9    50GB
  24,16  393e6    8.3GB  15.8e9    79GB
  23,17  555e6   11.7GB    23e9   115GB  63GB
  22,18  708e6   14.9GB    ...     ...
  20,20  431e6    7.8GB    18e9    90GB 
   

CPU load

A typical cpu load for a N=40 site system looks like this:

4kB png image of cpu-load

Data are generated using the following tiny script:

   #!/bin/sh
   while ps -o pid,pcpu,time,etime,cpu,user,args -p 115877;\
     do sleep 30; done | grep -v CPU
   

115877 is the PID of the process. You have to replace it. Alternatively you can activate a script activated by daten.i (edit it). The machine was used by 5 users, therefore peak load is only about 12CPUs. 735MB memory and 6GB diskspace (or cache) were used. You can see the initialization process (20min), the matrix generation (57min) and the first 4 iterations (4x8min). The matrix generation is most dependend from CPU power. The iteration time mainly depends from the disk speed (try: time cat exe/tmp/ht* >/dev/null) and the speed of random memory access. For example a GS1280-1GHz needs a bandwith to the disk of 60MB/s per CPU to avoid a bottle neck. Reading 5GB in 8min means a sequential data rate of 12MB/s which is no problem for disks or memory cache. Reading randomly a 280MB vector in 8min means 600kB/s and should also be no problem for the machine. You can improve disk speed using striped disks or files (AdvFS) and putting every H-block on another disk. The maximum number of threads was limited to 16, but this can be changed (see src/config.h).

Why multi-processor scaling is so bad for v2.15?

During iterations the multi-processor scaling is so bad on most machines -- why? I guess, this is because of random read access to the vector a (see picture below). I thought a shared memory computer should not have such problems with scaling here, but probably I am wrong. In future I try to solve the problem.

6kB png image of dataflow
Figure shows dataflow during iterations for 2 CPUs.

random access of Random Access Memory (RAM)?

Version 2.24 was very slow in calculating the expectation value of <SiSj>. A gprof analysis was showing, that most time was spend for finding the index of an configuration in the configuration table (function b2i of hilbert.c). This was the reason to have a closer look to the speed of memory access. I wrote memspeed.c which was simply read a big number of integers at different steps. Reading integers one after another (sequential read) gives the best results of the order of 1-2GB/s. But worst case where integers are read at distance of about 16kB gives performance of about 10-40MB/s which is a factor of 100 smaller. This is a random access to the RAM. The OpSiSj-function does arround n1*(ln2(n1)+1) such accesses to memory for every SiSj value calculation. I think it should be possible to reduce the randomness of index calculation by using ising energy of each config to divide the configs in blocks. (Sep2006)