SPINPACK
What's about?
SPINPACK is a big program package to compute
lowest eigenvalues and eigenstates and various expectation values
(spin correlations etc) for quantum
spin systems.
These model systems can for example describe magnetic properties of
insulators at very low temperatures (T=0) where the magnetic moments
of the particles form entangled quantum states.
The package generates the symmetrized configuration vector,
the sparse matrix representing the quantum interactions and
computes its eigenvalues and eigenvectors using iterative MatrixVector
multiplications (SpMV) as the compute intense core operation
and finaly some expectation values for the quantum system.
The first SPINPACK version was based on Nishimori's
TITPACK (Lanczos method, no symmetries), but
it was early converted to C/C++ and completely rewritten (1994/1995).
Other diagonalization algorithms are implemented too
(Lanzcos, 2x2diagonalization and LAPACK/BLAS for smaller systems).
It is able to handle
Heisenberg,
tJ, and Hubbardsystems up to 64 sites or more using
special compiler and CPU features (usually up to 128)
or more sites in slower emulation mode (C++/CXX required for int128 emulation).
For instance we got the lowest eigenstates for the
Heisenberg Hamiltonian on a 40 site square lattice on our machines at 2002.
Note that the resources needed for computation grow exponentially with the
number of lattice sites (N=40 means 2^N/symfactor matrix dimension).
The Hamilton matrix can be stored to memory or file storage.
If there is no storage space the matrix elements will be recomputed
on every iteration round (slow).
The package is written mainly in C to get it running on all unix systems.
C++ is only needed for complex eigenvectors and
twisted boundary conditions if C has no complex extension like gcc has.
This way the package is very portable.
Parallelization can be done using MPI and PTHREADlibrary.
Mixed mode (hybrid mode) is possible, but not always faster
than pure MPI (2015).
v2.60 has slightly hybrid mode advantage on CPUs supporting hyperthreading.
This will hopefully be improved further. MPIscaling is tested to work
up to 6000 cores, PTHREADscaling up to 510 cores but requires
careful tuning (scaling 20081016).
The program can use all topological symmetries,
S(z) symmetry and spin inversion to reduce matrix size.
This will reduce the needed computing recources by a linear factor.
Since 2015/2016 CPU vector extensions (SIMD, SSE2, AVX2)
are supported to get better performance for
the symmetry operations on bit representations of the quantum spins.
The results are very reliable because the package has been used
since 1995 in scientific work. Lowlatency Highbandwith network
and low latency memory is needed to get best performance on large scale
clusters.
News
 Bug 20220704: do not use 64bitLAPACK libraries,
Spinpack uses 32bit integers only for the fortranAPI which is unsafe,
you may get strange errors for full diagonalization part,
this may result to segfaults, corrupt memory data or bad results
depending on the undefined data lying in the upper 32bit,
using 32bitLAPACK libraries is safe, use spinpack2.59c or later

Groundstate of the S=1/2 Heisenberg AFM on a N=42 kagome biggest
submatrix computed (Sz=1 k=Pi/7 size=36.7e9, nnz=41.59, v2.56 cplx8,
using partly nonblocking hybrid code on
supermuc.phase1
10400cores(650 nodes, 2 tasks/node, 8cores/task, 2hyperthreads/core, 4h),
matrix_storage=0.964e6nz/s/core SpMV=6.58e6nz/s/core Feb2017)

Groundstate of the S=1/2 Heisenberg AFM on a N=42 linear chain computed
(E0/Nw=0.22180752, Hsize = 3.2e9, v2.38, Jan2009)
using 900 Nodes of a SiCortex SC5832 700MHz 4GB RAM/Node (320min).
Update: N=41 Hsize = 6.6e9, E0/Nw=0.22107343
16*(16cores+256GB+IB)*32h matrix stored, v2.41 Oct2011).

Groundstate of the S=1/2 Heisenberg AFM on a N=42 square lattice computed
(E0 = 28.43433834, Hsize = 1602437797, ((7,3),(0,6)), v2.34, Apr2008)
using 23 Nodes a 2*DualOpteron2.2GHz 4GB RAM via 1Gbeth
(92Cores usage=80%, ca.60GB RAM, 80MB/s BW, 250h/100It).

Program is ready for cluster (MPI and Pthread can be used at the same
time, see the performance graphic)
and can again use memory as storage media for performance measurement
(Dec07).
 Groundstate of the S=1/2 Heisenberg AFM on a N=40 square lattice
computed (E0 = 27.09485025, Hsize = 430909650, v1.9.3, Jan2002).
 Groundstate of the S=1/2 J1J2Heisenberg AFM on a N=40 square lattice
J2=0.5, zeromomentum space:
E0= 19.96304839, Hsize = 430909650
(15GB memory, 185GB disk, v2.23, 60 iterations,
210h, Altix330 IA641.5GHz, 2 CPUs, GCC3.3, Jan06)
 Groundstate of the S=1/2 Heisenberg AFM on a N=39 triangular lattice
computed (E0 = 21.7060606, Hsize = 589088346, v2.19, Jan2004).
 Largest complex Matrix: Hsize=1.2e9 (26GB memory, 288GB disk, v2.19 Jul2003),
90 iterations: 374h alpha1GHz (with limited disk data rate, 4 CPUs, til4_36)
 Largest real Matrix: Hsize=1.3e9 (18GB memory, 259GB disk, v2.21 Apr2004),
90 iterations: real=40h cpu=127h sys=9% alpha1.15GHz (8 CPUs, til9_42z7)
Download
Verify download using:
gpg verify spinpack.tgz.asc spinpack.tgz
 spinpack.tgz experimental developper version (may have bug fixes, new features or speed improvements, see doc/history.html)
 
 spinpack2.59d.tgz[.asc] improved usability at bigger systems, see doc/history (2022.07)
 spinpack2.59c.tgz[.asc] fix lapack64bitbug, see doc/history (2022.07)
 spinpack2.58a.tgz simpler block matrix handling + more, see doc/history (big NN speedup, Matrix compression disabled, SuperMucAdaptions, fix multirun 2019.07)
 spinpack2.57.tgz simpler block matrix handling + more, see doc/history
 spinpack2.56c.tgz 2.57 backport fixes, above 2048*16threads, FTLMrandomfix, see doc/history
 spinpack2.56.tgz better hybrid MPIscaling above 1000 tasks, tested on kagome42_sym14_sz13..6, pgpsign, updated 20170223, see doc/history, still blocking MPI only)
 spinpack2.55.tgz better MPIscaling above 1000 tasks, tested on kagome42_sym14_sz13..8..1, pgpsign, updated 20170221, see doc/history)
 spinpack2.52.tgz OpenMPsupport (implemented as pthreademulation), but weak mixed code speed, pgpsign, Dec16)
 spinpack2.51.tgz g++6adaptions (gcc6.2 compileerrors/warnings fixed, pgpsign, Sep16)
 spinpack2.50d.tgz SIMDsupport (SSE2,AVX2), lot of bugfixes (Jan16+fixFeb16+fixMar16b+c+fixApr16d))
 spinpack2.49.tgz mostly bugfixes (Mar15) (updated Mar15,12, buggy bflybench, NN>32 32bitcompileerror.patch, see experimental version above)
 spinpack2.48.tgz testversion (v2.48pre Feb14 new features, +tUfixMay14 +chkptFixDez14 +2ndrunFixJan15)
 spinpack2.47.tgz bug fixes (see doc/history.html, bug fixes of 2.452.46)
(version 2014/02/14, 1MB, gpgsignatur)
 spinpack2.44.tgz (see doc/history.html, known bugs)
(version 2013/01/23 + fix May13,May14 2.44c, 1MB, gpgsignatur)
 spinpack2.43.tgz +checkpointing (see doc/history.html)
(version 2012/05/23, 1MB, gpgsignatur)
 spinpack2.42.tgz ns.mpispeed++ (see doc/history.html)
(version 2012/05/07, 1MB, gpgsignatur)
 spinpack2.41.tgz mpispeed++,doc++ (see doc/history.html)
(version 2011/10/24 + backportfix 20150923, 1MB, gpgsignatur)
 spinpack2.40.tgz bug fixes (see doc/history.html)
(version 2009/11/26, 890kB, gpgsignatur)
 spinpack2.39.tgz new option m, new lattice (doc/history.html)
(version 2009/04/20, 849kB, gpgsignatur)
 spinpack2.38.tgz MPIfixes (doc/history.html)
(version 2009/02/11, 849kB, gpgsignatur)
 spinpack2.36.tgz MPItuned (doc/history.html)
(version 2008/08/04, 802kB, gpgsignatur)
 spinpack2.35.tgz IA64tuned (doc/history.html)
(version 2008/07/21, 796kB, gpgsignatur)
 spinpack2.34.tgz bugs fixed for MPI (doc/history.html)
(version 2008/04/23, 770kB, gpgsignatur)
 spinpack2.33.tgz bugs fixed for MPI (doc/history.html)
(version 2008/03/16, 620kB, gpgsignatur)
 spinpack2.32.tgz bug fixed (doc/history.html)
(version 2008/02/19, 544kB, gpgsignatur)
 spinpack2.31.tgz MPI works and scales
(version 2007/12/14, 544kB, gpgsignatur)
 spinpack2.26.tgz code simplified
and partly speedup, prepare for FPGA and MPI
(version 07/02/27, gpgsignatur)
 spinpack2.15.tgz see doc/history.tex (updated 2003/01/20)
Installation
 gunzip c spinpackxxx.tgz  tar xf  # xxx is the version number
 cd spinpack; ./configure mpt
 make test # to test the package and create exe path
 # edit src/config.h exe/daten.def for your needs (see models/*.c)
 make
 cd exe; ./spin
Documentation
The documentation is available in the docpath.
Most parts of the documentation are rewritten in english now. If you
still find some parts written in german or outofdate documentation
send me an email with a short hint where I find this part and
I want to rewrite this part as soon as I can.
Please see doc/history.html for latest changes.
You can find a documentation about speed in the package or an older version
on this
spinpackspeedpage.
ToDo
1) Effectiveness (energy consumption) of code is not optimal (201904).
Storing Matrix to memory is the
most effective method (much less energy and time).
Computing the Matrix is CPUbounded (approx nzx*2000 Ops / 8 Byte)
and well optimized for Heisenbergsystems.
The butterfly network O(2logN) is used for bit permutation
to compute lattice symmetries (typical 4*N).
There is some room for hardware/software acceleration there.
But the core routine during iteration (SpMV) is networkI/Obounded.
Ideal we have a FLOP to transfer rate of 2 FLOP per 8 Byte (double)
as worst case to 8 FLOP per 8 Byte (singleprecisioncomplex)
as best case, which is
low compared to todays HPCsystems with 150 FLOP per Byte.
So about 1 percent of peak performance can be used on
QDRInfiniband clusters only.
But at the moment (201904) index 4 Byte data and matrix size data
(latency) is transferred too which cost about 50%
more data. This must be changed before making the code using
full overlaping communication and using remaining CPU power for
data compression to get further acceleration on HPCClusters.
2) Parallel computation of two ore more datasets (vectors) at the same time
will increase memory consumption by a factor of (nnz+(m*2))/(nnz+2),
but makes memory bounded SpMVcore to more effective SpMMcore
(m times FLOPs per (factor above slightly increased) memory bandwith).
This is useful together with improved overlapping computation and
communication.
3) The most time consuming important function is b_smallest
in hilbert.c
for matrix generation.
This function computes the representator
of a set of symmetric spin configurations (bit pattern) from a member of
this set. It also returns a phase factor and the orbit length.
It would be a great progress,
if the performance of that function could be improved. Ideas are welcome.
One of my ideas is to use FPGAs but my impression
on 2009 was, that the FPGA/VHDLCompiler and Xilingstools are so slow,
badly scaling and buggy, that code generation and debugging is really no
fun and a much better FPGA toolchain is needed for HPC.
201505 I added software benesnetwork to get gain of AVX2, but it looks like
that its still not the maximum available speed (HT shows near 2 factor,
bitmask falls out of L1cache?).
Please use these data for your work or verify my data.
Questions and corrections are welcome. If you miss data or explanations
here, please send a note to me.
Frequently asked questions (FAQ)

Q: I try to diagonalize a 4spin system, but I do not get the full spectrum. Why?

A: Spinpack is designed to handle big systems. Therefore it uses as much
symmetries as it can. The very small 4spin system has a very special
symmetry which makes it equivalent to a 2spin system build by two s=1 spins.
Spinpack uses this symmetry automatically to give you the possibility
to emulate s=1 (or s=3/2,etc) spin systems by pairs of s=1/2 spins.
If you want to switch this off, edit src/config.h and change
CONFIG_S1SYM to CONFIG_NOS1SYM.

Q: What is the best suitable clusterhardware?

A: Use SpMVbenchmarks (HPCG) to check best system if not using spinpack
itself. 1st NetworkBandwith is importand, 2nd NetworkBW again,
3th MemoryChannels, 4th CPU CacheSize, 5th CPU integer(!) units.
Dont forget reliability of network and RAM.
This means do not buy today TopCPUs for best overall performance,
better look at top networks, highMemoryBW
and cheap Multicore/Multichannel/highBWCPUs (check performance/price).
Calculate performance/price including 5y (90%SpMV+10%HPL) power
consumption price including cooling.
Most CPU prices have similar HPLPerf/(price+5yEnergy).
5yHPLenergy price can range from 100% to 300% of CPUprice!
Use power consumption value at HPCGtest over minimum 2 nodes.
This will be likely much lower than for the peak at HPLtest.
Feel free to improve my suggestions.
This picture is showing a small sample of a possible Hilbert matrix.
The nonzero elements are shown as black pixels (v2.33 Feb2008 kago36z14j2).
This picture is showing a small sample of a possible Hilbert matrix.
The nonzero elements are shown as black (J1) and gray (J2) pixels
(v2.42 Nov2011 j1j2chain N=18 Sz=0 k=0). Config space is sorted by
J1IsingmodelEnergy to show structures of the matrix.
Ising energy ranges are shown as slightly grayed arrays.
.
.
.
Ground state energy scaling for finite size spin=1/2AFMchains N=4..40
using up to 300GB memory to store the N=39 sparse matrix and 245 CPUhoures
(2011, src=lc.gpl).
Author: Joerg Schulenburg, UniMagdeburg, 20082016