SPINPACK

What's about?

SPINPACK is a big program package to compute lowest eigenvalues and eigenstates and various expectation values (spin correlations etc) for quantum spin systems. These model systems can for example describe magnetic properties of insulators at very low temperatures (T=0) where the magnetic moments of the particles form entangled quantum states. The package generates the symmetrized configuration vector, the sparse matrix representing the quantum interactions and computes its eigenvalues and eigenvectors using iterative Matrix-Vector multiplications (SpMV) as the compute intense core operation and finaly some expectation values for the quantum system. The first SPINPACK version was based on Nishimori's TITPACK (Lanczos method, no symmetries), but it was early converted to C/C++ and completely rewritten (1994/1995). Other diagonalization algorithms are implemented too (Lanzcos, 2x2-diagonalization and LAPACK/BLAS for smaller systems). It is able to handle Heisenberg, t-J, and Hubbard-systems up to 64 sites or more using special compiler and CPU features (usually up to 128) or more sites in slower emulation mode (C++/CXX required for int128 emulation). For instance we got the lowest eigenstates for the Heisenberg Hamiltonian on a 40 site square lattice on our machines at 2002. Note that the resources needed for computation grow exponentially with the number of lattice sites (N=40 means 2^N/symfactor matrix dimension).
The Hamilton matrix can be stored to memory or file storage. If there is no storage space the matrix elements will be recomputed on every iteration round (slow).
The package is written mainly in C to get it running on all unix systems. C++ is only needed for complex eigenvectors and twisted boundary conditions if C has no complex extension like gcc has. This way the package is very portable.
Parallelization can be done using MPI- and PTHREAD-library. Mixed mode (hybrid mode) is possible, but not always faster than pure MPI (2015). v2.60 has slightly hybrid mode advantage on CPUs supporting hyper-threading. This will hopefully be improved further. MPI-scaling is tested to work up to 6000 cores, PTHREAD-scaling up to 510 cores but requires careful tuning (scaling 2008-1016).
The program can use all topological symmetries, S(z) symmetry and spin inversion to reduce matrix size. This will reduce the needed computing recources by a linear factor.
Since 2015/2016 CPU vector extensions (SIMD, SSE2, AVX2) are supported to get better performance for the symmetry operations on bit representations of the quantum spins. The results are very reliable because the package has been used since 1995 in scientific work. Low-latency High-bandwith network and low latency memory is needed to get best performance on large scale clusters.

News

Bug 2022-07-04: do not use 64bit-LAPACK libraries, Spinpack uses 32bit integers only for the fortran-API which is unsafe, you may get strange errors for full diagonalization part, this may result to segfaults, corrupt memory data or bad results depending on the undefined data lying in the upper 32bit, using 32bit-LAPACK libraries is safe, use spinpack-2.59c or later
Groundstate of the S=1/2 Heisenberg AFM on a N=42 kagome biggest sub-matrix computed (Sz=1 k=Pi/7 size=36.7e9, nnz=41.59, v2.56 cplx8, using partly non-blocking hybrid code on supermuc.phase1 10400cores(650 nodes, 2 tasks/node, 8cores/task, 2hyperthreads/core, 4h), matrix_storage=0.964e6nz/s/core SpMV=6.58e6nz/s/core Feb2017)
Groundstate of the S=1/2 Heisenberg AFM on a N=42 linear chain computed (E0/Nw=-0.22180752, Hsize = 3.2e9, v2.38, Jan2009) using 900 Nodes of a SiCortex SC5832 700MHz 4GB RAM/Node (320min).
Update: N=41 Hsize = 6.6e9, E0/Nw=-0.22107343 16*(16cores+256GB+IB)*32h matrix stored, v2.41 Oct2011).
Groundstate of the S=1/2 Heisenberg AFM on a N=42 square lattice computed (E0 = -28.43433834, Hsize = 1602437797, ((7,3),(0,6)), v2.34, Apr2008) using 23 Nodes a 2*DualOpteron-2.2GHz 4GB RAM via 1Gb-eth (92Cores usage=80%, ca.60GB RAM, 80MB/s BW, 250h/100It).
Program is ready for cluster (MPI and Pthread can be used at the same time, see the performance graphic) and can again use memory as storage media for performance measurement (Dec07).
Groundstate of the S=1/2 Heisenberg AFM on a N=40 square lattice computed (E0 = -27.09485025, Hsize = 430909650, v1.9.3, Jan2002).
Groundstate of the S=1/2 J1-J2-Heisenberg AFM on a N=40 square lattice J2=0.5, zero-momentum space: E0= -19.96304839, Hsize = 430909650 (15GB memory, 185GB disk, v2.23, 60 iterations, 210h, Altix-330 IA64-1.5GHz, 2 CPUs, GCC-3.3, Jan06)
Groundstate of the S=1/2 Heisenberg AFM on a N=39 triangular lattice computed (E0 = -21.7060606, Hsize = 589088346, v2.19, Jan2004).
Largest complex Matrix: Hsize=1.2e9 (26GB memory, 288GB disk, v2.19 Jul2003), 90 iterations: 374h alpha-1GHz (with limited disk data rate, 4 CPUs, til4_36)
Largest real Matrix: Hsize=1.3e9 (18GB memory, 259GB disk, v2.21 Apr2004), 90 iterations: real=40h cpu=127h sys=9% alpha-1.15GHz (8 CPUs, til9_42z7)

Download

Verify download using: gpg --verify spinpack.tgz.asc spinpack.tgz

spinpack.tgz experimental developper version (may have bug fixes, new features or speed improvements, see doc/history.html)
---
spinpack-2.59d.tgz[.asc] improved usability at bigger systems, see doc/history (2022.07)
spinpack-2.59c.tgz[.asc] fix lapack-64bit-bug, see doc/history (2022.07)
spinpack-2.58a.tgz simpler block matrix handling + more, see doc/history (big NN speedup, Matrix compression disabled, SuperMuc-Adaptions, fix multirun 2019.07)
spinpack-2.57.tgz simpler block matrix handling + more, see doc/history
spinpack-2.56c.tgz 2.57 backport fixes, above 2048*16threads, FTLM-random-fix, see doc/history
spinpack-2.56.tgz better hybrid MPI-scaling above 1000 tasks, tested on kagome42_sym14_sz13..6, pgp-sign, updated 2017-02-23, see doc/history, still blocking MPI only)
spinpack-2.55.tgz better MPI-scaling above 1000 tasks, tested on kagome42_sym14_sz13..8..1, pgp-sign, updated 2017-02-21, see doc/history)
spinpack-2.52.tgz OpenMP-support (implemented as pthread-emulation), but weak mixed code speed, pgp-sign, Dec16)
spinpack-2.51.tgz g++6-adaptions (gcc6.2 compile-errors/warnings fixed, pgp-sign, Sep16)
spinpack-2.50d.tgz SIMD-support (SSE2,AVX2), lot of bug-fixes (Jan16+fixFeb16+fixMar16b+c+fixApr16d))
spinpack-2.49.tgz mostly bug-fixes (Mar15) (updated Mar15,12, buggy bfly-bench, NN>32 32bit-compile-error.patch, see experimental version above)
spinpack-2.48.tgz test-version (v2.48pre Feb14 new features, +tUfixMay14 +chkptFixDez14 +2ndrunFixJan15)
spinpack-2.47.tgz bug fixes (see doc/history.html, bug fixes of 2.45-2.46) (version 2014/02/14, 1MB, gpg-signatur)
spinpack-2.44.tgz (see doc/history.html, known bugs) (version 2013/01/23 + fix May13,May14 2.44c, 1MB, gpg-signatur)
spinpack-2.43.tgz +checkpointing (see doc/history.html) (version 2012/05/23, 1MB, gpg-signatur)
spinpack-2.42.tgz ns.mpi-speed++ (see doc/history.html) (version 2012/05/07, 1MB, gpg-signatur)
spinpack-2.41.tgz mpi-speed++,doc++ (see doc/history.html) (version 2011/10/24 + backport-fix 2015-09-23, 1MB, gpg-signatur)
spinpack-2.40.tgz bug fixes (see doc/history.html) (version 2009/11/26, 890kB, gpg-signatur)
spinpack-2.39.tgz new option -m, new lattice (doc/history.html) (version 2009/04/20, 849kB, gpg-signatur)
spinpack-2.38.tgz MPI-fixes (doc/history.html) (version 2009/02/11, 849kB, gpg-signatur)
spinpack-2.36.tgz MPI-tuned (doc/history.html) (version 2008/08/04, 802kB, gpg-signatur)
spinpack-2.35.tgz IA64-tuned (doc/history.html) (version 2008/07/21, 796kB, gpg-signatur)
spinpack-2.34.tgz bugs fixed for MPI (doc/history.html) (version 2008/04/23, 770kB, gpg-signatur)
spinpack-2.33.tgz bugs fixed for MPI (doc/history.html) (version 2008/03/16, 620kB, gpg-signatur)
spinpack-2.32.tgz bug fixed (doc/history.html) (version 2008/02/19, 544kB, gpg-signatur)
spinpack-2.31.tgz MPI works and scales (version 2007/12/14, 544kB, gpg-signatur)
spinpack-2.26.tgz code simplified and partly speedup, prepare for FPGA and MPI (version 07/02/27, gpg-signatur)
spinpack-2.15.tgz see doc/history.tex (updated 2003/01/20)

Installation

gunzip -c spinpack-xxx.tgz | tar -xf - # xxx is the version number
cd spinpack; ./configure --mpt
make test # to test the package and create exe path
# edit src/config.h exe/daten.def for your needs (see models/*.c)
make
cd exe; ./spin

Documentation

The documentation is available in the doc-path. Most parts of the documentation are rewritten in english now. If you still find some parts written in german or out-of-date documentation send me an email with a short hint where I find this part and I want to rewrite this part as soon as I can.
Please see doc/history.html for latest changes. You can find a documentation about speed in the package or an older version on this spinpack-speed-page.

ToDo

1) Effectiveness (energy consumption) of code is not optimal (2019-04). Storing Matrix to memory is the most effective method (much less energy and time). Computing the Matrix is CPU-bounded (approx nzx*2000 Ops / 8 Byte) and well optimized for Heisenberg-systems. The butterfly network O(2logN) is used for bit permutation to compute lattice symmetries (typical 4*N). There is some room for hardware/software acceleration there.
But the core routine during iteration (SpMV) is network-I/O-bounded. Ideal we have a FLOP to transfer rate of 2 FLOP per 8 Byte (double) as worst case to 8 FLOP per 8 Byte (single-precision-complex) as best case, which is low compared to todays HPC-systems with 150 FLOP per Byte. So about 1 percent of peak performance can be used on QDR-Infiniband clusters only. But at the moment (2019-04) index 4 Byte data and matrix size data (latency) is transferred too which cost about 50% more data. This must be changed before making the code using full overlaping communication and using remaining CPU power for data compression to get further acceleration on HPC-Clusters.
2) Parallel computation of two ore more datasets (vectors) at the same time will increase memory consumption by a factor of (nnz+(m*2))/(nnz+2), but makes memory bounded SpMV-core to more effective SpMM-core (m times FLOPs per (factor above slightly increased) memory bandwith). This is useful together with improved overlapping computation and communication.
3) The most time consuming important function is b_smallest in hilbert.c for matrix generation. This function computes the representator of a set of symmetric spin configurations (bit pattern) from a member of this set. It also returns a phase factor and the orbit length. It would be a great progress, if the performance of that function could be improved. Ideas are welcome. One of my ideas is to use FPGAs but my impression on 2009 was, that the FPGA/VHDL-Compiler and Xilings-tools are so slow, badly scaling and buggy, that code generation and debugging is really no fun and a much better FPGA toolchain is needed for HPC. 2015-05 I added software benes-network to get gain of AVX2, but it looks like that its still not the maximum available speed (HT shows near 2 factor, bitmask falls out of L1-cache?).

Examples for open access

Please use these data for your work or verify my data. Questions and corrections are welcome. If you miss data or explanations here, please send a note to me.

s=1/2 Heisenberg model square lattice (finite size extrapolation: gnuplot data, gnuplot script)
s=1/2 Heisenberg model triangular lattice (finite size extrapolation: gnuplot data, gnuplot script)
s=1/2 Heisenberg model kagome lattice (finite size extrapolation: gnuplot script, data included)

Frequently asked questions (FAQ)

Q: I try to diagonalize a 4-spin system, but I do not get the full spectrum. Why?
A: Spinpack is designed to handle big systems. Therefore it uses as much symmetries as it can. The very small 4-spin system has a very special symmetry which makes it equivalent to a 2-spin system build by two s=1 spins. Spinpack uses this symmetry automatically to give you the possibility to emulate s=1 (or s=3/2,etc) spin systems by pairs of s=1/2 spins. If you want to switch this off, edit src/config.h and change CONFIG_S1SYM to CONFIG_NOS1SYM.
Q: What is the best suitable cluster-hardware?
A: Use SpMV-benchmarks (HPCG) to check best system if not using spinpack itself. 1st Network-Bandwith is importand, 2nd Network-BW again, 3th MemoryChannels, 4th CPU CacheSize, 5th CPU integer(!) units. Dont forget reliability of network and RAM. This means do not buy today Top-CPUs for best overall performance, better look at top networks, high-Memory-BW and cheap Multicore/Multichannel/highBW-CPUs (check performance/price). Calculate performance/price including 5y (90%SpMV+10%HPL) power consumption price including cooling. Most CPU prices have similar HPL-Perf/(price+5yEnergy). 5y-HPL-energy price can range from 100% to 300% of CPU-price! Use power consumption value at HPCG-test over minimum 2 nodes. This will be likely much lower than for the peak at HPL-test. Feel free to improve my suggestions.

Hilbert matrix N=36 s=1/2 kago lattice This picture is showing a small sample of a possible Hilbert matrix. The non-zero elements are shown as black pixels (v2.33 Feb2008 kago36z14j2).

Hilbert matrix for N=18 s=1/2 quantum chain This picture is showing a small sample of a possible Hilbert matrix. The non-zero elements are shown as black (J1) and gray (J2) pixels (v2.42 Nov2011 j1j2-chain N=18 Sz=0 k=0). Config space is sorted by J1-Ising-model-Energy to show structures of the matrix. Ising energy ranges are shown as slightly grayed arrays.

.
.
.

ground state s=1/2-AFM-LC Ground state energy scaling for finite size spin=1/2-AFM-chains N=4..40 using up to 300GB memory to store the N=39 sparse matrix and 245 CPU-houres (2011, src=lc.gpl).

Author: Joerg Schulenburg, Uni-Magdeburg, 2008-2016