URZ HPC-Cluster Neumann (disassembled 2021)


Neumann - 100 Teraflop Infiniband-Cluster

Aktuelles:

  • Mar2016 - Aufnahme Normalbetrieb, Problembeseitigungen
  • ToDo: background-jobs, Application-Checkpointing (failed)
  • Mai2016 - Please dont use your home for big data! This will create problems on the login node. Always use /scratch/tmp/${USER}/ for big data and remove files after usage. Disk quota was set to 2GB at /home 2021.
  • Sep2021 - below our switch-off plan (fixed at Nov 25th)
    • 25.11.2021 - new HPC system available for users (10 days later)
    • 30.11.2021 - removing user accounts and data
    • 07.12.2021 - dissassembling (6 days later)

Rechnersystem-Kurzbeschreibung

Mit der Installation des Haswell-Systems der Firma Clustervision im November/Dezember 2015 steht den Nutzern unserer Universität ein Rechen-Cluster (compute cluster) aus mit Infiniband vernetzten Mehrprozessor-Knoten mit Linux-Betriebssystem und Slurm-Job-Scheduler zur Verfügung. Der Neumann-Cluster ist für spezielle parallelisierte Anwendungen mit hohen Anforderungen an Compute-Leistungen bestimmt. Er löst die älteren HPC-Systeme ab. Durch die hohe Netzwerkbandbreite and ausreichend Hauptspeicher ist insbesondere die Berechnung sehr grosser Probleme auf diesem Cluster effizienter möglich.
Bilder/Images: Rueckansicht mit Kuehltuer, Rueckansicht Detail, Rueckansicht Verkabelung, Board Inlet, Board Z10PH-D16

Hardware

Architektur: uniform distributed memory, 172 infiniband-connected 16core-ccNuma-nodes 256 GB/node, 2 GPU-nodes 4 cards/node
Prozessor (CPU): 2 x Xeon E5-2630v3 (Haswell) 2.4GHz L2=8x256KB L3=20MB Boost_AVX=2.6GHz Boost_single=3.2GHz 256-bit-Vector-support (AVX2) 16 FLOP/clock, 610 GFLOP/node (64bit), 4 memory-channels/CPU je 14.9 GB/s, 59 GB/s/CPU, 85 W/CPU
CoProzessor (GPU): GeForce GTX 980 ca. 4TFLOP32bit 156GFLOP64bit, 4GB RAM onboard, 224 GB/s, 180W
4 cards/node, 2lx+1win Nodes (versus 1.2TFLOP32bit/node, 119GB/s und 170W/2CPUs der Haswell-Nodes)
Board: ASUSTeK RS720Q-E8-RS12 1.xx (4 boards per 2HE-chassis, PCIE3.0x16 128Gb/s)
Hauptspeicher (RAM): 256 Gbytes, 16*16GB-DDR4 ECC Micron, DDR4-1866MHz=14.9GB/s, 4 channels/CPU, 2 DIMMs/Channel, Memory-Bandwidth 119 GB/s/Node
Festplatten (HD): diskless nodes, BeeGFS 4 Nodes je 2*(10+2 RAID6) * 4TB ca. 290TB, 8GB/s, 80*105 IOPs * 4KB, meta: 4*32KIOPs * 4KB
Netzwerkanschluss: Gigabit-Ethernet (management), QDR-Infiniband (40Gb/s, peak=4GB/s)
Stromverbrauch: 58kW (idle: 16kW)
Performance-Daten: MemStream: 119 GB/s/node, 20.5 TB/s
MPI: 3.15 GB/s/node (alltoall uniform, but see problems)
Peak = 103 TFLOPs (40 FLOP/Word, 1.7 GF/W)

Systemsoftware, Anwender-Software

Zugang/Ansprechpartner

Der Zugang erfolgt aus der UNI-Domain über ssh neumann.urz.uni-magdeburg.de (141.44.8.59) mit Ihrem vom URZ vergebenen Uni-Account. Der Zugang erfolgt passwortlos mit ssh-public-keys. Bitte senden Sie mir den ssh-pubkey mit einer Kurzbeschreibung Ihres Projektes und den Schaetzungsweisen Bedarf an Memory, Knotenzahl und Laufzeit fuer ihre Jobs per EMAIL. Studenten benoetigen eine formlose Bestaetigung ihres universtaeren Betreuers, dass von ihnen zentrale HPC-Resourcen fuer Forschungszwecke genutzt werden duerfen. Wenn Sie Windows und Excced für den Zugang (grafisch) benutzen, beachten Sie bitte die Konfigurationshinweise des URZ. Bitte beachten Sie, dass unsere Computeserver nicht der Aufbewahrung von Daten dienen. Deshalb sind die Plattensysteme nur teilweise mit Redundanz ausgestattet und auf Backups wird zugunsten von Performance und Stabilitaet verzichtet. Sichern Sie bitte selbst Ihre Resultate zeitnah und entfernen Sie angelegte Dateien, um anderen Nutzern genug Speicher fuer deren Rechnungen zur Verfuegung stellen zu koennen. Danke! Für Fragen und Probleme wenden Sie sich bitte an mailto:Joerg.Schulenburg(at)URZ.Uni-Magdeburg.DE?subject=WWW-t100 oder Tel.58408.

Termine/Infos/Planung:

29.10.15 - Lieferung Server-Schraenke
09.11.15 - Lieferung Knoten und Beginn Aufbau Hardware
16.11.15 - Beginn Softwarerinstallation, Austausch defekter Nodes
25.11.15 - bare metal Linpack/HPL benchmarks und Hardwaretests (58kW)
xx.12.15 - storage test 8-9GB/s beegfs
16.12.15 - Lieferung + Einbau Ruecktuerkuehlung
22.12.15 - beegfs and 16 cores/node running (nach storage-Problemen und max 12 MPI tasks/node)
22.12.15 - IB-Anschluesse getauscht um max. IB-Leistung zu erreichen (s. "Probleme" unten)
13.01.16 - Testbetrieb fuer Nutzer (Unterbrechungen und Aenderungen moeglich)
26.01.16 - Kuehlwasseranschluss fertig, Betrieb mit Kuehlung
27.01.16 - Aufbau Regelbetrieb (production mode), Umbau der Partitions fuer verschiedene Nutzerbeduerfnisse
03.02.16 - Performanceverluste Partition "big" (Fehleranalyse)
03.02.16 - node152 configured out for tests, lot of ECC-errors + slow
10.02.16 - spezielle projektspezifische Partition isut_20d_4GB mit 72 Nodes erstellt (s. Projekte)
11.02.16 - set partition big to 2h (ca. 2 days) to allow isut_20d_4GB start
26.02.16 - Umbenennung in Neumann zu Ehren von John_von_Neumann
08.03.16 - fix bad modulefiles openmpi/gcc/64/1.8.4 and openfoam/2.3.1
09.03.16 - add modulefiles ansys/17.0/fluent and starCCM/10.06
23.03.16 - Partition=sw01_short auf 3h begrenzt zum ueben/debugging/Fehlersuche
24.03.16 - fix missing libnuma.so on nodes for modulefile openmpi/gcc/64/1.8.4
30.03.16 - install strace-4.11 (used for debugging, ToDo: rename to debugging)
01.04.16 - X11-forwarding not working (ssh -X neumann), fixed by CV
13.04.16 - fix module openblas/gcc/64/0.2.14+15 from CV (syntax err + missing LIBRARY_PATH + need gcc/4.8.2 on nodes, 0.2.15 only ~parallel(?))
18.04.16 - Partition=sw01_short auf tags 1h begrenzt (09:00, 2h ab 14:00), nachts 6h ab 21:00, 4h ab 03:00, 2h ab 05:00)
21.04.16 - Partition=sw01_short changed: nachts 6h ab 18:00 (user demand) (ToDo: UTC?)
02.05.16 - repeated early MPI-exits because of 2GB old /dev/shm/psm_shm.*-files (please send ideas to fix it)
09.05.16 - ca. 17:00 problem with login node (LDAP-database corrupt)
10.05.16 - service-call, database repair + reboot login-node, all jobs lost
10.05.16 11:00 system up, possible proplem reason: crashing graphical app on login-node, please avoid apps on login-node
         please contact the admin if you feel, that your app was causing the crash to find a fix
17.05.16 - compile + install openblas_haswellp-r0.2.15 +cpu-affinity +warmup (only single-thread was installed)
17.05.16 - compile + install openblas_haswellp-r0.2.14 (default installation was 1..2x slower, without +cpu-affinity +warmup)
18.05.16 17:40 kill running zombies from Feb25,May13,May14 on 26 nodes (detected by health-script, reason 4GB-ulimit on login-node?)
19.05.16 - /cluster/apps/ /cluster/modulefiles/ grp-write-permissions removed (please ask the admin to get write permission)
24.05.16 - /home at 85%, please do not use /home for big data, use /scratch/tmp/${USER}/ instead
27.05.16 - install apps/libc/glibc-2.17 + add linux+asm-headers, gcc/4.8.2 works on nodes
08.06.16 - HPC-User-Treffen, Hauptthema: fair job-scheduling benoetigt
08.06.16 - queue urgent reaktiviert (7 Nodes, priorisierte Kurzjobs und passende Fuelljobs, auf Antrag, 1%IMST 2.5%ILM 30%ISUT)
09.06.16 - Uni-HPC-Wiki fuer Nutzer angelegt
23.06.16 - system maintenance needed (shifted to 19.07.16, other problems)
23.06.16 - slurm.Nodes.RealMemory (default=1MB) to 254000MB corrected, allows option --mem usage (please use!)
24.06.16 - please use increasing nice values when starting multiple jobs, partitions modified
10.07.16 - 4TB-disk failed (of 8*(10+2 RAID6) /scratch), speed -6.25% MTBF1=0.56e6h vs. MTBF=1.2e6h
19.07.16 - system maintenance needed (shifted to August, scontrol show reservation)
22.07.16 - 4TB-disk replaced, rebuild stated 12:00 - ca. 17:00, speed -6.25%
22.07.16 - 4TB-disk failed (of 8*(10+2 RAID6) /scratch), speed -6.25% MTBF1=0.23e6h vs. MTBF=1.2e6h
12.08.16 - some nodes show errors ... more infos later
26.08.16 - slow system I/O, controller node out-of-memory by console log + 90% swap ... console-log disabled
05.09.16 - c034 docker crashed by IB-I/O-error, reboot node034
11.09.16 - /home disk full, please use scratch
18.11.16 - node501 (win) after update failed boot POST=b9 (DIMM replaced 15.12.)
15.12.16 - node152 (DIMM replaced), lot of ECC-errors, failed boot + POST=b7
08.02.17 - activation of slurm job+account-logging, sacct + sstat works now (needed for priorization)
13.03.17 - testing new fair-share script (fvst_isut=30%,fmb_ilm=2.5%,imst=1%)
13.04.17 - /home 100% used, 64% after warning per mailing list (6% /scratch)
12.05.17 - fix missing /scratch (beegfs) after reboot
           /install/postscripts/cv_install_beegfs uses yum with default repos,
           which failes if a repo is gone (fixed by option --disablerepo=\*)
31.05.17 - /home 100% used (571*750MB-core-files removed, /home=85%used
           top-home-user=168GB, 10e6 files, please use /scratch/tmp/$USER instead of $HOME)
03.06.17 - 76 nodes not responding, problem with boot postscripts yum
           external reposities failing + ldap/nfs database corruption (?)
           expected downtimes until Tue 06.06.2017
08.06.17 - 38 nodes without /scratch, expected downtime until 15.06.
           boot process failed on external repos, fixed
19.06.17 - Infiniband problem examination with commercial mpi-libs within docker
           number of tasks/node limited to below 32 above 34 nodes
           number of tasks/node limited to below 16 above 75 nodes
22.06.17 - 8 nodes with bad DIMMs replaced (see sinfo -R)
           10 further bad DIMMs will be replaced at end of July
26.07.17 - Infiniband problem examination finished, start updating system
28.07.17 - 4TB-disk failed (of 8*(10+2 RAID6) /scratch), MTBF2=0.8e6h vs. MTBF=1.2e6h
03.08.17 - 4TB-disk replaced + rebuild
01.01.18 - 62 users, +17/y2017 still growing, about 20Mill.CPUh/year
12.01.18 - 3 min clock-skew on corrected, ntp source was not configured
14.01.18 - 82 nodes hanging (nfs4 problem?  nfs_revalidate_inode: ... getattr failed, reboot)
22.01.18 - set localtime from UTC, Amsterdam to Berlin
31.01.18 - add openblas-0.2.15_no_affinity (openblas-affinity conflicts with
           MPI-affinity and may reduce performance dramatically by
           pinning multiple MPI-tasks on the same core of a node)
02.02.18 - Fr ca. 20:00 system in bad state, DoS on NFS, /home was at 100%
05.02.18 - reboot bad nodes (ca. 30), disable sw04_longrun + sw01_short,
           please use longrun + short instead (shorter names)
01.03.18 - switching to cpuh-based priorization, low cpuh users priorized
          "squeue_prio_cpuh 2018-03" shows monthly cpuh per institution
10.03.18 23:49 slurm partitions were set down to prevent full /home and crash
13.03.18 - partition node ordering simplified (more IB fragment., better mgmt)
           add %R to squeue_all -l, adding node info (shorter than before)
           slurm.conf CPUS=16 removed, hardware shows 32, reducing error logs
           -- plz check that max 16 MPI tasks used for best performance --
14.03.18 - speedup ssh connections by using /dev/urandom instead blocking
           /dev/random (blocking was upto minutes after some nodes reboot)
           /etc/sysconfig/sshd  SSH_USE_STRONG_RNG=0 (crypto is weaker now)
05.04.18 - remount / xfs with noatime,nodiratime should improve responsiveness
12.04.18 - set include/lib-paths in module gcc/4.8.3 to glibc-2.17
           (fixing compile errors on nodes)
18.04.18 15:00 cooling unit down, shutdown until 23.04.18
24.04.18 libmng+libpng12 installed on login-node, fix fslview of fsl-toolbox
28.05.18 libXtst installed on login-node, for some mathlab-functionality
12.06.18 modulefiles/python/2.7.12_Keras extended by PYTHON_CONFIGURE_OPTS
19.06.18 modulefiles/matlab renamed, R2018a added
21.06.18 GPU-kernel-module + module nvidia_driver/390 installed (was missing)
23.06.18 15:51 home 80% used, jobs suspended
26.06.18 epilog-script added to kill hanging processes and clean tmp-dir
30.06.18 15:22 home 90% used, all jobs auto-suspended, 01.07. resumed
08.07.18 cmake-3.8.2 installed
15.07.18 15:47 home 90% used, auto-suspend failed, 23:08 disk ok
12.09.18 Reparatur Klimaanlage, ggf. Lastreduzierung durch suspend/cancel
23.10.18 install cmake-3.4.3 + cmake-3.12.3; use: module avail cmake
25.10.18 gcc-4.9.4 installed, needed to compile glibc-2.28, uses glibc-2.17
26.10.18 nvidia-smi usable now, but 1 of 4 GPUs off for unknown reason
30.10.18 gcc-5.5.0 (c,c++ only) installed (build-problems on docker)
13.11.18 gcc-8.2.0 installed (ompi-1.8.4 problem on nodes? ok on login)
11.01.19 bad DIMM identified on node154 (after disable channel interleave,
      some testwise DIMM replacement and linpack test-runs with ECC-errors)
18.01.19 re-enable node154, after replace bad DIMM and testing
21.01.19 re-enable node158, after identify and replace bad DIMM using EDAC
22.01.19 re-enable node029, after BIOS BOOT POST=b7, halt, replace bad DIMM
22.01.19 re-enable node168, after identify and replace 2 bad DIMMs using EDAC
24.01.19 re-enable node146, no EDAC error reproduced, please report problems
24.01.19 61 working nodes (36%) show some EDAC errors (correctable ECC memory errors, over 9 months)
         21 working nodes (12%) show more than 250000 EDAC errors in sum (nodes to be checked)
         after about 16 bad DIMMs (0.6% of 172*16) replaced (15 nodes) meantime,
         only 56% of 172 nodes are not affected by memory ECC errors
         25 Nodes have 1-4 errors on a single CPU (soft errors?)
         DIMMs are vertical (lower sky radiation), CPUs horizontal
27.02.19 please do not flood partitions with jobs (rule: max 10 jobs/user)
05.03.19 node013 job+node hanging (nfs overload?), reboot
05.03.19 node502 non-killable GPU-process since 10.02., noGPU.No4, reboot
05.03.19 node503 GPU.No4=0x82 not available, draining for reboot 
06.03.19 node503 reboot, get back GPU.No4=0x82
26.03.19 reduce gpu MaxTime from 44h to 4h (its mainly for GPU testing)
02.06.19 slurm queues stopped, clima fail on 01.06. 16:30, until 03.06.2019
06.06.19 defect 4TB-SAS-disk(R3dsk19) (of 8*(10+2 RAID6) replaced, 12h rebuild
  statistics: 4 failing disks of 96 disks / 3.5 years (MTBF=0.74e6h datasheet=1.2e6h)
16.06.19 shutdown for maintenance (power supply) until 17.06.19
17.06.19 problem getting virtual layer (docker/openstack) running, no login possible
21.06.19 openstack/docker disabled, cluster reconfigured, testing mode, new login
24.06.19 nodes enabled, please report problems, new ssh-fingerprints: 
 ECDSA   key fingerprint  SHA256:pTxYsStE8JI3VGXVfXn6Bs1c3agnmkpM8DbgDHjGMUw.
 ECDSA   key fingerprint  MD5:0e:26:b4:3b:59:c3:4b:43:8a:54:73:7d:fa:96:e4:78.
 ED25519 key fingerprint  MD5:e6:b7:df:5f:ab:b9:a2:a3:72:59:b4:63:78:14:f3:1e.
 RSA     key fingerprint  MD5:ed:5f:c5:6c:5e:51:92:08:d2:f7:f3:f0:16:a4:7b:31.
 DSA     key fingerprint  MD5:bf:ad:a2:57:a9:c2:35:13:98:ef:d8:20:c4:07:85:50.
03.07.-23.07. no support (holidays), use EMAIL if urgent
29.07.19 fix priorisation (was not working, missing option since 24.06.19)
21.09.19 python-devel-2.7 installed, python-2.7 + openssl updated
18.09.19 PriorityType=priority/multifactor (old:basic) to fix fairness
  problems for external scheduler, new jobs have lowest priority (1) initially
22.09.19 full home-disk and failing disk-monitor (cluster reconfigured)
  some outputs lost, please check your running jobs
06.12.19 modified job fairness (new: fairness for users within groups)
13.12.19 mirrored metadata SSD disk failed on storage02, 25% of beegfs scratch data inaccessible
16.12.19 defect INTEL SSDSC2BW480H6 (SandForce{200026BB}, 3.8y, 4TBW) removed from DMRAID1, reboot ok
14.02.20 GPU-node502 dead, hard resetted
14.02.20 out-of-memory on login node because of bad user processes
15.02.20 GPU-node503 4th GPU=0x82 dead, node remotely resetted
18.02.20 mirrored metadata SSD disk replaced on storage02,
         rebuild (dmraid -R) segfaults = redundance problem
19.02.20 fuse-sshfs installed (data access)
16.03.20 storage03 beegfs-metadata-SSD-raid degraded (scratch-dir)
         defect INTEL SSDSC2BW480H6 (535 series) sdd=480GB to sde=32KB
         dmsetup status: 0 890863616 mirror 2 8:32 8:48 6560/6797 1 AD 1 core
         SMART 9=34721h=4y 232.Reservd=097 233.Wearout=083 241.W32MiB=107542
         reboot storage03 to initialize rebuild with new SSD 
         storage02 also booted for rebuild on replacment, dmraid segfaults
26.03.20 07:11 login node is crashed because of multibit-ecc-error DIMMB1
         followed by nfs-errors, will not be resumed before 27.03.20 ca. 14:00
         most nodes need reboots on monday, hanging nfs
30.03.20 reboot nodes to fix hanging nfs+apps, 
         switch nfs from hard to soft + timeo=3000 (5min)
31.03.20 10:20 room high temperature alarm 33C, automatically handled
         nodes switched from 2.4GHz (35.3kW) to 1.2GHz (23.0kW) for 28min
18.06.20 fix bad DIMMs and boot hangs (press F1 to progress ...) of 2 nodes
22.06.20 fix bad DIMM node044, node150 (multiple ECC-errors)
24.06.20 fix bad DIMM node032 (850 ECC-errors/month + Memory_Train_ERR on boot)
07.07.20 fix bad DIMM node041,node093 (10000-18000 ECC-Errors/month or memtest)
13.07.20 fix gcc-5.5.0 openmp (by using gcc-4.9.4 openmp files via links)
         fix gcc-4.8.3 on nodes (5.5+8.2 have h-file problems on nodes)
23.07.20 tentatively reducing priority queue "short" equal to "night"
         to give both the same chance at night
08.03.21 00:48 /scratch is down, beegfs-mgmtd stopped for unknown reason
08.03.21 11:03 problem identified, beegfs-mgmtd restarted, /scratch is back
11.03.21 OpenMPI-4.1 installed (module load openmpi/gcc/64/4.1)
15.04.21 OpenFOAM-8  installed (module load openmpi/gcc/64/1.8.4 openfoam/8)
18.05.21 12:30 system maintenance, beegfs SSD failure and full root disk
19.05.21 /home moved to separate disk + quota activated, st01 SSD replaced
06.08.21 crashed RAID-controller, errors on /scratch, maintenance mode
20.10.21 beegfs-meta storage01 timeout problems, /scratch down
         2nd RAID-1 (Intel SSD) disk of beegfs-meta  switched
          to a 32KB FW mode, RAID1 degraded and hanging,
         replacement SSD ignored for rebuild for unknown reason
01.11.21 storage04 power unit failure detected, no power redundancy

25.11.21 new system ready, about one week for users to transfer data
29.11.21 18:57 beegfs-mgmt-failure "unrecoverable error", fixed by restart,
         storage02 AVAGO MegaRAID SAS 9361-8i Cachevault CVPM02 failed,
         instead of 0.8 - 1.5 GB/s write per 12 disk-RAID6, only 80 MB/s/RAID6 
30.11.21 usage disabled, removing user accounts and data, uptime 614 days
01.12.21 fix slow storage RAID by setting wrcache=AWB (always write back)
07.12.21 dissassembling
10.01.22 dissassembler found some CPUs with unusual damage on the surface

Projekte:

This is a incomplete list of projects on this cluster to give you an impression, what the cluster is used for.

Questions and Answers:

Probleme:

The aim of this list is to share knowledge about problems on cluster administration. If you find this list by web search and you have similar experience or solutions, do not hesitate to share your knowledge with us. Thanks.

...

Weitere HPC-Systeme:


weitere Infos zu zentralen Compute-Servern im CMS oder im fall-back OvGU-HPC overview


Author: Joerg Schulenburg, Uni-Magdeburg URZ, Tel. 58408 (2015-2020)