experiences with computers and software (under construction)

Because we manage most of our computers by ourselves, we have some experiences with PCs running linux and more powerful computers (SUN, HP, SGI, IBM) with gigabytes of RAM and multiprocessors running some Unix variants. I used it for heavy calculations (exact diagonalization of BIG matrices), for which one needs enough CPU power and shared memory. May be, you find some interesting hints here.

Linux

SuSE 9.1

SuSE 8.0

We now running SuSE-8.0 on three of our PCs. The only problem after installation was, that sendmail was not working (connection refused on port 25). I changed the arguments in /etc/rc.d/sendmail to "-bd" and now its running without problems. I also recommend to use ext2 or ext3 instead of reiserfs. ReiserFS is probably not stable and can cause problems after power failures (see crashes).

Red Hat 7.3 (first contact to Red Hat)

My first impression is, that SuSE is more comfortable to install. I missed the description for each software package. Also tuning of X to more than 85Hz was not possible with Xconfigurator, so I had to edit the Modelines in /etc/X11/XF86Config-4 by hand.

Mozilla 1.1

Since August 2002 I use Mozilla 1.x instead of netscape. I can only recommend it.

Qemu-0.6.1 - the PC emulation (Jan05)

Very impressive! Faster than bochs. Does handle vmware disks (except 2GB-splitted files and SCSI-images). WinXP can be installed but it does not boot after installation. I checked the partition entry and got some strange CHS values (64 heads, 63 sectors), which probably confuse the BIOS. I tried to to use different -hdachs options, and the behavior of the booting process is changing, but no way to run it. I had the same problem for WinNT4. If you know whats going wrong, tell me. Knoppix console locks unregulary for unknown reason.
I used my spinpack-package for speed tests and found a factor of 6 to 10 for numerical applications (a 300MHz machine on a 2600MHz PC).

Qemu-0.7 - the PC emulation (Jul05)

WinXP-Prof-DE runs. After installation it does not work in normal mode and shows error messages that license can not be prooven. Use the "Abgesicherter Modus" (press F8 very early) and install SP2 from CD-img (also with some errors, but it works). After that XP runs slow but normally. Use images of CDs instead of /dev/cdrom which is really terrible slow (5 hours instead of one for installation).

server administration

I have administrated the following list of servers (excerpt):

Unfortunately the list ist incomplete, but gives you an overview what kind of computers I have administered.

HP 9000 Model 819/K210, system: HP-UX B.10.01

We bought this system 1995 with one HP-PA7200-120 processor and 640MB RAM. The system was running very stable and the machine was easy to decompose. As I remember right, we had to build a new kernel to use more than 128MB RAM, but it was no problem to do it. Unfortunately we had no cooled room for the machine and in the summer we had to shut off the machine some times in order to risk no damage.

Disadvantages were the noisy fan and later we recognized that the three 2GB-HDs (SCSI-LVD) had very hight temperature around the maximum of disk specification. Probably that was the reason that 1998 the first disk leaved us. The second was following 1999. We try to put a new SCSI disk (non HP) to the machine, but it was not detected. So we ask for original HP disks and should pay 6TDM for 9GB, unbelievable! We could find no other sources and replacing of all disks by other SCSI (SE or LVD) disks or by a SCSI-IDE adapter failed too, so we decided to give the machine away 2001. A second machine bought at the same time by colleagues had the same problems. Another problem was the external MOD drive. After 6 months running it did produce only errors. We thought about dust problems and asked for exchange or cleaning. After further months I got a call from Netherlands that we use the wrong formatted disks (with 1024 byte sectors instead of 512 byte sectors) this should cause an error in the system after writing a number of disks. I could not belief this, first there is no logic behind this, second we got this disks together with the other equipment and third another MOD drive (probably not so dusty) of same type was running well on the same machine. Thats service! If there were no warranty I had cleaned it by myself. At the end we did not used the MOD drive anymore and I would never buy state of the art super hightec drives for high prices if it is not standard technique. A normal CD writer would be the better choice.

IBM model 580

Bought in the '90 (?), 640MB RAM, 66MHz processors, running AIX 4.3. The support was not the best. As we bought some new disk for this machine we did not get the right screws and adapters but the disk was working well for years lying at the dusty bottom (good disks). Updating to a new AIX version was always an adventure. "Never change a running ...". At 1997 we got a defect graphics card after power failure. The cheapest replacement was a Gt4xi graphics card of the price of new PC. At 2001 we throw the machine away.

Alpha-PCs

bought 1998, two equivalent machines with 1.5GB memory, 164UX-Boards 67MHz, DEC-alpha 21164A CPUs 533MHz, 9GB SCSI-disks, running Linux. Very nice machines and the cheapest available for this configuration. The only problem was, that the IDE-Adapter is really slow, I do not know why. A additional PCI-IDE card (noname CMD-PCI646U2) could be used only in slow modes because of missing drivers (?). But using IDE disks via SCSI-IDE Adapters on the SCSI-bus was no problem. The insight of the tower is very warm, but now the machines are in a air conditioned room and no problems are expected. With older Linux kernels 2.2.x there seems to be a problem with applications using more than 1GB RAM, but after using Linux 2.4.x the machines are 100% stable. After we moved to another building we had problems with auto sense of 10Mb/100Mb duplex/halfduplex network. Luckily we do not need highspeed network at the moment. Only the 10Mb/halfduplex version was working well.Soft-RAID0 was able to increase the 19MB/s disks to 28MB/s (two disks) speed. After all, it was a good deal.

SGI power challenge (not administrated by our work group)

bought 1999, 8 MIPS R10000 250MHz, 8GB shared memory, running IRIX64 v6.4, very fast and stable. If I remember right we had to change one defect board during the last two years (during warranty time). Using MP-pragmas for parallel processing was working bad for complex programs. Sometimes the program was 3 times slower on 8 processors than on one processor. I could not find out why. With pthreads I got a speedup of factor 8 with the same algorithm. So I do not trust the very expensive parallel C compilers and use the more primitive standard libraries for multi processing.

Tru64-V5.1 on Alpha GS160, GS1280 and ES45 (Dec04)

I admin three Alphasystems, two of them are big machines (One has 128GB Memory and 32 EV7-Processors, the other 24GB and 16CPUs) and very fast! Unfortunately system is not very stable, there are two to four crashes a year. The hardware support is ok, but the software support is bad. You get updates regulary, but dont try to ask HP questions about misbehavior of the Tru64-system. The hotline dont think about forwarding your question/report to the programmers, they only ask for money for tuning support (@HP: I dont want to buy tuning support, I want to have the bugs I found in your system fixed without paying additional money for it!). Probably they can not reproduce our problems with there test machines and the effort for analyzing the problem for a 128GB machine is high, but simply playing the ball back is not the right thing to do with its customers aren't it? So I dont ask anymore and try to solve the problem by myself. Thats not easy without tracing like linux-strace program and kernel sources. Here is a list of things which cause problems on Tru64-V5.1, may be its usefull to know for you:

Sun Ultra 1 SBus, UltraSPARC 143MHz, 512MB RAM

Bought two machines 1997 (one with only 64MB RAM). There are quite OK. Easily to open and to look inside. After one year we changed from Solaris to Linux because it is easier to manage if you have already a lot of experiences in Linux but less in using Solaris. One bad point is the SunTurboGX graphic card. It was only possible to use with 1152x900x76colors at 72kHz/76Hz. With a 20-inch hightec SUN-Monitor you get easily a headache with less than 80Hz. It is really a bad combined hardware. Another point is the CPU-fan. Two of them are died within the last two years. Also we had one disk crash and you need special SUN disks (expensive). They are only used now as number cruncher and for guests. Funny thing is, that you get a lot of software you never need for every machine, but only one boot disk for about 16 machines. Since September 2002 we are testing SuSE 7.3 for sparc without problems until now.

Pentium dual board 440LX

Bought 1998 with two 300MHz PII CPUs and 512MB RAM and SCSI. Big mistake! Every PC magazine was claiming that one needs SCSI for burning CDs at this time, but this probably was only true for WinXX. SCSI was not necessary for the CD writer to avoid buffer underruns. After using IDE CD writer under Linux on other older PCs without any trouble, I prefer the cheaper IDE versions today. This was not a problem at all but the CPUs chosen were bad. CPUs becomes so hot that the board beeps if both CPUs are 100% used. After few weeks with lots of crashes the contact to the cooling bodies got lost (cooling body was bend by heat). With the new cooling bodies we got, the problem was not completely solved. The seller could not really solve the problem and switched the BIOS heat warnings off, but with moderate success. Nowadays I know that this CPU version was the hottest one. The voltage was increased to 5V to get the CPUs running at 300MHz. Surprisingly the CPUs are not outburned after using 3 years. On hot days with both CPUs used the board does still its quiet beeping. After such experiences we went back to Celerons with moderate clocks and 128MB RAM, which are silent, still waiting that PCs with stable boards and more than 1GB RAM are broadly available to build small PC clusters as a better solution. Buying not the newest product seems a good tactics nowadays. If you need more performance, tune your code!

Linux crashes

On January 2002 we head trouble with two machines running SuSE 6.4 with reiserfs-2.x on root. Both machines showed inconsistencies on the reiserfs-filesystem (all actions took lot of time) after about 19 months. A new installation was necessary.

Remark: If you have any entries in /etc/hosts twice, sendmail failes to start. We took more than two hours to find and fix this problem.

On September the 23th, 2002 we had a powerfail. After that a PC with SuSE 7.3 with reiserfs-3.x.0k-pre9 installed showed non-reproducable errors during numerical calculations (wrong results, unexpected aborts). We made a reiserfs-check, bad it claims that everything was ok. After reboot and further tests also the compiler gcc gave non-reproducable "internal compiler errors", which appear more and more frequently leading finaly in a kernel panic. We installed SuSE 8.0 with ext2 and again strange things happens. So we opened the PC and made a visual check of the hardware. Only the CPU-fan (2 years old) was not in its best state, in some cases it does not start to rotate after stopping by hand. Was the fan not started after the power fail? This would cause to high temperatures and would explain the strange behaviour. Indeed the PC was running without problems after checking the CPU-fan so that we are now sure to have located the problem.

Speed tests

Network speed tests

Have you ever tried to measure speed of your network? A simple command to do that is:

  time dd if=/dev/zero bs=1024k count=1000 | rsh remotehost "cat >/dev/null"
The result is 89s for a 100Mbps ethernet card (1000MB/89s=11.2MB/s=90Mbps). Pretty accurate! For a 1000Mbps card the test failed because rsh took 100% of CPU time. In a second test I started the above command 6 times parallel and got 55MB/s=440Mbps using a 8-CPU-machine which shows that a better speed-test is needed here. Do you have a simple on?

Security

We had only one compromised system. An old 486 PC used as printer and floppy disk server (for the SUNs without floppy disk). This PC was used to upload and download files. We noticed it because the PC crashed after the 500MB disk was full. Other problems were connected with sendmail and relaying. It was used three times to send spam over the world. We noticed it because the machines could not do anything else and there was lot of disk activity. Sorry to all victims of spam. Now we have configured all our machines to not relay email. Since our machines are configured more securely and we use SSH logins we rarely notice portscans and other attacks.

If you use Windows on your client PC and want to login to a Unix-Box, you can use exceed + ssh (commercial), cygwin-package or Xming to work with graphical applications.

Using pine for imap-server via SSL:

 pine -f {IMAP-Server/imap/ssl/user=userid}inbox   # OR
 .pinerc  inbox-path=\
    {sunny.urz.uni-magdeburg.de/imap/ssl/user="username"}inbox
   # inbox-path={imap.web.de/novalidate-cert/user="username"}inbox
 instead of novalidate-cert do:
   - download OvGUMssl.pem to /etc/ssl/certs
   - openssl x509 -noout -fingerprint -in OvGUMssl.pem  # better use SHA1
     # MD5 Fingerprint=72:A0:34:4C:64:18:57:6A:80:9A:89:72:48:92:7F:83
   - openssl x509 -noout -hash -in OvGUMssl.pem # 6cc6a28b
   - ln -s OvGUMssl.pem $(openssl x509 -noout -hash -in OvGUMssl.pem).0
   - openssl verify -CApath /etc/ssl/certs OvGUMssl.pem
   - same with dfn-cert.pem
   - ToDo: check CRL = certification revocation list
 # check the connection: netstat -atn # to hostip:993 ESTABLISHED (imap via SSL)

out-of-memory/out-of-swap (Mai06)

This happens some times on our compute servers, mostly if users dont estimate the memory needs of there programs. Most operating systems are slowing down, but are working further. Tru64-5.1B does the worst thing, killing any process (also old deamons running as root), which results often as crash. IRIX64-6.5 has killed the user process in all test situations and everything else continues well. ulimit -v can help but is not practicable for MPI processes with asymmetric memory consumtion on shared memory machines.

Ethernet cluster (Oct2008)

Explored a firmware bug in the IPMI Software of the BMC of a DELL PowerEdge 1950 Server which cause a Linux system crash on high loads. See at end of that webpage for more information.

SiCortex SC5832 + 256GB 32Core SMP (Jan2009)

SC5832, 256GB SMP,

problem: OOM killer kills init, sshd etc. (Mar09)

This happens if a long running process (days or weeks) eats all the memory. The OOM-killer does not kill this process because long run processes are important. As an solution create /etc/skel/.ssh/rc with "ps -o pid --no-heading | xargs renice 10 >/dev/null 2>&1" and copy that to existing user homes. You could also use /etc/ssh/sshrc (xauth add .. must be added to sshrc files, because xauth is not called by sshd if rc file exist and x11 tunneling will fail). Nice makes killing of system processes more unlikely. Also set /proc/sys/vm/overcommit_memory to 2 and /proc/sys/vm/overcommit_ratio to 90 or higher. Also disable swap, it makes no sence for HPC, it will only create a long time slow down before OOM happens. Programs which alloc all of the memory and more are bad.

problem: automatic power down an SGI Altix330 Server (Apr09)

The problem is to power off the Altix 330 server after shutdown in case of high room temperatures or power failures (to save UPS power for others). shutdown -p -h does not work, the machine stays and consumes still power. The only way is to connect the service processor from the service net by telnet and power it off. This can be done automatically by:

(echo "pwr down";sleep 9;echo -e "\x1dquit";sleep 1) | telnet 10.0.0.1

Same technique can be used for the alpha servers above.

ssh - attacker dt_ssh5 18.01.2010

At Januar 2010 one of our machines did ssh attacks to other servers in the world. It was the 141.44.40.29-linux machine. netstat -atn | wc -l showed about 2500 ssh connections. ps auxw output was looking like this (user name changed):

matze   18773  0.5  0.0   1736   308 ?        S    Jan17  10:42 ./dt_ssh5 200 2 17.79.182.153 2
root    18492  0.1  0.1   8252  2396 ?        Ss   11:10   0:00 sshd: root@pts/0
root    18890  0.0  0.0   4120  1844 pts/0    Ss   11:11   0:00 -bash
matze   25993  0.0  0.0   1736   468 ?        S    11:14   0:00 ./dt_ssh5 200 2 17.79.182.153 2
matze   26023  0.0  0.0   1736   468 ?        S    11:14   0:00 ./dt_ssh5 200 2 17.79.182.153 2
matze   26024  0.0  0.0   1736   468 ?        S    11:14   0:00 ./dt_ssh5 200 2 17.79.182.153 2
The user had a to simple password choosen (183000 google hits for it). The binary was lying in /tmp-path and showed this properties:
ls -l
-rwxr-xr-x   1    1001 users 1379632 2010-01-17 05:05 dt_ssh5
file tmp/dt_ssh5
tmp/dt_ssh5: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.6.4,\
 bad note description size 0x83e58955, bad note name size 0xe8000001, bad note name size 0xc2815a00,\
 bad note description size 0xc7397500, bad note name size 0x89589c00, bad note name size 0xc831589c,\
 statically linked, stripped
md5sum tmp/dt_ssh5
 f0b5fc67c41d567c1f306e88363f139a  tmp/dt_ssh5 
strings -9 dt_ssh5 showed strings belonging ssh and openssl libreraries. Two successfull logins in /var/log/messages (name changed):
Dec 21 21:55:41 fermion sshd[23143]: Accepted keyboard-interactive/pam for matze from 58.247.222.163 port 40039 ssh2
Jan 17 05:05:10 fermion sshd[18758]: Accepted keyboard-interactive/pam for matze from 217.79.182.153 port 45300 ssh2
Jan 17 05:05:10 fermion sshd[18761]: subsystem request for sftp
Jan 17 05:05:10 fermion sshd[18761]: channel 0: rcvd big packet 131030, maxpack 32768
Jan 17 05:05:10 fermion sshd[18761]: channel 0: rcvd big packet 112867, maxpack 32768
Jan 17 05:05:10 fermion sshd[18761]: channel 0: rcvd big packet 112838, maxpack 32768
Jan 17 05:05:10 fermion sshd[18761]: channel 0: rcvd big packet 112809, maxpack 32768