Nanos gigantum humeris insidentes

This morning I really feel as if I’ve joined the big time, almost able to rub shoulders with people such as Mario Wolczko 😉

Due to its heavily customisable design, I use Gentoo Linux on all of my home Linux servers – and this generally works well. Gentoo developers do seem to have the unerring ability to semi-regularly throw absolute clangers, though – and this is one of those occasions…

Previously the Linux e2fs tools included two libraries helpfully named ‘ss‘ and ‘com_err‘, and these have just last night been merged into a single ‘e2fsprogs-libs‘ package (the awkwardness of the name seeming to belie the problems this would cause). However, as it turns out several other packages rely on libss and a significant minority of the system utilities actually require libcom_err. It turns out that this list includes ‘mount‘, ‘sshd‘, ‘wget‘, ‘curl‘ … and probably many others.

Completely unaware of how far-reaching the dependency chain of a library that, after all, is part of a filesystem tools package is, I went ahead and uninstalled the ss and com_err packages, a prerequisite to installing the newer package which they were marked as blocking.

Everything seemed to be going well until the packages were removed and the package manger came to download the new package to install – which failed as ‘wget‘ was no longer able to run. Intrigued, I tried to fetch the package manually, which also failed. And the same for ‘curl‘. I wasn’t really worried until I then tried to NFS mount a directory from my storage server to take the file from, and this also failed.

This, I realised, would be tricky: since the server runs headless I didn’t have convenient physical access – and that probably wouldn’t have helped in any case. The system is housed in a MiniITX chassis which is practically bomb-proof, very hard to prise open, and lacks any form of optical drive – so rebooting to a LiveCD and fixing things from there was also out. I consoled myself with the thought that I could drop the necessary package onto a USB drive, and then recover from there until it occurred to me that this would require a working mount, which the system now lacked. ssh was also affected, so I was unable to connect remotely – after key negotiation the connection is dropped, presumably as the running ssh daemon tries to exec() ssh in order to handle the connection.

So – we have a system that has effectively a single working console and cannot possibly reboot itself, without any working tool with which to fetch a package which might be able to fix things… which hasn’t been built yet.

On the storage server I was able to build a binary package for the updated consolidated libraries and also install netcat – which unfortunately didn’t exist on the broken server. After a little though, I wrote a minimal perl script to open a socket to listen on a specified port and output the resulting data. With this redirecting to a file and netcat on the working machine echoing data to it, I was able to transfer the package and install it.

Luckily there had been no major API or version changes, and so this was sufficient to get the system up and running again.

This is why I like UNIX systems – try recovering a Windows system in such a way if a system update goes awry.

Having solved the problem, the irksome aspect is that this problem was reported numerous times whilst the updated package was being tested, and Gentoo developers offered various hacks to address the problem which were both messy and not for the faint-hearted. But the problem was removed and so nothing else happened… until the package was marked as stable and rolled out to all users, problematic install and all. There are still several packages which, if compiled with kerberos support, cannot currently be installed as they require the discrete com_err and ss, and haven’t been updated for the new package. Someone has really dropped the ball here – and it’s almost worse that, rather than this being an issue which materialised out of the blue, it was known and thoroughly discussed and absolutely nothing useful was done about it.

P.S. Reference