In case you haven't yet noticed, the machine is back up. Mail and
Web services are proceeding apace.
It was a real pain. We first got a hint that there was a problem
on Wednesday, June 15, when the machine crashed for the first time,
at about 3AM PDT. I woke up, and called MFN (our colo provider)
to reboot the machine. It took six hours for the full fsck to check
the disks and come back up. (It was the first time the disks were
checked since the machine went up in 2003. The machine only went
down when we changed racks last August.)
It happened again at 2:20AM on Thursday, June 16. The fsck was a
lot faster, only an hour from the call to service resuming.
I stayed up late Thursday night, in the hopes of catching the
problem. Needless to say, it stayed up that night.
Saturday morning, it crashed. I had it rebooted. It froze.
I knew we were down for the count.
Now, back in March and April, DBR had a fundraiser where we were
asking for money for new hardware. I had sketched out an upgrade
path to move from a single machine to multiple machines, aimed at
balancing load. (We were experiencing load problems, particularly
at week-end and month-end log file processing. The impetus for
this strategy was to put that work on a less visible machine.)
We had planned to start this upgrade path in July.
So, at least we had an idea of what we wanted to do. I started
to set this in motion last Saturday, when I called my sales
representative at ASA Computers to let him know we're in a crisis,
and we need a machine, fast. I gave him our prefered specs.
On Monday morning, he called me to let me know they had a machine,
already in production for another customer, that they could re-route
to DBR that came close to our desired specs. It has a bit less
memory, but apart from that it is a good match. I agreed we'd
purchase that when the burn-in finished.
Suffice to say, over Monday-Wednesday, we had some burn-in issues.
Wednesday evening, I picked up the machine and took it to our
colocation facility to start the data migration. I arrived at 4PM.
Now, the new machine is Fedora Core 3, and a 64-bit processor.
The old machine was RedHat 9, on a 32-bit processor. I expected
some upgrade issues.
If we had been able to follow the upgrade plan without the crash,
I'd have installed the machine, and fully tested it before taking
it live. Since we were down, I had to test-on-the-fly with the live
site. I am sure there are still many, many issues to be sorted out.
First, the network configuration changed. They now have a nice
GUI tool to configure interfaces, which is fine. I set up the IP
address to be non-externally visible so I could port data.
I then rebooted the old machine, and started to port. The machine
stayed up for 30 minutes before crashing again. The reboots took
30-40 minutes with fsck. Aha: The 100 Gbyte file system is ext2,
not ext3. That makes the fsck fscking long. But, after this
happened a few times, it received a more serious crash that required
a manual fsck. This time, I was able to see paths and the like,
and this told me exactly what the problem with the old machine was:
One of the disk sectors was bad, and we were hitting it.
So, I removed some files that can be recreated, if needed, and that
saved us 46 Gbytes. Wow. The machine has since stayed up.
The first thing I copied was the basics of the web pages, and I tried
to start the web server. Turns out there are enough differences that
the server did not start. The migration from apache 1 to apache 2
involved about an hour's worth of time changing configuration files.
So, I copied, and copied, and copied. There were still about 20
Gbytes of web site data, images, databases, etc, that had to go over.
At about 1AM, I decided I can do the rest remotely, so I changed
the IP address to be externally visible, and disabled web, mail,
and all services.
Thump. When I changed the IP address, the default route did not
regenerate. So, the machine was not externally visible, and had
lost all visibility out to the world. The last network engineer
on site had also gone home. I was stuck, so I decided I'd address
this in the morning.
My car battery died. Just to add to a perfect day. 30 minutes
later, I had a AAA jump start.
Thursday morning, we figured out the default route (the route
command has a minor change that is not well documented, or perhaps
after 17 hours, I just couldn't find it.) We were back in business.
Sort of. There's a new feature to Linux called SELinux. It provides
an extra level of security for Linux, and is, in principle, a good
thing. The only problem is that the migration of data didn't take
that into account. So, when I tried to bring up the web, we received
a very cryptic message that had nothing to do with the problem.
It took an amount of digging to work this out, and I found and
implemented the solution. It suggested a reboot, so I rebooted.
The default route did not come back.
So, at the end of the work day yesterday, it was back to the
colocation facility (they would have charged for support for this) to
experiment with tricks to make the route persistent across reboots.
I finally worked that out, and went home.
Last night, remotely, I debugged an issue with mail, and made
some changes to the mail configuration files, so mail came up.
I tried to bring up the web, and while the server came up, a lot
of pages didn't. It turns out that the 32-bit shared libraries
did not work properly on the 64-bit system, so I ended up needing
to rebuild several programs before things were working.
Now, I'm in a more watchful state to see if anything else breaks.
I'd love to have a nice, restful weekend, but I've got to catch up
on work. Maybe in 2006?