The main harddrive in one of our servers died a week before Christmas. It was a total crisis, and also a test of backup-routines. The crisis was that the server was unavailable until the datacenter where it is co-located could replace the harddrive, and also that the backups where stored in a place with high bandwidth in (so backups could be transferred to it quickly) and almost no bandwidth out (which means that it took quite a while to restore the backups). This - all in all - caused quite a bit of downtime because the backups had to be slowly transferred to other servers who temporarily did the job of the server with the defective harddrive, and it also put a larger load on these serves because they suddenly got a unforseen load-increase.
And by unforseen load-increase I mean that doing backups regularly is something that, luckily, was covered, but what to do with them, what server to restore them to and how was not even considered before the crisis was there.
So. If you are a CEO or just a normal IT guy in a corporation, you may want to consider:
- Do you have backup-routines who cover all important data on all servers (all data that is not programs who are included in the OS, etc)?
- Do you have routines for where this data should be restored to in order to bring backed-up services up immediately using alternative server(s)?
- Do you have alternative servers ready to run affected services if one or more servers go down?
The reason these questions come to light today is this:
“Date : 01/11/2007
Reboot is failing, BIOS is not detecting your disk but is waiting forever.
At this point I can only offer a reinstall on a new drive of the linux of your choice (I recommend CentOS 4.4 as it’s the quickest install) and slaving your old drive in hopes of some data recovery. This assumes the drive can even be hooked to the IDE bus as a slave without preventing system boot.
Please update this ticket to tell us how you wish to proceed.”
Yes. Another server stopped responding. The datacenter were asked to reboot it. They did and this is what they had to say about it. Great. Another dead harddrive. The more servers you have, the more trouble you have.
Luckily, it did not really matter much that the server died.It was one of the servers used to run the YacySearch search-engine. A index of about a million URLs and their keywords were “lost”. Well, “lost” as in it does not matter enough to make backups of or attempt to restore the data from the previous harddrive - since it is only a matter of re-crawling and re-indexing them, but still, data was lost because of no backup routines for it.
The result, in this cause, was that YacySearch is temporarily slightly slower and temporarily shows a few less search-results for a few keywords. But service downtime and losses could have been greater. So a word of advice: Check your backup-routines, and the routines for keeping your services running if a server fails..
This also applies to personal computers. Imagine this: The computer you are currently using to read this just died. It’s harddrive is defective. Everything on it is gone forever. Does it bother you? If it does then that means you that your backup-routines are not good enough.
Sphere: Related Content