Pre-Before – There are cliffs at the bottom for A.D.D./A.D.H.D. accessibility.
Before I begin, let me state that I have limited experience with server administration and rely on the extremely helpful information I can gather from the experienced administrators which divulge in conversation on this forum. Also, I encourage any form of flaming of my logic for any reason whatsoever so that I can iron out my thinking in the right direction for this problem. If I am over-thinking this feel free to let me know and tell me the most pragmatic and refined way to expedite the solutions to my problems.
With the disclaimer out of the way:
I currently administer a dedicated server running Fedora Core 4 with the Ensim Appliance which currently hosts 100 domains.
Normally there are not any problems, but on occasion I have had instances of where our domain registration checker, a Perl script which runs through CGI to Tucows, our domain registration provider, causes Apache to fail. This causes every single site on our server to be inaccessible. Obviously, I need to correct this issue (our domain name checker runs on the IP of our server rather than our individual domain, I believe that it may be some weird permission error that is causing this, and I am working to rectify this), regardless of that however, with only 100 domains under my belt, I can only expect more daunting and problematic issues to occur in the future. There is plenty of talk of server clustering, load balancing and DNS failovers, however, none of these would have prevented the above mentioned incident from occurring; it would have just replicated it onto the entire cluster, pointless….
My real question here is what would be the best way to prevent a situation like this from occurring again. Am I tying myself down by looking at it from the appliance’s point of view?
One way I have looked to tackle this is by finding a way to replicate only the content of the server and have it mirrored onto an independent sovereign web server, appliance free, and to be used solely as a redundancy. With DNS failover, it should kick in and work perfectly except with a one day delay, as I would have it be a cron job that works much the same way my Ensim appliance backup’s do, where they occur nightly at a specific time. We can then use another backup to restore our primary server to a healthy state, or correct the originating cause of failure.
Another method I have pondered is to have the entire actual server be replicated bit for bit onto an identical machine, except with a one day lag. To me, this seems as if it would be costly and more prone to failure and in addition, I’d be backing up a lot of extraneous system files, and essentially have a waste of bandwidth and lagging our servers. However, assuming I had done this correctly, with the DNS failover, would be fully functional and seamless.
Keeping in mind that this is a startup with limited funding for the project (although if necessary, the price is not necessarily the problem) and that I would be sorting high traffic sites onto their own independent servers with clustering to accommodate for a slashdotting or another severe bandwidth crisis, what would be the best way to prevent software misconfiguration and daemon failures and ensure that even if there is a catastrophic failure with Apache, BIND or MySQL, that our hosting will fail over to its last good state.
Our server crashed because of a software misconfiguration.