IIS problem?? [expletive deleted] Windows server crashing repeatedly ...
After 17 months of ok service (crashing 2-3 times per month), my Windows 2000 server has started crashing 4+ times per day recently.
My server logs around a crash event look as follows:
2004-06-01 18:11:20 188.8.131.52 80 GET /images/goto/gospiral.gif - 200 0 Mozilla/4.0+(compatible;+MSIE+5.23;+Mac_PowerPC) ASPSESSIONIDSARRCAAD=BKALCBFCJODCPBFDIKMGMHLM http://www.worldofboxes.com/animals/animal-prints.htm
2004-06-01 18:11:20 184.108.40.206 80 GET /images/goto/gocelt.gif - 200 585 Mozilla/4.0+(compatible;+MSIE+5.23;+Mac_PowerPC) ASPSESSIONIDSARRCAAD=BKALCBFCJODCPBFDIKMGMHLM http://www.worldofboxes.com/animals/animal-prints.htm
2004-06-01 18:11:2#Software: Microsoft Internet Information Services 5.0
#Date: 2004-06-01 18:25:02
#Fields: date time c-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs(User-Agent) cs(Cookie) cs(Referer)
2004-06-01 18:25:02 220.127.116.11 80 GET /index.htm - 200 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) - -
2004-06-01 18:25:06 18.104.22.168 80 GET /mainstyle.css - 200 3668 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) ASPSESSIONIDQSTCSDBB=CJBPPEGCHLAILAIADBBJPHGF http://www.worldofboxes.com
2004-06-01 18:25:12 22.214.171.124 80 GET /common/ha.asp - 200 0 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) ASPSESSIONIDQSTCSDBB=CJBPPEGCHLAILAIADBBJPHGF http://www.worldofboxes.com
In this case, the system event log recorded a crash at 18:19:51, about eight minutes after the last, truncated entry in the server log. This is typical -- the server logs appear to stop mid-line anywhere from two to 45 minutes before an operating system crash, with eight minutes being the most common interval. (The site is busy by day, so those gaps are not due to lack of activity.)
I can think of two explanations:
1) on restarting after the crash, IIS chews off a big chunk of the current log before it starts writing again, creating the appearance of a gap.
2) IIS has stopped logging several minutes before the system crash.
In case (2), I think IIS is still serving pages even though it is not logging the activity, because the web service is monitored every 20 minutes by Alertra, and I have not received a single alert through dozens of crashes. This is possible considering that the site must be down 2.5 minutes starting from Alertra's first try before an alert is sent, and the reboots take two to five minutes, so perhaps by chance no crash has triggered an alert. But if IIS had stopped when the logs stopped, many alerts would have been inevitable. (Or else Alertra is a scam after all, but I don't think so.)
Assuming case (2), I can speculate on two further possible conclusions:
2a.) IIS is starting to get unstable at the time it stops logging, and it brings the whole system down two to 45 minutes later.
2b.) Whatever is crashing the system affects IIS first, causing it to stop logging.
Has anyone ever seen or heard of anything like this? Does IIS look like the problem, or a victim?
Removing and reinstalling IIS, and restoring the metabase from a backup, did not help even slightly.
As for the crashes themselves, the system stop message is fairly generic and unhelpful:
0x000000d1 (0x00000000, 0x00000002, 0x00000000, 0x00000000)
The exception address is always the same and seems to fall in the range of ntoskrnl.exe.
Also, no new hardware or software was added to the system for a very long time.
Regarding software, you are right of course that Windows is frequently patched for security -- though nothing was applied for three weeks prior to the onset of the crashing problem. I also upgraded to SP4 after the problem started in hopes of a cure, with no luck.
Unfortunately, the event logs contain almost nothing except the reboot sequence over and over. The first crash that seemed to mark the beginning of the problem was the first event in five days. There is no common red flag for a hardware problem (though of course that is possible). Twice during the six-week frequent crash period, there were a series of inscrutable informational (not error or warning) messages regarding the NIC card, so maybe that is a suspect.
The only other events that are routinely noted are the expiration of a self-issued SSL certificate that I use for an admin application, and was too lazy to self-reissue, and bi-monthly charging of a battery.
The most likely scenario is that you have a script or component on the machine somewhere which is leaking memory or not releasing resources after use.
Over time the problem builds until IIS falls over showing the exact symptoms you've described.
You've not said whether this is a private server or hosting public sites - if the latter then most likely a customer has uploaded a script that kills IIS after a few runs (usually caused by calling recordsets from a very large or very badly designed Access database). If the formaer look for any changes you made yourself before this started.
How often do you restart IIS ? - on w2000 you should really restart it every 24 hours or so just to ensure it gets a good clearout, chaeck out the iis5recycle tool :
This server is rather over-powered for its current workload, with 1GB of memory, of which 80% is always free. I checked for a memory leak, recording memory stats every 20 minutes with the system monitor, but the amount of free memory does not change much.
Nor were the everyday web applications changed in the six months leading up to the problem.
Is it possible IIS is running out of something other than memory?
I spoke with Dell, the manufacturer of this formerly happy machine, and they provided a tool called DSET which hunts for all possible evidence of hardware problems. It found nothing. Of course, it's in their interest to not find hardware problems, since the server is under warranty.
Windows is nice when it works, but very frustrating when it doesn't.
This is possible considering that the site must be down 2.5 minutes starting from Alertra's first try before an alert is sent, and the reboots take two to five minutes, so perhaps by chance no crash has triggered an alert.
Alertra allows you to configure the retry interval for the 3 attempts it makes to contact your site. The default is 30 seconds, but you can set it as low as 3. Look for "Time to wait before retry after error" on the "Update Device" screen.
In the end, IIS had nothing to do with it. On my system at least, IIS does chew off a chunk of the existing log when it starts writing again after a crash. So the gaps I noticed in the logs were an illusion.
The problem turned out to be Kerio personal firewall v2.1x, which continues to crash the system six times per day even after I reinstalled it. Other people have noted this problem on widely scattered message boards, and it seems to be unfixable (even if the product were still supported, which it is not.)
Strange since it worked for more than a year before going bad, and I use Kerio on my home machine without problems.