
|
View Full Version : server freezing (almost nightly)
AlaskanWolf 01-26-2002, 10:47 PM server running cpanel
One of my servers is constently freezing on a nightly basis, and during the day and night, loads are anywhere from .5 to 1.0 during non peak and 5-6 during peak times
The high loads at night are from the backup script running, which i have cut from every night to run every other night
We have been running some scripts that take out the amount of hits in terms of page views...script calls...hits from the domlogs and have found some customers that might of been causing some issues, going off our stats, i figured any sites getting over 3,000 page views in 24 hours would be deemed high resource and would be moved to another server.
Does anyone else got any other tricks up their sleeves to try find what scripts / processes I can find out, what i can do to apache (apache seems to always use up quite a few resources)
How about suexec? I installed it last week but we got a flood of support requests that it broke their scripts and frontpage, so rather then messing with a few hundred customers, we removed suexec for another day
Exim seems to be a high hitter as well in top
Anyway regards and hope someone can help, i have been having many many sleepness nights having to call the NOC to reboot the server
Matt Lightner 01-26-2002, 11:07 PM If it's "freezing" nightly, then there is a very good chance that the problem is related to hardware. If you think that it always happens around the time that your backups are run, there could be some kind of conflict with your backup drive that is causing the lockups. You should have one of the on-site technicians plug in a monitor next time it happens and read what it says on the console. If there are unexpected errors there, send those to your host and ask them to take a look at the server for you.
alchiba 01-26-2002, 11:22 PM Hmm. . . but if the backup has been cut back to every other night yet the lock-ups occur every night, then I'd tend to rule out the drive as a prime suspect. As a quick-and-dirty test, you could cp a massive tarball back and forth between the main and the backup drive and see what happens. Are these SCSI or IDE? If IDE, is the backup a slave or on its own channel?
Exim has chewed up one of my boxes on several occassions, but it's never locked it up. What happens is it gets constipated with bad mail and it drags the server, but that's easily cleared up and probably a routine problem anyway.
One suggestion I'd make is write a script that will run top and make a snapshot of it in a file. Run it as cron job every couple minutes. The last snapshot should give you an idea.
AlaskanWolf 01-26-2002, 11:41 PM I meant to say, its freezing "pretty" much nightly and i ruled out hardware since after finding some of those high hitters, the server has remained online during the night. Past 2-3 days after killing some customers, its remained online, but for the heck of god so i could sleep at night, i rebooted the server for the heck of it before going to bed
When we would be able to get into it to peek, at times the loads were even as high as 400+
PSS: I just setup the backup to run every other night just last night, so i am waiting to see what it will do tonight, i will be sure not to reboot it etc...
xxxxxxxxxxxxxxxxxxxxxx
Matt,
we were running a script that was taking top every X minutes, and couldnt find anything out of the ordinary, a few times the loads were under 1.0
I gatta say never to get amax servers again. I been learning my lessons with these crappy servers from day one.
I will reinstall a top and get the NOC to put the monitor up next time before they do a reboot to see if anythings on the screen
Checking the messages during these freezes, there would be nothing in that log file thats of any help....
WOOOO HOOOOO 800th message
AlaskanWolf 01-26-2002, 11:47 PM PSS; unrelated but that other server that i had posted about crashing at anytime, it was the mobo (intel) and ended up replacing it with another type and the server has been running great since
seems alot of people run the intel nic (eepro100) and it has a very large share of sporatic outages
dektong 01-26-2002, 11:53 PM Originally posted by AlaskanWolf
since after finding some of those high hitters, the server has remained online during the night.
One of my servers has daily average of around 150000 page views. WebPanel (i.e. CPanel 3) is installed on the server. The server has never locked up.
Now ... I am guessing ... some onboard NIC has been known to cause lock ups. Could it be the case? What m/b and what OS version are you running?
When we would be able to get into it to peek, at times the loads were even as high as 400+
Can you get a pattern of when this will happen? You can create a cron job of one minute interval that will just append the output of "uptime" to a file and see at what time you will get this problem. If there is such pattern, then you can just login into the server before the server is going under this load and try to find out what's causing this high load.
WOOOO HOOOOO 800th message [/B]
Congrats! ;)
cheers,
:beer:
allan 01-27-2002, 12:05 AM Originally posted by AlaskanWolf
seems alot of people run the intel nic (eepro100) and it has a very large share of sporatic outages
i have a server from Penguin Computers that was having problems with an eepro100 NIC, we replaced the Red Hat drivers with the eepro drivers from Scyld and the performance was much better:
http://www.scyld.com/network/
AlaskanWolf 01-27-2002, 03:23 AM looks like I got the eepro100 nic card, but no clue on how to update the drivers (tired on that one that we changed out the mobo on, and failed miserably
from our server info page: Intel Corp. 82557 [Ethernet Pro 100]
I think this may be causing those "no load" freezes, since it has the same ingreadments as that intel mobo problem...no load...no transfers...sporatic
found the driver, looks like it will solve my problems (part of them anyway) but no clue on how to install it or compile it into the kernel
Anyone got some instructions that could assist me?
driver: ftp://ftp.scyld.com/pub/network/eepro100.c
Gernot 01-27-2002, 09:40 AM Hi,
Forget about eepro100. This driver is pretty faulty and still causes sporadic outages as you have noticed.
We've been running quite a lot of servers with onboard Intel NICs without any problems, but we use the e100 driver that's from Intel themselves. I suggest you switch over to this driver.
A few easy instructions:
1.) Download this file (ftp://aiedownload.intel.com/df-support/2896/eng/e100-1.6.29.tar.gz) to your server.
2.) Untar it.
3.) Go to the directory e100-1.6.29/src
4.) Type : make install (you need to have your proper kernel sources installed if running Redhat, or if you have a custom kernel, you still need to have its dependency-files (you must not have done a 'make clean') as this will fail otherwise)
5.) Edit /etc/modules.conf and change all references to eepro100 to e100
6.) Reboot
This should solve your problems if it's related to the NIC.
Thanks,
Gernot
magnafix 01-27-2002, 12:11 PM An addition to the 'uptime' cron script suggestion --
Our monitoring system (home grown) checks load by cat'ing /proc/loadavg and if load is over 3 (or whatever we specify), it beeps us and dumps the output of '/usr/bin/top -b -n 1' to a file for later review.
AlaskanWolf 01-27-2002, 07:12 PM Heres what i get when i run make install
root@wolf [/home/admin/e100-1.6.29/src]# make install
gcc -Wall -DLINUX -D__KERNEL__ -DMODULE -DEXPORT_SYMTAB -D__NO_VERSION__ -O2 -pipe -I/lib/modules/2.4.14/build/include -I. -Wstrict-prototypes -fomit-frame-pointer -DMODVERSIONS -include /lib/modules/2.4.14/build/include/linux/modversions.h -DIANS -DIANS_BASE_ADAPTER_TEAMING -DIANS_BASE_VLAN_TAGGING -DIANS_BASE_VLAN_ID -DIDIAG_PRO_SUPPORT -DE100_ETHTOOL_IOCTL -c -o e100_main.o e100_main.c
In file included from /lib/modules/2.4.14/build/include/linux/fcntl.h:4,
from /lib/modules/2.4.14/build/include/linux/fs.h:600,
from /lib/modules/2.4.14/build/include/linux/capability.h:17,
from /lib/modules/2.4.14/build/include/linux/binfmts.h:5,
from /lib/modules/2.4.14/build/include/linux/sched.h:9,
from /lib/modules/2.4.14/build/include/linux/mm.h:4,
from e100.h:76,
from e100_main.c:104:
/lib/modules/2.4.14/build/include/asm/fcntl.h:75: Internal error: Segmentation fault.
Please submit a full bug report.
See <URL:http://bugzilla.redhat.com/bugzilla/> for instructions.
make: *** [e100_main.o] Error 2
Suggestions?
Gernot 01-27-2002, 07:25 PM Uhm, you don't happen to use the old, obsolete gcc package from Redhat 7.1? I remember that it had such a problem with segmentation faults.
Did you upgrade it to at least 2.96-85? If not, the upgrade is on www.redhat.com and can be downloaded from there.
Gernot
AlaskanWolf 01-27-2002, 08:56 PM installed successfully and in the process of rebooting
Thanks!
magnafix - is it possible to ask a favor and send me your top script?
We had a script running every 5 minutes that ran top, and of course the load sky rocketed, would be nice to have it only run when a load is at XX instead of every 5 minutes :)
magnafix 01-28-2002, 12:33 PM Here's a fragment of the script we run every two minutes on numerous machines.
#!/usr/bin/perl
$machine = `/bin/hostname -s`;
chop($machine);
$logfile = "/path/to/logs/loadtest.$machine";
$loadavg = `/bin/cat /proc/loadavg`;
@loadarr = split ' ', $loadavg;
$nowload = $loadarr[0];
$msg = "$machine load: $nowload\n";
open (LOG, ">>$logfile") or die("Cannot open logfile!");
print LOG (localtime) . " $msg";
if($nowload > 4)
{
# maybe you want to send mail/beep to sysadmin now
system("/bin/date >> /path/to/logs/top.$machine");
system("/usr/bin/top -b -n 1 >> /path/to/logs/top.$machine");
# if($nowload > 8)
# {
# Maybe you want something more significant to happen
# when load is 8+, like restarting apache or something
# }
}
close(LOG);
exit;
|