Web Hosting Talk







View Full Version : How to find out why server mysteriously down?


ckpeter
03-25-2002, 12:53 PM
Recently my server has been going down mysteriously. After 7-8 hours after it is up, usually when I am asleep, the server will do down.

There shouldn't be a huge load: I only have a few FTP connection and a (light-weight) virtual server running. Unless someone lanuched a DoS while I am asleep.

What can I do to find out the cause? I only have 128MB of RAM, and tech support suggested that may be the problem. Is there any way I can have the server record its status when it crash?

Thanks,

Peter

MotleyFool
03-26-2002, 03:04 AM
It could be a problem with the hardware.. either bad motherboard or RAM chips or heating

I dont think low RAM can cause the server to go down... what's your current load?

ckpeter
03-26-2002, 12:23 PM
It is mostly 0s. Even the RAM looks quite free. In fact, yesterday, as I was logged in SSH, the server just closed the connection and went down.

Is there anyway to have the system log a message if it crashes because of low RAM?

Thanks,

Peter

stlouislouis
03-26-2002, 12:38 PM
Hi ckpeter,

Can you give a little more info to folks about your server?

What version of what OS?

What kind of hardware? How old? Did you buy or build the system? What motherboard, cpu, type of RAM and hard drives?

Regarding hard drives, IDE or SCSI? Connected to motherboard or add in card? Experienced any data loss or corruption?

What kind of power do you have going to the box? Do you have the system plugged into anything like a UPS to guarantee steady power? How big is your power supply? How old is it? Flakey power is the cause of lots of weird problems. Along this line, have you had the machine protected from surges and brown outs? Had any electrical storms, brown outs or other power problems in the last few months?

Have you applied all the software updates and patches? How about BIOS or hardware/firmware upgrades? Are you running X-windows on the server?

How long has this behavior been going on?

What was/were the last update(s) you did to the system before this started happening?

Is the server in your home, or a data center?

What kind of network connection? Datacenter, cable/dsl line, what?

What services do you have running?

What does top show?

Any configuration/compile options on, for instance, your web server that might cause you to problems with RAM?

Do you have a swap partition? How big?

Do you have a firewall running? Any details?

When it crashes, what if any messages are being displayed?

What kinds of user accounts are set up on the machine -- and what kinds of folks have them? Anyone besides you in the wheel group who can su to root?

I'm just learning *nix; so I'm no expert. However, I do suspect these types of details might tip off someone to what the problem is.

Hope the problem gets resolved soon,

Louis

ckpeter
03-26-2002, 02:13 PM
Louis,

My server is running redhat 7.1.

This is a dedicated server in a datacenter. It is a celeron 800 with 128MB of ram and 30G of harddrive. Since I didn't build the machine, I am not sure of the details. There seems to be no data loss, however.

I have applied most of the updates(if not all). I am not running an X server.

This actually started a few days ago. Where it runs fine for a few hours (7-11 hours), and goes down. Usually a reboot will bring it back up.

I should mention that I am running a custom kernel which allows me to create virtual servers. (This may be a problem)

I have shutted off most services in the main server. In a virtual server I am running ftp with a few connections.

Output of Top: (note that the virtual server list is separate and not shown)
---------
38 processes: 37 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.1% user, 0.0% system, 0.0% nice, 99.8% idle
Mem: 110480K av, 100788K used, 9692K free, 0K shrd, 3268K buff
Swap: 200772K av, 5976K used, 194796K free 71728K cached

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
1297 root 13 0 1032 1032 836 R 0.1 0.9 0:00 top
1 root 0 0 524 480 452 S 0.0 0.4 0:20 init
2 root 9 0 0 0 0 SW 0.0 0.0 0:00 keventd
3 root 19 19 0 0 0 SWN 0.0 0.0 2:38 ksoftirqd_CPU0
4 root 9 0 0 0 0 SW 0.0 0.0 0:54 kswapd
5 root 9 0 0 0 0 SW 0.0 0.0 2:40 bdflush
6 root 9 0 0 0 0 SW 0.0 0.0 1:58 kupdated
7 root -1 -20 0 0 0 SW< 0.0 0.0 0:00 mdrecoveryd
72 root 9 0 0 0 0 SW 0.0 0.0 0:00 khubd
561 root 9 0 604 604 504 S 0.0 0.5 0:02 syslogd
566 root 9 0 1200 1196 464 S 0.0 1.0 0:00 klogd
580 rpc 9 0 580 576 496 S 0.0 0.5 0:00 portmap
671 root 9 0 652 652 548 S 0.0 0.5 0:03 automount
683 root 9 0 876 816 700 S 0.0 0.7 0:00 sshd
706 root 9 0 924 888 732 S 0.0 0.8 0:00 xinetd
718 root 0 0 648 648 552 S 0.0 0.5 0:01 crond
742 daemon 9 0 488 468 412 S 0.0 0.4 0:00 atd
755 root 9 0 528 508 448 S 0.0 0.4 0:00 rhnsd
781 named 9 0 3120 2528 1464 S 0.0 2.2 0:00 named
793 named 9 0 2296 1164 848 S 0.0 1.0 0:00 named
794 named 9 0 3120 2528 1464 S 0.0 2.2 0:29 named
795 named 9 0 3120 2528 1464 S 0.0 2.2 0:01 named
796 named 9 0 3120 2528 1464 S 0.0 2.2 0:15 named
797 named 9 0 3120 2528 1464 S 0.0 2.2 0:00 named
798 named 9 0 2296 1164 848 S 0.0 1.0 0:30 named
802 named 9 0 2296 1164 848 S 0.0 1.0 0:02 named
803 named 9 0 2296 1164 848 S 0.0 1.0 0:15 named

---------

I don't custom compile software. I install them through rpm.

I don't have a firewall running.

Part of the confusion is the fact that my server got hacked a few weeks ago. So I am not sure whether this is a security related problem, problem with the custom kernel(mailing list reported no unusual crash), or hardware problem.

The best thing that would help right now, is a way to log the system status just before it crashes.

Thanks,

Peter

Jedito
03-26-2002, 02:37 PM
Check in /var/logs/message and look what it say before the server crash or reboot.

MCHost-Marc
03-26-2002, 03:27 PM
Probably bad RAM ..

zupanm
03-26-2002, 04:54 PM
random reboots are usually bad hardware. My guess RAM

dektong
03-26-2002, 05:01 PM
I am pretty sure your m/b chipset is i815 and you are using the onboard NIC. Some i815 chipset motherboards with build in NIC have been reported to cause a server lockups at random times. One possible fix is to update the nic driver; you may want to search for this in google. I think one of WHT member, AlaskanWolf might have similar problem that he posted here in the past.

Good Luck ...

cheers,
:beer:

ckpeter
03-26-2002, 08:05 PM
<edit>tech support notified me that the NIC isn't that, but thanks nonetheless.</edit>

My server is currently down and a remote reboot failed to bring it up. My guess is that someone is working on it.

The server doesn't reboot by itself, it just goes down. I have to remote reboot to bring it up, usually.

I am working with tech support right now. They are still suggesting low RAM as a possible cause. Can low RAM crash/down a server?

Also, is there any program for linux that would check for bad RAM?

Thanks,

Peter

stlouislouis
03-26-2002, 09:36 PM
Hi ckpeter,


Having your machine in a data center makes it much harder to diagnose and deal with this, I know.

You asked about memory test for Linux. I did a google search on this. I found this link:

http://www.teresaudio.com/memtest86/

There are other links as well from this search:

http://www.google.com/search?hl=en&q=RAM+testing+program+for+LINUX


At arstechnica.com, there are hardware and Linux forums with very good, talented people who freely offer helpful advice to folks. Go to the open forums section of the site. I linked to the Linux forum below for you. I've heard folks speak highly of memtest86 for Windows -- but suspect the Linux version is worth checking out.

Along with the other posters above who mentioned it, I too would suspect bad RAM. However, I can't outright suggest you spend your money on new RAM without having any confimation that it is indeed bad RAM. Besides, unless you could specify a quality type of RAM for your server, you may get some crummy brand of RAM from whomever you are renting the server from. RAM is sensitive to static; who knows how careful the techs who install it are to safeguard against static electric discharge?

If you do go for more RAM, may I suggest in addition to adding more, you consider swapping out the RAM that's in your machine with new RAM as well.

It could be the reason the tech is advising you to get more RAM is because the way they configure their systems -- possibly unknowingly -- the software needs more RAM to function as it should without crashing over time. You've got to figure they deal with many similarly configured machines; if they notice more RAM in the machines *they set up -- the way they set them up* typically solves the problem, then it might be worthwhile to try more RAM.

Two other things you mention stand out to me. The first is that you got hacked a few weeks ago. What happened? How do you know you were hacked? What damage did the "hack" do? How do you know there isn't a rootkit or some such thing installed on your box? What was done to eliminate any damage the intrusion did? Did they do a clean install? What?

Secondly, I have never had a box in a data center before. So I don't know if it's standard practice or not for a firewall NOT to be operating on a server box they way they have yours set up. They may have a firewall on some sort of "front end" device, I don't know. But I would want my server to have an installed, properly configured and operating firewall.

I can see how one might lock oneself out of one's own box if one were not careful in how one sets up their firewall -- so your provider may frown on their use. But to me, they are essential.

I wish I could be of more help than what has already been mentioned. I feel bad for your current plight. But with the box in a data center, and your having limited access, I'm not sure what to suggest besides running diagnostics and analyzing your logs for clues. It must be VERY frustrating.

I would strongly recommend you go to arstechnica open forum and post your question in the Linux forum here:

http://arstechnica.infopop.net/OpenTopic/page?a=frm&s=50009562&f=96509133

Let us know how things go. I sure do wish you the best. I feel for anyone whose server is down and far away....

Louis

bitserve
03-27-2002, 12:30 AM
My first guess is a bad power supply.

Will you let us know who was right? :)

Good luck, too.

ckpeter
03-27-2002, 12:40 AM
Thank you all for your kind and thorough help.

Louis, thanks for the links, I did a search myself with "check RAM" and came up nothing. However, the program you suggested requires boot access, which is something a server in a datacenter can't provider. (I can use it to test my home PC, so its still helpful).

Currently, tech support installed 256 MB RAM(presumably all new), we will see what happens.

I guess I should be installing a firewall. I am not exactly sure how I got hacked last time, it seems that the server is running slow, and then one day I can't logged in to ssh. Tech support checked that there were some kind of security software installed. I ended up doing a complete reinstall. I am now searching for information on firewalls.

Thanks,

Peter

ckpeter
03-28-2002, 02:41 AM
Just an update for anyone interested: The server seems to have been running fine since the RAM upgrade. Although I can't be sure exactly what tech support did. There seems to be more than 256 MB of RAM in my system(360 to be exact). The current uptime is more than 1 day.

I will see if the server can stay up for a few more days to decide if this is solved.

Thanks again for all your kind helps,

Peter

bacid
03-29-2002, 01:11 AM
i had a similar problem with an older box of mine.. it turned out to be bios related...

the bios by default was set to "performance" mode which overclocked the ram/pci bus, etc...

this would cause the box to just restart itself randomly or just die and require a forced reboot..


setting it to "stability" mode which used conversative settings fixed the issue..

if you continue to have problems even after the ram upgrade, you might wanna take a look at the bios :)

good luck

ckpeter
03-29-2002, 11:02 AM
Thanks for the suggestion. Currently the box has 2 days + uptime, so it should have stablized by now. We will see how it goes for the next few days.

Thanks again for the suggestion,

Peter