jbourke
05-17-2002, 12:20 PM
I built a server myself and I suspect that I have bad RAM. It keeps becoming unreachable. I cycle the power through my APC Masterswitch and it will boot up fine but then go down again after a day or so.
What can I do from the UNIX command line to test out the system and find out where the problem is?
I'm running Redhat 7.2. The system is using an AMD 1800+ MP processor in a Tyan S2460 motherboard.
Jim
Noldar
05-17-2002, 01:42 PM
I've never tried it, but I ran across this once when searching for info on how Linux uses memory.
http://people.redhat.com/dledford/memtest.html
Richard
insiderhosting
05-17-2002, 05:40 PM
why don't you read the logs to see the reason as to why the server keeps crashing?
-Steven
jbourke
05-17-2002, 05:47 PM
Originally posted by insiderhosting
why don't you read the logs to see the reason as to why the server keeps crashing?
I have but I haven't found anything. The machine apparently stops logging and locks up.
Which log file should I look in for clues?
Jim
insiderhosting
05-17-2002, 05:53 PM
well i would start with /var/log messages and look for anything in there. More specifically look for any kernel errors.
-Steven
Skeptical
05-18-2002, 05:07 AM
I've had to deal with such crappy uptimes on 3 of my servers. In the end I gave up and replaced the motherboard and cpu and viola. Problem went away. Looks like it was a bad batch of mb/cpu.
Softicom.NET
05-18-2002, 07:38 AM
what kinda server u got? (Specs)
jbourke
05-21-2002, 10:41 AM
Sorry, I don't have all the specs handy and I'm 1700 miles away from the server this week. Its a Tyan S2460 MB and an Athlon 1800+ MP CPU.
I am stumped. I'm going to swap out the motherboard, CPU, and RAM this weekend.
Jim
interesting story.
We colocated a couple of servers for my place of employment - offsite dns, backup mail, and data storage. The colocation facility is located in our ISP's main NOC facility in a larger town about an hour away. We built the machines - stress tested them in house, and then finally set everything up and let them run for about another week just to make sure there were no problems.
We shipped them off to the isp to have them racked - the noc staff set them up on the racks and everything seemed to be good... until they started crashing. About the 2nd day there they all locked up and continued to do so about every 2 days. After many many calls to the noc and swapped parts I finally went up there myself to find out what the problem was.
The minute I got in there I knew *exactly* what was wrong - it had to of been 110 F - while it was late fall - they had no a/c on whatsoever. For the interim I went to a hardware store up there and bought two box fans and stuck them in the rack (we had leased an entire rack.) We didn't have any troubles after that but we did move the servers somewhere else quickly.
Long story short... have you been to the data center? is it really hot in there? Besides OS problems (especially if the logs aren't really showing anything) maybe the servers hot.. maybe the rack isn't ventilated correctly - maybe your cpu fan isn't working properly??
batcavenet
05-21-2002, 12:35 PM
well - in most situations I have seen personally - this is a bad motherboard or cpu - most likely motherboard.
JDT
Starhost
05-21-2002, 01:16 PM
Double check your memory. Bad memory is one of the most problems you see.
hey.. can't a guy tell a story! :D :D :D