Results 1 to 25 of 50
Thread: Debugging random server reboot.
-
07-07-2009, 03:00 PM #1Newbie
- Join Date
- Jul 2009
- Posts
- 21
Debugging random server reboot.
I was searching google when I saw this => webhostingtalk.com/showthread.php?t=873989 and decided to sign up and post my issue...
I've been plagued over the past few weeks when I noticed the server keeps rebooting randomly.
Code:# last reboot reboot system boot 2.6.18-128.1.14. Tue Jul 7 11:21 (03:27) reboot system boot 2.6.18-128.1.14. Mon Jul 6 18:58 (19:50) reboot system boot 2.6.18-128.1.14. Mon Jul 6 15:04 (23:44) reboot system boot 2.6.18-128.1.14. Mon Jul 6 10:54 (1+03:54) reboot system boot 2.6.18-128.1.14. Mon Jul 6 03:33 (1+11:16) reboot system boot 2.6.18-128.1.14. Sun Jul 5 14:48 (2+00:01) reboot system boot 2.6.18-128.1.14. Sat Jul 4 15:59 (2+22:49) reboot system boot 2.6.18-128.1.14. Sat Jul 4 05:58 (3+08:51) reboot system boot 2.6.18-128.1.14. Fri Jul 3 08:12 (4+06:36) reboot system boot 2.6.18-128.1.14. Thu Jul 2 06:52 (5+07:56) reboot system boot 2.6.18-128.1.14. Thu Jul 2 04:12 (5+10:36) reboot system boot 2.6.18-128.1.14. Wed Jul 1 08:42 (6+06:07) reboot system boot 2.6.18-128.1.14. Wed Jul 1 06:36 (6+08:12) reboot system boot 2.6.18-128.1.14. Mon Jun 29 05:43 (8+09:05) reboot system boot 2.6.18-128.1.14. Sun Jun 28 21:39 (8+17:09) reboot system boot 2.6.18-128.1.14. Sat Jun 27 10:33 (10+04:15) reboot system boot 2.6.27.10-grsec Thu Jun 25 14:07 (1+20:23) reboot system boot 2.6.18-128.el5PA Thu Jun 25 09:47 (04:17) reboot system boot 2.6.18-128.el5PA Thu Jun 25 05:26 (08:38) reboot system boot 2.6.18-128.el5PA Thu Jun 25 04:46 (09:19)
IS there a way I can monitor and find out what might be causing this? If I remove halt and shutdown binaries might that help?
-
07-07-2009, 03:52 PM #2Junior Guru Wannabe
- Join Date
- Aug 2007
- Location
- Brighton, UK
- Posts
- 66
I wouldn't recommend removing the shutdown command, check your syslog immediately prior to each reboot and see if the system is logging anything that might give you a clue as to what's causing it. If you're unsure as to what the syslog is telling you, post some extracts from it here for us to have a look at.
-
07-07-2009, 04:19 PM #3Web Hosting Guru
- Join Date
- May 2008
- Posts
- 340
You can install various tools, one such is syssta(SAR) which will document the server load, swap, disk I/O etc and you can then check it later at any time for previous metrics. This can tell you if the server is going down due to a load/spike problem.
Also what does dmesg and boot.log say ?
-
07-07-2009, 04:34 PM #4Newbie
- Join Date
- Jul 2009
- Posts
- 21
/var/log/messages: just shows "syslogd 1.4.1: restart" nothing before to indicate a problem.
/var/log/boot.log: empty
/var/log/dmesg: pastebin.com/m2a3ed9e8
How do I setup/configure systa(SAR)? (since I've never heard of it)
-
07-07-2009, 04:36 PM #5Web Hosting Guru
- Join Date
- May 2008
- Posts
- 340
You can setup SAR by installing the sysstat package via YUM if you're using CentOS/Debian/Ubuntu
or compile it from source by downloading it from,
Twitter : http://twitter.com/eth1networks
Contact Us : support[at]eth1.in
-
07-07-2009, 04:48 PM #6Newbie
- Join Date
- Jul 2009
- Posts
- 21
So just "yum install systat"? Is there anything i need to configure? and what is it that i should look for in the logs?
-
07-07-2009, 04:49 PM #7Web Hosting Guru
- Join Date
- May 2008
- Posts
- 340
Yes, once installed you can check all statistics using the command,
sar -ATwitter : http://twitter.com/eth1networks
Contact Us : support[at]eth1.in
-
07-07-2009, 04:53 PM #8Web Hosting Master
- Join Date
- Mar 2009
- Location
- Austin, TX
- Posts
- 935
Your physical hardware pieces ok? i.e. RAM is not bad?
SysAdmin.xyz
Having severs with customer data on it without proper monitoring is like having one night stand without using protections - eventually, there will be an 'oh s**t!' moment.
-
07-07-2009, 05:06 PM #9Newbie
- Join Date
- Jul 2009
- Posts
- 21
Just noticed it's already been installed.
Anyway, here is my log (sar -A):
errorlogs.netii.net/sar.html
I'm just not sure what I'm trying to find.
-
07-07-2009, 05:26 PM #10Web Hosting Guru
- Join Date
- May 2008
- Posts
- 340
Server loads look stable.
11:30:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
11:40:01 AM 2 252 1.89 1.96 1.99
11:50:01 AM 2 252 1.61 1.94 1.95
12:00:01 PM 2 262 1.68 1.94 1.91
12:10:01 PM 5 256 1.44 1.93 2.03
12:20:01 PM 7 256 3.33 2.44 2.14
12:30:01 PM 6 263 1.40 1.88 2.03
12:40:01 PM 6 257 4.17 2.53 2.11
12:50:01 PM 2 248 1.37 1.45 1.70
01:00:01 PM 10 262 2.54 1.97 1.77
01:10:01 PM 5 268 2.20 1.98 1.90
01:20:01 PM 5 258 1.72 2.11 2.04
01:30:01 PM 1 251 1.70 1.79 1.92
01:40:01 PM 4 259 2.16 1.61 1.68
01:50:01 PM 5 272 3.42 2.84 2.39
02:00:01 PM 8 257 0.84 1.29 1.80
02:10:01 PM 4 253 0.75 0.82 1.29
02:20:01 PM 4 252 0.82 1.03 1.17
02:30:01 PM 2 260 1.74 2.17 1.65
02:40:01 PM 4 249 1.65 1.44 1.48
02:50:01 PM 6 263 2.16 2.25 1.88
03:00:01 PM 2 263 2.79 2.36 2.12
03:10:01 PM 4 256 1.52 2.05 2.09
03:20:01 PM 1 255 1.08 1.29 1.64
03:30:01 PM 5 260 0.37 0.74 1.19
03:40:01 PM 1 258 0.47 0.64 0.92
03:50:01 PM 0 252 2.95 2.21 1.54
04:00:01 PM 5 268 0.79 1.27 1.35
04:10:01 PM 2 255 0.54 0.85 1.10
04:20:01 PM 4 266 1.57 1.26 1.15
04:30:01 PM 0 249 1.22 1.04 1.08
04:40:01 PM 4 263 3.42 1.60 1.22
04:50:01 PM 4 254 1.03 1.08 1.16
Average: 4 258 1.76 1.68 1.67Twitter : http://twitter.com/eth1networks
Contact Us : support[at]eth1.in
-
07-07-2009, 07:23 PM #11Newbie
- Join Date
- Jul 2009
- Posts
- 21
looking at the last reboot list, its 100% random. The only thing I can think of is it's either LiteSpeed or MySQL. If I block all traffic, the server is fine, and does not crash. So I'm pretty sure it is something with a running service that gets traffic.
I've made edits to my.cnf dropping the values very low, but I still always have enough RAM. MySQL seems stable, nothing in it's logs about crashing. I've set LiteSpeed error logs to HIGH/DEBUG so I'll see if that shows anything.
But honestly... I'm stumped and out of ideas. :/
-
07-07-2009, 10:34 PM #12WHT Addict
- Join Date
- Jan 2006
- Location
- Dallas, TX
- Posts
- 106
This smells hardware to me. Could be RAM gone bad, could be losing a mobo or proc. That's the sort of thing that causes "random" issues like this. Blocking all traffic and the server runs doesn't point away from hardware because (a) you're reducing any load that would be going through the server (and the RAM and the mobo and the proc) and (b) it's really hard to prove a negative! :0
Chris Gebhardt
VIRTBIZ Internet Services
Web Hosting, Dallas Colocation, Dedicated Server
virtbiz.com | ph (972)485-4125 | toll-free (866)485-4125
-
07-07-2009, 11:41 PM #13Newbie
- Join Date
- Jul 2009
- Posts
- 21
I would consider hardware, but my DC has already confirmed it's not and they even moved me to another server.
-
07-08-2009, 06:23 PM #14Newbie
- Join Date
- Jul 2009
- Posts
- 21
made some progress... i guess you could call it that...
basically, server crashed today and data center said something about the screen having a stack overflow... not sure what that means.
-
07-08-2009, 06:53 PM #15Corporate Member
- Join Date
- Feb 2008
- Location
- Houston, Texas, USA
- Posts
- 3,262
Hi,
The fact that you had some lingering orphan nodes means that the shutdown/reboot wasn't graceful. It's a crash. By "stack overflow", your DC probably means that this is a kernel stack overflow, which leads to a kernel panic. It explains the ext3 orphan cleanup after restart.
If indeed your chassis was swapped and hard drives replaced, have you installed any kernel module lately (or thirdparty software that might have installed one)?
RegardsUNIXy - Fully Managed Servers and Clusters - Established in 2006
Server Management - Unlimited Servers. Unlimited Requests. One Plan!
cPanel Varnish Plugin -- Seamless SSL Caching (Let's Encrypt, AutoSSL, etc)
Slow Site or Server? Unable to handle traffic? Same day performance fix: joe@unixy
-
07-08-2009, 08:56 PM #16Newbie
- Join Date
- Jul 2009
- Posts
- 21
It's your typical CPanel(Current Build) with MySQL(5.0.81) and LiteSpeed 4.0.5 running. As for kernel modules... not that I know of. The kernel running now is the default RH:
Linux hostname 2.6.18-128.1.14.el5PAE #1 SMP Wed Jun 17 07:15:54 EDT 2009 i686 i686 i386 GNU/Linux
I've pretty much ruled LiteSpeed out as the cause(ran high debug and noticed nothing), so I've edited the my.cnf file to maybe see if I had the cache values too high and was causing an overflow?
-
07-08-2009, 11:39 PM #17Newbie
- Join Date
- Jul 2009
- Posts
- 21
Nope...
I'm just pretty much lost unless anyone has an idea, no matter how crazy...
server 1(original) runs stable/fine now that there is now traffic on it.
server 2(new/this one) runs fine then randomly crashes.
so its safe to say that in some way traffic/script is causing some sort of stack overflow? is is remotely possible that a query or some php script could cause the server to randomly crash?
im not sure if this might help, but would dropping cpanel build down to stable fix the problem?
-
07-09-2009, 12:23 AM #18Junior Guru Wannabe
- Join Date
- Jun 2009
- Location
- Houston,Tx
- Posts
- 46
Look for Core dumps.
When software causes a crash, it may(depends on the software) leave a core dump.
Also, since you said the old server has no traffic and is working fine, what type of hardware does your DC have you on?
2 things come in mind:
1 The harware to driver version is not right. What is the kernel version?
2 apache/lightspeed, which can cause memory overflows.
The easiest to check, is lightspeed. Switch it to apache, and see if that resolves the issue.
If so, check your rpm's and memory kernel modules. Lightspeed is extremely memory dependant. Upgrade your kernel and other utilities. Then REINSTALL light speed. Upgrading a broken installation may not work, from my experience, reinstall is the best.
If this does not help resolve the issue, try updating the kernel and all modules, look for core dumps, run gdb on them to see what is causing it.
Finally, if all else fails, please contact me. I'd be more then willing to take a look.
-
07-09-2009, 01:19 AM #19Corporate Member
- Join Date
- Feb 2008
- Location
- Houston, Texas, USA
- Posts
- 3,262
A user program does not crash a stable Linux kernel. The kernel crashes when:
- It trips on a hardware defect
- It hits a bug in its own code
- It hits a bug in third party code (loadable modules)
A PHP or apache program will not crash a stable kernel. If I had to bet, it would be a hardware defect. That's why I always recommend running business critical tasks on server hardware (Xeon, ECC RAM, redundant power supply, RAID mirroring, etc).
RegardsUNIXy - Fully Managed Servers and Clusters - Established in 2006
Server Management - Unlimited Servers. Unlimited Requests. One Plan!
cPanel Varnish Plugin -- Seamless SSL Caching (Let's Encrypt, AutoSSL, etc)
Slow Site or Server? Unable to handle traffic? Same day performance fix: joe@unixy
-
07-09-2009, 11:20 AM #20Junior Guru Wannabe
- Join Date
- Jun 2009
- Location
- Houston,Tx
- Posts
- 46
I agree, a 3rdparty software does not cause the crash.
Memory usage does. Lightspeed is memory intensive, and can grab up too memory, and instead of the system reallocating the memory elsewhere, lightspeed keeps grabbing more.
Until there is no memory, but now you have other services fighting to get their memory.
It's not the direct cause, which is why I stated, if switching to apache works, update the kernel and all other kernel modules. Mainly due to the actual cause being in those, rather then lightspeed itself. Mysql is the most memory intensive software on the server(in a standard cpanel lamp server, normally). Mysql cannot cause the server to crash, but let it eat up all the memory, and the server will crash. I've seen it way too many times to disregard anything that can utilize and not give up memory.
You should add a 4th reason to your list:
It cannot manage memory properly due to the software on the server, and the server configuration.
Dont get me wrong, if you tweak the options a bit, the kernel can manage the memory FAR better, and you'd never run into a server crash. The services would flat out stop, and then you'd have a server that was powered on, but not truly functional.
I definately agree with you, and I only mean to explain my pattern of thought here.
-
07-09-2009, 02:15 PM #21Newbie
- Join Date
- Jul 2009
- Posts
- 21
I would consider it a memory issue... but it crashed with 800k Free...
As for the LiteSpeed Core Dump, I have it enabled under Web Console -> General -> Enable Core Dump: Yes. Where would I find the dump(s)?
As for the kernel, what version should I consider upgrading to and how do I upgrade the modules? I would have thought that the RH Default would be fine.
-
07-09-2009, 03:17 PM #22Corporate Member
- Join Date
- Feb 2008
- Location
- Houston, Texas, USA
- Posts
- 3,262
Frankly, only your provider can trouble shoot this issue especially when it causes the machine to crash. All we are doing as responders in this forum is shoot in the dark. A server crash can happen as a result of one or more causes. There are hundreds of causes! A server crash has to be meticulously investigated and results interpreted by an experienced systems person. I have a feeling you're wasting your time implementing fixes that may have nothing to do with the root cause.
I wish there were something I (or others) could do but help coming from this forum is very limited. Right now, your provider should be all hands on deck exploring all clues and possibilities. They gave you an important clue with the on-console kernel stack overflow and I highly recommend you focus on that report. Ask your provider to prove the kernel crash is not hardware related.
Good luckUNIXy - Fully Managed Servers and Clusters - Established in 2006
Server Management - Unlimited Servers. Unlimited Requests. One Plan!
cPanel Varnish Plugin -- Seamless SSL Caching (Let's Encrypt, AutoSSL, etc)
Slow Site or Server? Unable to handle traffic? Same day performance fix: joe@unixy
-
07-09-2009, 03:29 PM #23Junior Guru Wannabe
- Join Date
- Jun 2009
- Location
- Houston,Tx
- Posts
- 46
Well, to answer your questions, a few things must be considered.
1, what is the kernel version now? (cmd: uname -a , this will show us that and more, star out the hostname)
2, What version of redhat are you using? ( cmd: cat /etc/redhat-release)
3, What is currently installed?(cmd: lsmod).
4, What is your system telling you? (cmd: dmesg)
5, What are the last 30 lines of /var/log/messages? (cmd: tail -30 /var/log/messages)
Once we know this, we should be able to assist you with upgrading.
I would really see if you could hire or obtain a system administrator.
-
07-09-2009, 04:05 PM #24Newbie
- Join Date
- Jul 2009
- Posts
- 21
1, what is the kernel version now? (cmd: uname -a , this will show us that and more, star out the hostname)
# uname -a
Linux hostname 2.6.18-128.1.14.el5PAE #1 SMP Wed Jun 17 07:15:54 EDT 2009 i686 i686 i386 GNU/Linux
2, What version of redhat are you using? ( cmd: cat /etc/redhat-release)
# cat /etc/redhat-release
CentOS release 5.3 (Final
3, What is currently installed?(cmd: lsmod).
http://pastebin.com/m7a28c465
4, What is your system telling you? (cmd: dmesg)
http://pastebin.com/m16d2070e (from the log file /var/log/dmesg)
5, What are the last 30 lines of /var/log/messages? (cmd: tail -30 /var/log/messages)
just a bunch of firewall logs about connection being dropped (no ddos flods though)
-
07-09-2009, 08:15 PM #25Junior Guru Wannabe
- Join Date
- Jun 2009
- Location
- Houston,Tx
- Posts
- 46
2 things.
1, looks like your kernel could use an update(yup update kernel, if not custom, which is entirely possibly done by your host).
If it is custom, ask your host to update, which is recommended, even if it's not custom.
From the output of dmesg, it would appear that perhaps the kernel is custom.
2, It looks like you have SELINUX turned on. Run: getenforce to find out and turn off(edit /etc/selinux.conf) Permissive is not a good idea, I'd recommend disabled completely.
This is only 1 possible problem, everything I'm telling you will not hurt the system and is safe to do.
DO NOT UPGRADE YOUR KERNEL UNLESS YOU KNOW WHAT YOU ARE DOING. It must be installed by yum/rpm to be updated this way. Contact your host if you are unsure of how your kernel was installed.
Similar Threads
-
Server Reboot Through Remote Reboot...
By JohnGladen in forum Dedicated ServerReplies: 16Last Post: 11-04-2006, 07:16 PM -
Server Security, Debugging, Customization, and Alteration Services
By LP-Trel in forum Employment / Job OffersReplies: 0Last Post: 01-22-2006, 05:43 AM -
little help debugging
By Jouninshinobi in forum Programming DiscussionReplies: 12Last Post: 08-30-2004, 04:00 PM -
Random Questions -- Random Answers?
By flitcher in forum Web Hosting LoungeReplies: 22Last Post: 05-18-2004, 10:53 PM -
Random entry from random table
By Wojtek in forum Programming DiscussionReplies: 1Last Post: 02-24-2004, 01:46 AM