Page 1 of 2 12 LastLast
Results 1 to 25 of 50
  1. #1

    Question Debugging random server reboot.

    I was searching google when I saw this => webhostingtalk.com/showthread.php?t=873989 and decided to sign up and post my issue...

    I've been plagued over the past few weeks when I noticed the server keeps rebooting randomly.

    Code:
    # last reboot
    reboot   system boot  2.6.18-128.1.14. Tue Jul  7 11:21          (03:27)
    reboot   system boot  2.6.18-128.1.14. Mon Jul  6 18:58          (19:50)
    reboot   system boot  2.6.18-128.1.14. Mon Jul  6 15:04          (23:44)
    reboot   system boot  2.6.18-128.1.14. Mon Jul  6 10:54         (1+03:54)
    reboot   system boot  2.6.18-128.1.14. Mon Jul  6 03:33         (1+11:16)
    reboot   system boot  2.6.18-128.1.14. Sun Jul  5 14:48         (2+00:01)
    reboot   system boot  2.6.18-128.1.14. Sat Jul  4 15:59         (2+22:49)
    reboot   system boot  2.6.18-128.1.14. Sat Jul  4 05:58         (3+08:51)
    reboot   system boot  2.6.18-128.1.14. Fri Jul  3 08:12         (4+06:36)
    reboot   system boot  2.6.18-128.1.14. Thu Jul  2 06:52         (5+07:56)
    reboot   system boot  2.6.18-128.1.14. Thu Jul  2 04:12         (5+10:36)
    reboot   system boot  2.6.18-128.1.14. Wed Jul  1 08:42         (6+06:07)
    reboot   system boot  2.6.18-128.1.14. Wed Jul  1 06:36         (6+08:12)
    reboot   system boot  2.6.18-128.1.14. Mon Jun 29 05:43         (8+09:05)
    reboot   system boot  2.6.18-128.1.14. Sun Jun 28 21:39         (8+17:09)
    reboot   system boot  2.6.18-128.1.14. Sat Jun 27 10:33         (10+04:15)
    reboot   system boot  2.6.27.10-grsec  Thu Jun 25 14:07         (1+20:23)
    reboot   system boot  2.6.18-128.el5PA Thu Jun 25 09:47          (04:17)
    reboot   system boot  2.6.18-128.el5PA Thu Jun 25 05:26          (08:38)
    reboot   system boot  2.6.18-128.el5PA Thu Jun 25 04:46          (09:19)
    No one has server access but me. I've got SSH ip white listed, port changed, auth keys only, private/hidden ip, and root account disabled.

    IS there a way I can monitor and find out what might be causing this? If I remove halt and shutdown binaries might that help?

  2. #2
    Join Date
    Aug 2007
    Location
    Brighton, UK
    Posts
    66
    I wouldn't recommend removing the shutdown command, check your syslog immediately prior to each reboot and see if the system is logging anything that might give you a clue as to what's causing it. If you're unsure as to what the syslog is telling you, post some extracts from it here for us to have a look at.

  3. #3
    Join Date
    May 2008
    Posts
    340
    You can install various tools, one such is syssta(SAR) which will document the server load, swap, disk I/O etc and you can then check it later at any time for previous metrics. This can tell you if the server is going down due to a load/spike problem.

    Also what does dmesg and boot.log say ?

  4. #4
    /var/log/messages: just shows "syslogd 1.4.1: restart" nothing before to indicate a problem.

    /var/log/boot.log: empty

    /var/log/dmesg: pastebin.com/m2a3ed9e8

    How do I setup/configure systa(SAR)? (since I've never heard of it)

  5. #5
    Join Date
    May 2008
    Posts
    340
    You can setup SAR by installing the sysstat package via YUM if you're using CentOS/Debian/Ubuntu


    or compile it from source by downloading it from,

    Twitter : http://twitter.com/eth1networks
    Contact Us : support[at]eth1.in

  6. #6
    So just "yum install systat"? Is there anything i need to configure? and what is it that i should look for in the logs?

  7. #7
    Join Date
    May 2008
    Posts
    340
    Yes, once installed you can check all statistics using the command,

    sar -A
    Twitter : http://twitter.com/eth1networks
    Contact Us : support[at]eth1.in

  8. #8
    Join Date
    Mar 2009
    Location
    Austin, TX
    Posts
    935
    Your physical hardware pieces ok? i.e. RAM is not bad?
    SysAdmin.xyz
    Having severs with customer data on it without proper monitoring is like having one night stand without using protections - eventually, there will be an 'oh s**t!' moment.

  9. #9
    Just noticed it's already been installed.

    Anyway, here is my log (sar -A):
    errorlogs.netii.net/sar.html

    I'm just not sure what I'm trying to find.

  10. #10
    Join Date
    May 2008
    Posts
    340
    Server loads look stable.
    11:30:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
    11:40:01 AM 2 252 1.89 1.96 1.99
    11:50:01 AM 2 252 1.61 1.94 1.95
    12:00:01 PM 2 262 1.68 1.94 1.91
    12:10:01 PM 5 256 1.44 1.93 2.03
    12:20:01 PM 7 256 3.33 2.44 2.14
    12:30:01 PM 6 263 1.40 1.88 2.03
    12:40:01 PM 6 257 4.17 2.53 2.11
    12:50:01 PM 2 248 1.37 1.45 1.70
    01:00:01 PM 10 262 2.54 1.97 1.77
    01:10:01 PM 5 268 2.20 1.98 1.90
    01:20:01 PM 5 258 1.72 2.11 2.04
    01:30:01 PM 1 251 1.70 1.79 1.92
    01:40:01 PM 4 259 2.16 1.61 1.68
    01:50:01 PM 5 272 3.42 2.84 2.39
    02:00:01 PM 8 257 0.84 1.29 1.80
    02:10:01 PM 4 253 0.75 0.82 1.29
    02:20:01 PM 4 252 0.82 1.03 1.17
    02:30:01 PM 2 260 1.74 2.17 1.65
    02:40:01 PM 4 249 1.65 1.44 1.48
    02:50:01 PM 6 263 2.16 2.25 1.88
    03:00:01 PM 2 263 2.79 2.36 2.12
    03:10:01 PM 4 256 1.52 2.05 2.09
    03:20:01 PM 1 255 1.08 1.29 1.64
    03:30:01 PM 5 260 0.37 0.74 1.19
    03:40:01 PM 1 258 0.47 0.64 0.92
    03:50:01 PM 0 252 2.95 2.21 1.54
    04:00:01 PM 5 268 0.79 1.27 1.35
    04:10:01 PM 2 255 0.54 0.85 1.10
    04:20:01 PM 4 266 1.57 1.26 1.15
    04:30:01 PM 0 249 1.22 1.04 1.08
    04:40:01 PM 4 263 3.42 1.60 1.22
    04:50:01 PM 4 254 1.03 1.08 1.16
    Average: 4 258 1.76 1.68 1.67
    Does it happen intermittently ? Check if there are any cronjobs which can cause this.
    Twitter : http://twitter.com/eth1networks
    Contact Us : support[at]eth1.in

  11. #11
    looking at the last reboot list, its 100% random. The only thing I can think of is it's either LiteSpeed or MySQL. If I block all traffic, the server is fine, and does not crash. So I'm pretty sure it is something with a running service that gets traffic.

    I've made edits to my.cnf dropping the values very low, but I still always have enough RAM. MySQL seems stable, nothing in it's logs about crashing. I've set LiteSpeed error logs to HIGH/DEBUG so I'll see if that shows anything.

    But honestly... I'm stumped and out of ideas. :/

  12. #12
    Join Date
    Jan 2006
    Location
    Dallas, TX
    Posts
    106
    This smells hardware to me. Could be RAM gone bad, could be losing a mobo or proc. That's the sort of thing that causes "random" issues like this. Blocking all traffic and the server runs doesn't point away from hardware because (a) you're reducing any load that would be going through the server (and the RAM and the mobo and the proc) and (b) it's really hard to prove a negative! :0
    Chris Gebhardt
    VIRTBIZ Internet Services
    Web Hosting, Dallas Colocation, Dedicated Server
    virtbiz.com | ph (972)485-4125 | toll-free (866)485-4125

  13. #13
    I would consider hardware, but my DC has already confirmed it's not and they even moved me to another server.

  14. #14
    made some progress... i guess you could call it that...

    basically, server crashed today and data center said something about the screen having a stack overflow... not sure what that means.

  15. #15
    Join Date
    Feb 2008
    Location
    Houston, Texas, USA
    Posts
    3,262
    Hi,

    The fact that you had some lingering orphan nodes means that the shutdown/reboot wasn't graceful. It's a crash. By "stack overflow", your DC probably means that this is a kernel stack overflow, which leads to a kernel panic. It explains the ext3 orphan cleanup after restart.

    If indeed your chassis was swapped and hard drives replaced, have you installed any kernel module lately (or thirdparty software that might have installed one)?

    Regards
    UNIXy - Fully Managed Servers and Clusters - Established in 2006
    Server Management - Unlimited Servers. Unlimited Requests. One Plan!
    cPanel Varnish Plugin -- Seamless SSL Caching (Let's Encrypt, AutoSSL, etc)
    Slow Site or Server? Unable to handle traffic? Same day performance fix: joe@unixy

  16. #16
    It's your typical CPanel(Current Build) with MySQL(5.0.81) and LiteSpeed 4.0.5 running. As for kernel modules... not that I know of. The kernel running now is the default RH:

    Linux hostname 2.6.18-128.1.14.el5PAE #1 SMP Wed Jun 17 07:15:54 EDT 2009 i686 i686 i386 GNU/Linux

    I've pretty much ruled LiteSpeed out as the cause(ran high debug and noticed nothing), so I've edited the my.cnf file to maybe see if I had the cache values too high and was causing an overflow?

  17. #17
    Nope...

    I'm just pretty much lost unless anyone has an idea, no matter how crazy...

    server 1(original) runs stable/fine now that there is now traffic on it.
    server 2(new/this one) runs fine then randomly crashes.

    so its safe to say that in some way traffic/script is causing some sort of stack overflow? is is remotely possible that a query or some php script could cause the server to randomly crash?

    im not sure if this might help, but would dropping cpanel build down to stable fix the problem?

  18. #18
    Join Date
    Jun 2009
    Location
    Houston,Tx
    Posts
    46
    Look for Core dumps.

    When software causes a crash, it may(depends on the software) leave a core dump.

    Also, since you said the old server has no traffic and is working fine, what type of hardware does your DC have you on?

    2 things come in mind:

    1 The harware to driver version is not right. What is the kernel version?

    2 apache/lightspeed, which can cause memory overflows.


    The easiest to check, is lightspeed. Switch it to apache, and see if that resolves the issue.

    If so, check your rpm's and memory kernel modules. Lightspeed is extremely memory dependant. Upgrade your kernel and other utilities. Then REINSTALL light speed. Upgrading a broken installation may not work, from my experience, reinstall is the best.

    If this does not help resolve the issue, try updating the kernel and all modules, look for core dumps, run gdb on them to see what is causing it.

    Finally, if all else fails, please contact me. I'd be more then willing to take a look.

  19. #19
    Join Date
    Feb 2008
    Location
    Houston, Texas, USA
    Posts
    3,262
    A user program does not crash a stable Linux kernel. The kernel crashes when:

    • It trips on a hardware defect
    • It hits a bug in its own code
    • It hits a bug in third party code (loadable modules)


    A PHP or apache program will not crash a stable kernel. If I had to bet, it would be a hardware defect. That's why I always recommend running business critical tasks on server hardware (Xeon, ECC RAM, redundant power supply, RAID mirroring, etc).

    Regards
    UNIXy - Fully Managed Servers and Clusters - Established in 2006
    Server Management - Unlimited Servers. Unlimited Requests. One Plan!
    cPanel Varnish Plugin -- Seamless SSL Caching (Let's Encrypt, AutoSSL, etc)
    Slow Site or Server? Unable to handle traffic? Same day performance fix: joe@unixy

  20. #20
    Join Date
    Jun 2009
    Location
    Houston,Tx
    Posts
    46
    I agree, a 3rdparty software does not cause the crash.

    Memory usage does. Lightspeed is memory intensive, and can grab up too memory, and instead of the system reallocating the memory elsewhere, lightspeed keeps grabbing more.

    Until there is no memory, but now you have other services fighting to get their memory.

    It's not the direct cause, which is why I stated, if switching to apache works, update the kernel and all other kernel modules. Mainly due to the actual cause being in those, rather then lightspeed itself. Mysql is the most memory intensive software on the server(in a standard cpanel lamp server, normally). Mysql cannot cause the server to crash, but let it eat up all the memory, and the server will crash. I've seen it way too many times to disregard anything that can utilize and not give up memory.

    You should add a 4th reason to your list:

    It cannot manage memory properly due to the software on the server, and the server configuration.

    Dont get me wrong, if you tweak the options a bit, the kernel can manage the memory FAR better, and you'd never run into a server crash. The services would flat out stop, and then you'd have a server that was powered on, but not truly functional.

    I definately agree with you, and I only mean to explain my pattern of thought here.

  21. #21
    I would consider it a memory issue... but it crashed with 800k Free...

    As for the LiteSpeed Core Dump, I have it enabled under Web Console -> General -> Enable Core Dump: Yes. Where would I find the dump(s)?

    As for the kernel, what version should I consider upgrading to and how do I upgrade the modules? I would have thought that the RH Default would be fine.

  22. #22
    Join Date
    Feb 2008
    Location
    Houston, Texas, USA
    Posts
    3,262
    Frankly, only your provider can trouble shoot this issue especially when it causes the machine to crash. All we are doing as responders in this forum is shoot in the dark. A server crash can happen as a result of one or more causes. There are hundreds of causes! A server crash has to be meticulously investigated and results interpreted by an experienced systems person. I have a feeling you're wasting your time implementing fixes that may have nothing to do with the root cause.

    I wish there were something I (or others) could do but help coming from this forum is very limited. Right now, your provider should be all hands on deck exploring all clues and possibilities. They gave you an important clue with the on-console kernel stack overflow and I highly recommend you focus on that report. Ask your provider to prove the kernel crash is not hardware related.

    Good luck
    UNIXy - Fully Managed Servers and Clusters - Established in 2006
    Server Management - Unlimited Servers. Unlimited Requests. One Plan!
    cPanel Varnish Plugin -- Seamless SSL Caching (Let's Encrypt, AutoSSL, etc)
    Slow Site or Server? Unable to handle traffic? Same day performance fix: joe@unixy

  23. #23
    Join Date
    Jun 2009
    Location
    Houston,Tx
    Posts
    46
    Well, to answer your questions, a few things must be considered.

    1, what is the kernel version now? (cmd: uname -a , this will show us that and more, star out the hostname)

    2, What version of redhat are you using? ( cmd: cat /etc/redhat-release)

    3, What is currently installed?(cmd: lsmod).

    4, What is your system telling you? (cmd: dmesg)

    5, What are the last 30 lines of /var/log/messages? (cmd: tail -30 /var/log/messages)

    Once we know this, we should be able to assist you with upgrading.

    I would really see if you could hire or obtain a system administrator.
    Greg Borbonus
    greg@ableadmins.com LinkedIn Profile
    skype: greg.borbonus
    832-699-0461

  24. #24
    1, what is the kernel version now? (cmd: uname -a , this will show us that and more, star out the hostname)
    # uname -a
    Linux hostname 2.6.18-128.1.14.el5PAE #1 SMP Wed Jun 17 07:15:54 EDT 2009 i686 i686 i386 GNU/Linux

    2, What version of redhat are you using? ( cmd: cat /etc/redhat-release)
    # cat /etc/redhat-release
    CentOS release 5.3 (Final

    3, What is currently installed?(cmd: lsmod).
    http://pastebin.com/m7a28c465

    4, What is your system telling you? (cmd: dmesg)
    http://pastebin.com/m16d2070e (from the log file /var/log/dmesg)

    5, What are the last 30 lines of /var/log/messages? (cmd: tail -30 /var/log/messages)
    just a bunch of firewall logs about connection being dropped (no ddos flods though)

  25. #25
    Join Date
    Jun 2009
    Location
    Houston,Tx
    Posts
    46
    2 things.

    1, looks like your kernel could use an update(yup update kernel, if not custom, which is entirely possibly done by your host).

    If it is custom, ask your host to update, which is recommended, even if it's not custom.

    From the output of dmesg, it would appear that perhaps the kernel is custom.

    2, It looks like you have SELINUX turned on. Run: getenforce to find out and turn off(edit /etc/selinux.conf) Permissive is not a good idea, I'd recommend disabled completely.

    This is only 1 possible problem, everything I'm telling you will not hurt the system and is safe to do.

    DO NOT UPGRADE YOUR KERNEL UNLESS YOU KNOW WHAT YOU ARE DOING. It must be installed by yum/rpm to be updated this way. Contact your host if you are unsure of how your kernel was installed.
    Greg Borbonus
    greg@ableadmins.com LinkedIn Profile
    skype: greg.borbonus
    832-699-0461

Page 1 of 2 12 LastLast

Similar Threads

  1. Server Reboot Through Remote Reboot...
    By JohnGladen in forum Dedicated Server
    Replies: 16
    Last Post: 11-04-2006, 07:16 PM
  2. Server Security, Debugging, Customization, and Alteration Services
    By LP-Trel in forum Employment / Job Offers
    Replies: 0
    Last Post: 01-22-2006, 05:43 AM
  3. little help debugging
    By Jouninshinobi in forum Programming Discussion
    Replies: 12
    Last Post: 08-30-2004, 04:00 PM
  4. Random Questions -- Random Answers?
    By flitcher in forum Web Hosting Lounge
    Replies: 22
    Last Post: 05-18-2004, 10:53 PM
  5. Random entry from random table
    By Wojtek in forum Programming Discussion
    Replies: 1
    Last Post: 02-24-2004, 01:46 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •