Results 1 to 16 of 16
  1. #1

    Server keeps going down/crashing

    I have a server with SoftLayer. Every couple days, at different times, the server will go down. I'll be unable to access DA, SSH or login via IMPI text console. The server logs don't show anything unusual and we have ruled out the possibility of hardware faults.

    I'm 99% sure the server hasn't actually "crashed" - since I can sometimes get the SSH login to come up, but then the connection will close, and it's responding to ping. SL can't find any probs in the logs and they are also unable to login (when the problem is occuring).

    I have been told to monitor the server and report anything strange. Obviously, this isn't possible 24/7 so I'm looking into my options. Is there any software which will run every minute or couple minutes and dump everything running on the server at that time, so the next time it does crash, we could reboot and look at what happened just before?

    Appreciate all suggestions.

    Scott

  2. #2
    Join Date
    Jan 2004
    Location
    York, UK
    Posts
    371
    Do you have anything running that automatically manages firewall rulesets? Like lfd which I use to ban IPs from which brute-force password guessing attempts occur, or dfd wehich does a similar job.

    If so try turning that off (don't tuen the firewall off completely, of course, just the tools that play with it without your intervention) for a short while and see if that helps. It could be that such a tool is miconfigured or has a bug and is mucking up the firewall rules completely when the bug it triggered.

  3. #3
    I don't think it's the firewall. We have APF+BFD, we can ping it but can't login or access any services (da, ftp, httpd). The datacenter can't login via console either (times out or gives a blank screen). But thanks for your suggestion.

  4. #4
    Join Date
    Feb 2004
    Location
    here and there
    Posts
    767
    Sounds like it's crashing to me - something is causing a lock, but the stack is still able to reply to ICMP ping packets. I've seen this on some machines before.

    What OS? Have you compiled your own kernel? Any weird software you're running?
    Dedicated Servers, Virtual Machines, Colocation, BGP & IPs
    objx.net - AS33333 - Salt Lake, Utah
    awknet.com - AS17048 - Los Angeles, California

  5. #5
    It's running RHEL 4 (64-bit), 2.6.18 kernel w/ grsecurity, but it was running fine for 20+ days, then started crashing every day (or every other day). Nothing unusual, Apache 2.2, PHP 5.1.6, MySQL 5, DA, ... etc.

    I'll try an older grsec kernel for a few days though (or maybe compile a new one) since the 2.6.18 patch was from ~spender so may not be stable/tested much.

    Another note, I tried running a PHP script from the command line and it's been seg faulting half way through.

  6. #6
    Join Date
    Nov 2005
    Location
    Great Falls, VA
    Posts
    160
    Same thing happened to mine starting about a week ago. Nothing very unusual showed up in logs, etc. After crashing about twice a day for 4 days and ruling out every other possibility, we did a chassis swap and it has been perfectly fine since then.

  7. #7
    besposito, thanks. I had the ram changed but that didn't make a difference. I'll try the latest kernel (2.6.18.2 w/ grsec, hopefully 2.6.18.1 patch will work) and failing that I'll ask for the chassis to be swapped.

  8. #8
    Join Date
    Feb 2004
    Location
    here and there
    Posts
    767
    I'd go directly to the chassis swap. If it ran fine for 20+ days without a hitch it's likely power related, or you've got a short somehow...
    Dedicated Servers, Virtual Machines, Colocation, BGP & IPs
    objx.net - AS33333 - Salt Lake, Utah
    awknet.com - AS17048 - Los Angeles, California

  9. #9
    Join Date
    Nov 2005
    Location
    Great Falls, VA
    Posts
    160
    Quote Originally Posted by bloghost
    besposito, thanks. I had the ram changed but that didn't make a difference. I'll try the latest kernel (2.6.18.2 w/ grsec, hopefully 2.6.18.1 patch will work) and failing that I'll ask for the chassis to be swapped.
    Actually, that's funny you mention that. We had the RAM changed as well and it continued with the same problem until we did a chassis swap. By the way, I'm besposito, I accidentally logged in with my old name.
    JetNet, LLC
    Fast, Reliable, Affordable
    Shared • Reseller • Dedicated
    http://www.jetnethost.com

  10. #10
    I'm on 2.6.18.2 now. I'll hold out and keep my phone close so if it goes down, I can have them swap the chassis completely. Thanks!

  11. #11
    Server went down again a couple hours ago. SoftLayer have scheduled a full chassis swap for later today.

  12. #12
    Join Date
    Nov 2005
    Location
    Great Falls, VA
    Posts
    160
    Good, that should correct the issue
    JetNet, LLC
    Fast, Reliable, Affordable
    Shared • Reseller • Dedicated
    http://www.jetnethost.com

  13. #13
    Let us know how it goes

    BTW, regarding the monitoring software: you should have at least sar installed. It's an absolute minimum There are also tools for on/off- site monitoring: monit, nagios, OpenNMS.
    If you only have 1 server to monitor, check monit first. I do suggest you start using monitoring software even if this issue (hopefully) is resolved by the full chassis swap. Good luck.
    :: Mountain Network Systems :: 323-933-9291
    eCommerce solutions since 1995
    http://www.webcart.net/

  14. #14
    Join Date
    Nov 2002
    Location
    WebHostingTalk
    Posts
    8,901
    * Moved to Technical and Security Issues...

    Sirius
    I support the Human Rights Campaign!
    Moving to the Tampa, Florida area? Check out life in the suburbs in Trinity, Florida.

  15. #15
    Quote Originally Posted by Webcart
    Let us know how it goes

    BTW, regarding the monitoring software: you should have at least sar installed. It's an absolute minimum There are also tools for on/off- site monitoring: monit, nagios, OpenNMS.
    If you only have 1 server to monitor, check monit first. I do suggest you start using monitoring software even if this issue (hopefully) is resolved by the full chassis swap. Good luck.
    Thanks, I'll have a look. btw, I have more than 1 server.

    02:30:29 up 1 day, 13:34, 1 user, load average: 0.27, 0.20, 0.21

  16. #16
    It crashed again this morning unfortunately.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •