Results 1 to 9 of 9
  1. #1
    Join Date
    Jun 2006
    Posts
    1,112

    TIME_WAIT and connection time outs

    Hi

    I've one main web server, the problem is that many people (now including myself) are often receiving "Connection timed out" messages in their web browser when trying to visit websites. This web server is a CentOS 5 machine and the HTTP server in use is Apache 2.2.

    Of course, I've considered contacting server admin people who will look at this sort of thing on a one-off price or manage my servers at a periodic billing rate - but I'd much prefer to see what others have to say here first... hopefully learn some new stuff. It isn't a huge problem right now, but it can be annoying browsing the websites because a refresh would be required to connect again. I've learnt everything I know about Linux etc myself so far, through the likes of WebHostingTalk.. now is time for me to learn about TCP, HTTP, Apache and more if anybody has any ideas about this problem.

    When running netstat, I'm seeing a rather large amount of TIME_WAIT's, I'm thinking this could have something todo with the connection time outs?

    Here is my netstat output for TCP: http://pastebin.com/m8646943 - notice all of the HTTP TIME_WAIT's for gangsternation.net? (also, a couple of other sites with less traffic)

    Any information, guidance, thoughts and more would be great! Remember that I'm not looking to go into server management right now.. I'd rather try and learn - I am happy for any recommendations though if you really think it's worth it!

    Thanks very much

  2. #2
    Join Date
    Nov 2002
    Location
    Portland, Oregon
    Posts
    2,992
    Netstat is a little utility that many administrators use to monitor the network connections on their servers. It is quite useful for tracking down that small subset of performance bottlenecks that aren't attributable to yet another piece of convoluted application code that some careless programmer wrote and now you have to take care of. But I digress.

    C:\>netstat -np tcp

    Active Connections

    Proto Local Address Foreign Address State

    TCP 192.168.0.1:80 192.168.0.12:1217 ESTABLISHED
    TCP 192.168.0.1:80 192.168.0.5:1218 TIME_WAIT
    TCP 192.168.0.1:80 192.168.0.234:1252 TIME_WAIT
    TCP 192.168.0.1:80 192.168.0.37:1267 ESTABLISHED
    TCP 192.168.0.1:80 192.168.0.23:1298 TIME_WAIT
    TCP 192.168.0.1:80 192.168.0.32:1345 TIME_WAIT

    And so on and on, for many, many lines. Each line here represents a connection between a TCP socket your server and a matching one on some other machine--usually an HTTP client such as a browser or proxy server, but depending on your architecture you might also see connections to other kinds of servers (database, application, directory, etc.). Each connection has a unique combination of IP addresses and port numbers that identify the endpoints to which the sockets are bound. More to the point, each one also has a state indicator. As connections are set up used and torn down, they pass through a variety of these states, most of which aren't shown here, because they come and go quite quickly).

    The connections in the ESTABLISHED state are, well, established--they are neither being set up nor torn down but just used. This is what you will often see the most of. But what about the others? On a busy HTTP server, the number of sockets in this TIME_WAIT state can far exceed those in the ESTABLISHED state. For instance, I checked an IIS 6.0 box that serves a fairly busy corporate site earlier today and got 124 ESTABLISHED connections versus 431 in TIME_WAIT.

    What does this all mean? More importantly, is it something you should be worried about?

    The answers are:

    1. It's complicated.

    2. Maybe.

    To understand what all those TIME_WAITs are doing there, it's useful to review (or learn) a little TCP. I'll wait here while you brush up on RFC793.

    That was fast. Just kidding. The bit you need to know is so simple, even I can explain it.

    As you know, TCP provides a reliable connection between two endpoints, across which data can be sent in segmented form. As part of this, TCP also provides a mechanism for gracefully shutting down such connections. This is accomplished with a full duplex handshake, which can be diagrammed like so:

    Server Client

    -------------- FIN -------------->

    <------------- ACK ---------------

    <------------- FIN ---------------

    -------------- ACK ------------->

    As you can see by this very sophisticated diagram, a graceful shutdown requires the two endpoints to exchange some TCP/IP packets with the FIN and ACK bits set, in a certain sequence. This exchange of packets in turn corresponds to certain state changes on each side of the connection. In the diagram, I've labeled the two sides "Server" and "Client" such that the sequence of events mirrors what usually happens when connections are closed by HTTP.

    Here is what happens, step-by-step:

    1. First the application at one endpoint--in this example, that would be the Web server--initiates what is called an "active close." The Web server itself is now done with the connection, but the TCP implementation that supplied the socket it was using still has some work to do. It sends a FIN to the other endpoint and goes into a state called FIN_WAIT_1.

    2. Next the TCP endpoint on the browser's side of the connection acknowledges the server's FIN by sending back an ACK, and goes into a state called CLOSE_WAIT. When the server side receives this ACK, it switches to a state called FIN_WAIT_2. The connection is now half-closed.

    3. At this point, the socket on the client side is in a "passive close," meaning it waits for the application that was using it (the browser) to close. When this happens, the client sends its own FIN to the server, and deallocates the socket on the client side. It's done.

    4. When the server gets that last FIN, it of course sends back on ACK to acknowledge it, and then goes into the infamous TIME_WAIT state. For how long? Ah, there's the rub.

    The socket that initiated the close is supposed to stay in this state for twice the Maximum Segment Lifetime--2MLS in geek speak. The MLS is supposed to be the length of time a TCP segment can stay alive in the network. So, 2MLS makes sure that any segments still out there when the close starts have time to arrive and be discarded. Why bother with this, you ask?

    Because of delayed duplicates, that's why. Given the nature of TCP/IP, it's possible that, after an active close has commenced, there are still duplicate packets running around, trying desperately to make their way to their destination sockets. If a new socket binds to the same IP/port combination before these old packets have had time to get flushed out of the network, old and new data could become intermixed. Imagine the havoc this could cause around the office: "You got JavaScript in my JPEG!"

    So, TIME_WAIT was invented to keep new connections from being haunted by the ghosts of connections past. That seems like a good thing. So what's the problem?

    The problem is that 2MLS happens to be a rather long time--240 seconds, by default. There are several costs associated with this. The state for each socket is maintained in a data structure called a TCP Control Block (TCB). When IP packets come in they have to be associated with the right TCB and the more TCBs there are, the longer that search takes. Modern implementations of TCP combat this by using a hash table instead of a linear search. Also, since each TIME_WAIT ties up an IP/port combination, too many of them can lead to exhaustion of the default number of ephemeral ports available for handling new requests. And even if the TCB search is relatively fast, and even if there are plenty of ports to bind to, the extra TCBs still take up memory on the server side. In short, the need to limit the costs of TIME_WAIT turns out to be a long-standing problem. In fact, this was part of the original case for persistent connections in HTTP 1.1.

    The good news is that you can address this problem by shortening the TIME_WAIT interval. This article by Brett Hill explains how to do so for IIS. As Brett explains, four minutes is probably longer than needed for duplicate packets to flush out of the network, given that modern network latencies tend to be much shorter than that. The bad news is that, while shortening the interval is quite common, it still entails risks. As Faber, Touch and Yue (who are the real experts on this) explain: "The size of the MSL to maintain a given memory usage level is inversely proportional to the connection rate." In other words, the more you find yourself needing to reduce the length of TIME_WAIT, the more likely doing so will cause problems.

    How's that for a Catch-22?

    ---------------------------------------------------
    Personal opinion without looking at the server...I'd lean towards a network/bandwidth issue. Perhaps too many http connections are crashing Apache? (though not as likely). If there is nothing relevant in /var/log/messages, /var/log/httpd/error.log, and the other standard logs, the next place to look, would probably be the network connection. When you notice the timeouts, are you able to ping your domain and/or IP?
    Last edited by Johnny Cache; 03-30-2008 at 07:37 PM.

  3. #3
    Join Date
    Nov 2002
    Location
    Portland, Oregon
    Posts
    2,992

  4. #4
    Join Date
    Jun 2006
    Posts
    1,112
    Thanks for the reply 5ivepdx. Yeah, I read that article after a quick search on Google earlier. I've checked the Apache error log and /var/log/messages, absolutely nothing indicating to the issues I'm having sadly

    I'll do a bit more research now, ping the domain and see if it messes up at the same time I'm seeing connection time outs

  5. #5
    Join Date
    Jun 2006
    Posts
    1,112
    Ok some results back from pinging. It seems that whenever the browser displays "connection timed out", pinging also fails with "Request timed out".. so I'm thinking a network issue? Where should I be looking from here? It's odd though, because the usual ping is around 30-35ms, but sometimes will jump upto 40-125ms randomly! The larger ping times may be todo with my ISP though I guess..

    Thanks
    Last edited by DevMonkey; 03-30-2008 at 08:04 PM. Reason: Removing signature from second post, no need to display it twice!

  6. #6
    Join Date
    Jun 2006
    Posts
    1,112
    Odd actually, I've also noticed the same "connection timed out" messages when accessing images on a different static-serving web server with the same provider. This server is using lighttpd though, not Apache. Should I therefore be asking my provider about this? I'm worried that it could just be dismissed if I don't provide enough information about what's going on..

  7. #7
    Join Date
    Nov 2002
    Location
    Portland, Oregon
    Posts
    2,992
    If the different server is one of yours, then your bandwidth is probably being jammed up with the web requests.

    If the server is elsewhere and you can't ping it, it's something going on with the datacenter/ISP. Let them know you receive sporadic network timeouts with inability to ping the IP's and ask them to research it.

  8. #8
    Join Date
    Jun 2006
    Posts
    1,112
    Yeah the other server is mine too, both web servers for different content, each running a different HTTP server - both experiencing connection time outs every now and again.

  9. #9
    Join Date
    Jun 2006
    Posts
    1,112
    A quick update; this was a problem between my povider and a certain (large) ISP - problem solved

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •