Hey salmanpasta,
Sorry about that incorrect code snippet, I originally had it a bit different while testing, using the
sed command to show which domains were getting requests as well as the IPs, I chopped out that bit so I could show you how to just grab globally all the unique IPs hitting your server, and it looks like somewhere along the line I pasted the wrong thing.
The orginal code I had was this:
Code:
zgrep "20/Feb/2013:06:4" /home/*/logs/* | grep "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0" | sed -e 's#/home/.*/logs/##' -e 's#-Feb-2013.gz:# #' | awk '{print $2}' | sort -n | uniq -c | sort -n
Which does correctly display the IPs, and here is what I was also using:
Code:
zgrep "20/Feb/2013:06:4" /home/*/logs/* | grep "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9) Gecko/2008052906 Firefox/3.0" | sed -e 's#/home/.*/logs/##' -e 's#-Feb-2013.gz:# #' | awk '{print $1,$2}' | sort -nk2 -k1 | uniq -c | sort -n
Here is some dummy example output that provides:
Code:
85 shop.example.com 125.125.125.125
85 example.com 127.127.127.127
97 example.com 124.124.124.124
187 example.com 123.123.123.123
Nice job on figuring out how to find the IPs on your own! That way you did it is also really nice in this case, since it is going to capture requests from all of February with that user-agent instead of just the 10 minute window I was grabbing from.
In this case because we noticed a particularly old, odd, foreign user-agent, and then noticed that all of the IPs seem to belong to an Amazon service that can be used by people trying to make themselves the next Google by making their own crawlers, it would probably be safe to block any IPs with more than a handful of requests.
However do keep in mind here just for knowledge sake, that say the IP you block is a shared IP address on that Amazon server. One user on that server configuring a bad crawler that doesn't abide by
robots.txt rules, might be doing that without the consent or knowledge of other users on that same server. So potentially if someone just happened to also use that server's IP at some point in time to do some legitimate stuff on your website, that connection would be blocked. Whereas providing a
403 - Access Denied HTTP response based off user-agent would still allow the good users on that server to communicate with your server, while blocking the bad.
Now that said, in most cases I'd say go ahead and block IPs of potential bots that could cause problems, especially if they hit your sites semi-regularly, to help always ensure good server availability for human visitors. But in some cases IPs are going to be hard to keep up with, as a bot creator could simply jump from one server IP to another and keep hitting you, so trying to block via user-agent or by anything really unique in the type of requests they're sending to your server is always going to be a good long-term solution.
If you do consistently notice a trend of always getting hit by bots from a certain provider, you could even take some measures as drastic as blocking their entire IP ranges which I mention in my article about blocking a range of IP addresses.
If that's something you're interested in doing for Amazon's EC2 service you can find their IP ranges
here.
Most legitimate crawlers like Alexa will not have any kind of web-browser or OS mentioned in their user-agent, and that's usually a sign of a bot trying to disguise as legitimate traffic. Most good bots will also tell you where to go to find out more about them, in this case here's Alexa's full user-agent string:
Code:
"ia_archiver (+http://www.alexa.com/site/help/webmasters; crawler@alexa.com)"