    Robots.txt Cloaking Trick by WebMasterWorld

    Is anyone familiar with this trick? Details,
    searchenginegenie dot com/seo-blog/2005/12/webmasterworld-cloaks-robotstxt-file dot html

    I'm not clear how to implement.

    Basically it looks like a robots whitelist technique. Before, they were using robots.txt as a blacklist and spending one day a week trying to keep up with the bad guys proliferating. Now they just whitelist a few big engines and reject the rest. But I'm not clear how.

    The page you're referencing is coming up on 4 years old. Google reserves the right to remove sites like this and I'd bet that other search engines are the same.
    Your Opinion Quite Irrelevant

    I'm not interested. Let webmasters decide whether the technique is "appropriate." This thread is a technical discussion not an editorial page.

    If Google wishes to help, it could offer a proper blacklist technique. Instead it's busy with things like
    archlinux dot me/dusty/2009/08/23/why-im-quitting-gmail/

    The guy said he spent one day a week fighting robots. Four years later, I don't see that things have improved.

    And actually...

    Google *is* indexing his site...anyway.

    He's a webmaster's webmaster. He runs a webmaster's forum. He used to *write a blog* inside his robots.txt comments. Give me a break? I'm not going to second guess. He's been there and done that.

    No objection holds water. Remember, robots.txt is *already* meant to enable/disable robots selectively. This 'trick' doesn't change the design, its ethics, or its syntax.

    If a webmaster always returns valid robots.txt format, there is no ethics violation. He applies different logic to enable/disable, that's all.

    The big boys themselves admit robots.txt is not sufficient! Google and Yahoo have custom robots.txt directives. So *they* violate the rules. It gets a little worse, too. Google *has* been caught not obeying robots.txt.

    And if you think a blacklist is simpler than a whitelist, I invite you to write rules on the 300,000 user agents listed at botsvsbrowsers dot com .

    sorry - i missed this post the first time around.

    > how does this work

    You just add TXT as an executable extension in your http.conf file. Then just set the file up as a script. It gets exectued.

    The rest of the story:

    We ran that robots.txt script for about 6 years. At this point, we have given up on both black lists and white lists. The really bad bots, ignore robots entirely. The good bots have so extended the standard that it is no longer anything like the original. The robots.txt system is a complete joke. (note that the robots.txt "recommendation" was never approved by any web standard organization on the planet - even the search engines don't agree on all syntax and proprietary extensions).

    > cloaking

    This is not cloaking. Cloaking is showing visitors one thing and the search engines another. Robots.txt is not intended for humans - it is intended for bots. If you show the code that produces the page - is it really cloaking? No on both accounts.

    The only really protection you can offer your site is to take advantage the so called 'first click free' program. Run a ip tracker and block those that abuse your site after X number of page views.

