    Using spider to check for copyright violation

    I have invested time in putting together a text document that I have openly shared with an online community. Now I believe that someone might be claiming authorship of my work. Searching for a string on google provides the expected hits but my "nemesis" probably would not be indexed on a search engine.

    Could I create (or download) a spider that would search for a specific string in a text file and merely report the URL back to me via email when found?

    I'm aware of the philosophical issue about open source, and my having freely placed it on the web. I'm specifically interested in the mechanics of searching web sites for the contents of text files.



    I never dealt with bots and spiders... yet..

    Testing 1.. Testing 1..2.. Testing 1..2..3...

    It's possible... I've done one before (to a limited extent) It just depends how much time and bandwidth you have available.

    If you know how to program you could make simple applications that can be used anywhere and use controls such as INET or Winsock and just go on its way searching. You then could use the same type of control to email you results in the finds.

    I suggest you asking around about this type of thing.

    The information that I find on the Internet about spidering, is how to use a spider to improve your standing on a search engine, or download all the MP3's or porn from a given website.

    I obviously don't have the URL for the site. I was told by an associate to check a URL and I found my work credited to someone else. They told me they got it from another URL which I also checked and contacted. I've been naive about this. I don't intend to bring anyone to court, but I'd like to be able to email them and ask them to remove the content (or else credit me for the work).

    After finding these two, I googled them and found they weren't indexed on the major search engines. I'm sure there is more out there and I would guess that a spider would be the most efficient way. Maybe I'm barking up the wrong tree. Is there a better way?



    Well, without a starting point, there is no way for your spider to find your content. The whole point of hyperlinks is so that there is a way to reach pages from other content.

    If you have a page that is never linked from another page, it is very unlikely that your spider will find it. There are a lot of search engines other than the popular google, msn or yahoo.

    Think of it this way -- how would you find where your content was posted? If you aren't able to locate it "pragmatically" -- that is, from typing a keyword in a search engine -- then it will be very difficult to write a spider that does the same.

    There are other issues with this -- how will prove you hold the copyright to the content?
    Good point. I don't know how to start. That may end my quest right there! I guess that I would look at sites that had similiar content, and harvest the URL's of where people had surfed from. I know this is available to the web host (it is on my site) but may not be available to a bot. I know these people would not have links to their sites, so the breadcrumbs that I would be looking for would be the users.

    If they posted to the site maybe. I don't know. This is not my area of expertise hence my posting here.

    As for how I would resolve it afterwards, there is no legal remedy since this isn't legally protected. I'm counting on peoples doing the right thing after I contact them. But that isn't really the point of the post.


    You could use DMOZ/ODP data for a starting point, along with referrer logs from your web host.

    How much bandwidth would you have available to do this?

