Results 1 to 8 of 8
  1. #1
    Join Date
    Sep 2001
    Posts
    86

    Question getting data without a browser

    I have noticed that there are several services that grab data from a web site and then use that data to offer a service of some kind.

    Here is an example, overture.com is a click thru advertiser. About a month ago I received several e-mail solicitations from companies offering special reports and services using overtures data. I later found out that Overture did not authorize the use of the data. So my question is, how did these companies grab the data without Overture's help/permission? I don't think any laws were broken because by virtue of having the data on the web it is available for all to access. This company wasn't using a web browser to access the data but must have been using some type of script that pulled the data and put this data into their web pages.

    I have also noticed this with many price comparison web sites. They pull the data from several sites without their help and provide a price comparison service.

    This all leads me to my security question. How do you prevent others from accesssing the data if you don't want your data used in one of these price comparison engines?

    Thanks in advance any info.

  2. #2
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    You could use mod_rewrite or redirect or some similar technique to send such "bots" off to never never land. If the bot doesn't try to mimic a browser, the browser identification string will probably be a good clue in the logs. You'd need to know what to look for of course.

    If you find a persistant offender from a specific IP address or range, you can block them too.

  3. #3
    Join Date
    Sep 2001
    Posts
    86
    How exactly does the 'bot' work?

  4. #4
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    A 'bot' is just a program that accesses hosts - web, ftp, whatever, and does something.

    Google and other search engines run programs that access your site, suck data off, index it, follow links, and move on. Consider those 'bots'.

    Well behaved robots will follow the robot.txt protocol and if you've denied access to some or all of your site to such engines, they will move on.

    However in the example you have - price comparison - its unlikely that any robot application will respect the robot.txt restrictions - your only method is to block them based on IP or what gets reported in the user agent. Check your httpd-access log and you'll see different strings for bots, differerent browsers etc.

    To be clear - its not just browsers driven by humans that can access a web server. For example, if you have access to wget, something simple like

    wget -mirror http://yoursite.com/

    will suck the whole thing down, to a link depth of the default (which I think is 5).

  5. #5
    Join Date
    Sep 2001
    Posts
    86
    Ahhh.... I see. We have a site that we run that always has lower prices than our online competitors. Someone in our office wanted to set it up so that we could display our competitors prices in real time next to ours, showing that we are the lowest. We sell the same 1000+ products with our competitors and don't have the time to retrieve the prices one by one with a browser. Could we grab all the pricing info at once with the bot that you described?

    If yes, how would we put the pricing info into HTML tables and make the data be displayed correctly? It seems like we would have to grab the data, give it a variable name and then output that variable in our HTML page. Is this possible?

  6. #6
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,416
    What you would like to do will likely require some custom scripting. Depending on how consistant the other site is, it may be possible to use a parser of some sort to ease the data extraction load.

    Much assembly required, not something we can do in posts here.

    There may be other avenues open to you - do queries on Google for something like the following (exactly as shown on the line below

    "screen scraping" html

    And you may find something that can ease the job. But I'm fairly certain you will be doing a whack of scripting to come up with something reliable enough).

    Now if your competitor could be talked into putting all 1000 items in one nice neat table, or better yet, a RSS feed (don't ask) it would be much simpler. Go ahead and ask!
    “Even those who arrange and design shrubberies are under
    considerable economic stress at this period in history.”

  7. #7
    Join Date
    Sep 2001
    Posts
    86
    Hey, thanks for the search idea. That allowed me to get some good information. What about cURL with PHP, would this do the trick?

    OK I'll ask..... what is RSS?

  8. #8
    Join Date
    Sep 2001
    Posts
    86
    Well, I did a little more research today and found that there are several companies that have built applications that do this sort of thing. www.connotate.com seems to be the best match for what I am looking for. I have not had a chance to talk with a salesperson or getting pricing info yet. Does anyone have any experience with Connotate?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •