Web Hosting Talk







View Full Version : best way to crawl images on other sites?


jjk2
06-20-2008, 10:35 PM
i am trying to make a small image search engine.

i am able to do this via simply using cURL,

1) dl the index page
2) parse it
3) output the images as thumb nail.
4) user clicks on image, they go to the actual site.

but heres my dillema. it will make repeatedly, frequent number of connections and going through steps 1) several thousand times. i think this can result in ban by the site owner.

this gets really complicated, i guess i could use wget with random proxy, but this generates lot of time and memory.

another problem is, is it better to store the parsed contents to a db, and recall it everytime someone makes a search query? but this will not return the latest images from a website.

how does google image or yahoo images do this then ? is there an open source solution i can build on ?

many thanks for reading.

Xeentech
06-21-2008, 01:11 AM
Why would you ever have to download the index page "several thousand times."? I'd certainly ban you if you were constantly re-downloading the same page.

When I user posts a link to a site on one of my fourm/cms/media/social sites, the back end downloads the page, parses the html for <img /> and other media and grabs the media. Then shows the selection to the user to so they can pick the icon.

Why would you have to request it thousands of times?

jjk2
06-21-2008, 02:32 AM
u'd have to do file_get_contents or wget or curl to retrieve the html page containing the <img> files.

then <img> can be parsed.

now if a user on my site made a search query.

the script will end up requesting or doing many wgets on many html pages.

this could lead to up to enough frequency to get me banned.

im trying to see a less bandwith intensive method.

case
06-21-2008, 03:24 AM
I'd recommend ditching curl and wget and looking to a solution like www::mechanize.

Here is a quick script that will return the all the URL's to the images on any specific page.

Expanding this script to crawl multiple sites would be simple by converting $url to an array and stepping through each site with a loop.




#!/usr/bin/perl

use strict;
use warnings;
use WWW::Mechanize;

my $url = 'http://creativecommons.org/image';
my $browser = WWW::Mechanize->new();

$browser->get( $url )
or die "Unable to get $url!\n";

foreach my $img ( $browser->find_all_images() ) {
print $img->url() . "\n";
}


Which outputs:


http://creativecommons.org/wp-content/themes/cc4/images/license-8.png
http://creativecommons.org/wp-content/themes/cc4/images/find-8.png
http://creativecommons.org/wp-content/themes/cc4/images/cc-title-8.png
/images/categories/image.png
/images/features/150illegalart.jpg
/images/features/150flickr.jpg
/images/commons/sc.png
/images/commons/cci.png
/images/commons/learn.png
http://i.creativecommons.org/l/by/3.0/88x31.png

jjk2
06-21-2008, 03:38 AM
thats bloody amazing!

what is www::mechanize advantage over curl and wget ?

that perl script looks a lot simpler than php.

i better start learning perl....

dollar
06-21-2008, 04:14 AM
You should be caching the images on your local machine as a courtesy to the website owners. If you've ever searched with google iamges you should know that often times it's out of date with the actual images (they're broken, different, on a different page, etc..).

I would guess that they have a set interval that they check if an image has been updated (say every 2 weeks for example) and if so they update their database with the new image.

jjk2
06-21-2008, 03:32 PM
i think a better solution is now,

simply scraping results returned by google or yahoo.

i think this is much more efficient, but i dont know if this is allowed or not.

aditya2071990
06-25-2008, 06:13 AM
but i dont know if this is allowed or not.

Of course it won't be allowed...imagine if I built an image search that scrapes your results...it is like saying, 'you do all the hard work of crawling the web, while I simply sit back and steal all your results, and won't even give u a credit...'

It ain't advisable, mate...try and think of something else...