Web Hosting Talk







View Full Version : copy all links on a website


walrus21
06-16-2007, 08:20 PM
Hello
Does anyone know of any way to copy all the links on a website. say a short piece of vb code maybe that can go to a site and copy all the links that are on it? just wanting to see if this can be done, thanks all!!

ISPserver
06-18-2007, 10:20 AM
As I understood you want full copy of site?

If it's you can use any 'Website Copier' or 'Offline Browser'.

mwatkins
06-18-2007, 11:15 AM
If mirroring the site is what the OP wants, wget will provide all that is needed. Install and man wget.

But the OP question suggests that what is wanted is a spider to crawl an entire site and collect links. For a single page, in Python:

import re
from urllib2 import urlopen

def get_links(url):
return set(re.findall(r'href="(.*?)"', urlopen(url).read()))

mwatkins
06-18-2007, 11:40 AM
Note that get_links() as defined above is only an example; you'd want to test for various exceptions; my own library of utilities includes something similar but I return two values - one is a list of "internal" links, the other a list of "external" links.

Meant to post an example:

->> links = get_links('http://www.webhostingtalk.com/')
->> len(links)
485

A subset of the above 485:

->> len( [link for link in links if link.startswith('http')])
47
->> print [link for link in links if link.startswith('http')]
['http://www.webhostingtalk.com/news/google-joins-intel-hp-ibm-microsoft-in-climate-savers-computing-initiative/', 'http://www.devpapers.com/', 'http://www.pixelpapers.com/', 'http://www.programmingtalk.com/', 'http://www.inetinteractive.com',

[snip]

I won't post "example" spider code because poorly written / ill-behaved web spiders are an abomination.

jts-online
06-18-2007, 12:30 PM
You can Manage this by using perl or HTTRACK

jerrysanders
06-18-2007, 11:50 PM
If you want to download an entire website with HTTRACK you could use a line like this...

httrack http://www.somesite.com --depth=3 --ext-depth=0 --path=/home/httrack/ --sockets=1 --clean -*.gif -*.jpg -*.png -*.ico -*.css -*.js

Hope that helps :)