Web Hosting Talk







View Full Version : how to write program(java or c) to download the whole web site


david hu
03-24-2004, 12:57 AM
Hi,

How to write a program to download the whole web pages of a specific web site? Better I can download all the static web pages and dynamic pages(like asp, jsp, servlet generated pages) through my own program.

David

robeyh
03-24-2004, 01:19 AM
If you're doing this server-side just use tar + gz or bz2. Client side you're never going to be able to get the dynamic pages unless you know the architecture of the site in so much detail that you must have access to the server.

Really you likely don't need to write a program to do this. wget (and likely curl as well) already have "mirroring" features. I'm sure there are programs written for windows as well.

tigre
03-24-2004, 02:19 AM
You will only be able to download html code unless you are ftping or sshing into an account.

If I understand correctly, you want a program that automatically gets all files of a certain site. Try searching for spiders.

A Spider starts intially at a start page and move on to other links on that page. If programmed to do so, it can download the file and images. The spider will recursively visit all pages for the site by restricting to the specific domain.

david hu
03-24-2004, 10:18 AM
What I really want is a software(better I can have the source code) that I can download the web site I want every 24 hours( or the hours I specified). I also wish I can config how many sites I want to download.

robeyh said that "Client side you're never going to be able to get the dynamic pages unless you know the architecture of the site in so much detail that you must have access to the server." Why the software cannot do that at client side? I do wish to download the sites that dynamically generate web pages.

stdunbar
03-24-2004, 12:06 PM
Do a Google search for wget. wget -r will traverse an entire site for you. wget allows you to log into the site depending on the security that they use and it support SSL.

robeyh
03-24-2004, 12:27 PM
david-

the problem is that with dynamic pages the page that is given to the client is not the actual page. What you need to properly mirror dynamic pages is the code used as well as any information sources used by that code. For static pages wget should work just fine.

nnormal
03-24-2004, 12:42 PM
this might be the equivalent of duck hunting with an anti aircraft gun but what you want is handled very well by Subversion + webdav + apache. Not only will it allow you to download your entire site but it will keep track of every revision you have made to it.

http://subversion.tigris.org/

stdunbar
03-24-2004, 12:48 PM
Thanks nnormal - that's the best expression I've heard in a while! :D


Originally posted by nnormal
this might be the equivalent of duck hunting with an anti aircraft gun

robeyh
03-24-2004, 12:56 PM
:) exactly what I do. And I happen to enjoy duck hunting with an aa gun.

I think that david's problem is that he wants to do this for sites that he doesn't have server access to.

astraeuz
03-24-2004, 01:48 PM
Offline Explorer Pro/Enterprise is a highly recommended software for this task, for windows. try http://www.metaproducts.com/