
07-02-2002, 03:03 PM
|
|
WHT Addict
|
|
Join Date: Feb 2002
Location: UK
Posts: 120
|
|
Hi all,
Is wget the best way to download a complete copy of a website? Parts of the site need username + password, has anyone used this before?
Can lynx do a good job of it too?
If there is a better linux utility to do this?
Any help much appreciated,
Infinite 
__________________
Do'h!
|

07-02-2002, 03:25 PM
|
|
Web Hosting Master
|
|
Join Date: Apr 2001
Location: Montana USA
Posts: 673
|
|
wget -m I think.
wget --help
GNU Wget 1.7, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...
Mandatory arguments to long options are mandatory for short options too.
Startup:
-V, --version display the version of Wget and exit.
-h, --help print this help.
-b, --background go to background after startup.
-e, --execute=COMMAND execute a `.wgetrc'-style command.
Logging and input file:
-o, --output-file=FILE log messages to FILE.
-a, --append-output=FILE append messages to FILE.
-d, --debug print debug output.
-q, --quiet quiet (no output).
-v, --verbose be verbose (this is the default).
-nv, --non-verbose turn off verboseness, without being quiet.
-i, --input-file=FILE download URLs found in FILE.
-F, --force-html treat input file as HTML.
-B, --base=URL prepends URL to relative links in -F -i file.
--sslcertfile=FILE optional client certificate.
--sslcertkey=KEYFILE optional keyfile for this certificate.
Download:
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
-O --output-document=FILE write documents to FILE.
-nc, --no-clobber don't clobber existing files or use .# suffixes.
-c, --continue resume getting a partially-downloaded file.
--dot-style=STYLE set retrieval display style.
-N, --timestamping don't re-retrieve files unless newer than local.
-S, --server-response print server response.
--spider don't download anything.
-T, --timeout=SECONDS set the read timeout to SECONDS.
-w, --wait=SECONDS wait SECONDS between retrievals.
--waitretry=SECONDS wait 1...SECONDS between retries of a retrieval.
-Y, --proxy=on/off turn proxy on or off.
-Q, --quota=NUMBER set retrieval quota to NUMBER.
Directories:
-nd --no-directories don't create directories.
-x, --force-directories force creation of directories.
-nH, --no-host-directories don't create host directories.
-P, --directory-prefix=PREFIX save files to PREFIX/...
--cut-dirs=NUMBER ignore NUMBER remote directory components.
HTTP options:
--http-user=USER set http user to USER.
--http-passwd=PASS set http password to PASS.
-C, --cache=on/off (dis)allow server-cached data (normally allowed).
-E, --html-extension save all text/html documents with .html extension.
--ignore-length ignore `Content-Length' header field.
--header=STRING insert STRING among the headers.
--proxy-user=USER set USER as proxy username.
--proxy-passwd=PASS set PASS as proxy password.
--referer=URL include `Referer: URL' header in HTTP request.
-s, --save-headers save the HTTP headers to file.
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
--no-http-keep-alive disable HTTP keep-alive (persistent connections).
--cookies=off don't use cookies.
--load-cookies=FILE load cookies from FILE before session.
--save-cookies=FILE save cookies to FILE after session.
FTP options:
-nr, --dont-remove-listing don't remove `.listing' files.
-g, --glob=on/off turn file name globbing on or off.
--passive-ftp use the "passive" transfer mode.
--retr-symlinks when recursing, get linked-to files (not dirs).
Recursive retrieval:
-r, --recursive recursive web-suck -- use with care!
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite).
--delete-after delete files locally after downloading them.
-k, --convert-links convert non-relative links to relative.
-K, --backup-converted before converting file X, back up as X.orig.
-m, --mirror shortcut option equivalent to -r -N -l inf -nr.
-p, --page-requisites get all images, etc. needed to display HTML page.
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions.
-R, --reject=LIST comma-separated list of rejected extensions.
-D, --domains=LIST comma-separated list of accepted domains.
--exclude-domains=LIST comma-separated list of rejected domains.
--follow-ftp follow FTP links from HTML documents.
--follow-tags=LIST comma-separated list of followed HTML tags.
-G, --ignore-tags=LIST comma-separated list of ignored HTML tags.
-H, --span-hosts go to foreign hosts when recursive.
-L, --relative follow relative links only.
-I, --include-directories=LIST list of allowed directories.
-X, --exclude-directories=LIST list of excluded directories.
-nh, --no-host-lookup don't DNS-lookup hosts.
-np, --no-parent don't ascend to the parent directory.
Mail bug reports and suggestions to <bug-wget@gnu.org>.
__________________
John Masterson
Former Hosting Company Owner
|

07-04-2002, 05:17 AM
|
|
WHT Addict
|
|
Join Date: Feb 2002
Location: UK
Posts: 120
|
|
thanks magnafix. has anyone used wget regularly, would you recommend to set it up in a cron job?
Cheers,
Infinite 
__________________
Do'h!
|

07-04-2002, 05:26 AM
|
|
Fool about Town
|
|
Join Date: Sep 2001
Location: Madras
Posts: 737
|
|
I have used / use wget and it just rocks..
for mirroring use
wget -m -nH http://sourcesite.com [to avoid copying under a directory called sourcesite.com use -nH the No Host option]
I haven't tried password protected directories, but they should work just as fine because the .htaccess is also copied.
Cheers
Balaji
__________________
Offering Managed Servers - for an exclusive clientèle who value uptime, caring support and superior technology.
|

07-04-2002, 05:31 AM
|
|
WHT Addict
|
|
Join Date: Feb 2002
Location: UK
Posts: 120
|
|
Quote:
Originally posted by MotleyFool
I have used / use wget and it just rocks..
|
Thanks MotleyFool, I'll give it a try!
Cheers,
Infinite 
__________________
Do'h!
|

07-04-2002, 03:47 PM
|
|
Web Hosting Master
|
|
Join Date: Nov 2001
Location: Vancouver
Posts: 2,416
|
|
Assuming Apache -- If you need wget to copy .htaccess files, you might have to alter your httpd.conf and comment out the following section:
<Files ~ "^\.ht">
Order allow,deny
Deny from all
Satisfy All
</Files>
By default most Apache installs prevent web clients (including wget) from viewing .htaccess contents.
Could be wrong - I've used wget before to mirror but only visible files - but a heads up on something to check on.
|

07-05-2002, 09:00 AM
|
|
WHT Addict
|
|
Join Date: Feb 2002
Location: UK
Posts: 120
|
|
Thanks mwatkins, I'll make/copy a version of .htaccess, and make sure the other server keeps the same password
Cheers,
Infinite 
__________________
Do'h!
|

07-05-2002, 10:53 AM
|
|
Web Hosting Master
|
|
Join Date: Dec 2001
Location: Singapore
Posts: 747
|
|
Hi.
does wget -m -nH http://site1.com also work on cgi's and php pages ?
I meant if I run that on a forum site maybe, will it work too ?
eg:
wget -m -nH http://webhostingtalk.com ?
just curious ?

__________________
███ .
███ .. ...
███
███ fulltime sysadmin since 1997!
|

07-05-2002, 08:48 PM
|
|
Web Hosting Master
|
|
Join Date: Jul 2001
Posts: 889
|
|
admin0: it will try to work but it won't work right for any practical purposes since wht is served by php scripts yet you are downloading parsed files of html... if that made any sense.. hehe..

__________________
|

07-05-2002, 10:27 PM
|
|
Junior Guru Wannabe
|
|
Join Date: Jun 2002
Posts: 80
|
|
What if your site has MySQL database? Does it work too?
|

07-06-2002, 08:49 AM
|
|
WHT Addict
|
|
Join Date: Feb 2002
Location: UK
Posts: 120
|
|
aquos, wget will take html files of the web server, like your web browser does, it won't get any MySQL tables or databases, just like your web browser won't take any MySQL tables either.
admin0, if you wanted to mirror a forum, you would have to do it at the database level, thats where all the info is, including posts, profiles etc. You could get a cron job set up to dump the database, zip it up, and place it in a password protected area. Then the other server could wget the database file, and unzip, and run it into MySQL.  Someone may have a simpler way though
wget will only take a copy of the HTML (and images etc.) pages that it finds on the server, it's similar to a search engine spider.
HTH,
Infinite 
__________________
Do'h!
|

07-06-2002, 10:46 AM
|
|
Web Hosting Master
|
|
Join Date: Dec 2001
Location: Singapore
Posts: 747
|
|
was just curious !!

__________________
███ .
███ .. ...
███
███ fulltime sysadmin since 1997!
|

07-06-2002, 11:35 AM
|
|
WHT Addict
|
|
Join Date: Feb 2002
Location: UK
Posts: 120
|
|
Quote:
Originally posted by admin0
was just curious !!
|
no worries admin0 
__________________
Do'h!
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
| Postbit Selector |
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|
|
| Login: |
|
|
| Advertisement: |
|
|
| Web Hosting News: |
|
|
|