Results 1 to 13 of 13

Thread: mirror website

Hybrid View

  1. #1

    mirror website

    Hi all,

    Is wget the best way to download a complete copy of a website? Parts of the site need username + password, has anyone used this before?

    Can lynx do a good job of it too?

    If there is a better linux utility to do this?

    Any help much appreciated,
    Infinite
    Do'h!

  2. #2
    Join Date
    Apr 2001
    Location
    Montana USA
    Posts
    673
    wget -m I think.

    wget --help
    GNU Wget 1.7, a non-interactive network retriever.
    Usage: wget [OPTION]... [URL]...

    Mandatory arguments to long options are mandatory for short options too.

    Startup:
    -V, --version display the version of Wget and exit.
    -h, --help print this help.
    -b, --background go to background after startup.
    -e, --execute=COMMAND execute a `.wgetrc'-style command.

    Logging and input file:
    -o, --output-file=FILE log messages to FILE.
    -a, --append-output=FILE append messages to FILE.
    -d, --debug print debug output.
    -q, --quiet quiet (no output).
    -v, --verbose be verbose (this is the default).
    -nv, --non-verbose turn off verboseness, without being quiet.
    -i, --input-file=FILE download URLs found in FILE.
    -F, --force-html treat input file as HTML.
    -B, --base=URL prepends URL to relative links in -F -i file.
    --sslcertfile=FILE optional client certificate.
    --sslcertkey=KEYFILE optional keyfile for this certificate.

    Download:
    --bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host.
    -t, --tries=NUMBER set number of retries to NUMBER (0 unlimits).
    -O --output-document=FILE write documents to FILE.
    -nc, --no-clobber don't clobber existing files or use .# suffixes.
    -c, --continue resume getting a partially-downloaded file.
    --dot-style=STYLE set retrieval display style.
    -N, --timestamping don't re-retrieve files unless newer than local.
    -S, --server-response print server response.
    --spider don't download anything.
    -T, --timeout=SECONDS set the read timeout to SECONDS.
    -w, --wait=SECONDS wait SECONDS between retrievals.
    --waitretry=SECONDS wait 1...SECONDS between retries of a retrieval.
    -Y, --proxy=on/off turn proxy on or off.
    -Q, --quota=NUMBER set retrieval quota to NUMBER.

    Directories:
    -nd --no-directories don't create directories.
    -x, --force-directories force creation of directories.
    -nH, --no-host-directories don't create host directories.
    -P, --directory-prefix=PREFIX save files to PREFIX/...
    --cut-dirs=NUMBER ignore NUMBER remote directory components.

    HTTP options:
    --http-user=USER set http user to USER.
    --http-passwd=PASS set http password to PASS.
    -C, --cache=on/off (dis)allow server-cached data (normally allowed).
    -E, --html-extension save all text/html documents with .html extension.
    --ignore-length ignore `Content-Length' header field.
    --header=STRING insert STRING among the headers.
    --proxy-user=USER set USER as proxy username.
    --proxy-passwd=PASS set PASS as proxy password.
    --referer=URL include `Referer: URL' header in HTTP request.
    -s, --save-headers save the HTTP headers to file.
    -U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION.
    --no-http-keep-alive disable HTTP keep-alive (persistent connections).
    --cookies=off don't use cookies.
    --load-cookies=FILE load cookies from FILE before session.
    --save-cookies=FILE save cookies to FILE after session.

    FTP options:
    -nr, --dont-remove-listing don't remove `.listing' files.
    -g, --glob=on/off turn file name globbing on or off.
    --passive-ftp use the "passive" transfer mode.
    --retr-symlinks when recursing, get linked-to files (not dirs).

    Recursive retrieval:
    -r, --recursive recursive web-suck -- use with care!
    -l, --level=NUMBER maximum recursion depth (inf or 0 for infinite).
    --delete-after delete files locally after downloading them.
    -k, --convert-links convert non-relative links to relative.
    -K, --backup-converted before converting file X, back up as X.orig.
    -m, --mirror shortcut option equivalent to -r -N -l inf -nr.
    -p, --page-requisites get all images, etc. needed to display HTML page.

    Recursive accept/reject:
    -A, --accept=LIST comma-separated list of accepted extensions.
    -R, --reject=LIST comma-separated list of rejected extensions.
    -D, --domains=LIST comma-separated list of accepted domains.
    --exclude-domains=LIST comma-separated list of rejected domains.
    --follow-ftp follow FTP links from HTML documents.
    --follow-tags=LIST comma-separated list of followed HTML tags.
    -G, --ignore-tags=LIST comma-separated list of ignored HTML tags.
    -H, --span-hosts go to foreign hosts when recursive.
    -L, --relative follow relative links only.
    -I, --include-directories=LIST list of allowed directories.
    -X, --exclude-directories=LIST list of excluded directories.
    -nh, --no-host-lookup don't DNS-lookup hosts.
    -np, --no-parent don't ascend to the parent directory.

    Mail bug reports and suggestions to <bug-wget@gnu.org>.
    John Masterson
    Former Hosting Company Owner

  3. #3
    thanks magnafix. has anyone used wget regularly, would you recommend to set it up in a cron job?

    Cheers,
    Infinite
    Do'h!

  4. #4
    Join Date
    Sep 2001
    Location
    Sirkali Rural Tamilnadu
    Posts
    738
    I have used / use wget and it just rocks..

    for mirroring use

    wget -m -nH http://sourcesite.com [to avoid copying under a directory called sourcesite.com use -nH the No Host option]

    I haven't tried password protected directories, but they should work just as fine because the .htaccess is also copied.

    Cheers
    Balaji
    I am now happily selling Natural Herbal Hair Oil - happy to be so far removed from technology!

  5. #5
    Originally posted by MotleyFool
    I have used / use wget and it just rocks..
    Thanks MotleyFool, I'll give it a try!

    Cheers,
    Infinite
    Do'h!

  6. #6
    Join Date
    Nov 2001
    Location
    Vancouver
    Posts
    2,422
    Assuming Apache -- If you need wget to copy .htaccess files, you might have to alter your httpd.conf and comment out the following section:

    <Files ~ "^\.ht">
    Order allow,deny
    Deny from all
    Satisfy All
    </Files>

    By default most Apache installs prevent web clients (including wget) from viewing .htaccess contents.

    Could be wrong - I've used wget before to mirror but only visible files - but a heads up on something to check on.

  7. #7
    Thanks mwatkins, I'll make/copy a version of .htaccess, and make sure the other server keeps the same password

    Cheers,
    Infinite
    Do'h!

  8. #8
    Join Date
    Dec 2001
    Location
    Netherlands
    Posts
    849
    Hi.

    does wget -m -nH http://site1.com also work on cgi's and php pages ?

    I meant if I run that on a forum site maybe, will it work too ?

    eg:

    wget -m -nH http://webhostingtalk.com ?

    just curious ?

    .
    # experienced Cloud/OpenStack Architect
    #
    # Feel free to PM me for any info or help to build your cloud.

  9. #9
    Join Date
    Jul 2001
    Posts
    889
    admin0: it will try to work but it won't work right for any practical purposes since wht is served by php scripts yet you are downloading parsed files of html... if that made any sense.. hehe..


  10. #10
    Join Date
    Jun 2002
    Posts
    80

    Question

    What if your site has MySQL database? Does it work too?

  11. #11
    aquos, wget will take html files of the web server, like your web browser does, it won't get any MySQL tables or databases, just like your web browser won't take any MySQL tables either.

    admin0, if you wanted to mirror a forum, you would have to do it at the database level, thats where all the info is, including posts, profiles etc. You could get a cron job set up to dump the database, zip it up, and place it in a password protected area. Then the other server could wget the database file, and unzip, and run it into MySQL. Someone may have a simpler way though

    wget will only take a copy of the HTML (and images etc.) pages that it finds on the server, it's similar to a search engine spider.

    HTH,
    Infinite
    Do'h!

  12. #12
    Join Date
    Dec 2001
    Location
    Netherlands
    Posts
    849
    was just curious !!


    .
    # experienced Cloud/OpenStack Architect
    #
    # Feel free to PM me for any info or help to build your cloud.

  13. #13
    Originally posted by admin0
    was just curious !!
    no worries admin0
    Do'h!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •