lost and found ( for me ? )

httrack : copy web contents

httrack copies web contents you want to copy.

[root@fc17 ~]# uname -ri
3.4.6-2.fc17.x86_64 x86_64

[root@fc17 ~]# cat /etc/fedora-release
Fedora release 17 (Beefy Miracle)

install httrack via yum
[root@fc17 ~]# yum install httrack -y

create a directory to store web contents
$ mkdir web-copy

I’ll copy my blog site.
Please take care of copyright matters , bandwidth , load etc when you copy web contents.

run httrack
$ httrack

Welcome to HTTrack Website Copier (Offline Browser) 3.43-9+libhtsjava.so.2
Copyright (C) Xavier Roche and other contributors
To see the option list, enter a blank line or try httrack --help

Enter project name :my-blog-copy

Base path (return=/home/hattori/websites/) :/home/hattori/web-copy

Enter URLs (separated by commas or blank spaces) :http://lost-and-found-narihiro.blogspot.jp

(enter) 1 Mirror Web Site(s)
2 Mirror Web Site(s) with Wizard
3 Just Get Files Indicated
4 Mirror ALL links in URLs (Multiple Mirror)
5 Test Links In URLs (Bookmark Test)
0 Quit
: 1

Proxy (return=none) :

You can define wildcards, like: -*.gif +www.*.com/*.zip -*img_*.zip
Wildcards (return=none) :

You can define additional options, such as recurse level (-r<number>), separed by blank spaces
To see the option list, type help
Additional options (return=none) :

---> Wizard command line: httrack http://lost-and-found-narihiro.blogspot.jp  -O "/home/hattori/web-copy/my-blog-copy"  -%v   

Ready to launch the mirror? (Y/n) :y

Mirror launched on Tue, 24 Jul 2012 23:56:52 by HTTrack Website Copier/3.43-9+libhtsjava.so.2 [XR&CO'2010]
mirroring http://lost-and-found-narihiro.blogspot.jp with the wizard help..

Here’s a capture data
User-Agent Header seems to be Mozilla.
Hypertext Transfer Protocol
   GET /2012/01/linux-mint-12-configure-ip-aliases.html HTTP/1.1\r\n
       [Expert Info (Chat/Sequence): GET /2012/01/linux-mint-12-configure-ip-aliases.html HTTP/1.1\r\n]
           [Message: GET /2012/01/linux-mint-12-configure-ip-aliases.html HTTP/1.1\r\n]
           [Severity level: Chat]
           [Group: Sequence]
       Request Method: GET
       Request URI: /2012/01/linux-mint-12-configure-ip-aliases.html
       Request Version: HTTP/1.1
   Referer: http://lost-and-found-narihiro.blogspot.jp/\r\n
   Cookie: $Version=1; blogger_TID=xxx; $Path=/\r\n
   Connection: Keep-Alive\r\n
   Host: lost-and-found-narihiro.blogspot.jp\r\n
   User-Agent: Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)\r\n
   Accept: image/png, image/jpeg, image/pjpeg, image/x-xbitmap, image/svg+xml, image/gif;q=0.9, */*;q=0.1\r\n
   Accept-Language: en, *\r\n
   Accept-Charset: iso-8859-1, iso-8859-*;q=0.9, utf-8;q=0.66, *;q=0.33\r\n
   Accept-Encoding: gzip, identity;q=0.9\r\n
   [Full request URI: http://lost-and-found-narihiro.blogspot.jp/2012/01/linux-mint-12-configure-ip-aliases.html]

after finishing copying web contents , web contents will be stored under “~/web-copy/my-blog-copy” directory.
[root@fc17 my-blog-copy]# pwd

[root@fc17 my-blog-copy]# ls
backblue.gif  hts-in_progress.lock  lost-and-found-narihiro.blogspot.jp
fade.gif      hts-log.txt
hts-cache     index.html

open index.html with an web browser.


you could copy web contents with wget command like this:

$ wget --user-agent=Mozilla --mirror --wait=1 http://zzzzz

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.