sitedonload

python site downloader: save html, css, img and js for given url

fell free to contact me if you have any idea about this project：[email protected]

python 网页下载器，用于下载html、css、img、js，支持子网页下载、定时下载、命令行操作、日志记录

#eg
python main.py -d '' -u http://m.sohu.com -o tmp

the site will be saved in following structure:

$save_dir
|-- sitelog.log
|-- $save_time/
    |-- index.html
    |-- css/
        |-- *.css
    |-- images/
        |--*.jpg/png/gif/ico
    |-- js/
        |-- *.js
	|-- subs/ (when use `-s` command)
		|-- sub0/
			|-- sitelog.log
			|-- $save_time
			    |-- index.html
			    |-- css/
			    |-- images/
			    |-- js/
		|-- sub1/

for each url, follwing content will be saved:

log save in sitelog.log
main page save in index.html
css save in css/
img save in images/
javascript save in js/
sub page save in subs/

Note:

Ads not include

this project runs well in:

手机搜狐网 http://m.sohu.com
qq新闻 http://news.qq.com/
知乎精华 https://www.zhihu.com/topic/19550228/top-answers

you may need to change code for your sepcific site, dont worry, its easy

# eg for img url 
original img_url in douban.com is in <img data-origin=""
you just need to replace <img src="" to your image path

Chrome is recommanded to open index.html

coding in line with pep8

before use

python version 2.7x required
you should install packages in file dependence

pip install -r dependence

Note that some packages is not easy to install, eg lxml. you should google for method

how to use

cmd:

Usage:
  main.py -d <delaytime> -u <url>  [-o <save_dir>] [-s]

Arguments:
  delaytime     delaytime, eg: 60
  save_dir      path to save your site, eg: 'tmp'
  url           url, eg: http://m.sohu.com

Options:
  -h --help     show this help
  -d            delaytime
  -o            save_dir
  -s            save sub urls
  -u            url

# eg
# save http://m.sohu.com to `tmp/backup`
python main.py -d '' -u http://m.sohu.com -o tmp/backup

# save http://m.sohu.com to `tmp/backup` for every 60 seconds
python main.py -d 60 -u http://m.sohu.com -o tmp/backup

# save http://m.sohu.com and top 20 sub_urls to `tmp/backup` for every 60 seconds
python main.py -d 60 -u http://m.sohu.com -o tmp/backup -s

python main.py -d 60 -u http://www.sina.com.cn -o tmp/backup

api:

from main import loop
# loop(url, save_dir, delaytime=None)
# eg, note that delaytime must be int_type_str

# save http://m.sohu.com to `tmp/backup` for every 60 seconds
loop('http://m.sohu.com', 'tmp', '60')

# save http://m.sohu.com and top 20 sub_urls to `tmp/backup` for every 60 seconds
loop('http://m.sohu.com', 'tmp', '60', True)

todo

multiprocessing support (需要进程间通讯，待验证)

history

add header to act as pc
add loop() for sleep function
add cmd using docopt
debug: xurljoin() wrong when facing with /?: http://m.sohu.com/?...
debug: special img_url in m.sohu.com: has both src and original
debug: special img_url in m.sohu.com: src in same
debug: special url like //a/b...
debug: encoding problem appeared in 163.com
debug: gb2312, gbk UnicodeDecodeError
debug: wrong encoding when open with IE
debug: convert some relative urls to abs urls
debug: img url might be redirected
debug: \\ causes problem in url
debug: ssl warning
add function for zhihu.com: save png in
add command -s: use -s to save top 20 sub urls
add log function: now you can check sitelog.txt for error info

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dependence		dependence
logger.py		logger.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sitedonload

before use

how to use

todo

history

About

Releases

Packages

Languages

License

basicworld/sitedownloader

Folders and files

Latest commit

History

Repository files navigation

sitedonload

before use

how to use

todo

history

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages