forked from LittleRedHat/moh
-
Notifications
You must be signed in to change notification settings - Fork 0
succulentxb/moh
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
## 环境安装 安装虚拟环境 virtualenv venv 激活虚拟环境 source ./venv/bin/activate(mac/linux) windows上 venv/Scripts/activate 首次运行安装依赖包: pip install -r requirements.txt ## 爬虫运行 cd crawl 运行命令: scrapy crawl moh -a domain=[NATION] -a debug=True 命令停止用ctrl+\ ctrl+z or ctrl+c无法终止scrapy,会导致scrapy在后台运行 ##命令说明 domain对应国家的缩写,debug选项打开调试模式(此时不会将爬下来的网页写到数据库中,并会输出爬取得网页信息基本信息,用于查看时间,标题是否正确 ## 爬取规则制定说明 ## TODO css 内部字体下载 过滤掉视频等不需要文件下载 ## api 修改 添加title搜索和日期过滤 示例如下 { "should": [ [ "cancer" ] ], "by": "title", "size": "20", "from": 0, "sort": "score", "filters": [ { "name": "type", "value": [ "html" ] }, { "name": "publish", "value": [ "2014-01-01", "2017-01-01" ] }, { "name": "nation", "value": [ "all" ] }, { "name": "language", "value": [ "all" ] } ] } by -> title 或者 content 日期过滤 -> 时间格式为 'yyyy-mm-dd' { name:"date", "value":[开始时间,截止时间] }
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- JavaScript 87.4%
- Python 9.0%
- CSS 2.5%
- HTML 1.1%