-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME
76 lines (63 loc) · 1.34 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
## 环境安装
安装虚拟环境 virtualenv venv
激活虚拟环境 source ./venv/bin/activate(mac/linux) windows上 venv/Scripts/activate
首次运行安装依赖包: pip install -r requirements.txt
## 爬虫运行
cd crawl
运行命令:
scrapy crawl moh -a domain=[NATION] -a debug=True
命令停止用ctrl+\
ctrl+z or ctrl+c无法终止scrapy,会导致scrapy在后台运行
##命令说明
domain对应国家的缩写,debug选项打开调试模式(此时不会将爬下来的网页写到数据库中,并会输出爬取得网页信息基本信息,用于查看时间,标题是否正确
## 爬取规则制定说明
## TODO
css 内部字体下载
过滤掉视频等不需要文件下载
## api 修改
添加title搜索和日期过滤
示例如下
{
"should": [
[
"cancer"
]
],
"by": "title",
"size": "20",
"from": 0,
"sort": "score",
"filters": [
{
"name": "type",
"value": [
"html"
]
},
{
"name": "publish",
"value": [
"2014-01-01",
"2017-01-01"
]
},
{
"name": "nation",
"value": [
"all"
]
},
{
"name": "language",
"value": [
"all"
]
}
]
}
by -> title 或者 content
日期过滤 -> 时间格式为 'yyyy-mm-dd'
{
name:"date",
"value":[开始时间,截止时间]
}