- 基于weibo.com的新版API构建,拥有最丰富的字段信息
- 多种采集模式,包含微博用户,推文,粉丝,关注,转发,评论,关键词搜索
- 核心代码仅100行,代码可读性高,可快速按需进行定制化改造
git clone [email protected]:nghuyong/WeiboSpider.git --depth 1 --no-single-branch
cd WeiboSpider
pip install -r requirements.txt
访问https://weibo.com/, 登陆账号,打开浏览器的开发者模式,再次刷新
复制weibo.com
数据包,network中的cookie值,右键copy value。编辑weibospider/cookies.json
,替换成刚刚复制的Cookie,给其命名。
具体来讲,只需要SUB
值即可:
{
"name": "SUB=_2A25OUHI9DeRhGeBN6VUX9SvEzT-IHXVtux51rDV6PUJbktAKLVCmkW1NRJ24IFM3w0MsIksejOBZJEcAgJZZobMv"
}
微博有较为严格的反爬措施,对单个cookie设置了访问频率限制(大约1次/秒),可收集多个帐号cookie提高爬取效率。
将多个cookie粘贴至weibospider/cookies.txt
中,一行一个,适当调整settings.py
文件中的DOWNLOAD_DELAY
和CONCURRENT_REQUESTS
。
推荐设置:
CONCURRENT_REQUESTS
:当cookie数小于8时,将设置为cookies个数,cookie个数大于8时设置为8DOWNLOAD_DELAY
:设置为cookies个数的倒数。例如拥有1个cookie时设置为1,拥有8个cookie时设置为0.2(每cookie访问频率不超过1次/秒)
重写fetch_proxy 方法,该方法需要返回一个代理ip,具体参考这里
根据自己实际需要重写./weibospider/spiders/*
中的start_requests
函数
采集的数据输出在output
文件夹中,命名为{spider.name}_{datetime}.jsonl
cd weibospider
python run_spider.py user
{
"crawl_time": 1666863485,
"_id": "1749127163",
"avatar_hd": "https://tvax4.sinaimg.cn/crop.0.0.1080.1080.1024/001Un9Srly8h3fpj11yjyj60u00u0q7f02.jpg?KID=imgbed,tva&Expires=1666874283&ssig=a%2FMfgFzvRo",
"nick_name": "雷军",
"verified": true,
"description": "小米董事长,金山软件董事长。业余爱好是天使投资。",
"followers_count": 22756103,
"friends_count": 1373,
"statuses_count": 14923,
"gender": "m",
"location": "北京 海淀区",
"mbrank": 7,
"mbtype": 12,
"verified_type": 0,
"verified_reason": "小米创办人,董事长兼CEO;金山软件董事长;天使投资人。",
"birthday": "",
"created_at": "2010-05-31 23:07:59",
"desc_text": "小米创办人,董事长兼CEO;金山软件董事长;天使投资人。",
"ip_location": "IP属地:北京",
"sunshine_credit": "信用极好",
"label_desc": [
"V指数 财经 75.30分",
"热门财经博主 数据飙升",
"昨日发博3,阅读数100万+,互动数1.9万",
"视频累计播放量9819.3万",
"群友 3132"
],
"company": "金山软件",
"education": {
"school": "武汉大学"
}
}
python run_spider.py fan
{
"crawl_time": 1666863563,
"_id": "1087770692_5968044974",
"follower_id": "1087770692",
"fan_info": {
"_id": "5968044974",
"avatar_hd": "https://tvax1.sinaimg.cn/default/images/default_avatar_male_180.gif?KID=imgbed,tva&Expires=1666874363&ssig=UuzaeK437R",
"nick_name": "用户5968044974",
"verified": false,
"description": "",
"followers_count": 0,
"friends_count": 195,
"statuses_count": 9,
"gender": "m",
"location": "其他",
"mbrank": 0,
"mbtype": 0,
"credit_score": 80,
"created_at": "2016-06-25 22:30:13"
}
}
...
python run_spider.py follow
{
"crawl_time": 1666863679,
"_id": "1087770692_7083568088",
"fan_id": "1087770692",
"follower_info": {
"_id": "7083568088",
"avatar_hd": "https://tvax4.sinaimg.cn/crop.0.0.1080.1080.1024/007JnVEcly8gyqd9jadjlj30u00u0gpn.jpg?KID=imgbed,tva&Expires=1666874479&ssig=9zhfeMPLzr",
"nick_name": "蒋昀霖",
"verified": true,
"description": "工作请联系:[email protected]",
"followers_count": 329216,
"friends_count": 58,
"statuses_count": 342,
"gender": "m",
"location": "北京",
"mbrank": 6,
"mbtype": 12,
"credit_score": 80,
"created_at": "2019-04-17 16:25:43",
"verified_type": 0,
"verified_reason": "东申未来 演员"
}
}
...
python run_spider.py tweet
{
"crawl_time": 1666864583,
"_id": "4762810834227120",
"mblogid": "LqlZNhJFm",
"created_at": "2022-04-27 10:20:54",
"geo": null,
"ip_location": null,
"reposts_count": 1907,
"comments_count": 1924,
"attitudes_count": 12169,
"source": "三星Galaxy S22 Ultra",
"content": "生于乱世纵横四海,义之所在不计生死,孤勇者陈恭一生当如是。#风起陇西今日开播# #风起陇西# 今晚,恭候你!",
"pic_urls": [],
"pic_num": 0,
"video": "http://f.video.weibocdn.com/o0/CmQEWK1ylx07VAm0nrxe01041200YDIc0E010.mp4?label=mp4_720p&template=1280x720.25.0&ori=0&ps=1CwnkDw1GXwCQx&Expires=1666868183&ssig=RlIeOt286i&KID=unistore,video",
"url": "https://weibo.com/1087770692/LqlZNhJFm"
}
...
python run_spider.py comment
{
"crawl_time": 1666863805,
"_id": 4826279188108038,
"created_at": "2022-10-19 13:41:29",
"like_counts": 1,
"ip_location": "来自河南",
"content": "五周年快乐呀,请坤哥哥继续保持这份热爱,奔赴下一场山海",
"comment_user": {
"_id": "2380967841",
"avatar_hd": "https://tvax4.sinaimg.cn/crop.0.0.888.888.1024/002B8iv7ly8gv647ipgxvj60oo0oojtk02.jpg?KID=imgbed,tva&Expires=1666874604&ssig=%2FdGaaIRkhf",
"nick_name": "流年执念的二瓜娇",
"verified": false,
"description": "蓝桉已遇释怀鸟,不爱万物唯爱你。",
"followers_count": 238,
"friends_count": 1655,
"statuses_count": 12546,
"gender": "f",
"location": "河南",
"mbrank": 6,
"mbtype": 11
}
}
...
python run_spider.py repost
{
"_id": "4826312651310475",
"mblogid": "Mb2vL5uUH",
"created_at": "2022-10-19 15:54:27",
"geo": null,
"ip_location": "发布于 德国",
"reposts_count": 0,
"comments_count": 0,
"attitudes_count": 0,
"source": "iPhone客户端",
"content": "共享[鼓掌][太开心][鼓掌]五周年快乐!//@陈坤:#山下学堂五周年# 五年, 感谢同行。",
"pic_urls": [],
"pic_num": 0,
"user": {
"_id": "2717869081",
"avatar_hd": "https://tvax1.sinaimg.cn/crop.0.0.160.160.1024/a1ff6419ly8gz1xoq9oolj204g04g745.jpg?KID=imgbed,tva&Expires=1666876939&ssig=Cl93CLjdB%2F",
"nick_name": "YuFeeC",
"verified": false,
"mbrank": 0,
"mbtype": 0
},
"url": "https://weibo.com/2717869081/Mb2vL5uUH",
"crawl_time": 1666866139
}
...
python run_spider.py search
{
"crawl_time": 1666869049,
"keyword": "丽江",
"_id": "4829255386537989",
"mblogid": "Mch46rqPr",
"created_at": "2022-10-27 18:47:50",
"geo": {
"type": "Point",
"coordinates": [
26.962427,
100.248299
],
"detail": {
"poiid": "B2094251D06FAAF44299",
"title": "山野文创旅拍圣地",
"type": "checkin",
"spot_type": "0"
}
},
"ip_location": "发布于 云南",
"reposts_count": 0,
"comments_count": 0,
"attitudes_count": 1,
"source": "iPhone1314iPhone客户端",
"content": "丽江小漾日出\n推出户外移动餐桌\n接受私人定制\n让美食融入美景心情自然美丽了!\n#小众宝藏旅行地##超出片的艺术街区# ",
"pic_urls": [
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k1a56c4oj234022onph",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19eb2kxj22ts1vvb2a",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k1a0wzglj22ua1w7hdw",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19wsafnj231x21a7wj",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19jd1xkj22oh1sbkjo",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19mma74j22ru1ukx6q",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19tf1bfj234022oe85",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19pk37pj234022okjm",
"https://wx1.sinaimg.cn/orj960/4b138405gy1h7k19g6nzfj20wi0lo7my"
],
"pic_num": 9,
"user": {
"_id": "1259570181",
"avatar_hd": "https://tvax1.sinaimg.cn/crop.0.0.1080.1080.1024/4b138405ly8gzfkfikyqvj20u00u0ag1.jpg?KID=imgbed,tva&Expires=1666879848&ssig=6PUDG5RonQ",
"nick_name": "飞鸟与鱼",
"verified": true,
"mbrank": 7,
"mbtype": 12,
"verified_type": 0
},
"url": "https://weibo.com/1259570181/Mch46rqPr"
}
...
- 2022.11: 支持针对单个关键词获取单天超过1200页的检索结果 #257
- 2022.11: 支持长微博全文的获取
- 2022.11: 基于关键词微博搜索支持指定时间范围
- 2022.10: 添加IP归属地信息的采集,包括用户数据,微博数据和微博评论数据
- 2022.10: 基于weibo.com站点对项目进行重构
- 已构建超大规模数据集WeiboCOV,可免费申请,包含2千万微博活跃用户以及6千万推文数据,参见这里