Skip to content

Commit

Permalink
Merge pull request #374 from NanmiCoder/feature/baidu_tieba_20240805
Browse files Browse the repository at this point in the history
feat: MediaCrawler支持百度贴吧
  • Loading branch information
NanmiCoder authored Aug 8, 2024
2 parents 7e9a759 + 62ac454 commit a10cdcf
Show file tree
Hide file tree
Showing 30 changed files with 7,527 additions and 268 deletions.
44 changes: 41 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
> 点击查看更为详细的免责声明。[点击跳转](#disclaimer)
# 仓库描述

**小红书爬虫****抖音爬虫****快手爬虫****B站爬虫****微博爬虫**...。
**小红书爬虫****抖音爬虫****快手爬虫****B站爬虫****微博爬虫****百度贴吧**...。
目前能抓取小红书、抖音、快手、B站、微博的视频、图片、评论、点赞、转发等信息。

原理:利用[playwright](https://playwright.dev/)搭桥,保留登录成功后的上下文浏览器环境,通过执行JS表达式获取一些加密参数
Expand All @@ -22,6 +22,7 @@
| 快手 ||||||||
| B 站 ||||||||
| 微博 ||||||||
| 贴吧 ||||||||


## 使用方法
Expand Down Expand Up @@ -99,14 +100,51 @@
- [ MediaCrawler-基于抽象类设计重构项目缓存](https://articles.zsxq.com/id_4ju73oxewt9j.html)
- [ 手把手带你撸一个自己的IP代理池](https://articles.zsxq.com/id_38fza371ladm.html)



## 感谢下列Sponsors对本仓库赞助
- <a href="https://sider.ai/ad-land-redirect?source=github&p1=mi&p2=kk">通过注册安装这个款免费的Sider ChatGPT插件帮我获得一定奖励💰,这个插件我用了大半年,作为谷歌上最火的一款插件,体验非常不错。</a>
> 安装并注册该浏览器插件之后保留一天即可,我就可以获得3元的推广奖励,谢谢大家,支持我继续开源项目。

成为赞助者,展示你的产品在这里,联系作者wx:yzglan

## 打赏

如果觉得项目不错的话可以打赏哦。您的支持就是我最大的动力!

打赏时您可以备注名称,我会将您添加至打赏列表中。
<p>
<img alt="打赏-微信" src="static/images/wechat_pay.jpeg" style="width: 200px;margin-right: 140px;" />
<img alt="打赏-支付宝" src="static/images/zfb_pay.png" style="width: 200px" />
</p>

## 捐赠信息

PS:如果打赏时请备注捐赠者,如有遗漏请联系我添加(有时候消息多可能会漏掉,十分抱歉)

| 捐赠者 | 捐赠金额 | 捐赠日期 |
|-------------|-------|------------|
| *| 50 元 | 2024-03-18 |
| *| 50 元 | 2024-03-18 |
| *| 20 元 | 2024-03-17 |
| *| 20 元 | 2024-03-17 |
| *| 20 元 | 2024-03-17 |
| Strem Gamer | 20 元 | 2024-03-16 |
| *| 20 元 | 2024-03-14 |
| Yuzu | 20 元 | 2024-03-07 |
| **| 100 元 | 2024-03-03 |
| **| 20 元 | 2024-03-03 |
| Scarlett | 20 元 | 2024-02-16 |
| Asun | 20 元 | 2024-01-30 |
|* | 100 元 | 2024-01-21 |
| allen | 20 元 | 2024-01-10 |
| llllll | 20 元 | 2024-01-07 |
|*| 20 元 | 2023-12-29 |
| 50chen | 50 元 | 2023-12-22 |
| xiongot | 20 元 | 2023-12-17 |
| atom.hu | 20 元 | 2023-12-16 |
| 一呆 | 20 元 | 2023-12-01 |
| 坠落 | 50 元 | 2023-11-08 |



## MediaCrawler爬虫项目交流群:
> 扫描下方我的个人微信,备注:github,拉你进MediaCrawler项目交流群(请一定备注:github,会有wx小助手自动拉群)
Expand Down
4 changes: 2 additions & 2 deletions cmd_arg/arg.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@
async def parse_cmd():
# 读取command arg
parser = argparse.ArgumentParser(description='Media crawler program.')
parser.add_argument('--platform', type=str, help='Media platform select (xhs | dy | ks | bili | wb)',
choices=["xhs", "dy", "ks", "bili", "wb"], default=config.PLATFORM)
parser.add_argument('--platform', type=str, help='Media platform select (xhs | dy | ks | bili | wb | tieba)',
choices=["xhs", "dy", "ks", "bili", "wb", "tieba"], default=config.PLATFORM)
parser.add_argument('--lt', type=str, help='Login type (qrcode | phone | cookie)',
choices=["qrcode", "phone", "cookie"], default=config.LOGIN_TYPE)
parser.add_argument('--type', type=str, help='crawler type (search | detail | creator)',
Expand Down
29 changes: 19 additions & 10 deletions config/base_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
SAVE_LOGIN_STATE = True

# 数据保存类型选项配置,支持三种类型:csv、db、json
SAVE_DATA_OPTION = "json" # csv or db or json
SAVE_DATA_OPTION = "csv" # csv or db or json

# 用户浏览器缓存的浏览器文件配置
USER_DATA_DIR = "%s_user_data_dir" # %s will be replaced by platform name
Expand All @@ -37,7 +37,7 @@
START_PAGE = 1

# 爬取视频/帖子的数量控制
CRAWLER_MAX_NOTES_COUNT = 20
CRAWLER_MAX_NOTES_COUNT = 100

# 并发爬虫数量控制
MAX_CONCURRENCY_NUM = 1
Expand All @@ -57,7 +57,7 @@
"6422c2750000000027000d88",
"64ca1b73000000000b028dd2",
"630d5b85000000001203ab41",
"668fe13000000000030241fa", # 图文混合
"668fe13000000000030241fa", # 图文混合
# ........................
]

Expand Down Expand Up @@ -88,6 +88,16 @@
# ........................
]

# 指定贴吧需要爬取的帖子列表
TIEBA_SPECIFIED_ID_LIST = [

]

# 指定贴吧名称列表,爬取该贴吧下的帖子
TIEBA_NAME_LIST = [
# "盗墓笔记"
]

# 指定小红书创作者ID列表
XHS_CREATOR_ID_LIST = [
"63e36c9a000000002703502b",
Expand All @@ -112,19 +122,18 @@
# ........................
]


#词云相关
#是否开启生成评论词云图
# 词云相关
# 是否开启生成评论词云图
ENABLE_GET_WORDCLOUD = False
# 自定义词语及其分组
#添加规则:xx:yy 其中xx为自定义添加的词组,yy为将xx该词组分到的组名。
# 添加规则:xx:yy 其中xx为自定义添加的词组,yy为将xx该词组分到的组名。
CUSTOM_WORDS = {
'零几': '年份', # 将“零几”识别为一个整体
'高频词': '专业术语' # 示例自定义词
}

#停用(禁用)词文件路径
# 停用(禁用)词文件路径
STOP_WORDS_FILE = "./docs/hit_stopwords.txt"

#中文字体文件路径
FONT_PATH= "./docs/STZHONGS.TTF"
# 中文字体文件路径
FONT_PATH = "./docs/STZHONGS.TTF"
1 change: 1 addition & 0 deletions constant/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# -*- coding: utf-8 -*-
3 changes: 3 additions & 0 deletions constant/baidu_tieba.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# -*- coding: utf-8 -*-

TIEBA_URL = 'https://tieba.baidu.com'
7 changes: 5 additions & 2 deletions main.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from media_platform.bilibili import BilibiliCrawler
from media_platform.douyin import DouYinCrawler
from media_platform.kuaishou import KuaishouCrawler
from media_platform.tieba import TieBaCrawler
from media_platform.weibo import WeiboCrawler
from media_platform.xhs import XiaoHongShuCrawler

Expand All @@ -18,7 +19,8 @@ class CrawlerFactory:
"dy": DouYinCrawler,
"ks": KuaishouCrawler,
"bili": BilibiliCrawler,
"wb": WeiboCrawler
"wb": WeiboCrawler,
"tieba": TieBaCrawler
}

@staticmethod
Expand All @@ -28,6 +30,7 @@ def create_crawler(platform: str) -> AbstractCrawler:
raise ValueError("Invalid Media Platform Currently only supported xhs or dy or ks or bili ...")
return crawler_class()


async def main():
# parse cmd
await cmd_arg.parse_cmd()
Expand All @@ -38,7 +41,7 @@ async def main():

crawler = CrawlerFactory.create_crawler(platform=config.PLATFORM)
await crawler.start()

if config.SAVE_DATA_OPTION == "db":
await db.close()

Expand Down
2 changes: 2 additions & 0 deletions media_platform/tieba/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# -*- coding: utf-8 -*-
from .core import TieBaCrawler
Loading

0 comments on commit a10cdcf

Please sign in to comment.