Skip to content

Commit

Permalink
[add] Shell script & GitHub action for fetching PDF & transform Markdown
Browse files Browse the repository at this point in the history
  • Loading branch information
TechQuery authored and xycjscs committed Jun 20, 2024
1 parent c1e6250 commit d39ae5a
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 4 deletions.
19 changes: 19 additions & 0 deletions .github/workflows/fetch-PDF.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
name: PDF downloader
on:
push:
branches:
- main
paths:
- "data/**"
jobs:
Download-and-Transform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- run: tool/fetch-PDF.sh data downloads

- uses: actions/upload-artifact@v4
with:
name: PDF-Markdown
path: downloads
15 changes: 11 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# KnowledgeBase-xiaoyibao

[![Web deployment](https://github.com/xycjscs/KnowledgeBase-xiaoyibao/actions/workflows/deploy-Web.yml/badge.svg)][1]
[![PDF downloader](https://github.com/xycjscs/KnowledgeBase-xiaoyibao/actions/workflows/fetch-PDF.yml/badge.svg)][2]

这是 xiaoyibao 扩展项目中的知识库项目,用于存储生成 RAG 所需的医疗专业资料。

Expand Down Expand Up @@ -117,6 +118,14 @@ pnpm start
计划仓库中不同 json 文档存储不同的 `{标题-说明-链接}` 库,README 文件自动读取 json 文件渲染首页,下载脚本自动读取 json 并执行下载。

## 批量下载 PDF 并转换为 Markdown

执行 `tool` 目录下的 `fetch-PDF.sh` 脚本,自动从 `data` 目录中的 JSON 文件中找出所有 PDF 链接,下载到 `downloads` 目录后再转为 Markdown、图片等独立文件。

```sh
tool/fetch-PDF.sh data downloads
```

## 维基百科形式的协作

文档以维基百科的形式进行协同创作,任何人可以修改文档中的任何内容,包括删减不合适的目录。
Expand All @@ -125,10 +134,7 @@ pnpm start

## 待开发功能或资料

- [ ] 自动化 PDF 文档转 Markdown 文本

- https://github.com/bsorrentino/pdf-tools
- https://github.com/opengovsg/pdf2md
- [x] 自动化 PDF 文档转 Markdown 文本

- [ ] QA 对数据库

Expand All @@ -137,3 +143,4 @@ pnpm start
- [x] 自动将文档更新于一链接

[1]: https://github.com/xycjscs/KnowledgeBase-xiaoyibao/actions/workflows/deploy-Web.yml
[2]: https://github.com/xycjscs/KnowledgeBase-xiaoyibao/actions/workflows/fetch-PDF.yml
18 changes: 18 additions & 0 deletions tool/fetch-PDF.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#! /usr/bin/env bash

if [ ! $1 ] || [ ! $2 ]; then
cat <<EOF
Usage:
fetch-PDF.sh folder/with/PDF/URL/text/files folder/save/PDF/Markdown/files"
EOF
exit 1
fi

grep -Ei https?://.+\.pdf -r $1 -oh | xargs -I {} curl --create-dirs --output-dir $2 -O {}
(
cd $2
find ../$2 -type f | xargs -I {} pnpm --package=@bsorrentino/pdf-tools dlx pdftools pdf2md {}
)

0 comments on commit d39ae5a

Please sign in to comment.