['ViDoSeek','SlideVQA'] 这两个数据集应该怎么准备？ #9

1137043480 · 2025-03-06T13:11:27Z

./scripts/pdf2images.py中的datasets = ['ViDoSeek','SlideVQA'] 这两个数据集应该怎么准备？
huggingface上可以找到https://huggingface.co/datasets/autumncc/ViDoSeek/tree/main 这个ViDoSeek数据集，
具体怎么使用，还请详细给说明一下。

1137043480 · 2025-03-06T14:59:21Z

步骤：
1、将https://huggingface.co/datasets/autumncc/ViDoSeek/tree/main下载到./data文件夹中
命令为：
cd data
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download autumncc/ViDoSeek --repo-type dataset --local-dir ./data/

2、解压下载的videoseek_pdf_document.zip和slidevqa_pdf_document.zip
命令：
unzip videoseek_pdf_document.zip
unzip slidevqa_pdf_document.zip
这两个unzip命令均会解压为pdf文件夹，因此，可以在第一个文件夹解压后，在data文件夹中创建ViDoSeek文件夹，把pdf文件夹移动到其中，同理，创建SlideVQA文件夹并将解压的pdf文件夹移动到其中。因为代码./scripts/pdf2images.py需要使用这两个data中创建的ViDoSeek文件夹和SlideVQA文件夹

3、将ViDoSeek文件夹和SlideVQA文件夹的pdf文件夹中的pdf文件转换为图片，并以img文件夹的形式分别保存在各自文件夹中，也就是pdf文件夹和img文件夹同等级关系。使用python ./scripts/pdf2images.py命令可以实现。但是需要注意：
需要
步骤 1：安装 Poppler
在 Linux 系统（如 Ubuntu/Debian）中，通过以下命令安装：
sudo apt-get update
sudo apt-get install poppler-utils
步骤 2：验证安装
确保 pdfinfo 命令可用：
which pdfinfo # 应输出类似 /usr/bin/pdfinfo

另外，/scripts/pdf2images.py需要创建img文件夹，否则报错找不到文件夹；另外，为了加快转换pdf为图片的速度， convert_from_path(filepath)函数中可以添加参数 thread_count=16或者8，根据cpu的个数而定，pdf2image转换时使用cpu，非GPU，可以比默认值1时快2-3倍。
以下为修改后的代码：
import os
from tqdm import tqdm
from pdf2image import convert_from_path

datasets = ['ViDoSeek','SlideVQA']

for dataset in datasets:
root_path = f'./data/{dataset}'
pdf_path = os.path.join(root_path,'pdf')

# 创建图片保存目录（如果不存在）
img_path = os.path.join(root_path, 'img')
os.makedirs(img_path, exist_ok=True)  # 关键修复：自动创建目录

pdf_files = [file for file in os.listdir(pdf_path) if file.endswith('pdf')]
for filename in tqdm(pdf_files):
    filepath = os.path.join(pdf_path,filename)
    imgname = filename.split('.pdf')[0]
    images = convert_from_path(filepath, thread_count=16,) # 多线程测试有效
    for i, image in enumerate(images):
        idx = i + 1
         # 保存到已存在的目录
        image.save(os.path.join(root_path, 'img', f'{imgname}_{idx}.jpg'), 'JPEG')

4、Dependencies按照最新版本的要求即可，需要注意的是，需要能够访问huggingface，代码才能自动下载需要的base_model和adapter_model.

Dependencies

Create environment

conda create -n vidorag python=3.10

Clone project

git clone https://github.com/alibaba-nlp/ViDoRAG.git
cd ViDoRAG

Install requirements

pip install -r requirements.txt

5、python ./llms/vl_embedding.py中需要修改if name == "main"中的image_embeddings的地址，否则找不到项目中的文件，代码如下。并且可以添加几行代码，打印看看结果。代码如下：

if name == "main":
colpali = VL_Embedding("vidore/colqwen2-v1.0")
# image_embeddings = colpali.embed_img("./img/00a76e3a9a36255616e2dc14a6eb5dde598b321f_1.jpg")
image_embeddings = colpali.embed_img("./data/ExampleDataset/img/00a76e3a9a36255616e2dc14a6eb5dde598b321f_1.jpg")

text_embeddings = colpali.embed_text("Hello, world!")
score = colpali.processor.score_multi_vector(image_embeddings, text_embeddings)

# 打印一下看看
print(image_embeddings.shape) # torch.Size([1, 779, 128])
print(text_embeddings.shape) # torch.Size([1, 26, 128])
print(score) # tensor([[115.5000]])

6、python ./ingestion.py 就可以执行相应的操作，这个部分使用GPU，如果对上文中的ViDoSeek文件夹和SlideVQA文件夹进行操作，则耗时较长，代码中使用datasets = ['ExampleDataset', 'SlideVQA']进行测试，主要处理SlideVQA文件夹，V100-32G需要跑4个多小时，占用显存17G左右。

以上为个人操作经验分享，部分使用了deepseek r1帮忙解决问题，可大幅降低操作难度。
以上内容发送后，格式发生变化，可复制到本地，方便查看。

chuanwei2050 · 2025-03-09T03:17:01Z

test

AaronPiUC · 2025-03-09T11:42:39Z

mark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

['ViDoSeek','SlideVQA'] 这两个数据集应该怎么准备？ #9

['ViDoSeek','SlideVQA'] 这两个数据集应该怎么准备？ #9

1137043480 commented Mar 6, 2025 •

edited

Loading

1137043480 commented Mar 6, 2025 •

edited

Loading

chuanwei2050 commented Mar 9, 2025

AaronPiUC commented Mar 9, 2025

['ViDoSeek','SlideVQA'] 这两个数据集应该怎么准备？ #9

['ViDoSeek','SlideVQA'] 这两个数据集应该怎么准备？ #9

Comments

1137043480 commented Mar 6, 2025 • edited Loading

1137043480 commented Mar 6, 2025 • edited Loading

Create environment

Clone project

Install requirements

chuanwei2050 commented Mar 9, 2025

AaronPiUC commented Mar 9, 2025

1137043480 commented Mar 6, 2025 •

edited

Loading

1137043480 commented Mar 6, 2025 •

edited

Loading