标点重建模型在推理增加英文标点时将单词拆开 #2194

bigcash · 2024-11-06T09:11:08Z

如题：使用的模型是iic/punc_ct-transformer_cn-en-common-vocab471067-large
docker：modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.1.0-py310-torch2.3.0-tf2.16.1-1.18.0
容器内将funasr（原版本为1.1.6）升级至最新版本funasr==1.1.14后，同样有该问题。

🐛 Bug

from funasr import AutoModel
punc_model='local-model-path'
model = AutoModel(model=punc_model, model_revision="v2.0.4")
punc_results = model.generate(input=['when is Interview: The Documentary playing in Loews Cineplex'])
print(punc_results)

代码输出结果是：
[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': ' When is I nt er view : The Documentary playing in Loews Cineplex.', 'punc_array': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2])}]

Interview这个单词输出时被拆开了！如果将原始文本中的“:”去掉，则没有这个错误。看样子是包含了这个冒号造成的。

Additional context

输入文本中的“Interview: The Documentary”应该是一个电视节目的名字，所以这种是不是因为包含了冒号后，导致tokenize后token序列不一样了，然后导致了后续恢复为文本时的“单词被拆开”的问题？

bigcash added the bug Something isn't working label Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

标点重建模型在推理增加英文标点时将单词拆开 #2194

标点重建模型在推理增加英文标点时将单词拆开 #2194

bigcash commented Nov 6, 2024 •

edited

Loading

标点重建模型在推理增加英文标点时将单词拆开 #2194

标点重建模型在推理增加英文标点时将单词拆开 #2194

Comments

bigcash commented Nov 6, 2024 • edited Loading

🐛 Bug

Additional context

bigcash commented Nov 6, 2024 •

edited

Loading