Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

标点重建模型在推理增加英文标点时将单词拆开 #2194

Open
bigcash opened this issue Nov 6, 2024 · 0 comments
Open

标点重建模型在推理增加英文标点时将单词拆开 #2194

bigcash opened this issue Nov 6, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@bigcash
Copy link

bigcash commented Nov 6, 2024

如题:使用的模型是iic/punc_ct-transformer_cn-en-common-vocab471067-large
docker:modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.1.0-py310-torch2.3.0-tf2.16.1-1.18.0
容器内将funasr(原版本为1.1.6)升级至最新版本funasr==1.1.14后,同样有该问题。

🐛 Bug

from funasr import AutoModel
punc_model='local-model-path'
model = AutoModel(model=punc_model, model_revision="v2.0.4")
punc_results = model.generate(input=['when is Interview: The Documentary playing in Loews Cineplex'])
print(punc_results)

代码输出结果是:
[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': ' When is I nt er view : The Documentary playing in Loews Cineplex.', 'punc_array': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2])}]

Interview这个单词输出时被拆开了!如果将原始文本中的“:”去掉,则没有这个错误。看样子是包含了这个冒号造成的。

Additional context

输入文本中的“Interview: The Documentary”应该是一个电视节目的名字,所以这种是不是因为包含了冒号后,导致tokenize后token序列不一样了,然后导致了后续恢复为文本时的“单词被拆开”的问题?

@bigcash bigcash added the bug Something isn't working label Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant