You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fromfunasrimportAutoModelpunc_model='local-model-path'model=AutoModel(model=punc_model, model_revision="v2.0.4")
punc_results=model.generate(input=['when is Interview: The Documentary playing in Loews Cineplex'])
print(punc_results)
代码输出结果是:
[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': ' When is I nt er view : The Documentary playing in Loews Cineplex.', 'punc_array': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2])}]
如题:使用的模型是iic/punc_ct-transformer_cn-en-common-vocab471067-large
docker:modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.1.0-py310-torch2.3.0-tf2.16.1-1.18.0
容器内将funasr(原版本为1.1.6)升级至最新版本funasr==1.1.14后,同样有该问题。
🐛 Bug
代码输出结果是:
[{'key': 'rand_key_2yW4Acq9GFz6Y', 'text': ' When is I nt er view : The Documentary playing in Loews Cineplex.', 'punc_array': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2])}]
Interview这个单词输出时被拆开了!如果将原始文本中的“:”去掉,则没有这个错误。看样子是包含了这个冒号造成的。
Additional context
输入文本中的“Interview: The Documentary”应该是一个电视节目的名字,所以这种是不是因为包含了冒号后,导致tokenize后token序列不一样了,然后导致了后续恢复为文本时的“单词被拆开”的问题?
The text was updated successfully, but these errors were encountered: