-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
中文 Emebedding & Reranker 模型选型 #111
Labels
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
结论
选型建议:
Embedding 模型
PEG
作者:腾讯
模型地址: https://huggingface.co/TownsWu/PEG
论文: https://arxiv.org/pdf/2311.11691.pdf
重点优化检索能力。
GTE 系列
作者:阿里巴巴
模型地址: https://huggingface.co/thenlper/gte-large-zh
论文: https://arxiv.org/abs/2308.03281
picolo 系列
作者:商汤
地址: https://huggingface.co/sensenova/piccolo-large-zh
有一些微调的小tips
stella 系列
地址:https://huggingface.co/infgrad/stella-large-zh-v2
博客文章: https://zhuanlan.zhihu.com/p/655322183
基于piccolo 模型fine-tuning,支持1024 序列长度。博客文章记录了一些训练思路。
BGE 系列
作者:智源研究院
地址:https://huggingface.co/BAAI/bge-large-zh-v1.5
论文:https://arxiv.org/pdf/2309.07597.pdf
Github:https://github.com/FlagOpen/FlagEmbedding
开放信息最多的模型,也提供了fine-tuning 示例代码。同时也是 C-MTEB 榜单的维护者。
m3e 系列
作者:MokaAI
地址:https://huggingface.co/moka-ai/m3e-large
Github:https://github.com/wangyuxinwhy/uniem
研究的比较早,算是中文通用 Embedding 模型、数据集以及评测比较早的开拓者。
multilingual-e5-large
地址:https://huggingface.co/intfloat/multilingual-e5-large
论文:https://arxiv.org/pdf/2212.03533.pdf
多语言支持。
tao-8k
地址: https://huggingface.co/amu/tao-8k
支持8192 序列长度,但是信息很少。
Reranker 模型
bge-reranker 系列
作者:智源研究院
地址:https://huggingface.co/BAAI/bge-reranker-large
Github:GitHub - FlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs
基于 xlm-roberta 模型。
alime-reranker-large-zh
地址: https://huggingface.co/Pristinenlp/alime-reranker-large-zh
信息很少。也是基于 xlm-roberta 模型。
C-MTEB
我们只关心 Rerank 和 Retrieval 评测,结果见 mteb
The text was updated successfully, but these errors were encountered: