forked from ninehills/blog
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[create-pull-request] automated change
- Loading branch information
Showing
2 changed files
with
98 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# 中文 Emebedding & Reranker 模型选型 | ||
|
||
> Author: **ninehills** | ||
> Labels: **blog** | ||
> Created: **2023-12-28T04:58:44Z** | ||
> Link and comments: <https://github.com/ninehills/blog/issues/111> | ||
|
||
### 结论 | ||
|
||
选型建议: | ||
|
||
1. 大部分模型的序列长度是 512 tokens。 8192 可尝试 tao-8k,1024 可尝试 stella。 | ||
2. 在专业数据领域上,嵌入模型的表现不如 BM25,但是微调可以大大提升效果。 | ||
3. 有微调需求且对模型训练了解较少的,建议选择 bge 系列(完善的训练脚本、负例挖掘等)。但多数模型都基于BERT,训练脚本也通用,其他模型也可以参考。 | ||
4. 重排模型选择很少,推荐使用 bge-reranker,也支持微调。reranker 模型因为单次输入较多,只能通过 GPU 部署。 | ||
|
||
|
||
## Embedding 模型 | ||
|
||
### PEG | ||
|
||
作者:腾讯 | ||
模型地址: https://huggingface.co/TownsWu/PEG | ||
论文: https://arxiv.org/pdf/2311.11691.pdf | ||
|
||
重点优化检索能力。 | ||
### GTE 系列 | ||
|
||
作者:阿里巴巴 | ||
模型地址: https://huggingface.co/thenlper/gte-large-zh | ||
论文: https://arxiv.org/abs/2308.03281 | ||
|
||
### picolo 系列 | ||
|
||
作者:商汤 | ||
地址: https://huggingface.co/sensenova/piccolo-large-zh | ||
|
||
有一些微调的小tips | ||
|
||
### stella 系列 | ||
|
||
地址:https://huggingface.co/infgrad/stella-large-zh-v2 | ||
博客文章: https://zhuanlan.zhihu.com/p/655322183 | ||
|
||
基于piccolo 模型fine-tuning,支持1024 序列长度。博客文章记录了一些训练思路。 | ||
|
||
### BGE 系列 | ||
|
||
|
||
作者:智源研究院 | ||
地址:https://huggingface.co/BAAI/bge-large-zh-v1.5 | ||
论文:https://arxiv.org/pdf/2309.07597.pdf | ||
Github:https://github.com/FlagOpen/FlagEmbedding | ||
|
||
开放信息最多的模型,也提供了fine-tuning 示例代码。同时也是 C-MTEB 榜单的维护者。 | ||
|
||
### m3e 系列 | ||
|
||
作者:MokaAI | ||
地址:https://huggingface.co/moka-ai/m3e-large | ||
Github:https://github.com/wangyuxinwhy/uniem | ||
|
||
研究的比较早,算是中文通用 Embedding 模型、数据集以及评测比较早的开拓者。 | ||
|
||
### multilingual-e5-large | ||
|
||
地址:https://huggingface.co/intfloat/multilingual-e5-large | ||
论文:https://arxiv.org/pdf/2212.03533.pdf | ||
|
||
多语言支持。 | ||
|
||
### tao-8k | ||
|
||
地址: https://huggingface.co/amu/tao-8k | ||
|
||
支持8192 序列长度,但是信息很少。 | ||
|
||
## Reranker 模型 | ||
|
||
### bge-reranker 系列 | ||
|
||
作者:智源研究院 | ||
地址:https://huggingface.co/BAAI/bge-reranker-large | ||
Github:[GitHub - FlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs](https://github.com/FlagOpen/FlagEmbedding) | ||
|
||
基于 [xlm-roberta](https://huggingface.co/models?other=xlm-roberta) 模型。 | ||
### alime-reranker-large-zh | ||
|
||
地址: https://huggingface.co/Pristinenlp/alime-reranker-large-zh | ||
|
||
信息很少。也是基于 [xlm-roberta](https://huggingface.co/models?other=xlm-roberta) 模型。 | ||
|
||
|
||
### C-MTEB | ||
|
||
我们只关心 Rerank 和 Retrieval 评测,结果见 [mteb](https://huggingface.co/spaces/mteb/leaderboard) |