From d7660438686166a1a9806ae9784f617603bfef42 Mon Sep 17 00:00:00 2001
From: ninehills <ninehills@users.noreply.github.com>
Date: Thu, 28 Dec 2023 05:07:04 +0000
Subject: [PATCH] [create-pull-request] automated change

---
 README.md       |  1 +
 articles/111.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)
 create mode 100644 articles/111.md

diff --git a/README.md b/README.md
index e0a9c66..17a2adf 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,6 @@
 # 九原山
 ## Posts
+- #111 [中文 Emebedding & Reranker 模型选型](articles/111.md) 2023-12-28 `blog`
 - #109 [Gemini Pro Vision 作为 表格 OCR 解决方案的简单测试](articles/109.md) 2023-12-20 `blog`
 - #107 [大语言模型（LLM）推理性能优化以及推理框架、后端的评测](articles/107.md) 2023-12-19 `blog`
 - #104 [Embedding 模型在 RAG 场景下的评估和微调](articles/104.md) 2023-11-03 `blog`
diff --git a/articles/111.md b/articles/111.md
new file mode 100644
index 0000000..9f415ed
--- /dev/null
+++ b/articles/111.md
@@ -0,0 +1,97 @@
+# 中文 Emebedding & Reranker 模型选型
+
+> Author: **ninehills**  
+> Labels: **blog**  
+> Created: **2023-12-28T04:58:44Z**  
+> Link and comments: <https://github.com/ninehills/blog/issues/111>  
+
+
+### 结论
+
+选型建议：
+
+1. 大部分模型的序列长度是 512 tokens。 8192 可尝试 tao-8k，1024 可尝试 stella。
+2. 在专业数据领域上，嵌入模型的表现不如 BM25，但是微调可以大大提升效果。
+3. 有微调需求且对模型训练了解较少的，建议选择 bge 系列（完善的训练脚本、负例挖掘等）。但多数模型都基于BERT，训练脚本也通用，其他模型也可以参考。
+4. 重排模型选择很少，推荐使用 bge-reranker，也支持微调。reranker 模型因为单次输入较多，只能通过 GPU 部署。
+
+
+## Embedding 模型
+
+### PEG
+
+作者：腾讯
+模型地址： https://huggingface.co/TownsWu/PEG
+论文： https://arxiv.org/pdf/2311.11691.pdf
+
+重点优化检索能力。
+### GTE 系列
+
+作者：阿里巴巴
+模型地址： https://huggingface.co/thenlper/gte-large-zh
+论文： https://arxiv.org/abs/2308.03281
+
+### picolo 系列
+
+作者：商汤
+地址： https://huggingface.co/sensenova/piccolo-large-zh
+
+有一些微调的小tips
+
+### stella 系列
+
+地址：https://huggingface.co/infgrad/stella-large-zh-v2
+博客文章： https://zhuanlan.zhihu.com/p/655322183
+
+基于piccolo 模型fine-tuning，支持1024 序列长度。博客文章记录了一些训练思路。
+
+### BGE 系列
+
+
+作者：智源研究院
+地址：https://huggingface.co/BAAI/bge-large-zh-v1.5
+论文：https://arxiv.org/pdf/2309.07597.pdf
+Github：https://github.com/FlagOpen/FlagEmbedding
+
+开放信息最多的模型，也提供了fine-tuning 示例代码。同时也是 C-MTEB 榜单的维护者。
+
+### m3e 系列
+
+作者：MokaAI
+地址：https://huggingface.co/moka-ai/m3e-large
+Github：https://github.com/wangyuxinwhy/uniem
+
+研究的比较早，算是中文通用 Embedding 模型、数据集以及评测比较早的开拓者。
+
+### multilingual-e5-large
+
+地址：https://huggingface.co/intfloat/multilingual-e5-large
+论文：https://arxiv.org/pdf/2212.03533.pdf
+
+多语言支持。
+
+### tao-8k 
+
+地址： https://huggingface.co/amu/tao-8k
+
+支持8192 序列长度，但是信息很少。
+
+## Reranker 模型
+
+### bge-reranker 系列
+
+作者：智源研究院
+地址：https://huggingface.co/BAAI/bge-reranker-large
+Github：[GitHub - FlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs](https://github.com/FlagOpen/FlagEmbedding)
+
+基于 [xlm-roberta](https://huggingface.co/models?other=xlm-roberta) 模型。
+### alime-reranker-large-zh
+
+地址： https://huggingface.co/Pristinenlp/alime-reranker-large-zh
+
+信息很少。也是基于 [xlm-roberta](https://huggingface.co/models?other=xlm-roberta) 模型。
+
+
+### C-MTEB 
+
+我们只关心 Rerank 和 Retrieval 评测，结果见 [mteb](https://huggingface.co/spaces/mteb/leaderboard)