-
Notifications
You must be signed in to change notification settings - Fork 5
Text Representations and Interactions
- The History of Information Retrieval Research
- Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval (Dai and Callan, 2019)
- Context-Aware Document Term Weighting for Ad-Hoc Search (Dai and Callan, 2020)
- Learning Passage Impacts for Inverted Indexes (Mallia et al., 2021)
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval (Bai et al., 2020)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (Khattab et al., 2020)
- COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List (Gao et al., 2021)
- SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval (Formal et al., 2021)
- A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques (Lin and Ma, 2021)
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (Santhanam et al., 2021)
- Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation (Mackenzie et al., 2021)
- Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction (Hofstätter et al., 2022)
- SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval (Choi et al., 2022)
- A Dense Representation Framework for Lexical and Semantic Matching
- castorini/dhr: Dense hybrid representations for text retrieval
- SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval
Search systems typically consist of two stages: Retrieval and Re-ranking. In the first-stage retrieval, documents are converted into a (Bag-of-Words) sparse representation and retrieved via exact lexical matching. The quality of the first-stage retrieval greatly affects the overall performance since it determines what to re-rank. However, conventional sparse representations suffer from the vocabulary mismatch problem. Various methods have been proposed to mitigate the issue as we can find many resources on the internet like caiyinqiong/Semantic-Retrieval-Models: A curated list of awesome papers for Semantic Retrieval (TOIS Accepted: Semantic Models for the First-stage Retrieval: A Comprehensive Review). We will see just part of it.
In order for computers to process natural language, texts need to be converted into some numerical form. Assume that we have the following corpus.
ナイキの靴
ナイキのシューズ
アディダスの靴
ユニクロの靴
シャープの加湿器
Users enter queries such as "ナイキ", "NIKE", "靴“, "シューズ", and so on. While we could use regex to retrieve items but it is slow and only yields binary relevance. Therefore, we must convert the text into an efficient, meaningful representation that is suitable for computation.
To construct this representation, we create a vocabulary that pairs unique tokens with IDs, as follows.
{"ナイキ": 0, "の": 1, "アディダス": 2, "ユニクロ": 3, "靴": 4, "シューズ", 5, "服": 6, "シャープ": 7, "加湿器": 8}
I manually converted the items into both sparse and dense representations using my own heuristics.
text | tokens | token_ids | sparse | dense | |
---|---|---|---|---|---|
query | "ナイキ" | ["ナイキ"] | [0] | [1, 0, 0, 0, 0, 0, 0, 0, 0] | [0.9, 0.2] |
doc_1 | "ナイキの靴" | ["ナイキ", "の", "靴"] | [0, 1, 4] | [1, 1, 0, 0, 1, 0, 0, 0] | [0.9, 0.1] |
doc_2 | "アディダスの靴" | ["アディダス", "の", "靴"] | [2, 1, 4] | [0, 1, 1, 0, 1, 0, 0, 0] | [0.8, 0.3] |
doc_3 | "ナイキのシューズ" | ["ナイキ", "の", "シューズ"] | [0, 1, 5] | [0, 1, 1, 0, 0, 1, 0, 0] | [0.8, 0.3] |
doc_4 | "ユニクロの服" | ["ユニクロ", "の", "服"] | [3, 1, 6] | [0, 1, 0, 1, 0, 0, 1, 0] | [0.6, 0.5] |
doc_5 | "シャープの加湿器" | ["シャープ", "の", "加湿器"] | [7, 1, 8] | [0, 1, 0, 0, 0, 0, 0, 1, 1] | [-0.6, -0.7] |
There are some interesting distinctions between sparse and dense representations.
First, let's take a look at the traditional text representation.
- This approach is much faster than using regex
- Creating a posting list allows for O(1) lookup
- The number of dimensions can be enormous
- Retrieval on Sparse Representations is referred to as Sparse Retrieval, in contrast to Dense Retrieval
- This type of representation is also called a bag-of-words (BoW) representation, since it treats phrases such as "John eats apple" and "apple eats John" as equivalent
- Term-matching methods are unable to handle polysemy
- In this representation, "apple" as a company is indistinguishable from "apple" as a fruit.
- "靴" retrieves results such as "ナイキの靴" and "アディダスの靴", but does not include "ナイキのシューズ" even though it contains the word "シューズ" which is semantically equivalent to "靴"
- This is a well-known problem referred to as the "Vocabulary Mismatch" issue
- To address it, we typically have applied Synonym Expansion techniques
- This is a well-known problem referred to as the "Vocabulary Mismatch" issue
Compared to sparse representations, dense representations have fewer columns and their values are almost always non-zero. Converting sparse data into dense data can be seen as a form of dimensionality reduction, where by multiple columns in the sparse space can be projected onto the same column in the dense representation.
The field of semantic matching dates back to the 1980s, with the emergence of Latent Semantic Analysis. This was followed by Word2Vec, FastText, Recurrent Neural Networks (RNNs), and Transformers.
- Values are not restricted to being binary
- The number of dimensions in dense representations is significantly smaller compared to sparse representations
- The number of dimensions in dense representations is arbitrary (say 512 to 768)
- It is supposed to improve recall (could negatively affect precision)
As mentioned earlier, there are several methods for converting text into dense representations. Let's take BERT as an example. To convert input text into a dense representation, BERT first tokenizes it into a sequence of tokens with special tokens as follows.
tokenize("ナイキの靴")
=> ["[CLS]", "ナイキ", "の", "靴", "[SEP]"]
=> [
[... Token vector ("[CLS]") ...],
[... Token vector ("ナイキ") ...],
[... Token vector ("の") ...],
[... Token vector ("靴") ...],
[... Token vector ("[SEP]") ...],
] (2D vector)
- The special token
[CLS]
represents the high-level summary of the entire input - The special token
[SEP]
allows us to put multiple sentences into the input as shown below:
Query: "ナイキ", Doc: "ナイキの靴"
=> ["[CLS]", "ナイキ", "[SEP]", "ナイキ", "の", "靴", "[SEP]"]
BERT produces dense vectors, referred to as embeddings for each token in the input text. These token vectors encapsulate the meaning of the word while considering its contextual information. Therefore, the word "apple" is no longer ambiguous and can be disambiguated based on its surrounding text.
Precisely speaking, "ナイキの靴" will be...
from transformers import AutoTokenizer, AutoModel
model_name = "cl-tohoku/bert-base-japanese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "ナイキの靴
tokenizer.encode()
# => [2, 3977, 28580, 5, 10178, 3]
tokenizer.convert_ids_to_tokens(token_ids)
# => ['[CLS]', 'ナイ', '##キ', 'の', '靴', '[SEP]']
tokens = tokenizer(["ナイキの靴"], return_tensors="pt")
model(**tokens).last_hidden_state
# => tensor([[[-9.3576e-04, -7.4973e-01, 5.8114e-01, ..., -2.4896e-01,
1.9931e-02, 2.8122e-02],
[ 1.9191e-01, 7.5111e-01, 9.1816e-03, ..., 3.1111e-01,
-2.7744e-01, 2.8827e-01],
[-1.5154e-01, -2.0920e-01, 6.2290e-02, ..., -1.0269e+00,
-9.6557e-02, 1.4448e-01],
[ 2.8207e-02, -6.2355e-01, -3.1020e-03, ..., -8.2310e-02,
2.3494e-01, -1.7168e-01],
[ 5.3084e-01, 1.1978e-01, 6.9479e-01, ..., -2.0963e-01,
-7.4948e-01, 5.9649e-02],
[ 5.3614e-01, -4.5566e-01, 5.2827e-01, ..., -1.0647e-01,
-2.3755e-01, 8.1002e-01]]], grad_fn=<NativeLayerNormBackward0>)
model(**tokens).last_hidden_state.shape
# => torch.Size([1, 6, 768])
"ナイキ" is further tokenized into subwords "ナイ" and "##キ".
There are two operations that are used in information retrieval.
Max pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map.
"ナイキの靴"
=> ["ナイキ", "の", "靴"]
=> max_pool([
[1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0]
]) (2D)
=> [1, 1, 0, 0, 1, 0, 0, 0, 0] (1D)
According to Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, max pooling can be considered to be an OR operator (max_pooling([“ナイキ”, “の”, “靴”])
=> "ナイキ" OR "の" OR "靴"
).
The dot product is the sum of the products of the corresponding elements in two vectors.
In dense representations, it simply counts the number of matching words as shown below.
Query: "ナイキ", Doc: "ナイキの靴"
=> dot(
Query: [1, 0, 0, 0, 0, 0, 0, 0, 0],
Doc: [1, 1, 0, 0, 1, 0, 0, 0, 0]
)
=> 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 1
Query: "ナイキ", Doc: "シャープの加湿器"
=> dot(
Query: [1, 0, 0, 0, 0, 0, 0, 0, 0],
Doc: [0, 0, 1, 0, 0, 0, 0, 1, 1]
)
=> 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 0
Of course, the dot product can also be applied to dense vectors.
Query: "ナイキ", Doc: "ナイキの靴"
=> dot(
Query: [0.9, 0.2],
Doc: [0.9, 0.1]
)
=> 0.81 + 0.02 = 0.83
Query: "ナイキ", Doc: "シャープの加湿器"
=> dot(
Query: [0.9, 0.2],
Doc: [-0.6, -0.7]
)
=> -0.54 + -0.14 = -0.68
In both representations, f("ナイキ", "ナイキの靴") > f("ナイキ", "シャープの加湿器").
query-doc | Dot Product (sparse) | Dot Product (dense) |
---|---|---|
"ナイキ"-"ナイキの靴" | 1 | 0.83 |
"ナイキ"-"アディダスの靴" | 0 | 0.78 |
"ナイキ"-"ユニクロの服" | 0 | 0.64 |
"ナイキ"-"シャープの加湿器" | 0 | -0.68 |
- Cross-Encoder (All-to-All interaction model)
score = f(encode("[CLS] query [SEP] doc [SEP]"))
- Index Time: There is nothing we can do
- Query Time: Run a computationally expensive PLM (Optimized for effectiveness)
- Dense Retriever(Bi-Encoder, Two-Tower Model, Siemese Network)
- Encode a query and a document separately and learn the interaction between them
score = f(encode(query), encode(doc))
- Index Time: Encode docs (PLM)
- Query Time: Encode query using a PLM (Hybrid)
- Encode a query and a document separately and learn the interaction between them
- Late Interaction Model
- Unlike the above encoders, uni-encoder models do not run expensive transformers at query time. Queries are just tokenized to retrieve documents from an inverted index and then, simple vector operations are applied to rank them.
score = f(one_hot_encode(query), encode(doc))
- Encode query without using a PLM, run a few matrix operations (Optimized for efficiency)
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
reranker.rerank(
query="ナイキの靴",
products=["ナイキの靴", "アディダスの靴", "ユニクロの服", "シャープの加湿器"],
)
query-doc | Score |
---|---|
"ナイキ"-"ナイキの靴" | 0.1185 |
"ナイキ"-"アディダスの靴" | 0.116 |
"ナイキ"-"ユニクロの服" | 0.1107 |
"ナイキ"-"シャープの加湿器" | 0.0852 |
flowchart LR
DOTS[...]
DOTS2[...]
TV[token vecs]
SI[stopwords importance]
MM[MatMul]
TV2[token vecs]
DOTS --> TV -- StopwordReducer --> SI
SI --> MM
TV --> MM
MM --> TV2 --> DOTS2
term_importance_estimator.estimate(
"Duduma 偏光 レンズ メンズスポーツサングラス 超軽量 UV400 紫外線をカット / 自転車/釣り/野球/テニス/ゴルフ/レース/ランニング/ドライブ T90 (ブラックマットフレーム/ブラックレンズ)"
)
![](https://user-images.githubusercontent.com/883148/212504635-51fd750a-1a8b-48a4-a2d0-9a84d1604f2c.png)
From https://github.com/naver/splade
flowchart LR
OV[one-hot vector]
SV[sparse vector]
DP[Dot Product]
Query -- Tokenizer --> OV --> DP
Doc -- BERT --> SV --> DP
DP --> Score
where
reranker.rerank(
query="ナイキの靴",
products=["ナイキの靴", "アディダスの靴", "ユニクロの服", "シャープの加湿器"],
)
flowchart LR
S[Source]
F[Filter]
GT[Generated Tokens]
L[logits]
SE[Search Engine]
subgraph Indexing
direction LR
Doc -- BERT --> L --> F --> GT
end
S --> Doc
GT --> SE
token_generator.generate(
"[ナイキ] メンズ ランニングシューズ スニーカー エアマックスSC 通気性 クッション性 カジュアル デイリー スポーツ ウォーキング AIR MAX SC CW4555"
)
![](https://user-images.githubusercontent.com/883148/213895868-cd5ccad2-3ff9-49d0-87da-5ecb9024f510.png)