Text Representations and Interactions

References

The History of Information Retrieval Research
Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval (Dai and Callan, 2019)
Context-Aware Document Term Weighting for Ad-Hoc Search (Dai and Callan, 2020)
Learning Passage Impacts for Inverted Indexes (Mallia et al., 2021)
SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieval (Bai et al., 2020)
- SparTerm: Learning Term-based Sparse Representation for Fast Text Retrieva (Slides)
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (Khattab et al., 2020)
COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List (Gao et al., 2021)
SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval (Formal et al., 2021)
A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques (Lin and Ma, 2021)
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (Santhanam et al., 2021)
Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation (Mackenzie et al., 2021)
Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction (Hofstätter et al., 2022)
SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval (Choi et al., 2022)
A Dense Representation Framework for Lexical and Semantic Matching
castorini/dhr: Dense hybrid representations for text retrieval
SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval

Overview

Search systems typically consist of two stages: Retrieval and Re-ranking. In the first-stage retrieval, documents are converted into a (Bag-of-Words) sparse representation and retrieved via exact lexical matching. The quality of the first-stage retrieval greatly affects the overall performance since it determines what to re-rank. However, conventional sparse representations suffer from the vocabulary mismatch problem. Various methods have been proposed to mitigate the issue as we can find many resources on the internet like caiyinqiong/Semantic-Retrieval-Models: A curated list of awesome papers for Semantic Retrieval (TOIS Accepted: Semantic Models for the First-stage Retrieval: A Comprehensive Review). We will see just part of it.

Text Representations

In order for computers to process natural language, texts need to be converted into some numerical form. Assume that we have the following corpus.

ナイキの靴
ナイキのシューズ
アディダスの靴
ユニクロの靴
シャープの加湿器

Users enter queries such as "ナイキ", "NIKE", "靴“, "シューズ", and so on. While we could use regex to retrieve items but it is slow and only yields binary relevance. Therefore, we must convert the text into an efficient, meaningful representation that is suitable for computation.

To construct this representation, we create a vocabulary that pairs unique tokens with IDs, as follows.

{"ナイキ": 0, "の": 1, "アディダス": 2, "ユニクロ": 3, "靴": 4, "シューズ", 5, "服": 6, "シャープ": 7, "加湿器": 8}

I manually converted the items into both sparse and dense representations using my own heuristics.

	text	tokens	token_ids	sparse	dense
query	"ナイキ"	["ナイキ"]	[0]	[1, 0, 0, 0, 0, 0, 0, 0, 0]	[0.9, 0.2]
doc_1	"ナイキの靴"	["ナイキ", "の", "靴"]	[0, 1, 4]	[1, 1, 0, 0, 1, 0, 0, 0]	[0.9, 0.1]
doc_2	"アディダスの靴"	["アディダス", "の", "靴"]	[2, 1, 4]	[0, 1, 1, 0, 1, 0, 0, 0]	[0.8, 0.3]
doc_3	"ナイキのシューズ"	["ナイキ", "の", "シューズ"]	[0, 1, 5]	[0, 1, 1, 0, 0, 1, 0, 0]	[0.8, 0.3]
doc_4	"ユニクロの服"	["ユニクロ", "の", "服"]	[3, 1, 6]	[0, 1, 0, 1, 0, 0, 1, 0]	[0.6, 0.5]
doc_5	"シャープの加湿器"	["シャープ", "の", "加湿器"]	[7, 1, 8]	[0, 1, 0, 0, 0, 0, 0, 1, 1]	[-0.6, -0.7]

There are some interesting distinctions between sparse and dense representations.

Sparse Representation

First, let's take a look at the traditional text representation.

This approach is much faster than using regex
- Creating a posting list allows for O(1) lookup
The number of dimensions can be enormous
Retrieval on Sparse Representations is referred to as Sparse Retrieval, in contrast to Dense Retrieval
This type of representation is also called a bag-of-words (BoW) representation, since it treats phrases such as "John eats apple" and "apple eats John" as equivalent
Term-matching methods are unable to handle polysemy
- In this representation, "apple" as a company is indistinguishable from "apple" as a fruit.
"靴" retrieves results such as "ナイキの靴" and "アディダスの靴", but does not include "ナイキのシューズ" even though it contains the word "シューズ" which is semantically equivalent to "靴"
- This is a well-known problem referred to as the "Vocabulary Mismatch" issue
  - To address it, we typically have applied Synonym Expansion techniques

Dense Representation

Compared to sparse representations, dense representations have fewer columns and their values are almost always non-zero. Converting sparse data into dense data can be seen as a form of dimensionality reduction, where by multiple columns in the sparse space can be projected onto the same column in the dense representation.

The field of semantic matching dates back to the 1980s, with the emergence of Latent Semantic Analysis. This was followed by Word2Vec, FastText, Recurrent Neural Networks (RNNs), and Transformers.

Values are not restricted to being binary
The number of dimensions in dense representations is significantly smaller compared to sparse representations
- The number of dimensions in dense representations is arbitrary (say 512 to 768)
It is supposed to improve recall (could negatively affect precision)

As mentioned earlier, there are several methods for converting text into dense representations. Let's take BERT as an example. To convert input text into a dense representation, BERT first tokenizes it into a sequence of tokens with special tokens as follows.

tokenize("ナイキの靴")
=> ["[CLS]", "ナイキ", "の", "靴", "[SEP]"]
=> [
    [... Token vector ("[CLS]") ...],
    [... Token vector ("ナイキ") ...],
    [... Token vector ("の") ...],
    [... Token vector ("靴") ...],
    [... Token vector ("[SEP]") ...],
] (2D vector)

The special token [CLS] represents the high-level summary of the entire input
The special token [SEP] allows us to put multiple sentences into the input as shown below:

Query: "ナイキ", Doc: "ナイキの靴"
=> ["[CLS]", "ナイキ", "[SEP]", "ナイキ", "の", "靴", "[SEP]"]

BERT produces dense vectors, referred to as embeddings for each token in the input text. These token vectors encapsulate the meaning of the word while considering its contextual information. Therefore, the word "apple" is no longer ambiguous and can be disambiguated based on its surrounding text.

Precisely speaking, "ナイキの靴" will be...

from transformers import AutoTokenizer, AutoModel

model_name = "cl-tohoku/bert-base-japanese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "ナイキの靴

tokenizer.encode()
# => [2, 3977, 28580, 5, 10178, 3]

tokenizer.convert_ids_to_tokens(token_ids)
# => ['[CLS]', 'ナイ', '##キ', 'の', '靴', '[SEP]']

tokens = tokenizer(["ナイキの靴"], return_tensors="pt")
model(**tokens).last_hidden_state
# => tensor([[[-9.3576e-04, -7.4973e-01,  5.8114e-01,  ..., -2.4896e-01,
           1.9931e-02,  2.8122e-02],
         [ 1.9191e-01,  7.5111e-01,  9.1816e-03,  ...,  3.1111e-01,
          -2.7744e-01,  2.8827e-01],
         [-1.5154e-01, -2.0920e-01,  6.2290e-02,  ..., -1.0269e+00,
          -9.6557e-02,  1.4448e-01],
         [ 2.8207e-02, -6.2355e-01, -3.1020e-03,  ..., -8.2310e-02,
           2.3494e-01, -1.7168e-01],
         [ 5.3084e-01,  1.1978e-01,  6.9479e-01,  ..., -2.0963e-01,
          -7.4948e-01,  5.9649e-02],
         [ 5.3614e-01, -4.5566e-01,  5.2827e-01,  ..., -1.0647e-01,
          -2.3755e-01,  8.1002e-01]]], grad_fn=<NativeLayerNormBackward0>)
model(**tokens).last_hidden_state.shape
# => torch.Size([1, 6, 768])

"ナイキ" is further tokenized into subwords "ナイ" and "##キ".

Matrix Operations

There are two operations that are used in information retrieval.

Max Pooling

Max pooling is a pooling operation that calculates the maximum value for patches of a feature map, and uses it to create a downsampled (pooled) feature map.

"ナイキの靴"
=> ["ナイキ", "の", "靴"]
=> max_pool([
    [1, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 0, 0, 0, 0]
]) (2D)
=> [1, 1, 0, 0, 1, 0, 0, 0, 0] (1D)

According to Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors, max pooling can be considered to be an OR operator (max_pooling([“ナイキ”, “の”, “靴”]) => "ナイキ" OR "の" OR "靴").

Dot Product

The dot product is the sum of the products of the corresponding elements in two vectors.

$$ a \cdot b=\sum_{i=1}^n a_i b_i $$

In dense representations, it simply counts the number of matching words as shown below.

Query: "ナイキ", Doc: "ナイキの靴"
=> dot(
  Query: [1, 0, 0, 0, 0, 0, 0, 0, 0],
  Doc:   [1, 1, 0, 0, 1, 0, 0, 0, 0]
)
=> 1 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 1

Query: "ナイキ", Doc: "シャープの加湿器"
=> dot(
  Query: [1, 0, 0, 0, 0, 0, 0, 0, 0],
  Doc:   [0, 0, 1, 0, 0, 0, 0, 1, 1]
)
=> 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 0

Of course, the dot product can also be applied to dense vectors.

Query: "ナイキ", Doc: "ナイキの靴"
=> dot(
  Query: [0.9, 0.2],
  Doc:   [0.9, 0.1]
)
=> 0.81 + 0.02 = 0.83

Query: "ナイキ", Doc: "シャープの加湿器"
=> dot(
  Query: [0.9, 0.2],
  Doc:   [-0.6, -0.7]
)
=> -0.54 + -0.14 = -0.68

In both representations, f("ナイキ", "ナイキの靴") > f("ナイキ", "シャープの加湿器").

query-doc	Dot Product (sparse)	Dot Product (dense)
"ナイキ"-"ナイキの靴"	1	0.83
"ナイキ"-"アディダスの靴"	0	0.78
"ナイキ"-"ユニクロの服"	0	0.64
"ナイキ"-"シャープの加湿器"	0	-0.68

Models and Interactions

Cross-Encoder (All-to-All interaction model)
- score = f(encode("[CLS] query [SEP] doc [SEP]"))
- Index Time: There is nothing we can do
- Query Time: Run a computationally expensive PLM (Optimized for effectiveness)
Dense Retriever(Bi-Encoder, Two-Tower Model, Siemese Network)
- Encode a query and a document separately and learn the interaction between them
  - score = f(encode(query), encode(doc))
- Index Time: Encode docs (PLM)
- Query Time: Encode query using a PLM (Hybrid)
Late Interaction Model
- Unlike the above encoders, uni-encoder models do not run expensive transformers at query time. Queries are just tokenized to retrieve documents from an inverted index and then, simple vector operations are applied to rank them.
- score = f(one_hot_encode(query), encode(doc))
- Encode query without using a PLM, run a few matrix operations (Optimized for efficiency)

COIL

COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List

UniCOIL

A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques

ColBERT

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

ColBERTer

From Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction

$$ s = s_{CLS} * w + s_{token} * (1 - w) $$

$$ s_{CLS} = q_{CLS} \cdot d_{CLS} $$

$$ s_{token} = \sum \max q_i \cdot d_j $$

Reranking

reranker.rerank(
    query="ナイキの靴",
    products=["ナイキの靴", "アディダスの靴", "ユニクロの服", "シャープの加湿器"],
)

query-doc	Score
"ナイキ"-"ナイキの靴"	0.1185
"ナイキ"-"アディダスの靴"	0.116
"ナイキ"-"ユニクロの服"	0.1107
"ナイキ"-"シャープの加湿器"	0.0852

Retrieval

flowchart LR

DOTS[...]
DOTS2[...]
TV[token vecs]
SI[stopwords importance]
MM[MatMul]
TV2[token vecs]

DOTS --> TV -- StopwordReducer --> SI
SI --> MM
TV --> MM
MM --> TV2 --> DOTS2

term_importance_estimator.estimate(
    "Duduma 偏光 レンズ メンズスポーツサングラス 超軽量 UV400 紫外線をカット ／ 自転車／釣り／野球／テニス／ゴルフ／レース／ランニング／ドライブ T90 (ブラックマットフレーム/ブラックレンズ)"
)

SPLADE

From https://github.com/naver/splade

flowchart LR

OV[one-hot vector]
SV[sparse vector]
DP[Dot Product]

Query -- Tokenizer --> OV --> DP
Doc -- BERT --> SV --> DP
DP --> Score

$$ s = q \cdot d $$

where $q$ is a BoW representation with $|V|$ dimensions and $d$ is the logits of the Masked Language Model (MLM) layer.

Reranking

reranker.rerank(
    query="ナイキの靴",
    products=["ナイキの靴", "アディダスの靴", "ユニクロの服", "シャープの加湿器"],
)

Retrieval

flowchart LR

S[Source]
F[Filter]
GT[Generated Tokens]
L[logits]
SE[Search Engine]

subgraph Indexing
    direction LR
    Doc -- BERT --> L --> F --> GT
end

S --> Doc
GT --> SE

token_generator.generate(
    "[ナイキ] メンズ ランニングシューズ スニーカー エアマックスSC 通気性 クッション性 カジュアル デイリー スポーツ ウォーキング AIR MAX SC CW4555"
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Representations and Interactions

References

Overview

Text Representations

Sparse Representation

Dense Representation

Matrix Operations

Max Pooling

Dot Product

Models and Interactions

COIL

UniCOIL

ColBERT

ColBERTer

Reranking

Retrieval

SPLADE

Reranking

Retrieval

Clone this wiki locally