Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msc placeholder: LLM as a Google alternative #7438

Open
synctext opened this issue May 24, 2023 · 21 comments
Open

msc placeholder: LLM as a Google alternative #7438

synctext opened this issue May 24, 2023 · 21 comments
Assignees

Comments

@synctext
Copy link
Member

synctext commented May 24, 2023

brainstorm Survey+thesis
1 course msc for Q1 left. Did ML course and industry Kubernets experience. prior google summer of code experience. Python == main working language. Possibly: https://bitcoinlib.readthedocs.io/ on Python side 💶 and LLM/semantic search from systems side for ECTS 🏫
survey ideas: guide to cloud-free local-first LLM. Both training, re-training, and inference.
Thesis could go into numerous directions

  • LLM cheap task re-tuning
  • LLM on smartphones
  • LLM+fact loading and the semantic gap
  • LLM with decentralised learning
  • P2P-LLM without any security
  • P2P-LLM with MeritRank trust framework
@kandrio
Copy link

kandrio commented Jul 6, 2023

We currently want to decide about two things:

  1. literature survey
  2. thesis direction

Ultimately, injecting databases to LLMs seems really interesting to me. I like the idea of extending LLMs with fact loading and enabling them to reference their sources. Therefore, this kind of direction seems perfect for the thesis. What do you think?

@synctext what would be an ideal literature survey topic to help me gain knowledge towards that direction?

survey ideas: guide to cloud-free local-first LLM. Both training, re-training, and inference.

The above proposal looks interesting. However, I don't understand why you linked to the Bitcoinlib docs. Any papers you could point me to for the survey?

@InvictusRMC
Copy link
Member

Hey Rowdy here, great that you'll be helping out. The superapp is, frankly speaking, a bit of a mess. Please reach out to me by email ([email protected]) to arrange a meeting to discuss the superapp. The last time we can have a face-to-face meeting is the 18th of July, after that, it'll have to be remote.

The current suspects of causing issues within the superapp:

  • Threading: the coroutine usage is suboptimal; IPv8 is slow & messages get lost due to buffer overflows.
  • Unused code: old (slow) student projects that are hugely impacting performance are bloating the app

Also, there are no e2e tests: we could use Espresso tests for the app.

@synctext synctext changed the title msc placeholder: systems person with ML expertise msc placeholder: LLM as a Google alternative Jul 12, 2023
@synctext
Copy link
Member Author

synctext commented Jul 12, 2023

Discussed focus of survey, summer job, and thesis. Lets do Kotlin 🚀

@kandrio
Copy link

kandrio commented Jul 17, 2023

I created a parent issue just for my summer work on the superapp:

From now on, I'll be exposing my findings and progress regarding the superapp there.

@kandrio
Copy link

kandrio commented Jul 19, 2023

Papers I found on data-augmentation of GPT LLMs:

  1. Ghazvininejad et al., 2017: https://arxiv.org/pdf/2302.12813.pdf#page=11&zoom=100,401,665
  2. Dinan et al., 2018 (using Wikipedia articles): https://arxiv.org/pdf/2302.12813.pdf#page=11&zoom=100,401,252
  3. Shuster et al., 2022 (using web-search): https://arxiv.org/pdf/2302.12813.pdf#page=12&zoom=100,401,841
  4. Peng et al., 2022 (unstructured knowledge): https://arxiv.org/pdf/2302.12813.pdf#page=12&zoom=100,401,304

@kandrio
Copy link

kandrio commented Aug 10, 2023

Literature Survey: Augmenting LLMs with Knowledge Retrieval

Overleaf Project: https://www.overleaf.com/read/fwyqhjskmdrc

I've been reading through a number of papers, the most recent one being: Internet-Augmented Dialogue Generation, by Facebook AI Research. This paper proposes a system that combines:

  1. Retrieval-augmented Generation, and
  2. Search Engine Augmented Generation

It provides a nice overview of different methods of klowledge retrieval (using neural networks and an unstructured knowledge base), and it also cites the original papers:

I plan to read through these papers by August 20th and informative summaries for each of the methods.

One paper that summarizes all of the above (FiD and RAG) is:

There are also a number of papers talking about augmenting LLMs with a structured knowledge base (graph):

Google Bard

Google's AI experiment is called Bard. It uses knowledge retrieval and it is inspired by the following two papers:

@kandrio
Copy link

kandrio commented Aug 13, 2023

Summary of paper about RAG (Retrieval Augmented Generation): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020)

Preliminaries

seq2seq models

A seq2seq model predicts the probability of the next token, given an input sequence of words.

It consists of:

  • an encoder and
  • a decoder.

The encoder reads the input sequence one timestep at a time and produces a fixed-dimensional vector representation of the entire sequence. This vector is called a context vector and it contains all the information from the input sequence. The context vector is then passed to the decoder, which generates the output sequence one timestep at a time.

Beam Search

Beam Search is a heuristic search algorithm that explores a graph G by expanding only the K (beam width) most promising nodes at each step. Beam Search simulates the behavior of Breadth-First Search. Specifically:

  • It uses BFS to create a search tree.
  • At each level of the tree, it checks all the successors of the current level and keeps only the top K ones, while pruning the others.
  • The process repeats until the height of the tree is reached.

Beam Search in NLP: When using seq2seq models, we utilize Beam Search to find the sequence y that is most likely to come after an input sequence x. In mathematic notation, the probability we aim to maximize is:

  • p(y|x) = p(yn|x, y1...n-1)*p(y1...n-1|x) = p(yn|x, y1...n-1)p(yn-1|x, y1...n-2)...*p(y1|x)
    Instead of choosing only the output token with the highest probability, we choose the top K tokens with the highest probability and we explore the generated sequences recursively until we reach an <EOS> token. Then, we choose the sequence y (out of the K sequences) that maximizes p(y|x).

Dense vector index

In a vector database, a document can correspond to one vector or many vectors, depending on the specific implementation of the database. A single vector captures the overall meaning of the document. This is often done by averaging the vectors of the words in the document. In other cases, a document may be represented by a vector for each word in the document. This is often done when it is important to be able to track the individual words in the document.

Indexing in a vector database is the process of organizing the vectors in the database in a way that makes it efficient to search for similar vectors. This is done by creating a data structure that maps each vector to a set of other vectors that are similar to it.

Top-K Sampling

This paper uses top-K sampling on the retriever side, This means that, instead of choosing only the document in the knowledge base that's the most similar to the input, we use the K most similar documents and we feed each one of them in the encoder.

Overview

The paper uses a pre-trained seq2seq model (BART) as the parametric memory (knowledge stored in the parameters of the model). This model is trained on a massive dataset of text and code, and it can be used to generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way. The model's knowledge is stored in its parameters, which are a set of weights that are learned during training.

Additionally, the paper uses a dense vector index of Wikipedia as the non-parametric memory (knowledge stored in an indexed database). The Wikipedia index is a large database of text that has been pre-processed and indexed. This allows the RAG model to quickly retrieve relevant passages from Wikipedia. The system:

  1. uses the Inner Product to calculate the similarity between the given query and each passage in the database.
  2. gets the top-K similar passages.

These passages are then used to augment the model's knowledge, which allows it to generate more accurate and informative responses.

In summary, the RAG model uses the parametric memory to generate a query that is then used to retrieve relevant passages from the non-parametric memory. The retrieved passages are then used to augment the model's knowledge, which allows it to generate more accurate and informative responses.

Components

Knowledge Base (Wikipedia)

The indexed (for fast retrieval) knowledge base serves as the aggregation of knowledge that the RAG model possesses.

Retriever (BERT)

The (pretrained) Retriever component solves the Maximum Inner Product Search problem (MIPS) and finds a list of k documents with the highest similarity with the input query x. The documents are stored in a BERTBASE database (encoded as vectors using a BERTd document encoder) and are compared with the BERTq vector of the input query. MIPS algorithms run in sublinear time which is very needed since the database can be extremely large. Therefore, calculating the inner product of the query embedding with each document in the database is extremely inefficient and is avoided (through MIPS algorithms).

NOTE: According to the authors of the paper, the training of the parameters of the BERTd encoder is costly and not very effective accuracy-wise. Therefore, during the fine-tuning stage, they only fine-tune the parameters of the query encoder BERTq.

Generator (DPR)

The (pretrained) Generator component is a BART seq2seq model that receives the input query, x and the list of documents, z as input and generates a response text. During training, the BART generator is fine-tuned. This paper proposes two different implementations for the Generator:

  • RAG-token
  • RAG-sequence

IMPORTANT: Both the retriever and the generator are pre-trained. The authors chose to update these two components only during the fine-tuning stage (end-to-end). Later on, we will analyze a paper called REALM which was the first that proposed end-to-end training of a similar retriever-generator architecture.

RAG-token

The RAG-token model takes into account all of the retrieved documents (separately) in order to generate each token of a sequence. It uses Beam Search to transition from token to token and, in each step, i it:

  1. calculates the probabilities (of being the next token in the sequence) for each token in the vocabulary: $p_{\theta}(y_i | x, z_i, y_{1:i-1})$.
  2. calculates the transition probability (of being the next token in the sequence) for each token in the vocabulary by summing over the different retrieved documents (marginalization): $p_{\theta}^{'}(y_i | x, y_{1:i-1}) = \sum_{z} {p_{\eta}(z_i | x) \cdot p_{\theta}(y_i | x, z_i, y_{1:i-1})}$.
  3. runs Beam Search by choosing the K next tokens ($y_i$) with the highest transition probability.

RAG-sequence

The RAG-sequence model takes into account only one retrieved document per sequence that it generates. Specifically, for each retrieved document, it uses Beam Search to generate K sequences. Then, it just returns the sequence with the highest probability.

@kandrio
Copy link

kandrio commented Aug 14, 2023

Summary of Paper about FiD (Fusion in Decoder): Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Preliminaries

Generative Models vs Extractive Models

Generative Models are trained to produce new text. They do this by learning the statistical relationships between words and phrases in a large corpus of text. When given a prompt, a generative model will try to produce text that is consistent with the statistical patterns it has learned.

NOTE: The authors of this paper interestingly found that, when increasing the size of the text database, become better and more accurate, contrary to extractive models.

Extractive Models are trained to find specific pieces of information in a text, that may be answering a question or identifying the main points of a passage. When given a query, an extractive model will return the parts of the text (spans) that it believes are relevant to the query.

Spans

Spans are pieces of text that are likely to be the answer to a question. For example, if the question is "What is the color of the cat?", an extractive model might extract the span: "The cat is black" as the answer.

Overview

Overall, the idea behind this paper is quite similar to the idea behind RAG (#7438 (comment)), but with a twist...

Again, we have two main components:

  • the Retriever: uses DPR (just like RAG) to represent passages as dense vectors in a BERT database and retrieves the most relevant passages using the FAISS method.
  • the Generator model: a seq2seq model that receives the input query and retrieved passages to produce accurate answers.

The main difference between FiD and RAG is that:

  • in RAG, they perform fusion of the knowledge in the retrieved documents after the decoder: they produce predictions for the next token separately for each document and then they perform marginalization to find the most likely next token among all documents.
  • in FiD, they perform fusion of the knowledge in the retrieved documents before the decoder: they concatenate the input query with each retrieved passage and they separately feed each concatenation to the encoder. After that, they concatenate all the produced context vectors together (fusion) before feeding them to the decoder which performs attention accross all documents (cross-attention).

@kandrio
Copy link

kandrio commented Aug 21, 2023

Augmenting LLMs with Knowledge Graphs

Graft-Net

Preliminaries

Question Subgraphs

A question subgraph is a subgraph of the knowledge base in which we have pruned the irrelevant (to a given question) nodes and edges. In addition, we have pruned the irrelevant documents as well, and we keep the ones that are likely to contain the answer.

The Knowledge Base

Triplestore Knowledge Base

A Triplestore knowledge base is a database that consists of subject-predicate-object triples. An example of such a triple is: (Subject: Albert Einstein, Predicate: was born in, Object: Ulm, Germany). Triples are a great form of representing factual knowledge because they capture the nature of the relationship between a subject and an object and can be easily processed by LLMs. We can view this Knowledge Base as a graph whose vertices are the various subjects and objects (entities) and the predicates are the edges between these entities. Each edge has a type that describes the kind of the relation between the connected entities.

Text Corpus

A text corpus D is a set of documents {d1, . . . , d|D|} where each document is a sequence of words di = (w1, . . . , w|di|). Specifically, in the context of this paper, a document is essentially a sentence, and an article is a collection of documents.

NOTE: It has a similar structure to the knowledge-base from RAG or FiD.

Entity Linking

We assume that there is a set L of links (v, dp) connecting entity v with a word at position p, in document d.

Graph Convolutional Network (GCN)

GCNs are great for classification of nodes in a graph-structured knowledge base. Here's how a GCN works for an input graph:

  1. For each node, collect the embeddings of all its neighbors
  2. Average these embeddings into one embedding
  3. Use that embedding as input to a CNN layer (matrix multiplication + non-linearity)
  4. Produce an output embedding for each node
  5. Repeat for the next layer.

NOTE: The more layers the GCN has, the more multi-hop reasoning the model will be able to perform, because it will gather information from more far away neighbors.

Relational GCN

One problem arises when the knowledge-base graph heterogeneous (more than one types of relations between entities). In that case, we want to take into consideration the type of relation that a node has with its neighbors before we average the embeddings. A relational GCN is similar to a regular GCN, but it uses a separate matrix for each type of relation. Therefore, when using a relational GCN, we aggregate the embeddings from all neighbors with a specific relation and we pass the averaged embedding into a separate CNN layer for each relation.

Lucene

Lucene is a Java library created by Apache that facilitates data search in a large corpus of text.

Overview

Question Subgraph Retrieval

The retrieval of the question subgraph, Gq happens in two parallel pipelines:

  1. Knowledge Base Retrieval
  2. Text Retrieval
Knowledge Base Retrieval

During the knowledge base retrieval, we retrieve a subgraph of the triplestore knowledge base as follows:

  1. First, given the question q, we retrieve a set of seed entities, Sq that are relevant to the question.
  2. Then, we run the Personalized PageRank (PPR) method (Haveliwala, 2002) around these seeds to identify other entities which might be an answer to the question. During PPR, we assign weights to edges around the seed entities. Each edge weight is essentially the cosine similarity between:
    • the question vector, v(q): average of all word vectors in the question
    • the relation vector, v(r): average of all word vectors in the relation corresponding to that edge
  3. In the end, we retain the top E entities v1, . . . , vE by PPR score, along with any edges between them, and add them to the question subgraph, Gq.
Text Retrieval

During the text retrieval phase, we retrieve documents (sentences) relevant to the question from the Wikipedia database. The text retrieval phase entails the following steps:

  1. First, we retrieve the top 5 most relevant Wikipedia articles. An article is a collection of documents (sentences). For that task, we use the weighted bag-of-words model from DrQA.
  2. Then, we populate a Lucene index with sentences from these articles, and retrieve the top ranking ones d1, ..., dD.
The Final Question Graph

The final question graph Gq consists of:

  • Vq: all retrieved entities and documents
  • Eq: all relations between the retrieved entities and all entity links between entities and documents

NOTE: Because the verticies of the graphs can be either entities or documents, the graph is considered heterogeneous.

Overview of Graft-Net

Graft-Net consists of the following stages:

  1. The Question Subgraph Retrieval stage. This is a characteristic of early fusion: the process of combining information from the knowledge base and text early in the model, i.e., before the graph neural network is used.
  2. The answer selection stage, where they use a GCN variant (1, 2, 3) to do binary classification (answer, not-answer) on the nodes of the subgraph.

Pull-Net

Pull-Net uses the text corpus to supplement information extracted from the Triplestore in order to answer multi-hop questions. The subjects and objects in the triples contain links to relevant documents in the text corpus. PullNet uses these links to produce more factually-based answers.

Like GRAFT-Net, Pull-Net has an initial phase where it retrieves a question subgraph Gq. However, Pull-Net learns how to construct the subgraph, rather than using an ad-hoc subgraph-building strategy. More specifically, PullNet relies on a small set of retrieval operations, each of which expands a graph node by retrieving new information from the knowledge base or the corpus. PullNet learns when and where to apply these “pull” operations with another graph CNN classifier. The “pull” classifier is weakly supervised, using question-answer pairs.

The end result is a learned iterative process for subgraph construction, which begins with a small subgraph containing only the question text and the entities which it contains, and gradually expands the subgraph to contain information from the knowledge base and corpus that are likely to be useful. The process is especially effective for multi-hop questions

@synctext
Copy link
Member Author

synctext commented Aug 21, 2023

Note the mission of the lab is new fundamental theory, with practical grounding (re-invent The Web, Web3). This means we are not interested in new machine learning theory. It is a tool which failed us in 2005, and now finally might become production usable in 2028. We have now several phd and msc students active on Machine learning:

"Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback", very facinating literature. All very detailed stuff and high-performance. Totally unsuitable for decentralised context with 1-2 billion connected smartphones with 8 cores each on average = 8-16 billion embedded CPU cores 😲
The Web3 context will take 8-10 years to mature: your thesis can be that cardinal starting point. Show that it can be done and scale infinitely. No data center, Beyond Federated Learning to gossip learning with trust. Augmentation of knowledge by trustworthy users is probably the first-principle operand.
For your thesis: decide how much distributed systems stuff is in there. Continuous LLM augmentation with a trust function (e.g. keeping up with Wikipedia edits on news stories) or the first fully self-organising LLM with self-evolution. Avoid competition with mainstream Big Tech labs, be Web3 ?

brainstorm

For achieving superhuman intelligence we need to invent a paradigm for storing all human knowledge and making it accessable for artificial reasoning engines or language models. @kandrio original thought, LLM are simply to huge to work with practically. If we are able to split the facts and the language model part we enable further growth. The mixing of knowledge and language is sub-optimal. We only need a new model of intelligence to fix this 😄 Bridge the semantic gap. Another old problem known for decades is the problem of ambiguity and synonyms when adding new facts. Just adding a fact also implies embedding it and adding metadata. Establishing global consensus on The Internet on facts is notoriously hard. We failed to solve digital democracy on fact writing. Crowdsourcing LLM augmentation is unsolved. Metadata pollution will severely cripple your system performance, see the detailed overlapping issue of Is Justin Bieber Gay?. Currently the human working at OpenAI decide on 4Chan/Reddit filtering versus unfiltered inclusion into their LLM. These OpenAI developers can also decide to feed live events into their LLM using an unfiltered Twitter feed: real-time event awareness.

Taxonomy of LLM augmentation. Explosion of a new topic which is only 3-4 years old. Lots of papers which build upon each other. Earliest paper is 2019! Title could be: LLM augmentation: a survey on this explosion. Superior to a taxonomy table is a "tree of knowledge evolution". More sensational survey title or grander scope: gathering all human knowledge for augmenting LLM with facts: a survey

update "LLM @ Android" Already very challenging and very sufficient for a TUDelft master thesis. Can you do minimal TFLite finetuning with size of LLM? "On-device LLM finetuning"

@kandrio
Copy link

kandrio commented Sep 11, 2023

@kandrio
Copy link

kandrio commented Sep 11, 2023

Atlas (next generation of RAG): Few-shot Learning with Retrieval Augmented Language Models (2022)

Atlas is essentially the next generation of RAG, for few-shot learning tasks.

When performing a task, from question answering to generating Wikipedia articles, Atlas starts by retrieving the top-k relevant documents from a large corpus of text with the retriever. Then, these documents are fed to the language model, along with the query, which in turn generates the output. Both the retriever and the language model are based on pre-trained transformer networks.

Atlas consists of:

  • a retriever: (based on the Contriever by Izacard et al., 2022) returns the top-k relevant documents based on their similarity with the query (dot-product between the query and document embeddings).
  • a generator: uses a T5 seq2seq model (ref) and employs the FiD technique that processes each document separately in the encoder and concatenates the embeddings before they enter the decoder.

Retriever

Like RAG, it entails a BERTq and a BERTd encoder. Unlike RAG, during fine-tuning of the retriever, Atlas trains both BERTq and a BERTd (not only BERTq). Hence, the BERTd embeddings for each document in the BERTBASE need to be regularly updated so that they are in-sync with the updated BERTd. This is a computationally expensive task.

IMPORTANT: Atlas proposes jointly pre-training both the retriever and the generator model (similar to REALM) unlike RAG which uses pre-trained models and trains end-to-end only during fine-tuning.

@kandrio
Copy link

kandrio commented Sep 12, 2023

REALM: Retrieval-Augmented Language Model Pre-Training (2020)

The first method to pre-train jointly the retriever and the generator. REALM uses an architecture that we've seen before (in RAG, FiD), but proposes a pre-training technique that yields great models.

Components

Just like RAG, we have two main components:

  • the Retriever consists of two sub-models
    • a BERT-based document encoder, Embeddoc
    • a BERT-based query encoder, Embedinput
  • the Generator is a seq2seq model that produces the answer for the masked query.

In REALM, all of the above models are trained during pre-training.

Initialization

At the beginning of training, if the retriever does not have good embeddings for Embedinput(x) and Embeddoc(z), the retrieved documents, z will likely be unrelated to x. This causes the generator to learn to ignore the retrieved documents. Once this occurs, the retriever does not receive a meaningful gradient and cannot improve, creating a vicious cycle.

To avoid this cold-start problem, the authors warm-start the retriever (Embedinput + Embeddoc) using a simple training objective known as the Inverse Cloze Task (ICT) where, given a sentence, the model is trained to retrieve the document where that sentence came from.

For the generator, the authors warm-start it with BERT pre-training. Specifically, they use the uncased BERT-base model (12 layers, 768 hidden units, 12 attention heads).

Pre-training

The unsupervised pre-training method that REALM proposes goes as follows:

  1. We randomly select sentences from the text corpus and mask specific tokens from each one.
  2. REALM receives as input a masked query, q. An example would be: "The [MASK] at the top of the pyramid".
  3. REALM outputs its token prediction (correct answer is "pyramidion")
  4. We backpropagate through the parameters, $\theta$ of the the retriever p$\theta$(z|x), and $\phi$, of the generator p$\phi$(y|x,z).

Computational Challenges

During pre-training, both the Embeddoc and the Embedinput are trained. Because the Embeddoc is updated during pre-training, after each backpropagation step, we need to:

  1. re-compute the document embeddings
  2. re-calculate the document index (in order to perform MIPS)

This is a computationally expensive task, especially for huge databases, such as Wikipedia which they used in this paper. So, the authors designed REALM such that the embedding updates happen every 100 backpropagation steps, as an asynchronous process.

Fine-tuning

The supervised fine-tuning method that the authors used in order to evaluate REALM on Open-domain Question Answering (Open-QA) goes as follows:

  1. We collect Q-A tuples, such as: ("What's the angle of an equilateral triangle", "60 degrees").
  2. REALM receives Q as input.
  3. REALM outputs its prediction.
  4. Like in pre-training, we backpropagate through the parameters of the the retriever p$\theta$(z|x), and $\phi$, of the generator p$\phi$(y|x,z), but this time we leave the Embeddoc untouched. Therefore, fine-tuning is much less computationally expensive.

@kandrio
Copy link

kandrio commented Sep 12, 2023

RETRO: Improving Language Models by Retrieving from Trillions of Tokens (2022)

This paper's breakthrough is that it managed to pre-train and augment a relatively small LLM (25×fewer parameters than GPT-3) with a database that is 2 trillion tokens large (1000×larger than similar retrieval-augmented LLMs).

One main difficulty with augmenting LLMs with external knowledge-bases is that training the retriever component can be computationally expensive, because while the document encoder becomes better, we need to re-compute the embeddings for each passage in the database. In this paper, they used a pre-trained document encoder, so they calculate the document embeddings once and they do not update them again . Therefore, the main bottleneck that they're facing when accessing the external database is to find the K nearest documents to the input query.

One main difference with related work is that in RETRO they don't retrieve single sentences, but chunks (a retrieved sentence along with the following sentence). I don't yet understand if that helps.

Overview

Here's an overview of how RETRO produces an answer to an input query, q:

  1. It splits the input query into chunks of 4 tokens
  2. For each chunk, cq of q, RETRO:
    a. calculates its embedding
    b. finds the 2 nearest neighbors in its knowledge base
    c. encodes cq through the encoder
    d. encodes the 2 nearest neighbors through the encoder
    e. interleaves the encodings of the nearest neighbors with the query chunk embeddings to perform cross-attention. NOTE: Neighbours of the first chunk only affect the last token of the first chunk and tokens from the second chunk.

RETRO manages to perform attention in complexity that is linear to the number of retrieved passages.

@kandrio
Copy link

kandrio commented Sep 13, 2023

LaMDA: Language Models for Dialog Applications (2022)

In this paper by Google, the authors manage to augment a language generation model with what they call a Toolset (TS).

The Toolset (TS)

The Toolset consists of:

  1. a calculator
  2. a translator
  3. an information retrieval system

The Toolset takes a single string as input and outputs a list of one or more strings. Each tool in TS expects a string and returns a
list of strings. For example, the information retrieval system can take “How old is Rafael Nadal?”, and output [“Rafael Nadal / Age / 35”].

The information retrieval system is also capable of returning snippets of content from the open web, with their corresponding URLs. The TS tries an input string on all of its tools, and produces a final output list of strings by concatenating the output lists from every tool in the following order: calculator, translator, and information retrieval system. A tool will return an empty list of results if it can’t parse the input (e.g., the calculator cannot parse “How old is Rafael Nadal?”), and therefore does not contribute to the final output list.

NOTE: Little information is given on how the information retrieval system works, apart from the fact that it entails a database, but also can provide web snippets along with their URLs.

The Architecture

LaMDA consists of two main sub-models:

  1. LaMDA-Base: A regular generative model that is pre-trained on a large dataset. LaMDA-Base is the first model to receive a query from the user. It then generates a response that is checked and refined by LaMDA-Research.
  2. LaMDA-Research: A generative model that usually receives the output of LaMDA-Base as input and is fine-tuned to choose the recipient of its output (the Toolset or the User). In general, LaMDA-Research queries the Toolset in a loop, until it has sufficient information to generate a final response to the user.

@kandrio
Copy link

kandrio commented Sep 18, 2023

Internet-Augmented Dialogue Generation (2021)

Their method consists of two components:

  • A search query generator: an encoder-decoder Transformer that takes in the dialogue context as input, and generates a search query. This is given to the black-box search engine API, and N documents are returned.
  • A FiD-style encoder-decoder model that encodes each document individually, concatenates them to the dialogue context encoding, and then finally generates the next response.

We can train each of these modules separately if we have supervised data available for both tasks, the first module requiring (context, search query) pairs, and the second module requiring (context, response) pairs.

The search engine is a black box in this system, and could potentially be swapped out for any method. In IADG, they use the Bing Search API for their experiments to generate a list of URLs for each query. Then, they use these URLs as keys to find their page content.

@kandrio
Copy link

kandrio commented Sep 18, 2023

SeeKeR: Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion (2022)

One model to do both retrieval and generation (wow)

@kandrio
Copy link

kandrio commented Sep 19, 2023

Draft

@synctext Here is the first complete draft of my literature survey:

Here's a snippet of my taxonomy table:

taxonomy

What do you think?

@kandrio
Copy link

kandrio commented Sep 22, 2023

Code Implementation

I recently dived into the implementation details of Retrieval-Augmented Generation (RAG), one of the most influential papers that I had to review for my Literature Survey (see this comment for a comprehensive review). RAG focuses on knowledge-intensive NLP tasks, as opposed to dialogue intensive tasks that a number of recent papers focus on.

The authors of RAG, have open-sourced a specific version of their work, RAG-token, as part of the transformers Python library by Hugging Face.

I was able to access that model, and write an example script where I employed RAG to answer a simple question: "Who holds the record in 100m freestyle?"

Here is my script:

from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

# a tokenizer receives an input text and breaks it into a list of tokens
# this way, it's easier for the model to understand the input query
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")

# initialize a pre-trained RAG Retriever which has access to a "dummy" subset of Wikipedia
retriever = RagRetriever.from_pretrained(
    "facebook/rag-token-nq",
    index_name="exact",
    use_dummy_dataset=True)

# initialize the RAG-token model that will generate the final answer to our query
# the generator of RAG-token will receive the retrieved evidence by the retriever
# along with the input question and it will produce an answer
model = RagTokenForGeneration.from_pretrained(
    "facebook/rag-token-nq",
    retriever=retriever)

# define our question, and tokenize it. Correct answer should be "michael phelps"
input_dict = tokenizer.prepare_seq2seq_batch(
    "who holds the record in 100m freestyle",
    return_tensors="pt")

# pass the question as input to RAG-token
generated = model.generate(input_ids=input_dict["input_ids"])

# print the answer
print(tokenizer.batch_decode(generated, skip_special_tokens=True))

Here is a screenshot that shows what RAG replied to my question (take a look at the bottom):

CC @synctext

@synctext
Copy link
Member Author

synctext commented Sep 27, 2023

WOW 👏 Impressive work. Only very minor comments:

  • "IV. Search-engine augmented generation", if you can be found by a search engine, you can then spam an AI? Please put this nuance into this very positively worded section, if you agree obviously.
  • title. "Augmenting LLMs with Knowledge: a survey around hallucination prevention", please remove repetition of LLM in title and make it as short as possible. Or even goal-oriented: "Preventing LLM Hallucination: a survey on knowledge augmentation"
  • It lacks opinion. With your taxonomy and nuanced "section V", do you feel comfortable to promote a single paper inside the conclusion section? example The state-of-the-art within this area is as of today defined by the work presented in {DAG?}. The score this work surpasses that of both general AI and non-generic specialised fact answering constructs.
  • Due to the impressive quality of this work and requirement for the "excellent" grade, please turn this into a scientific Arxiv publication", and link to my unique ID.

@kandrio
Copy link

kandrio commented Sep 28, 2023

Updated version of the paper after @synctext's useful comments:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants