Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add ADR for RAG Evaluations Framework #842

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

jalling97
Copy link
Contributor

@jalling97 jalling97 commented Jul 26, 2024

  • Adds the ADR for the overall RAG Evaluations Framework
  • This will be added to for each component of the RAG Evals MVP Epic and closed when all pieces are done and all decisions have been documented in the ADR.

@jalling97 jalling97 linked an issue Jul 26, 2024 that may be closed by this pull request
Copy link

netlify bot commented Jul 26, 2024

Deploy Preview for leapfrogai-docs canceled.

Name Link
🔨 Latest commit 13fce77
🔍 Latest deploy log https://app.netlify.com/sites/leapfrogai-docs/deploys/66fc332a11b160000826d13f

@jalling97 jalling97 self-assigned this Jul 26, 2024
@jalling97 jalling97 added the ADR 🧐 Architecture Decision Record label Jul 26, 2024
@jalling97 jalling97 changed the title (ADR) RAG Evaluations Framework adr: RAG Evaluations Framework Aug 27, 2024
@jalling97 jalling97 changed the title adr: RAG Evaluations Framework docs: RAG Evaluations Framework Aug 27, 2024
@jalling97 jalling97 added the documentation Improvements or additions to documentation label Aug 27, 2024
@jalling97 jalling97 changed the title docs: RAG Evaluations Framework docs: add RAG Evaluations Framework Aug 27, 2024
@jalling97 jalling97 changed the title docs: add RAG Evaluations Framework docs: add ADR for RAG Evaluations Framework Aug 27, 2024
adr/0007-rag-eval-framework.md Show resolved Hide resolved

#### Rationale

These datasets were created because it filled a gap in the openly available datasets that could have been used. For example, in QA datasets, there did not exist any dataset that had all **4** components listed above. Many had the questions, answers, and context, but none also included the source documents in a readily accessible manner. Therefore, the fastest and most effective course of action was to generate a QA dataset from source documentation using the [DeepEval Synthesizer](https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why we picked the reference documents we used as sources for this data set. It was a mix of technical documentation & DoD policy docs to capture the type of data we generally expect our mission users will be working with.


## Context

LeapfrogAI uses RAG to provide context-aware responses to users who have specific data they need to reference. In order to make sure RAG is operating at the levels we need it to, we need to get measurable feedback from our RAG pipeline to make it better. We also need a standard to show to mission heroes that we are in fact operating at that level. We do this with RAG-focused evaluations. This ADR documents all of the decisions and lessons learned for enabling a full-scale RAG evaluations pipeline MVP.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to evaluating and improving our "default" RAG implementation in LeapfrogAI, developing a standard approach to evals will help make delivery folks more efficient at evaluating custom RAG implementations they may build on and around the core LeapfrogAI.

adr/0007-rag-eval-framework.md Outdated Show resolved Hide resolved
@jalling97 jalling97 marked this pull request as ready for review September 30, 2024 21:36
@jalling97 jalling97 requested a review from a team as a code owner September 30, 2024 21:36
Copy link
Contributor

@justinthelaw justinthelaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks excellent so far! Some minor comments and nits.

adr/0007-rag-eval-framework.md Outdated Show resolved Hide resolved

#### Decision

The three models that will initially be evaluated are going to be:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be good to provide a small sections about why we chose the quantization and sizes of the models here. For example, reasons like consistency with # of active params or vRAM limitations, so that it doesn't seem like we selected 4-bit quantization for these model sizes arbitrarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some updates, let me know what you think!


In order to reach an MVP product, a single LLM judge will be utilized for the evaluations that require it. This will be the first stage so that the evaluation framework can begin receiving results. As progress is made, additional LLM-based judges will be incorporated to develop an LLM-jury styled approach. For context, please see the following [paper](https://arxiv.org/pdf/2404.18796).

Claude 3.5 Sonnet was chosen to be used as the first judge due to it's high levels of [performance](https://artificialanalysis.ai/models/claude-35-sonnet), which is crucial when utilizing an LLM judge. Additionally, it exists outside the family of models that will be evaluated against, which has been shown to be effective in comparison to using models of the same family due to [self-enhancement bias](https://arxiv.org/pdf/2306.05685).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a little more context on what performance here? Maybe talk about its reasoning scores, like MMLU or GPQA? Might also be good to mention its large context window for digesting and analyzing the in/output for the evaluated models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some updates, let me know what you think!

adr/0007-rag-eval-framework.md Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADR 🧐 Architecture Decision Record documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ADR: RAG Evaluations Framework
3 participants