New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

docs: add ADR for RAG Evaluations Framework #842

Open

jalling97 wants to merge 22 commits into main from 823-adr-rag-evaluations-framework

+218 −0

Contributor

jalling97 commented Jul 26, 2024 •

edited by justinthelaw

Loading

Adds the ADR for the overall RAG Evaluations Framework
This will be added to for each component of the RAG Evals MVP Epic and closed when all pieces are done and all decisions have been documented in the ADR.

jalling97 added 2 commits

July 26, 2024 12:59


          add first pass at ADR

a8b18bd


          add datasets rationale

f6913f4

jalling97 linked an issue

that may be closed by this pull request

ADR: RAG Evaluations Framework #823

Open

netlify bot commented Jul 26, 2024 •

edited

Loading

✅ Deploy Preview for leapfrogai-docs canceled.

Name	Link
🔨 Latest commit	`13fce77`
🔍 Latest deploy log	https://app.netlify.com/sites/leapfrogai-docs/deploys/66fc332a11b160000826d13f


          Merge remote-tracking branch 'origin/main' into 823-adr-rag-evaluatio…

51496c1

…ns-framework

jalling97 self-assigned this

jalling97 added this to the Current - RAG UX Enhancements | Model Directory | API Odds and Ends milestone

jalling97 added the ADR 🧐 label

jalling97 and others added 2 commits

July 26, 2024 15:52


          Merge remote-tracking branch 'origin/main' into 823-adr-rag-evaluatio…

bf293c2

…ns-framework


          Merge branch 'main' into 823-adr-rag-evaluations-framework

ca3b1de

jalling97 changed the title ~~(ADR) RAG Evaluations Framework~~ adr: RAG Evaluations Framework


          Update 0007-rag-eval-framework.md

f7482a1

jalling97 changed the title ~~adr: RAG Evaluations Framework~~ docs: RAG Evaluations Framework

jalling97 added the documentation label

jalling97 changed the title ~~docs: RAG Evaluations Framework~~ docs: add RAG Evaluations Framework


          Update 0007-rag-eval-framework.md

21b9fa1

jalling97 changed the title ~~docs: add RAG Evaluations Framework~~ docs: add ADR for RAG Evaluations Framework

jalling97 added 2 commits

August 28, 2024 10:42


          Update 0007-rag-eval-framework.md

7a8c902


          Update 0007-rag-eval-framework.md

7fc1d93

barronstone reviewed

View reviewed changes

adr/0007-rag-eval-framework.md Show resolved Hide resolved

adr/0007-rag-eval-framework.md Outdated

    
                #### Rationale

                These datasets were created because it filled a gap in the openly available datasets that could have been used. For example, in QA datasets, there did not exist any dataset that had all **4** components listed above. Many had the questions, answers, and context, but none also included the source documents in a readily accessible manner. Therefore, the fastest and most effective course of action was to generate a QA dataset from source documentation using the [DeepEval Synthesizer](https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data).

Collaborator

barronstone Sep 10, 2024

Explain why we picked the reference documents we used as sources for this data set. It was a mix of technical documentation & DoD policy docs to capture the type of data we generally expect our mission users will be working with.

adr/0007-rag-eval-framework.md Outdated

    
              ## Context

              LeapfrogAI uses RAG to provide context-aware responses to users who have specific data they need to reference. In order to make sure RAG is operating at the levels we need it to, we need to get measurable feedback from our RAG pipeline to make it better. We also need a standard to show to mission heroes that we are in fact operating at that level. We do this with RAG-focused evaluations. This ADR documents all of the decisions and lessons learned for enabling a full-scale RAG evaluations pipeline MVP.

Collaborator

barronstone Sep 10, 2024

In addition to evaluating and improving our "default" RAG implementation in LeapfrogAI, developing a standard approach to evals will help make delivery folks more efficient at evaluating custom RAG implementations they may build on and around the core LeapfrogAI.

adr/0007-rag-eval-framework.md Outdated Show resolved Hide resolved

jalling97 added 9 commits

September 10, 2024 10:51


          Add models to evaluate section

6ad5891


          add note about customizing RAG

ec66679


          Expand on QA documentation reasoning

733f6af


          fix typo

0534d4e


          Merge branch 'main' into 823-adr-rag-evaluations-framework

4cba064


          add execution/delivery decision and rationale

40ca0d0


          Update 0007-rag-eval-framework.md

a4eeb64


          Update 0007-rag-eval-framework.md

5854b66


          Merge branch 'main' into 823-adr-rag-evaluations-framework

8b06fc1

jalling97 marked this pull request as ready for review

September 30, 2024 21:36

jalling97 requested a review from a team as a code owner

September 30, 2024 21:36


          Merge branch 'main' into 823-adr-rag-evaluations-framework

8828f09

justinthelaw reviewed

View reviewed changes

Contributor

justinthelaw left a comment

This looks excellent so far! Some minor comments and nits.

adr/0007-rag-eval-framework.md Outdated Show resolved Hide resolved

adr/0007-rag-eval-framework.md

    
                #### Decision

                The three models that will initially be evaluated are going to be:

Contributor

justinthelaw Oct 1, 2024

It may be good to provide a small sections about why we chose the quantization and sizes of the models here. For example, reasons like consistency with # of active params or vRAM limitations, so that it doesn't seem like we selected 4-bit quantization for these model sizes arbitrarily.

Contributor Author

jalling97 Oct 1, 2024

Made some updates, let me know what you think!

adr/0007-rag-eval-framework.md Outdated

    
                In order to reach an MVP product, a single LLM judge will be utilized for the evaluations that require it. This will be the first stage so that the evaluation framework can begin receiving results. As progress is made, additional LLM-based judges will be incorporated to develop an LLM-jury styled approach. For context, please see the following [paper](https://arxiv.org/pdf/2404.18796).

                Claude 3.5 Sonnet was chosen to be used as the first judge due to it's high levels of [performance](https://artificialanalysis.ai/models/claude-35-sonnet), which is crucial when utilizing an LLM judge. Additionally, it exists outside the family of models that will be evaluated against, which has been shown to be effective in comparison to using models of the same family due to [self-enhancement bias](https://arxiv.org/pdf/2306.05685).

Contributor

justinthelaw Oct 1, 2024

Can you provide a little more context on what performance here? Maybe talk about its reasoning scores, like MMLU or GPQA? Might also be good to mention its large context window for digesting and analyzing the in/output for the evaluated models.

Contributor Author

jalling97 Oct 1, 2024

Made some updates, let me know what you think!

adr/0007-rag-eval-framework.md Outdated Show resolved Hide resolved

jalling97 and others added 3 commits

October 1, 2024 11:24


          Update adr/0007-rag-eval-framework.md

d208e66

Co-authored-by: Justin Law <[email protected]>


          Update adr/0007-rag-eval-framework.md

Co-authored-by: Justin Law <[email protected]>


          Update 0007-rag-eval-framework.md

13fce77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ADR 🧐 documentation