CLI utility for running the Open French Law RAG experiment.
- 📰 Blog post: https://lil.law.harvard.edu/blog/2025/01/21/open-french-law-rag/
- 📖 Case study: https://lil.law.harvard.edu/open-french-law-rag
This Retrieval Augmented Generation pipeline:
- Ingests the COLD French Law Dataset into a vector store.
- Only French content is ingested. English translations present in the dataset are not part of this experiment.
- Uses the resulting vector store and a combination of text generation models to answer a series of questions.
- Questions are asked both in English and French.
- Questions are asked both with and without context retrieved from the vector store
- Questions are asked against both an OpenAI model and open-source model, which is run via Ollama
- Outputs raw results to CSV
This pipeline requires Python 3.11+ and Python Poetry.
Pulling and pushing data from HuggingFace may require the HuggingFace CLI and valid authentication.
git clone https://github.com/harvard-lil/open-french-law-chabot.git
poetry install
Copy and edit .env.example
as .env
in order to provide the pipeline credentials to the OpenAI API and Ollama.
cp .env.example .env
The pipeline's configuration can be further edited via /const/init.py.
This script generates a vector store out of the content from the COLD French Law Dataset.
# See: ingest.py --help for a list of available options
poetry run python ingest.py
See output under /database
.
This scripts runs the full list of questions through the pipeline and writes the output to CSV.
# See: ask.py --help for a list of available options
poetry run python ask.py
See output under /output/*.csv
.
The experiment's output is organized in groups:
Group name | Text gen. Model | RAG | Language |
---|---|---|---|
a_en |
LLama2 70B | NO | EN |
a_fr |
LLama2 70B | NO | FR |
b_en |
LLama2 70B | YES | EN |
b_fr |
LLama2 70B | YES | FR |
c_en |
GPT-4 | NO | EN |
c_fr |
GPT-4 | NO | FR |
c_en |
GPT-4 | YES | EN |
c_fr |
GPT-4 | YES | FR |