This project is another RAG implementation and uses BM25 and Cosine Similarity fusion for context retrieval.
- Setup Instructions
- Prerequisites
- Installation
- Configuration
- Usage Instructions
- Starting the Services
- Stopping the Services
- API Documentation
- Upload Files
- Perform OCR
- Extract Information
- Testing the API
- Upload Files
- Perform OCR
- Extract Information
- Example and Demo
Before running the project, make sure you have the following prerequisites installed:
- Docker
- Docker Compose
- Clone the repository:
git clone ...
- Navigate to the project directory:
cd ...
The project uses a config.ini
file to store configuration settings. Here's an explanation of each section and its parameters:
-
[minio-config]
:endpoint
: The URL and port of the Minio server.access_key
: The access key for authenticating with Minio.secret_key
: The secret key for authenticating with Minio.bucket_name
: The name of the bucket to use in Minio for storing files.
-
[redis-config]
:host
: The hostname or IP address of the Redis server.port
: The port number of the Redis server.
-
[query-config]
:window_size
: The size of the window used for sliding window extraction.top_k
: The number of top matching chunks to consider for extraction.cosine_link_threshold
: The threshold for cosine similarity to consider chunks as linked.alpha
: The complimentary weight of the BM25 and Cosine Similarity score for the final score calculation; alpha * bm25_score + (1 - alpha) * cosine_score
-
[document-config]
:min_chunk_length
: The minimum length of a text chunk.max_chunk_length
: The maximum length of a text chunk.
-
[openai-config]
:api_key
: The API key for accessing the OpenAI API.gpt_model
: The GPT model to use for question-answering.embedding_model
: The embedding model to use for text embeddings.embedding_dimension
: The dimension of the text embeddings.gpt_context_len
: The maximum context length for the GPT model.
Make sure to update the config.ini
file with your own configuration values before running the project.
Changing just the api_key should be the bare minimum.
[MOCK_OCR_SAVED]
:MD5_HASH_1
:FILE_PATH_1
MD5_HASH_2
:FILE_PATH_2
MD5_HASH_3
:FILE_PATH_3
- .
- .
FILE_PATH
should be a json file and should be parseable by Python's JSON library. The dictionary structure should be the following:
{
"analyzeResult": {
"content": ""
}
}
content
: str, should contain the contents of the file uploaded through the /upload end point as a string.
MD5_HASH
: str, should be the MD5 hash of the uploaded file, so that it can be mapped to relevant json file and it's contents be ingested.
To start the services using Docker Compose, run the following command:
docker-compose up --build -d
This will start all the necessary services defined in the docker-compose.yaml
file, including Redis, Minio, Redis Commander, the FastAPI app, and the Celery worker and build docker image for app and celery worker as well.
To stop the services, run the following command:
docker-compose down
This will stop and remove all the containers associated with the project.
- Endpoint:
/upload
- Method: POST
- Request Body:
files
: List of files to upload (multipart/form-data)- Response:
status
: Success or error statusdata
: List of uploaded files with their file IDs and signed URLs
- Endpoint:
/ocr
- Method: POST
- Request Body:
signed_url
: Signed URL of the file to perform OCR on- Response:
status
: Success, error, or processing statusdata
: Task ID and message if the file is queued for processingmessage
: Error message if an error occurs
- Endpoint:
/extract
- Method: POST
- Request Body:
file_hash
: Hash of the file to extract information fromquery
: Query to extract information based on- Response:
status
: Success, error, or processing statusanswer
: Extracted information if the file is processed, or processing status if still in progressmessage
: Error message if an error occurs
To test the API endpoints, you can use tools like curl
or API testing tools like Postman. Here are examples of how to test the endpoints using curl
:
-
Prepare the files you want to upload (e.g.,
file1.pdf
,file2.png
). -
Run the following command to upload the files:
curl -X POST -F "[email protected]" -F "[email protected]" http://localhost:8181/upload
This command uploads file1.pdf
and file2.png
to the /upload
endpoint.
- Check the response to verify the upload status and get the file IDs and signed URLs.
-
Obtain the signed URL of the file you want to perform OCR on.
-
Run the following command to initiate OCR processing:
curl -X POST -H "Content-Type: application/json" -d '{"signed_url": "https://example.com/file.pdf"}' http://localhost:8181/ocr
Replace https://example.com/file.pdf
with the actual signed URL of the file.
- Check the response to verify the OCR processing status and get the task ID if the file is queued for processing.
-
Obtain the MD5 file hash of the processed file you want to extract information from.
-
Run the following command to extract information based on a query:
curl -X POST -H "Content-Type: application/json" -d '{"file_hash": "99ef153b76c24ee4703f3b9e025bab09", "query": "What is the maximum height allowed for buildings?"}' http://localhost:8181/extract
Replace 99ef153b76c24ee4703f3b9e025bab09
with the actual MD5 file hash and provide your desired query.
- Check the response to get the extracted information if the file is processed, or the processing status if it's still in progress.
Note: Make sure to have the services running locally or replace http://localhost:8181
with the appropriate URL where your API is hosted.
- Make sure YARI is running.
- demo/demo.py for a streamlit based demo.
- Make sure streamlit is installed before running.
- examples/example.sh for bash/curl based example.
Parts of this readme have been generated by an LLM.