This project provides a framework for managing scientific document databases and generating structured scientific texts. It leverages OpenAI models and the LangChain framework to streamline querying, organizing, and utilizing research materials effectively.
- Document Parsing and Storage: Extracts content from PDFs and stores them in a searchable vector database.
- Contextual Querying: Uses OpenAI embeddings to find relevant documents for a given query.
- Scientific Text Generation: Creates well-structured, context-based scientific texts ready for submission to top-tier conferences.
- Flexible API Integration: Incorporates OpenAI API for embedding and text generation tasks.
-
Clone the Repository
git clone <repository-url> cd Paper_RAG
-
Install Dependencies
Ensure Python 3.8+ is installed. Then, run:pip install -r requirements.txt
-
Set Up Your OpenAI API Key
During runtime, the script will prompt you to input your OpenAI API key. Alternatively, set it as an environment variable:export OPENAI_API_KEY="your-api-key"
-
Preparing the Database
To process documents and populate the vector database:python create_database.py
Place the following documents in their respective directories:
- Primary context documents:
data/pdf/
- Supplemental scientific articles:
data/articles/
- Primary context documents:
-
Querying the Database
Run the script with your query:python query_database.py "Your research question or topic"
The script will:
- Search the database for relevant documents.
- Generate a scientific text based on the found documents and the provided query.
create_database.py
: Handles document loading, splitting, and storage into the vector database.query_database.py
: Facilitates querying the database and generating responses using OpenAI models.- Utility Functions:
load_documents
: Loads PDF documents.split_text
: Splits documents into manageable chunks for processing.save_to_chroma
: Saves document embeddings to the Chroma vector database.
The generated scientific text adheres to the following template:
Using exclusively the following contexts:
Primary context (your thesis document): {context}
Supplemental context (scientific articles from top-tier conferences): {external_context}
Compose a scientifically structured text on the topic below, suitable for submission to top-tier scientific conferences. The text should:
- Clearly and accurately explain the concept or technique.
- Include relevant analysis or discussion based on the context.
- Be well-structured and coherent.
Specific topic: {question}
- Populate the directories with relevant PDFs.
- Create the database:
python create_database.py
- Query for a topic:
python query_database.py "Explain the implications of quantum computing in AI."
- Python 3.8+
- Libraries: See
requirements.txt
. Includes:langchain_community
langchain_openai
PyPDFLoader
Chroma
- Permission Errors: Ensure you have sufficient permissions to delete or create directories.
- Corrupt PDF Files: Validate your PDFs manually if they fail to load.
- API Issues: Check your OpenAI API key and rate limits.
- Add support for non-PDF formats.
- Improve error handling for database operations.
- Extend the prompt template for different scientific disciplines.
Author: Afonso Carvalho
License: MIT