This project demonstrates how to create a question-answering (QA) chatbot using FAISS for vector storage and LangChain for building the retrieval-based QA chain. The chatbot scrapes content from provided URLs, processes the text into embeddings, and answers questions using a fine-tuned LaMini-T5 model.
Model = all-mpnet-base-v2 => to create embeddings and also to retrieve them FAISS (Facebook AI Similarity Search) => VectorDB (to store the embeddings) Model= Lamini-T5( developed by UAE University)=> to refine the response and to store the conversation
Internal_links.txt`: File containing URLs to scrape
Answers to User Queries: The chatbot provides contextually relevant answers to questions based on text scraped from the provided URLs. Uses retrieval-based methods to find the most relevant text chunks and generate responses using a fine-tuned language model (LaMini-T5).
Source References: For each answer, the chatbot includes the source URL from which the information was retrieved. If no relevant information is found, it apologizes and notifies the user.
Embedded Context: The text scraped from URLs is processed into semantic embeddings, allowing fast and accurate retrieval for queries.
- Scrapes text content from URLs and internal links.
- Creates and stores embeddings using FAISS vector database.
- Leverages a fine-tuned language model (LaMini-T5) for answering questions.
- Allows saving and loading FAISS vector databases for efficiency.
The program will ask whether to load an existing FAISS database or create a new one from scratch. After processing embeddings, you can ask questions based on the content of the scraped URLs.
- Scraping URLs: The script fetches content from the provided URLs and internal links, extracting text from paragraphs and headings.
- Creating Embeddings: The text is split into chunks and processed into embeddings using the
all-mpnet-base-v2
model. - QA Chain: A retrieval-based QA chain is built using the LaMini-T5 model for text generation and FAISS for vector retrieval.
- Query Processing: The chatbot retrieves relevant text chunks and generates responses based on user queries.
FAISS_DB_DIR
: Directory to save or load the FAISS vector database.Internal_links.txt
: File containing URLs to scrape.MAX_RETRIES
andRETRY_DELAY
: Configure retry attempts for failed network requests.
- Ensure the URLs provided are accessible and contain sufficient text content.
- Internal links are filtered to exclude image files (e.g.,
.jpg
,.png
). - Processing large websites with many links may take significant time and resources.