Web_Scraping_Chatbot

Overview

This project demonstrates how to create a question-answering (QA) chatbot using FAISS for vector storage and LangChain for building the retrieval-based QA chain. The chatbot scrapes content from provided URLs, processes the text into embeddings, and answers questions using a fine-tuned LaMini-T5 model.

Model = all-mpnet-base-v2 => to create embeddings and also to retrieve them FAISS (Facebook AI Similarity Search) => VectorDB (to store the embeddings) Model= Lamini-T5( developed by UAE University)=> to refine the response and to store the conversation

Inputs

Internal_links.txt`: File containing URLs to scrape

Outputs

Answers to User Queries: The chatbot provides contextually relevant answers to questions based on text scraped from the provided URLs. Uses retrieval-based methods to find the most relevant text chunks and generate responses using a fine-tuned language model (LaMini-T5).

Source References: For each answer, the chatbot includes the source URL from which the information was retrieved. If no relevant information is found, it apologizes and notifies the user.

Embedded Context: The text scraped from URLs is processed into semantic embeddings, allowing fast and accurate retrieval for queries.

Features

Scrapes text content from URLs and internal links.
Creates and stores embeddings using FAISS vector database.
Leverages a fine-tuned language model (LaMini-T5) for answering questions.
Allows saving and loading FAISS vector databases for efficiency.

The program will ask whether to load an existing FAISS database or create a new one from scratch. After processing embeddings, you can ask questions based on the content of the scraped URLs.

How It Works

Scraping URLs: The script fetches content from the provided URLs and internal links, extracting text from paragraphs and headings.
Creating Embeddings: The text is split into chunks and processed into embeddings using the all-mpnet-base-v2 model.
QA Chain: A retrieval-based QA chain is built using the LaMini-T5 model for text generation and FAISS for vector retrieval.
Query Processing: The chatbot retrieves relevant text chunks and generates responses based on user queries.

Configuration

FAISS_DB_DIR: Directory to save or load the FAISS vector database.
Internal_links.txt: File containing URLs to scrape.
MAX_RETRIES and RETRY_DELAY: Configure retry attempts for failed network requests.

Notes

Ensure the URLs provided are accessible and contain sufficient text content.
Internal links are filtered to exclude image files (e.g., .jpg, .png).
Processing large websites with many links may take significant time and resources.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web_Scraping_Chatbot

Overview

Inputs

Outputs

Features

How It Works

Configuration

Notes

About

Releases

Packages

Languages

click2cloud-sagarB/link_chatbot

Folders and files

Latest commit

History

Repository files navigation

Web_Scraping_Chatbot

Overview

Inputs

Outputs

Features

How It Works

Configuration

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages