Web_Scraping_Chatbot (Sugarcane)

Overview

This project implements a web scraping-based question-answering (QA) chatbot for retrieving sugarcane-related information from web sources. It uses BeautifulSoup for web scraping, FAISS for vector storage, Azure OpenAI for embeddings, and LaMini-T5 for text generation. The chatbot processes scraped content into vector embeddings, retrieves relevant information, and generates responses based on user queries.

Model: text-embeddings-3-large → Creates embeddings and retrieves relevant text. FAISS (Facebook AI Similarity Search) → Stores and searches embeddings efficiently. Model: LaMini-T5 (developed by UAE University) → Generates refined responses and maintains conversational coherence.

Inputs

sugarcane_testing_link.txt: A text file containing URLs to scrape.

Outputs

Answers to User Queries Retrieves and presents relevant information from scraped sources. Uses FAISS and Azure OpenAI embeddings for efficient text retrieval. Answers user questions using LaMini-T5, ensuring coherence and relevance.
Source References Displays the URLs of retrieved information for transparency. If no relevant content is found, the chatbot notifies the user.
Embedded Context Extracted text is processed into semantic embeddings. Enables fast and accurate retrieval for queries.

Features

Web Scraping: Extracts text content from specified URLs, filtering out ads and irrelevant elements. Text Processing: Cleans, tokenizes, and lemmatizes extracted text for improved NLP performance. FAISS Vector Storage: Converts text into embeddings and stores them for quick retrieval. LaMini-T5 Model: Uses a fine-tuned Transformer model to generate responses. Azure OpenAI Embeddings: Transforms text into numerical representations for efficient search. Persistent FAISS Database: Saves and loads embeddings to avoid redundant processing. Retry Mechanism: Handles request failures with automatic retries. Interactive Chatbot: Engages users in a conversational format.

How It Works

Scraping URLs Reads URLs from sugarcane_testing_link.txt. Fetches content from the web and extracts text from paragraphs, headings, and tables. Removes advertisements and non-informative elements.
Creating Embeddings Processes extracted text into semantic embeddings using text-embedding-3-large. Stores embeddings in a FAISS vector database for efficient search.
Retrieval and Answer Generation When a user asks a question, the chatbot: Searches FAISS for the most relevant text chunks. Uses LaMini-T5 to refine and generate a natural response. Provides source references for transparency.

Notes

Ensure the URLs provided are accessible and contain sufficient text content.
Processing large websites with many links may take significant time and resources.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
README.md		README.md
Sugarcane_WS_main.py		Sugarcane_WS_main.py
requirements.txt		requirements.txt
sugarcane_links.txt		sugarcane_links.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web_Scraping_Chatbot (Sugarcane)

Overview

Inputs

Outputs

Features

How It Works

Notes

About

Releases

Packages

Languages

click2cloud-amodi/Link_Chatbot

Folders and files

Latest commit

History

Repository files navigation

Web_Scraping_Chatbot (Sugarcane)

Overview

Inputs

Outputs

Features

How It Works

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages