Webscraping

Repo for Petra Oil Bitumen

Introduction

The Petra Oil Web Scraper is a proprietary Python-based tool developed to streamline data extraction from websites. This documentation serves as a comprehensive guide to its functionalities, setup, and usage.

Overview

This scraper is purpose-built to extract internal links from specified websites, perform keyword searches within the extracted links, and provide detailed information based on user-defined queries.

Key Features

Link Extraction

Description: Extracts internal links from targeted websites. Implementation: Utilizes Python libraries for web scraping, navigating the HTML structure to identify anchor tags.

Keyword Search

Description: Conducts keyword searches within the extracted links. Implementation: Employs advanced string matching algorithms to locate defined keywords within the link set.

Information Prompt

Description: Allows users to request specific details related to keywords or website content. Implementation: Utilizes user-input queries to generate targeted responses based on search results or website content.

Duplicate Elimination

Description: Filters and eliminates duplicate links obtained during the scraping process. Implementation: Employs robust data structures to ensure uniqueness and prevent redundancy in the list of extracted links.

Depth Configuration

Description: Allows customization of link traversal depth. Implementation: Utilizes a tree traversal mechanism to control the exploration depth within the link structure.

Web Interface (Current State)

Description: The current iteration of the web interface boasts an expertly crafted user experience, driven primarily by JavaScript functionalities.

Implementation: Developed with a predominant focus on JavaScript to ensure an interactive and dynamic user interface. Leveraging modern JavaScript frameworks and methodologies, the interface delivers a sophisticated yet intuitive user experience. Its responsive design and seamless interaction cater to diverse user needs, providing a comprehensive platform for efficient data input and retrieval.

Usage Instructions

Interacting with the Scraper

Input: Provide the website link for link extraction and keyword search. Define Keywords: Set specific keywords for the search operation. Depth Configuration: Customize the depth parameter for link traversal. Prompt Information: Enter queries for specific information related to keywords or website content.

Output

1.Retrieved links from the provided website. 2.Keyword search results highlighting occurrences within the extracted links. 3.Prompted information based on user queries.

Development Setup

Prerequisites

Python environment with required dependencies. MongoDB instance for storing extracted data.

Required Dependencies

Python Libraries: See the requirements.txt file for a comprehensive list of required Python libraries and their versions necessary for the scraper's operation. Install them using pip install -r requirements.txt.

Database Configuration

The scraper utilizes MongoDB for storing extracted data. Installation: Install MongoDB and set up a local or remote instance. Configuration: Update the MongoDB connection settings in the config.py file to specify the database connection details such as hostname, port, username, and password

Contribution Guidelines

This is a private repository; contributions are restricted to authorized team members only.

License

This project is a proprietary system owned by Petra Oil Bitumen and is not open to the public.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
Output text		Output text
Outputs		Outputs
Plots		Plots
PoC		PoC
__pycache__		__pycache__
dashboard		dashboard
static		static
templates		templates
.env		.env
Chatbot.ipynb		Chatbot.ipynb
Logs.txt		Logs.txt
README.md		README.md
Semtiment.py		Semtiment.py
api_keys.py		api_keys.py
app.ipynb		app.ipynb
app.py		app.py
app2.py		app2.py
app_test.ipynb		app_test.ipynb
app_test_db.ipynb		app_test_db.ipynb
chat.py		chat.py
chat2.py		chat2.py
chatbot.py		chatbot.py
db_utils.py		db_utils.py
deletion_helper.py		deletion_helper.py
df_sorted_by_date.csv		df_sorted_by_date.csv
email_helper.py		email_helper.py
get_date.py		get_date.py
get_suburls.py		get_suburls.py
html_extractor.py		html_extractor.py
keyword_extraction.py		keyword_extraction.py
mongo_utils.py		mongo_utils.py
openai_func.py		openai_func.py
package-lock.json		package-lock.json
parallel.py		parallel.py
plotting_func.py		plotting_func.py
preprocess.py		preprocess.py
progress.py		progress.py
requirements.txt		requirements.txt
search_results.py		search_results.py
test_url_speed.ipynb		test_url_speed.ipynb
url_Html2.csv		url_Html2.csv
url_date.csv		url_date.csv
url_stats.py		url_stats.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webscraping

Introduction

Overview

Key Features

Link Extraction

Keyword Search

Information Prompt

Duplicate Elimination

Depth Configuration

Web Interface (Current State)

Usage Instructions

Interacting with the Scraper

Output

Development Setup

Prerequisites

Required Dependencies

Database Configuration

Contribution Guidelines

License

About

Releases

Packages

Contributors 6

Languages

Mafaz03/Webscraping

Folders and files

Latest commit

History

Repository files navigation

Webscraping

Introduction

Overview

Key Features

Link Extraction

Keyword Search

Information Prompt

Duplicate Elimination

Depth Configuration

Web Interface (Current State)

Usage Instructions

Interacting with the Scraper

Output

Development Setup

Prerequisites

Required Dependencies

Database Configuration

Contribution Guidelines

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages