Streamlit Web Scraper

🌐 A Clean & Simple Web Scraper Built With Streamlit, BeautifulSoup, and Selenium

This Streamlit Web Scraper extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.

✨ Key Features

Scrape Dynamic and Static Websites: Supports JavaScript-rendered content using Selenium and traditional HTML scraping using BeautifulSoup.
Single Markdown File Output: Consolidates all scraped data into one clean and structured Markdown file, organized by page and section.
SQLite-Based Scrape History: Logs all scraped sessions for future access, allowing you to view or download previous scrapes at any time.
Slack Integration (Coming Soon): In an upcoming release, scraped data can be sent directly to a Slack channel for easy sharing and collaboration.
Cross-Device & Version-Control Friendly: Built with Git in mind, enabling seamless version control and multi-device collaboration.

🚀 Getting Started

Prerequisites

Make sure you have the following installed:

Python 3.7+
Git
Virtualenv (optional but recommended)

Installation

Clone the repository to your local machine:

git clone https://github.com/mirkotrotta/streamlit_web_scraper.git
cd streamlit_web_scraper

Set up a virtual environment (optional, but recommended):
```
python3 -m venv venv
source venv/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```
Set up environment variables (if required, e.g., for future Slack integration):
- Create a .env file and add any necessary environment variables, such as SLACK_BOT_TOKEN (for future integration).
Run the Streamlit App:
```
streamlit run app.py
```
Open your browser and go to http://localhost:8501 to use the app.

🛠 How to Use

Scraping a Website

Enter the URL of the website you want to scrape.
Select if the website is dynamic (i.e., JavaScript-heavy) or static.
Click "Scrape Website": The scraper will retrieve text-based content from the site, organizing it by page and section into a Markdown file.
Download the Markdown file: Once the scrape is complete, download the file directly through the interface or access previously saved scrapes.

Features in Progress

Slack Integration: Soon, you'll be able to send scraped data to a specified Slack channel.
More Framework Support: Future experiments include integrating with additional frameworks and APIs for advanced scraping scenarios.

🔄 Roadmap

Planned Features

Slack Integration: Scrape data and automatically send it to a Slack channel for quick collaboration.
API Integration: Adding support for scraping APIs and handling authentication where required.
Advanced Scraping Techniques: Experimenting with frameworks like Playwright for even better dynamic content handling.

🤝 Contributing

Contributions are welcome! If you have suggestions for improvements, feel free to:

Fork the repository
Create a new branch (git checkout -b feature/YourFeature)
Commit your changes (git commit -m 'Add YourFeature')
Push to the branch (git push origin feature/YourFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License. See the LICENSE file for details.

👤 Author

Mirkotrotta

💬 Contact

For any inquiries, questions, or feedback, feel free to open an issue or contact me.

Development Workflow

All new features and fixes should be made in the dev branch.
After testing, merge dev into main via a pull request.
Use feature branches (e.g., feature-docker) for major updates.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
scraper		scraper
slack		slack
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Streamlit Web Scraper

🌐 A Clean & Simple Web Scraper Built With Streamlit, BeautifulSoup, and Selenium

✨ Key Features

🚀 Getting Started

Prerequisites

Installation

🛠 How to Use

Scraping a Website

Features in Progress

🔄 Roadmap

Planned Features

🤝 Contributing

📝 License

👤 Author

💬 Contact

Development Workflow

About

Releases

Packages

Languages

mirkotrotta/streamlit_web_scraper

Folders and files

Latest commit

History

Repository files navigation

Streamlit Web Scraper

🌐 A Clean & Simple Web Scraper Built With Streamlit, BeautifulSoup, and Selenium

✨ Key Features

🚀 Getting Started

Prerequisites

Installation

🛠 How to Use

Scraping a Website

Features in Progress

🔄 Roadmap

Planned Features

🤝 Contributing

📝 License

👤 Author

💬 Contact

Development Workflow

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages