This Streamlit Web Scraper extracts text content from any website, whether it's static or JavaScript-heavy, and saves the data into neatly formatted Markdown files. Ideal for personal research, data collection, or sharing website content with others.
- Scrape Dynamic and Static Websites: Supports JavaScript-rendered content using Selenium and traditional HTML scraping using BeautifulSoup.
- Single Markdown File Output: Consolidates all scraped data into one clean and structured Markdown file, organized by page and section.
- SQLite-Based Scrape History: Logs all scraped sessions for future access, allowing you to view or download previous scrapes at any time.
- Slack Integration (Coming Soon): In an upcoming release, scraped data can be sent directly to a Slack channel for easy sharing and collaboration.
- Cross-Device & Version-Control Friendly: Built with Git in mind, enabling seamless version control and multi-device collaboration.
Make sure you have the following installed:
- Python 3.7+
- Git
- Virtualenv (optional but recommended)
-
Clone the repository to your local machine:
git clone https://github.com/mirkotrotta/streamlit_web_scraper.git cd streamlit_web_scraper
-
Set up a virtual environment (optional, but recommended):
python3 -m venv venv source venv/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
-
Set up environment variables (if required, e.g., for future Slack integration):
- Create a
.env
file and add any necessary environment variables, such asSLACK_BOT_TOKEN
(for future integration).
- Create a
-
Run the Streamlit App:
streamlit run app.py
-
Open your browser and go to
http://localhost:8501
to use the app.
- Enter the URL of the website you want to scrape.
- Select if the website is dynamic (i.e., JavaScript-heavy) or static.
- Click "Scrape Website": The scraper will retrieve text-based content from the site, organizing it by page and section into a Markdown file.
- Download the Markdown file: Once the scrape is complete, download the file directly through the interface or access previously saved scrapes.
- Slack Integration: Soon, you'll be able to send scraped data to a specified Slack channel.
- More Framework Support: Future experiments include integrating with additional frameworks and APIs for advanced scraping scenarios.
- Slack Integration: Scrape data and automatically send it to a Slack channel for quick collaboration.
- API Integration: Adding support for scraping APIs and handling authentication where required.
- Advanced Scraping Techniques: Experimenting with frameworks like Playwright for even better dynamic content handling.
Contributions are welcome! If you have suggestions for improvements, feel free to:
- Fork the repository
- Create a new branch (
git checkout -b feature/YourFeature
) - Commit your changes (
git commit -m 'Add YourFeature'
) - Push to the branch (
git push origin feature/YourFeature
) - Open a Pull Request
This project is licensed under the MIT License. See the LICENSE file for details.
Mirkotrotta
For any inquiries, questions, or feedback, feel free to open an issue or contact me.