The purpose of this README is to outline the design, implementation, and usage of a Python script that uses Selenium to interact with a web page, specifically to click on a "Random Entry" link on the Stanford Encyclopedia of Philosophy (SEP) main page, scrape the resulting page, and format the content for display in the console.
This script is designed to perform the following tasks:
- Navigate to the SEP main page.
- Click on the "Random Entry" link.
- Scrape the content of the resulting page.
- Format the content and print it to the console.
The script uses Selenium WebDriver to automate the browser interaction, Requests to fetch the HTML content, and BeautifulSoup to parse and format the HTML content.
- requests: To send HTTP requests.
- selenium: To control the web browser.
- webdriver_manager: To manage the WebDriver binaries.
- bs4 (BeautifulSoup): To parse and format the HTML content.
- time: To add delays for page loading.
- scrape_and_format_content(url): Fetches and formats content from a given URL.
- main(): The main function to set up the WebDriver, navigate to the SEP main page, click the "Random Entry" link, and call the scraping function.
- Input: URL of the webpage to scrape.
- Process:
- Sends a GET request to the provided URL.
- Checks if the request was successful.
- Parses the HTML content using BeautifulSoup.
- Extracts the title from the h1 tag.
- Extracts the main content from the div with id='main-text'.
- Combines the title and content into a formatted string.
- Prints the formatted content to the console.
- Output: None (prints content to the console).
- Input: None
- Process:
- Sets the URL of the SEP main page.
- Sets up the Selenium WebDriver.
- Navigates to the SEP main page.
- Waits for the "Random Entry" link to be clickable and clicks it.
- Waits for the new page to load.
- Retrieves the URL of the new page.
- Calls scrape_and_format_content(url) with the new page URL.
- Closes the WebDriver.
- Output: None (calls the scraping function and prints content).
- Python 3.x
- Required Python libraries: requests, selenium, beautifulsoup4, webdriver_manager
Install the required libraries using pip:
pip install requests selenium beautifulsoup4 webdriver_manager
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
# Function to scrape and format content
def scrape_and_format_content(url):
# Send a GET request to the webpage
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the title of the article
title = soup.find('h1').get_text()
# Extract the main content
content = soup.find('div', {'id': 'main-text'}).get_text(separator='\n')
# Combine the title and content
full_content = f"Title: {title}\n\nContent:\n{content}"
# Print the content to the console
print(full_content)
else:
print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
def main():
main_url = 'https://plato.stanford.edu/index.html'
# Set up Selenium WebDriver
driver = webdriver.Chrome()
driver.get(main_url)
# Wait for the "Random Entry" link to be clickable and click it
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Random Entry"))).click()
# Wait for the new page to load
time.sleep(5)
# Get the URL of the new page
random_entry_url = driver.current_url
# Scrape and format the content of the new page
scrape_and_format_content(random_entry_url)
# Close the WebDriver
driver.quit()
if __name__ == "__main__":
main()
- Ensure all prerequisites are met and required libraries are installed.
- Run the script using a Python interpreter:
python scrape_random_entry.py
- The script will navigate to the SEP main page, click the "Random Entry" link, scrape the resulting page, and print the formatted content to the console.
- Add error handling for Selenium WebDriver actions.
- Implement logging for better traceability.
- Enhance the script to handle different page structures if necessary.
- Extend the functionality to save the scraped content to a file.
This README provides a detailed overview of the web scraping and content formatting script. The script automates the process of navigating to a webpage, interacting with elements, and extracting useful information for display in the console.