-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'post_course_optimisation'
# Conflicts: # README.md
- Loading branch information
Showing
9 changed files
with
324 additions
and
343 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,89 @@ | ||
# P2_mignon_helene | ||
**Livrable du Projet 2 du parcours D-A Python d'OpenClassrooms :** | ||
Scraping de books.toscrape.com avec BeautifulSoup4 ; exportation des infos dans fichiers .csv et des images de couverture dans dossier 'exports'. | ||
<p align="center"> | ||
<img src="img/logo_bookstore.png" alt="logo" /> | ||
</p> | ||
<h1 align="center">Scraping <em>BooksToScrape</em></h1> | ||
<p align="center"> | ||
<a href="https://www.python.org"> | ||
<img src="https://img.shields.io/badge/Python-3.6+-3776AB?style=flat&logo=python&logoColor=white" alt="python-badge"> | ||
</a> | ||
<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/"> | ||
<img src="https://img.shields.io/badge/BeautifulSoup-4.9+-d71b60?style=flat" alt="Beautiful Soup"> | ||
</a> | ||
<a href="https://github.com/psf/requests"> | ||
<img src="https://img.shields.io/badge/Requests-2.25+-00838f?style=flat" alt="Requests"> | ||
</a> | ||
</p> | ||
|
||
**Cette application a été optimisée après la fin de formation, voir [Version optimisée](https://github.com/hmignon/P2_mignon_helene/tree/post_course_optimisation).** | ||
# About the project | ||
|
||
_Notes : Ce programme invite l'utilisateur à copier l'url du site (https://books.toscrape.com/index.html) ou de la catégorie qu'il souhaite exporter. Testé sous Windows 10, Python 3.9.5._ | ||
**OpenClassrooms Python Developer Project #2: Use Python Basics for Market Analysis** | ||
|
||
---------------------------------------------- | ||
## Windows : | ||
Dans Windows Powershell, naviguer vers le dossier souhaité. | ||
### Récupération du projet | ||
Scraping of [books.toscrape.com](https://books.toscrape.com) with **BeautifulSoup4** and **Requests**, | ||
export data to .csv files and download cover images to *exports* folder. | ||
|
||
$ git clone https://github.com/hmignon/P2_mignon_helene.git | ||
_Tested on Windows 10, Python 3.9.5._ | ||
|
||
### Activer l'environnement virtuel | ||
$ cd P2_mignon_helene | ||
$ python -m venv env | ||
$ ~env\scripts\activate | ||
|
||
### Installer les paquets requis | ||
$ pip install -r requirements.txt | ||
### Post-course optimisation | ||
This project has been optimised after the end of the OpenClassrooms course. | ||
To view the previous version, go to [this commit](https://github.com/hmignon/P2_mignon_helene/tree/163c5f5b2c730e7b308d01f31479702fb7c1e8e9). | ||
|
||
### Lancer le programme | ||
$ python main.py | ||
|
||
---------------------------------------------- | ||
## MacOS et Linux : | ||
Dans le terminal, naviguer vers le dossier souhaité. | ||
### Récupération du projet | ||
Improvements made to this project include: | ||
- Using OOP for the main scraper | ||
- Parsing of command line arguments for options | ||
- Optimising loops for faster execution time | ||
- Json export | ||
|
||
# Setup | ||
|
||
### Clone the repository | ||
|
||
$ git clone https://github.com/hmignon/P2_mignon_helene.git | ||
- `git clone https://github.com/hmignon/P2_mignon_helene.git` | ||
|
||
### Activer l'environnement virtuel | ||
$ cd P2_mignon_helene | ||
$ python3 -m venv env | ||
$ source env/bin/activate | ||
### Create the virtual environment | ||
|
||
- `cd P2_mignon_helene` | ||
- `python -m venv env` | ||
- Activate the environment `source env/bin/activate` (MacOS and Linux) or `env\Scripts\activate` (Windows) | ||
|
||
### Installer les paquets requis | ||
$ pip install -r requirements.txt | ||
### Install required packages | ||
|
||
- `pip install -r requirements.txt` | ||
|
||
## Run the project | ||
|
||
In order to scrape the entirety of [books.toscrape.com](https://books.toscrape.com) to .csv files, | ||
use the command `python main.py` | ||
|
||
You can scrape one category via the argument `--category`. This argument takes either a **category name** or **full url**. | ||
For example, the 2 following commands would yield the same results: | ||
|
||
``` | ||
python main.py --category travel | ||
- OR - | ||
python main.py --category https://books.toscrape.com/catalogue/category/books/travel_2/index.html | ||
``` | ||
|
||
A **json** export option has been added, as it is marginally faster than exporting to **csv**. | ||
Both export types can be used in the same scraping process. | ||
|
||
``` | ||
python main.py -j OR --json | ||
python main.py -c OR --csv | ||
python main.py -c -j | ||
``` | ||
|
||
Cover images download can be skipped via `--ignore-covers` | ||
|
||
**Full list of optional arguments:** | ||
|
||
<p align="center"> | ||
<img src="img/help.png" alt="help" /> | ||
</p> | ||
|
||
### Using csv files | ||
|
||
### Lancer le programme | ||
$ python3 main.py | ||
If you wish to open the exported csv files in any spreadsheet software (Microsoft Excel, LibreOffice/OpenOffice Calc, Google Sheets...), | ||
please make sure to select the following options: | ||
- UTF-8 encoding | ||
- comma (,) as *separator* | ||
- double-quote (") as *string delimiter* |
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.