This project represents a practical scenario in which a major shoe company aims to analyze the shoe market on Mercado Livre to develop a competitive pricing strategy.
- Identify the most prominent brands across the first 20 pages of search results.
- Calculate the average price of athletic shoes by brand.
- Assess customer satisfaction for each brand (based on reviews).
To achieve these objectives, we'll implement a ETL (Extract, Transform, Load) pipeline using Python and widely-used open-source libraries:
- Extraction: Utilize the
Scrapy
framework to scrape product data from Mercado Livre. - Transformation: Use
pandas
to clean and format the raw data into structured DataFrames. - Loading: Store the processed data in an
SQLite3
local database. - Visualization: Present insights through an interactive dashboard built with
Streamlit
.
conda create -n scrapy-env
conda activate scrapy-env
conda install scrapy pandas streamlit
- Start a Scrapy project with the following command:
scrapy startproject collect
- Move into the project folder and create a spider for scraping:
cd collect
scrapy genspider mercadolivre https://lista.mercadolivre.com.br/tenis-corrida-masculino
The spider will be responsible for sending requests, parsing the necessary data, and handling pagination.
To ensure proper access, set up your user agent:
USER_AGENT = 'your_user_agent'
You can find your user agent by searching "my user agent" online.
Disable the robots.txt
rule to allow scraping multiple pages:
ROBOTSTXT_OBEY = False
Scrapy allows testing commands directly in the terminal:
scrapy shell
For example:
fetch('https://lista.mercadolivre.com.br/tenis-corrida-masculino')
This will fetch the HTML content of the page.
Some useful commands for parsing:
- To capture the HTML block containing relevant product information:
response.css('div.ui-search-result__content')
- To count the number of items within the block:
len(response.css('div.ui-search-result__content'))
- Assign the content block to a
products
variable for further parsing:
products = response.css('div.ui-search-result__content')
- Extract brand names from the content:
products.css('ui-search-item__brand-discoverability ui-search-item__group__element::text').get()
- To exit the shell:
exit()
After performing terminal tests, proceed with writing the scraping logic within the parse
method in the mercadolivre.py
spider.
Execute the spider to gather data and save the results in a JSON Lines file:
scrapy crawl mercadolivre -o .data\data.jsonl
Use the main.py
script located in the transformation
directory to handle data transformation and loading.
First, read the data.jsonl
file generated by Scrapy using pandas
.
The transformation process includes handling missing values (converting NULL
to 0
), adjusting data types, and combining price values (e.g., combining reais and cents into a single price column).
After cleaning the data, establish a connection to the SQLite3 database (quotes.db
) and load the processed data into a table (mercadolivre_itens
). SQLite operates directly on your machine, which simplifies the setup since no external server is needed.
Run the transformation and loading steps with:
python transformation/main.py
We'll develop the dashboard in app.py
. First, connect to the SQLite database to retrieve the scraped and transformed data.
Next, design the layout of the Streamlit dashboard. The dashboard includes 3 Key Performance Indicators (KPIs) to summarize important insights for the client:
- Total number of products;
- Count of unique brands;
- Average price (new price).
Additionally, charts and tables are created to answer key business questions:
- What brands dominate the listings across multiple pages?
- What is the average price by brand?
- How does customer satisfaction vary across brands?
Run the Streamlit app to view the dashboard:
streamlit run dashboard/app.py
All commands for:
- Data extraction:
scrapy crawl mercadolivre -o data.jsonl
- Data transformation and loading:
python transformation/main.py
- Dashboard visualization:
streamlit run dashboard/app.py
must be run from within the src
folder.
This project demonstrates a complete workflow from data extraction to transformation, loading, and visualization. It uses open-source tools and Python to carry out the entire ETL process and create an insightful dashboard.