SmartQuake is a research project aimed at predicting earthquakes on a global scale using the latest machine learning technologies. It compiles data from 14 global datasets, creating a robust dataset that can be leveraged for future earthquake prediction research.
- Visit the data_scraping/ folder
- Visit the data_processing/ folder
- Visit the data_merging/ folder
Access the dataset at various stages of acquisition:
This step involves scraping earthquake data from various sources, including text files, web pages, and PDFs. It utilizes BeautifulSoup for web scraping, Pandas for data manipulation, and Tabula for PDF data extraction.
-
From Google Drive, download the raw datasets and place them under the
dataset/data_scraping/.../raw
folder. -
Install the required dependencies:
pip install -r requirements.txt
-
Run the scraping script:
python dataset/main.py
The scraped datasets will be saved under
dataset/data_scraping/.../clean
.
-
Initialization: Create an instance of the
Scraper
class with parameters:input_path
: Path to the input file (for text and PDF sources).output_path
: Path to save the output CSV.url
: Webpage URL to scrape (for web sources).start_time
andend_time
: Date range for filtering data.header
: Column names for the output CSV.separator
: Character for separating data in text files (default is space).
-
Scraping:
find_quakes_txt(num_skips=0)
: For text files.num_skips
skips initial lines.find_quakes_web()
: For web pages. Scrapes data based on the body tag and predefined header.
-
Example:
scraper = Scraper(input_path='input.txt', output_path='output.csv', url='http://example.com', header=['Date', 'Magnitude', 'Location']) scraper.find_quakes_txt(num_skips=1) scraper.find_quakes_web()
Data processing is the second step in the SmartQuake data pipeline. After scraping the datasets, they are compiled into a standardized format where all datasets share the same columns.
Processed earthquake CSV files will contain the following columns:
- Timestamp: Stored as a
pd.Timestamp
string in UTC (format: YYYY-MM-DD HH:MM:SS.millisecond+00:00). - Magnitude: Moment magnitude (Mw).
- Latitude: Range within [-90, 90].
- Longitude: Range within [-180, 180].
- Depth: Depth in kilometers (optional, may be
None
for older records).
All datasets are sorted chronologically and contain no duplicates.
- data_processor.py: Contains the
DataProcessor
class for standardized processing. - run_processor.py: Runs the
DataProcessor
on all scraped datasets. - processed/: Folder containing the processed output datasets (CSVs).
-
Ensure that all
clean
datasets exist in thedataset/data_scraping/.../clean
folder. -
Verify that the
processed/
folder exists indata_processing/
. -
Run
run_processor.py
:python data_processing/run_processor.py
-
After completion, check for the processed CSVs in the
processed/
folder before proceeding to the merging step.
The merging process combines all processed datasets into a single file for machine learning model input. This step preserves the same columns and ensures chronological order without duplicates.
- helper.py: Contains helper functions for merging.
- merge.py: Merges non-USGS/SAGE datasets into
Various-Catalogs.csv
. - usgs_pre_1950/: Folder containing scripts for USGS data processing and merging.
- final/: Folder containing
usgs_sage_various_merge.py
, which merges all datasets intoCompleted-Merge.csv
.
- Compile Processed Datasets: Ensure all processed datasets are in
data_processing/processed/
(excluding USGS/SAGE datasets). - First Merge: Run
merge.py
to createVarious-Catalogs.csv
and move it to the folderdata_merging/final
. - USGS Data Processing:
- Visit the Google Drive and directly download
USGS_SAGE_Merged.csv
. Store the file indata_merging/final/
for the next step.
- Visit the Google Drive and directly download
- Final Merge: Run
usgs_sage_various_merge.py
to merge all datasets intoCompleted-Merge.csv
.
Dataset | Status | Link | Additional Comments |
---|---|---|---|
Argentina | good | Link | Downloaded Manually |
Canada | good | Link | Downloaded Manually |
Japan | good | Link | Downloaded Manually |
GHEA | good | Link | Downloaded Manually |
NOAA | good | Link | Downloaded Manually |
SoCal | good | Link | Downloaded Manually |
Turkey | good | Link | Downloaded Manually |
World Tremor | good | Link | Downloaded Manually |
East Africa | good | Link | Downloaded Manually |
Intensity | good | Link | Downloaded Manually |
PNW Tremor | good | Link | Downloaded Manually |
South Asia | good | Link | Downloaded Manually |
Texas | good | Link | Downloaded Manually |
USGS | good | Link | Downloaded through python scraper, takes a lot of time to finish |
SAGE | deprecated | Link | Advised to use USGS according to the official webpage |