SmartQuake

SmartQuake is a research project aimed at predicting earthquakes on a global scale using the latest machine learning technologies. It compiles data from 14 global datasets, creating a robust dataset that can be leveraged for future earthquake prediction research.

SmartQuake Data Pipeline

Visit the data_scraping/ folder
Visit the data_processing/ folder
Visit the data_merging/ folder

Dataset Checkpoints

Access the dataset at various stages of acquisition:

Data Scraping

Overview

This step involves scraping earthquake data from various sources, including text files, web pages, and PDFs. It utilizes BeautifulSoup for web scraping, Pandas for data manipulation, and Tabula for PDF data extraction.

Installation

From Google Drive, download the raw datasets and place them under the dataset/data_scraping/.../raw folder.
Install the required dependencies:
```
pip install -r requirements.txt
```
Run the scraping script:
```
python dataset/main.py
```
The scraped datasets will be saved under dataset/data_scraping/.../clean.

Usage

Initialization: Create an instance of the Scraper class with parameters:
- input_path: Path to the input file (for text and PDF sources).
- output_path: Path to save the output CSV.
- url: Webpage URL to scrape (for web sources).
- start_time and end_time: Date range for filtering data.
- header: Column names for the output CSV.
- separator: Character for separating data in text files (default is space).
Scraping:
- find_quakes_txt(num_skips=0): For text files. num_skips skips initial lines.
- find_quakes_web(): For web pages. Scrapes data based on the body tag and predefined header.

Example:

scraper = Scraper(input_path='input.txt', output_path='output.csv', url='http://example.com', header=['Date', 'Magnitude', 'Location'])
scraper.find_quakes_txt(num_skips=1)
scraper.find_quakes_web()

Data Processing

Overview

Data processing is the second step in the SmartQuake data pipeline. After scraping the datasets, they are compiled into a standardized format where all datasets share the same columns.

Data Standardization

Processed earthquake CSV files will contain the following columns:

Timestamp: Stored as a pd.Timestamp string in UTC (format: YYYY-MM-DD HH:MM:SS.millisecond+00:00).
Magnitude: Moment magnitude (Mw).
Latitude: Range within [-90, 90].
Longitude: Range within [-180, 180].
Depth: Depth in kilometers (optional, may be None for older records).

All datasets are sorted chronologically and contain no duplicates.

File Organization

data_processor.py: Contains the DataProcessor class for standardized processing.
run_processor.py: Runs the DataProcessor on all scraped datasets.
processed/: Folder containing the processed output datasets (CSVs).

Running Data Processing

Ensure that all clean datasets exist in the dataset/data_scraping/.../clean folder.
Verify that the processed/ folder exists in data_processing/.
Run run_processor.py:
```
python data_processing/run_processor.py
```
After completion, check for the processed CSVs in the processed/ folder before proceeding to the merging step.

Data Merging

The merging process combines all processed datasets into a single file for machine learning model input. This step preserves the same columns and ensures chronological order without duplicates.

File Organization

helper.py: Contains helper functions for merging.
merge.py: Merges non-USGS/SAGE datasets into Various-Catalogs.csv.
usgs_pre_1950/: Folder containing scripts for USGS data processing and merging.
final/: Folder containing usgs_sage_various_merge.py, which merges all datasets into Completed-Merge.csv.

Running Merge

Compile Processed Datasets: Ensure all processed datasets are in data_processing/processed/ (excluding USGS/SAGE datasets).
First Merge: Run merge.py to create Various-Catalogs.csv and move it to the folder data_merging/final.
USGS Data Processing:
- Visit the Google Drive and directly download USGS_SAGE_Merged.csv. Store the file in data_merging/final/ for the next step.
Final Merge: Run usgs_sage_various_merge.py to merge all datasets into Completed-Merge.csv.

Data Source

Dataset	Status	Link	Additional Comments
Argentina	good	Link	Downloaded Manually
Canada	good	Link	Downloaded Manually
Japan	good	Link	Downloaded Manually
GHEA	good	Link	Downloaded Manually
NOAA	good	Link	Downloaded Manually
SoCal	good	Link	Downloaded Manually
Turkey	good	Link	Downloaded Manually
World Tremor	good	Link	Downloaded Manually
East Africa	good	Link	Downloaded Manually
Intensity	good	Link	Downloaded Manually
PNW Tremor	good	Link	Downloaded Manually
South Asia	good	Link	Downloaded Manually
Texas	good	Link	Downloaded Manually
USGS	good	Link	Downloaded through python scraper, takes a lot of time to finish
SAGE	deprecated	Link	Advised to use USGS according to the official webpage

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
Time_Series_Forecasting		Time_Series_Forecasting
dataset		dataset
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
smartquake_pipeline.png		smartquake_pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartQuake

Dataset Checkpoints

Table of Contents

Data Scraping

Overview

Installation

Usage

Data Processing

Overview

Data Standardization

File Organization

Running Data Processing

Data Merging

File Organization

Running Merge

Data Source

About

Releases

Packages

Contributors 6

Languages

Collaborative-AI/Earthquake-Forecasting

Folders and files

Latest commit

History

Repository files navigation

SmartQuake

Dataset Checkpoints

Table of Contents

Data Scraping

Overview

Installation

Usage

Data Processing

Overview

Data Standardization

File Organization

Running Data Processing

Data Merging

File Organization

Running Merge

Data Source

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages