Detecting Edgeworth Cycles

https://zenodo.org/badge/DOI/10.5281/zenodo.7950794.svg

Replication package for paper "Detecting Edgeworth Cycles" by Timothy Holt, Mitsuru Igami, and Simon Scheidegger (2023). We hope that this repository can also serve as a "sandbox" for researchers interested in studying gasoline price data in more detail by providing various tools to assist in the data analysis process.

The scripts in this repository allow the user to:

Test the various parametric, random forest, and LSTM models that were detailed in the paper.
Plot random samples of the labeled data.
Use the non-parametric machine learning models to classify external data.
Efficiently parse the Tankerkoenig German retail gasoline price data into easily usable data structures including CSV.

Citation

Please cite "Detecting Edgeworth Cycles" in your publications if this resource helps your research or teaching:

Paper Citation:

@article{holt2024detecting,
  title={Detecting Edgeworth Cycles},
  author={Holt, Timothy and Igami, Mitsuru and Scheidegger, Simon},
  year={2024},
  journal={The Journal of Law and Economics},
  doi={https://doi.org/10.1086/726224}
}

Electronic Resource and Data Citation:

@dataset{holt2023data,
  author={Holt, Timothy and Igami, Mitsuru and Scheidegger, Simon},
  title        = {Replication Package for: Detecting Edgeworth Cycles},
  month        = Nov,
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.10126406},
  url          = {https://doi.org/10.5281/zenodo.10126406}
}

Cloning the Repository

To run the code, you must first download (clone) this repository:

Open a terminal and navigate to your desired directory using
```
 cd directory_path
```
Clone the repository using:
```
 git clone https://github.com/tabholt/detecting_edgeworth_cycles.git
```
Note: This command may not work on Windows, unless you have previously installed git. In this case, you may either install Git or download this repository from the GitHub webpage and then unzip it in your desired directory.
Navigate into the repository using:
```
 cd detecting_edgeworth_cycles
```
Note: You may name this directory anything you like. In this case, update the name after the cd command.

Package Contents

User Scripts (`main` directory)

plot_sample.py : plot a random sample of data from a given region
run_parametric_models.py : train and test parametric models
run_rf_model.py : train and test random forest model
run_lstm_model.py : train and test LSTM models
nonparamtric_classify_external_data.py : use pre-trained models to classify external data sets using LSTM or Random Forest models
de_rawdata_parse_postal_region.py : efficiently parse raw Tankerkoenig data into convenient data structures
convert_price_window_json_csv.py : convert price window data structure files from JSON to CSV format and vice-versa

Framework Code (`main/framework` directory)

Estimation_Framework.py : main code defining the parametric models and feature interfaces
RF_Framework.py : main code defining the random forest models
LSTM_Framework.py : main code defining the LSTM models
Label_Class.py : defines the data structure underpinning single observations, as well as collections of observations
Model_Loader.py : a convenient interface for loading data and splitting into training and test sets
model_settings.py : dictionaries and lists for some important default parameters to make the framework function

DE Raw Data Parsing Framework Code (`main/framework/de_data_parsing` directory)

sta_info_functions.py : functions to parse station info files (excluding prices)
price_db_functions.py : functions to parse station price files
observation_windows_functions.py : functions to create station window observations of quarterly average daily prices
Station_Class.py : data structure representing a single gas station with all relevant data
Timer_Utility.py : convenient utility for recording timing and performance in Python code
parameters.py : set of default parameters for parsing german raw data

Data Files (`main/label_databases` direcotry)

german_label_db.json : data from Germany
nsw_label_db.json : data from New South Wales
wa_label_db.json : data from Western Australia

Note: data files must be unzipped after downloading repository (see Running the Scripts section for details)

Downloadable Data Files (not included in directory)

tankerkoenig German raw data (updated daily):
- git clone https://[email protected]/tankerkoenig/tankerkoenig-data/_git/tankerkoenig-data
Detrended labeled data for all DE price windows from Q4-2014 to Q4-2020 (inclusive):
- cd labeled_databases
- curl https://drive.switch.ch/index.php/s/pwq1Sw0RssyDuUC/download --output ALL_detrended_price_windows.json
- Only about 10 percent of observations contain human labels. {1: cycling, 0.5: maybe cycling, 0: not cycling}

Note: These commands may not work on Windows. In this case, you may either install Git and/or install curl or download these repositories from the following links: tankerkoenig, detrended_price_windows and then unzip it in your desired directory.

Requirements

This code requires Python 3.8 or later, with the following packages and their associated dependencies:

matplotlib (3.7.1)
numpy (1.24.3)
pandas (1.5.3)
python (3.10.11)
scikit-learn (1.2.2)
scipy (1.10.1)
seaborn (0.12.2)
tensorflow (2.10.0)

Note: The code should be broadly compatible with recent versions of the above packages. Specific version numbers are included only for long-term replicability purposes.

The easiest way to create an environment to run the code is using Miniconda or Anaconda.

It is recommended to use Miniconda, since this is the simplest and lightest installation but the following setup instructions will work for both Miniconda and Anaconda. Miniconda must be installed using the terminal, while Anaconda has a graphical interface installer.

Environment Setup Instructions

Download and install Miniconda or Anaconda
Test your installation using
```
 conda list
```
Create a new environment (set of installed packages) from the provided environment yml file using
```
 conda env create -f replication_conda_environment.yml
```
Note: you will need to be in the repository main folder for this to work since it requires the replication_conda_environment.yml file located in the main folder.

Activate new environment using:

 conda activate edgeworth_replication_env

(optional) Verify installation was successful using conda list and verifying that the packages noted in the requirements section are listed

Note: You will need to use conda activate edgeworth_replication_env every time you open a new terminal session.

Running the Scripts

Each of the scripts requires some arguments to be passed to it at run time to define your chosen parameters such as region or type of model to be run. These arguments will be denoted as arg1, arg2, arg3. To run a script you then use:

    python script_name.py arg1 arg2 arg3

replacing script_name.py with the name of your script, and the various arg1, arg2, arg3 with your chosen parameter values.

Note: You must extract the JSON files from the zip file detecting_edgeworth_cycles/label_databases/label_data_files.zip before you can run the scripts.

Steps to unzip data files

Use your favorite unzipping program to extract the contents of the zipped file.
Ensure that the files:
- german_label_db.json
- nsw_label_db.json
- wa_label_db.json
are in the directory detecting_edgeworth_cycles/label_databases/

Plotting samples of data

To plot $n$ random samples of data from a given region run the script plot_sample.py

arg1 = region in {wa, nsw, de}
arg2 = $n$ (positive integer)

Note: You will need to close the plot that pops-up in order to see the subsequent plot.

Running parametric models

To train and test parametric models run the script run_parametric_models.py

arg1 = region in {wa, nsw, de}
arg2 = method in {PRNR, MIMD, NMC, MBPI, FT0, FT1, FT2, LS0, LS1, LS2, CS0, CS1, WAVY, all} - if 'all' is passed the the a model will be built, trained, and tested for each method.

Once the model has run, results will be printed to the terminal, and saved in a CSV log file called parametric_model_log.csv. Running multiple models will append new lines onto this log file.

Note: The optimal parameter $\theta$ values can be saved by setting save_model = True in the parameters section of the python script. This will be necessary to use the optimal $\theta$ values to classify external data.

Correspondences with Methods in Paper

Shortcut	Description
PRNR	Method 1: Positive Runs vs. Negative Runs
MIMD	Method 2: Mean Increase vs. Mean Decrease
NMC	Method 3: Negative Median Change
MBPI	Method 4: Many Big Price Increases
FT0	Method 5: Fourier Transform (maximum value)
FT1	alternate Fourier Transform (tallest peak)
FT2	alternate Fourier Transform (Herfindahl–Hirschman Index)
LS0	Method 6: Lomb-Scargle Periodogram (maximum value)
LS1	alternate Lomb-Scargle Periodogram (tallest peak)
LS2	alternate Lomb-Scargle Periodogram (Herfindahl–Hirschman Index)
CS0	Method 7: Cubic Splines (number of roots)
CS1	alternate Cubic Splines (integral value)
WAVY	number of times detrended price crosses its mean

Running Random Forest models

To train and test Random Forest models run the script run_rf_model.py

arg1 = region in {wa, nsw, de}

Once the model has run, results will be printed to the terminal, and saved in a CSV log file called random_forest_model_log.csv. Running multiple times will append new lines onto this log file.

Advanced: To save a model once it has been trained, set variable save_model = True in the parameters section of the python script.

Running LSTM models

To train and test LSTM models run the script run_lstm_model.py

arg1 = region in {wa, nsw, de}
arg2 = number of training epochs (positive integer)
arg3 = ensemble model boolean in {0, 1}

A training epoch is a single run through the data set. Model fit will increase with the number of epochs until over-fitting is achieved. For the paper 100 epochs was used, less than 10 is not recommended.

The ensemble model bool indicates whether to use ensemble LSTM model or basic one. 0 will give basic model, 1 will give ensemble model.

Once the model has run, results will be printed to the terminal, and saved in a CSV log file called lstm_model_log.csv. Running multiple times will append new lines onto this log file.

Advanced: To save a model once it has been trained, set variable save_model = True in the parameters section of the python script.

Advanced: Results from Figure 2 - Gains from Additional Data, can be simulated by changing the variable train_fraction in the parameters of any of the LSTM, RF, or Parametric models. This will modify the proportion of the data set that is used to train the models.

Use parametric models to classify external data (Advanced)

To use previously trained, optimal theta values to classify a data set contained in a JSON or CSV file:

Ensure that the dataset has the same format as the price window files either in CSV or JSON (ie. like label_databases/german_label_db.json). Not all data fields need to be present, but there must at least be price series and a unique identifier column for the observations. For example of CSV format, see output of convert_price_window_json_csv.py.
Set the external_data_path parameter of the parametric_classify_external_data.py script to the file containing your data.
Run script parametric_classify_external_data.py
- arg1 = region in {wa, nsw, de}
- arg2 = method in {PRNR, MIMD, NMC, MBPI, FT0, FT1, FT2, LS0, LS1, LS2, CS0, CS1, WAVY, all}
Classification results can be found in the specified file

Note: The region and method arguments will load the previously found optimal values of parameter $\theta$ for the respective model in the specified region. It will then compute the estimated truth value of the method given the optimal $\theta$ on the provided data set.

Use pre-trained non-parametric (ML) models to classify external data (Advanced)

To use previously trained and saved models to classify a data set contained in a JSON or CSV file:

Ensure that the dataset has the same format as the price window files either in CSV or JSON (ie. like label_databases/german_label_db.json). Not all data fields need to be present, but there must at least be price series and a unique identifier column for the observations. For example of CSV format, see output of convert_price_window_json_csv.py.
Modify basic settings in set parameters section of the nonparametric_classify_external_data.py script. You will need to insert:
- training set hash (see logs) from RF or LSTM model that you previously trained and saved
- the path and filename to your external data file in JSON or CSV.
- the type of model in {'rf', 'lstm_basic', 'lstm_ensemble'}
- the filename where you wish to save the results (either CSV or JSON extensions accepted)
Run script nonparametric_classify_external_data.py
Classification results can be found in the specified file

Note: The performance of the models will generally be negatively affected by biases or other features of the external data that were not also in the training data. Proceed with caution when using this feature.

Parse Tankerkoenig DE raw data into convenient data structures

To efficiently parse the Tankerkoenig raw data into easy to use data structures:

Download the tankerkoenig data set using:
- git clone https://[email protected]/tankerkoenig/tankerkoenig-data/_git/tankerkoenig-data

Run the script de_rawdata_parse_postal_region.py
- arg1 = postal region in {0...9, all}

German postal region is defined as the region represented by the set of postal codes starting with the specified digit. More information can be found at this Wikipedia article. Passing the argument 'all' will parse all regions from 0 to 9.

The output of this script will be found by default in the folder de_databases. There will be three types of file in here:

Station Info Database: This is the file that contains only the header data for the various gas stations. No price data is included in this file, to make it small and fast to load. This file is available by default in both JSON as well as serialized Python pkl format.
Price Station Database: These files contain a serialized dictionary of station objects (similar to the station info database) that contains a list of price observations and timestamps of all of the price reports from the given station. These databases are partitioned into postal regions.
Price Windows Files: These JSON files contain series of quarterly observations of daily average prices for each station in the given postal region. This 90-day price window data structure is the basis of the Detecting Edgeworth Cycles paper.

Note:

Processing a region should take about 20-50 minutes, thus fully processing all regions will take several hours.
This step will require at least 8GB of RAM memory, but 16GB or more is recommended.
The script is preconfigured to expect the folder tankerkoenig-data in the main directory. This can be configured using the framework/de_data_parsing/parameters.py dirs['raw_german_data_folder'] variable.
The script will ask before overwriting data.

Convert JSON price window files to CSV and vice-versa

Price window data files as produced by the de_rawdata_parse_postal_region.py can be converted into CSV files using the script convert_price_window_json_csv.py. Conversion can also be performed on user generated data between CSV and JSON formats, provided the data conforms to the same structure as the base provided data.

To use the converter run the script using convert_price_window_json_csv.py

arg1 = input_filename (str)

The name of the input file must include a path and be either a JSON or CSV file.

Output:

For JSON input, it converts the data to a CSV file with the same name.
For CSV input, it converts the data to a JSON file with the same name.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Edgeworth Cycles

Citation

Cloning the Repository

Package Contents

User Scripts (`main` directory)

Framework Code (`main/framework` directory)

DE Raw Data Parsing Framework Code (`main/framework/de_data_parsing` directory)

Data Files (`main/label_databases` direcotry)

Downloadable Data Files (not included in directory)

Requirements

Environment Setup Instructions

Running the Scripts

Steps to unzip data files

Plotting samples of data

Running parametric models

Running Random Forest models

Running LSTM models

Use parametric models to classify external data (Advanced)

Use pre-trained non-parametric (ML) models to classify external data (Advanced)

Parse Tankerkoenig DE raw data into convenient data structures

Convert JSON price window files to CSV and vice-versa

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
framework		framework
label_databases		label_databases
.gitignore		.gitignore
README.md		README.md
convert_price_window_json_csv.py		convert_price_window_json_csv.py
de_rawdata_parse_postal_region.py		de_rawdata_parse_postal_region.py
nonparametric_classify_external_data.py		nonparametric_classify_external_data.py
parametric_classify_external_data.py		parametric_classify_external_data.py
plot_sample.py		plot_sample.py
replication_conda_environment.yml		replication_conda_environment.yml
run_lstm_model.py		run_lstm_model.py
run_parametric_models.py		run_parametric_models.py
run_rf_model.py		run_rf_model.py

tabholt/detecting_edgeworth_cycles

Folders and files

Latest commit

History

Repository files navigation

Detecting Edgeworth Cycles

Citation

Cloning the Repository

Package Contents

User Scripts (main directory)

Framework Code (main/framework directory)

DE Raw Data Parsing Framework Code (main/framework/de_data_parsing directory)

Data Files (main/label_databases direcotry)

Downloadable Data Files (not included in directory)

Requirements

Environment Setup Instructions

Running the Scripts

Steps to unzip data files

Plotting samples of data

Running parametric models

Running Random Forest models

Running LSTM models

Use parametric models to classify external data (Advanced)

Use pre-trained non-parametric (ML) models to classify external data (Advanced)

Parse Tankerkoenig DE raw data into convenient data structures

Convert JSON price window files to CSV and vice-versa

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

User Scripts (`main` directory)

Framework Code (`main/framework` directory)

DE Raw Data Parsing Framework Code (`main/framework/de_data_parsing` directory)

Data Files (`main/label_databases` direcotry)

Packages