https://zenodo.org/badge/DOI/10.5281/zenodo.7950794.svg
Replication package for paper "Detecting Edgeworth Cycles" by Timothy Holt, Mitsuru Igami, and Simon Scheidegger (2023). We hope that this repository can also serve as a "sandbox" for researchers interested in studying gasoline price data in more detail by providing various tools to assist in the data analysis process.
The scripts in this repository allow the user to:
- Test the various parametric, random forest, and LSTM models that were detailed in the paper.
- Plot random samples of the labeled data.
- Use the non-parametric machine learning models to classify external data.
- Efficiently parse the Tankerkoenig German retail gasoline price data into easily usable data structures including CSV.
Please cite "Detecting Edgeworth Cycles" in your publications if this resource helps your research or teaching:
Paper Citation:
@article{holt2024detecting,
title={Detecting Edgeworth Cycles},
author={Holt, Timothy and Igami, Mitsuru and Scheidegger, Simon},
year={2024},
journal={The Journal of Law and Economics},
doi={https://doi.org/10.1086/726224}
}
Electronic Resource and Data Citation:
@dataset{holt2023data,
author={Holt, Timothy and Igami, Mitsuru and Scheidegger, Simon},
title = {Replication Package for: Detecting Edgeworth Cycles},
month = Nov,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.10126406},
url = {https://doi.org/10.5281/zenodo.10126406}
}
To run the code, you must first download (clone) this repository:
-
Open a terminal and navigate to your desired directory using
cd directory_path
-
Clone the repository using:
git clone https://github.com/tabholt/detecting_edgeworth_cycles.git
Note: This command may not work on Windows, unless you have previously installed git. In this case, you may either install Git or download this repository from the GitHub webpage and then unzip it in your desired directory.
-
Navigate into the repository using:
cd detecting_edgeworth_cycles
Note: You may name this directory anything you like. In this case, update the name after the
cd
command.
plot_sample.py
: plot a random sample of data from a given regionrun_parametric_models.py
: train and test parametric modelsrun_rf_model.py
: train and test random forest modelrun_lstm_model.py
: train and test LSTM modelsnonparamtric_classify_external_data.py
: use pre-trained models to classify external data sets using LSTM or Random Forest modelsde_rawdata_parse_postal_region.py
: efficiently parse raw Tankerkoenig data into convenient data structuresconvert_price_window_json_csv.py
: convert price window data structure files from JSON to CSV format and vice-versa
Estimation_Framework.py
: main code defining the parametric models and feature interfacesRF_Framework.py
: main code defining the random forest modelsLSTM_Framework.py
: main code defining the LSTM modelsLabel_Class.py
: defines the data structure underpinning single observations, as well as collections of observationsModel_Loader.py
: a convenient interface for loading data and splitting into training and test setsmodel_settings.py
: dictionaries and lists for some important default parameters to make the framework function
sta_info_functions.py
: functions to parse station info files (excluding prices)price_db_functions.py
: functions to parse station price filesobservation_windows_functions.py
: functions to create station window observations of quarterly average daily pricesStation_Class.py
: data structure representing a single gas station with all relevant dataTimer_Utility.py
: convenient utility for recording timing and performance in Python codeparameters.py
: set of default parameters for parsing german raw data
german_label_db.json
: data from Germanynsw_label_db.json
: data from New South Waleswa_label_db.json
: data from Western Australia
Note: data files must be unzipped after downloading repository (see Running the Scripts section for details)
- tankerkoenig German raw data (updated daily):
git clone https://[email protected]/tankerkoenig/tankerkoenig-data/_git/tankerkoenig-data
- Detrended labeled data for all DE price windows from Q4-2014 to Q4-2020 (inclusive):
cd labeled_databases
curl https://drive.switch.ch/index.php/s/pwq1Sw0RssyDuUC/download --output ALL_detrended_price_windows.json
- Only about 10 percent of observations contain human labels. {1: cycling, 0.5: maybe cycling, 0: not cycling}
Note: These commands may not work on Windows. In this case, you may either install Git and/or install curl or download these repositories from the following links: tankerkoenig, detrended_price_windows and then unzip it in your desired directory.
This code requires Python 3.8 or later, with the following packages and their associated dependencies:
- matplotlib (3.7.1)
- numpy (1.24.3)
- pandas (1.5.3)
- python (3.10.11)
- scikit-learn (1.2.2)
- scipy (1.10.1)
- seaborn (0.12.2)
- tensorflow (2.10.0)
Note: The code should be broadly compatible with recent versions of the above packages. Specific version numbers are included only for long-term replicability purposes.
The easiest way to create an environment to run the code is using Miniconda or Anaconda.
It is recommended to use Miniconda, since this is the simplest and lightest installation but the following setup instructions will work for both Miniconda and Anaconda. Miniconda must be installed using the terminal, while Anaconda has a graphical interface installer.
-
Test your installation using
conda list
-
Create a new environment (set of installed packages) from the provided environment yml file using
conda env create -f replication_conda_environment.yml
Note: you will need to be in the repository main folder for this to work since it requires the
replication_conda_environment.yml
file located in the main folder. -
Activate new environment using:
conda activate edgeworth_replication_env
-
(optional) Verify installation was successful using
conda list
and verifying that the packages noted in the requirements section are listed
Note: You will need to use conda activate edgeworth_replication_env
every time you open a new terminal session.
Each of the scripts requires some arguments to be passed to it at run time to define your chosen parameters such as region or type of model to be run. These arguments will be denoted as arg1, arg2, arg3. To run a script you then use:
python script_name.py arg1 arg2 arg3
replacing script_name.py
with the name of your script, and the various arg1, arg2, arg3 with your chosen parameter values.
Note: You must extract the JSON files from the zip file detecting_edgeworth_cycles/label_databases/label_data_files.zip
before you can run the scripts.
-
Use your favorite unzipping program to extract the contents of the zipped file.
-
Ensure that the files:
german_label_db.json
nsw_label_db.json
wa_label_db.json
are in the directory
detecting_edgeworth_cycles/label_databases/
To plot plot_sample.py
- arg1 = region in {wa, nsw, de}
- arg2 =
$n$ (positive integer)
Note: You will need to close the plot that pops-up in order to see the subsequent plot.
To train and test parametric models run the script run_parametric_models.py
- arg1 = region in {wa, nsw, de}
- arg2 = method in {PRNR, MIMD, NMC, MBPI, FT0, FT1, FT2, LS0, LS1, LS2, CS0, CS1, WAVY, all} - if 'all' is passed the the a model will be built, trained, and tested for each method.
Once the model has run, results will be printed to the terminal, and saved in a CSV log file called parametric_model_log.csv
. Running multiple models will append new lines onto this log file.
Note: The optimal parameter save_model = True
in the parameters section of the python script. This will be necessary to use the optimal
Correspondences with Methods in Paper
Shortcut | Description |
---|---|
PRNR | Method 1: Positive Runs vs. Negative Runs |
MIMD | Method 2: Mean Increase vs. Mean Decrease |
NMC | Method 3: Negative Median Change |
MBPI | Method 4: Many Big Price Increases |
FT0 | Method 5: Fourier Transform (maximum value) |
FT1 | alternate Fourier Transform (tallest peak) |
FT2 | alternate Fourier Transform (Herfindahl–Hirschman Index) |
LS0 | Method 6: Lomb-Scargle Periodogram (maximum value) |
LS1 | alternate Lomb-Scargle Periodogram (tallest peak) |
LS2 | alternate Lomb-Scargle Periodogram (Herfindahl–Hirschman Index) |
CS0 | Method 7: Cubic Splines (number of roots) |
CS1 | alternate Cubic Splines (integral value) |
WAVY | number of times detrended price crosses its mean |
To train and test Random Forest models run the script run_rf_model.py
- arg1 = region in {wa, nsw, de}
Once the model has run, results will be printed to the terminal, and saved in a CSV log file called random_forest_model_log.csv
. Running multiple times will append new lines onto this log file.
Advanced: To save a model once it has been trained, set variable save_model = True
in the parameters section of the python script.
To train and test LSTM models run the script run_lstm_model.py
- arg1 = region in {wa, nsw, de}
- arg2 = number of training epochs (positive integer)
- arg3 = ensemble model boolean in {0, 1}
A training epoch is a single run through the data set. Model fit will increase with the number of epochs until over-fitting is achieved. For the paper 100 epochs was used, less than 10 is not recommended.
The ensemble model bool indicates whether to use ensemble LSTM model or basic one. 0 will give basic model, 1 will give ensemble model.
Once the model has run, results will be printed to the terminal, and saved in a CSV log file called lstm_model_log.csv
. Running multiple times will append new lines onto this log file.
Advanced: To save a model once it has been trained, set variable save_model = True
in the parameters section of the python script.
Advanced: Results from Figure 2 - Gains from Additional Data, can be simulated by changing the variable train_fraction
in the parameters of any of the LSTM, RF, or Parametric models. This will modify the proportion of the data set that is used to train the models.
To use previously trained, optimal theta values to classify a data set contained in a JSON or CSV file:
- Ensure that the dataset has the same format as the price window files either in CSV or JSON (ie. like
label_databases/german_label_db.json
). Not all data fields need to be present, but there must at least be price series and a unique identifier column for the observations. For example of CSV format, see output ofconvert_price_window_json_csv.py
. - Set the
external_data_path
parameter of theparametric_classify_external_data.py
script to the file containing your data. - Run script
parametric_classify_external_data.py
- arg1 = region in {wa, nsw, de}
- arg2 = method in {PRNR, MIMD, NMC, MBPI, FT0, FT1, FT2, LS0, LS1, LS2, CS0, CS1, WAVY, all}
- Classification results can be found in the specified file
Note: The region and method arguments will load the previously found optimal values of parameter
To use previously trained and saved models to classify a data set contained in a JSON or CSV file:
- Ensure that the dataset has the same format as the price window files either in CSV or JSON (ie. like
label_databases/german_label_db.json
). Not all data fields need to be present, but there must at least be price series and a unique identifier column for the observations. For example of CSV format, see output ofconvert_price_window_json_csv.py
. - Modify basic settings in set parameters section of the
nonparametric_classify_external_data.py
script. You will need to insert:- training set hash (see logs) from RF or LSTM model that you previously trained and saved
- the path and filename to your external data file in JSON or CSV.
- the type of model in {'rf', 'lstm_basic', 'lstm_ensemble'}
- the filename where you wish to save the results (either CSV or JSON extensions accepted)
- Run script
nonparametric_classify_external_data.py
- Classification results can be found in the specified file
Note: The performance of the models will generally be negatively affected by biases or other features of the external data that were not also in the training data. Proceed with caution when using this feature.
To efficiently parse the Tankerkoenig raw data into easy to use data structures:
- Download the tankerkoenig data set using:
git clone https://[email protected]/tankerkoenig/tankerkoenig-data/_git/tankerkoenig-data
- Run the script
de_rawdata_parse_postal_region.py
- arg1 = postal region in {0...9, all}
German postal region is defined as the region represented by the set of postal codes starting with the specified digit. More information can be found at this Wikipedia article. Passing the argument 'all' will parse all regions from 0 to 9.
The output of this script will be found by default in the folder de_databases
. There will be three types of file in here:
- Station Info Database: This is the file that contains only the header data for the various gas stations. No price data is included in this file, to make it small and fast to load. This file is available by default in both JSON as well as serialized Python pkl format.
- Price Station Database: These files contain a serialized dictionary of station objects (similar to the station info database) that contains a list of price observations and timestamps of all of the price reports from the given station. These databases are partitioned into postal regions.
- Price Windows Files: These JSON files contain series of quarterly observations of daily average prices for each station in the given postal region. This 90-day price window data structure is the basis of the Detecting Edgeworth Cycles paper.
Note:
- Processing a region should take about 20-50 minutes, thus fully processing all regions will take several hours.
- This step will require at least 8GB of RAM memory, but 16GB or more is recommended.
- The script is preconfigured to expect the folder
tankerkoenig-data
in the main directory. This can be configured using theframework/de_data_parsing/parameters.py dirs['raw_german_data_folder']
variable. - The script will ask before overwriting data.
Price window data files as produced by the de_rawdata_parse_postal_region.py
can be converted into CSV files using the script convert_price_window_json_csv.py
. Conversion can also be performed on user generated data between CSV and JSON formats, provided the data conforms to the same structure as the base provided data.
To use the converter run the script using convert_price_window_json_csv.py
- arg1 = input_filename (str)
The name of the input file must include a path and be either a JSON or CSV file.
Output:
- For JSON input, it converts the data to a CSV file with the same name.
- For CSV input, it converts the data to a JSON file with the same name.