Skip to content

Commit

Permalink
Improves Bulk Download Script Example
Browse files Browse the repository at this point in the history
  • Loading branch information
AlexCatarino committed May 23, 2024
1 parent 5bb6430 commit a06f0a8
Show file tree
Hide file tree
Showing 6 changed files with 111 additions and 147 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,34 +11,14 @@

<p>After you subscribe to dataset updates, to update your local copy of the CFD dataset, use the <a href="https://www.quantconnect.com/datasets/oanda-cfd-data/cli">CLI Command Generator</a> to generate your download command and then run it in a terminal in your <a href="https://www.quantconnect.com/docs/v2/lean-cli/initialization/organization-workspaces">organization workspace</a>. Alternatively, instead of directly calling the <code>lean data download</code> command, you can place a Python script in the <span class="public-directory-name">data</span> directory of your organization workspace and run it to update your data files. The following example script updates all data resolutions:</p>

<div class="section-example-container">
<pre class="python">import os
from datetime import datetime
from pytz import timezone
<?
$dataset = "CFD Data";
$securityType = "cfd";
$market = "oanda";
$ticker = "xauusd";
$highResolutions = "[\"minute\", \"second\"]";
$extraArgs = "";
include(DOCS_RESOURCES."/datasets/download_bulk_data_script.php");
?>

# Define a method to download the data
def download_data(resolution, overwrite=False):
print(f"Updating {resolution} data...")
command = f'lean data download --dataset "CFD Data" --data-type "Bulk" --resolution "{resolution}"'
if overwrite:
command += " --overwrite"
os.system(command)

# Update minute and second data files
END_DATE = datetime.now(timezone("US/Eastern")).strftime("%Y%m%d")
new_data_available = False
for resolution in ["second", "minute"]:
latest_date = sorted([f for f in os.listdir(f"cfd/oanda/{resolution}/xauusd")])[-1].split('_')[0]
if latest_date &gt;= END_DATE:
print(f"{resolution} data is already up to date.")
continue
new_data_available = True
download_data(resolution)

# Update daily and hourly data files
if new_data_available:
for resolution in ["hour", "daily"]:
download_data(resolution, True)</pre>
</div>

<p>The preceding script checks the date of the most recent XAUUSD data you have for second and minute resolutions. If there is new data available for either of these resolutions, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
<p>The preceding script checks the date of the most recent XAUUSD data you have for second and minute resolutions. If there is new data available for either of these resolutions, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
Original file line number Diff line number Diff line change
Expand Up @@ -11,34 +11,14 @@

<p>After you subscribe to dataset updates, to update your local copy of the Forex dataset, use the <a href="https://www.quantconnect.com/datasets/oanda-forex/cli">CLI Command Generator</a> to generate your download command and then run it in a terminal in your <a href="https://www.quantconnect.com/docs/v2/lean-cli/initialization/organization-workspaces">organization workspace</a>. Alternatively, instead of directly calling the <code>lean data download</code> command, you can place a Python script in the <span class="public-directory-name">data</span> directory of your organization workspace and run it to update your data files. The following example script updates all data resolutions:</p>

<div class="section-example-container">
<pre class="python">import os
from datetime import datetime
from pytz import timezone

# Define a method to download the data
def download_data(resolution, overwrite=False):
print(f"Updating {resolution} data...")
command = f'lean data download --dataset "FOREX Data" --data-type "Bulk" --resolution "{resolution}"'
if overwrite:
command += " --overwrite"
os.system(command)

# Update minute and second data files
END_DATE = datetime.now(timezone("US/Eastern")).strftime("%Y%m%d")
new_data_available = False
for resolution in ["second", "minute"]:
latest_date = sorted([f for f in os.listdir(f"forex/oanda/{resolution}/eurusd")])[-1].split('_')[0]
if latest_date &gt;= END_DATE:
print(f"{resolution} data is already up to date.")
continue
new_data_available = True
download_data(resolution)

# Update daily and hourly data files
if new_data_available:
for resolution in ["hour", "daily"]:
download_data(resolution, True)</pre>
</div>

<p>To update your local dataset, the preceding script checks the date of the most recent EURUSD data you have for second and minute resolutions. If there is new data available for either of these resolutions, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
<?
$dataset = "FOREX Data";
$securityType = "forex";
$market = "oanda";
$ticker = "eurusd";
$highResolutions = "[\"minute\", \"second\"]";
$extraArgs = "";
include(DOCS_RESOURCES."/datasets/download_bulk_data_script.php");
?>

<p>To update your local dataset, the preceding script checks the date of the most recent EURUSD data you have for all resolutions. If there is new data available for either of these resolutions, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
Original file line number Diff line number Diff line change
Expand Up @@ -15,34 +15,14 @@

<p>Alternatively, instead of directly calling the <code>lean data download</code> command, you can place a Python script in the <span class="public-directory-name">data</span> directory of your organization workspace and run it to update your data files. The following example script updates all data resolutions:</p>

<div class="section-example-container">
<pre class="python">import os
from datetime import datetime
from pytz import timezone
<?
$dataset = "US Equities";
$securityType = "equity";
$market = "usa";
$ticker = "spy";
$highResolutions = "[\"minute\", \"second\", \"tick\"]";
$extraArgs = "";
include(DOCS_RESOURCES."/datasets/download_bulk_data_script.php");
?>

# Define a method to download the data
def download_data(resolution, overwrite=False):
print(f"Updating {resolution} data...")
command = f'lean data download --dataset "US Equities" --data-type "Bulk" --resolution "{resolution}"'
if overwrite:
command += " --overwrite"
os.system(command)

# Update minute, second, and tick data files
END_DATE = datetime.now(timezone("US/Eastern")).strftime("%Y%m%d")
new_data_available = False
for resolution in ["tick", "second", "minute"]:
latest_date = sorted([f for f in os.listdir(f"equity/usa/{resolution}/spy")])[-1].split('_')[0]
if latest_date &gt;= END_DATE:
print(f"{resolution} data is already up to date.")
continue
new_data_available = True
download_data(resolution)

# Update daily and hourly data files
if new_data_available:
for resolution in ["hour", "daily"]:
download_data(resolution, True)</pre>
</div>

<p>The preceding script checks the date of the most recent SPY data you have for tick, second, and minute resolutions. If there is new data available for any of these resolutions, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
<p>The preceding script checks the date of the most recent SPY data you have for all resolutions. If there is new data available for any of these resolutions, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
Original file line number Diff line number Diff line change
Expand Up @@ -13,31 +13,14 @@

<p>Alternatively, instead of directly calling the <code>lean data download</code> command, you can place a Python script in the <span class="public-directory-name">data</span> directory of your organization workspace and run it to update your data files. The following example script updates all data resolutions:</p>

<div class="section-example-container">
<pre class="python">import os
from datetime import datetime
from pytz import timezone

# Define a method to download the data
def download_data(resolution, overwrite=False):
print(f"Updating {resolution} data...")
command = f'lean data download --dataset "US Equity Options" --data-type "Bulk" --option-style "American" --resolution "{resolution}"'
if overwrite:
command += " --overwrite"
os.system(command)

# Update data files
END_DATE = datetime.now(timezone("US/Eastern")).strftime("%Y%m%d")
latest_date = sorted([f for f in os.listdir(f"option/usa/minute/aapl")])[-1].split('_')[0]
if latest_date &gt;= END_DATE:
print(f"Your data is already up to date.")
else:
download_data("minute")
for resolution in ['hour', 'daily']:
download_data(resolution, True)</pre>
</div>

<p>The preceding script checks the date of the most recent minute resolution data you have for AAPL. If there is new minute data available, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>



<?
$dataset = "US Equity Options";
$securityType = "option";
$market = "usa";
$ticker = "aapl";
$highResolutions = "[\"minute\"]";
$extraArgs = "--option-style \"American\"";
include(DOCS_RESOURCES."/datasets/download_bulk_data_script.php");
?>

<p>The preceding script checks the date of the most recent minute resolution data you have for AAPL. If there is new minute data available, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
Original file line number Diff line number Diff line change
Expand Up @@ -8,31 +8,14 @@

<p>After you subscribe to dataset updates, to update your local copy of the US Index Options dataset, use the <a href="https://www.quantconnect.com/datasets/algoseek-us-index-options/cli">CLI Command Generator</a> to generate your download command and then run it in a terminal in your <a href="https://www.quantconnect.com/docs/v2/lean-cli/initialization/organization-workspaces">organization workspace</a>. Alternatively, instead of directly calling the <code>lean data download</code> command, you can place a Python script in the <span class="public-directory-name">data</span> directory of your organization workspace and run it to update your data files. The following example script updates all data resolutions:</p>

<div class="section-example-container">
<pre class="python">import os
from datetime import datetime
from pytz import timezone

# Define a method to download the data
def download_data(resolution, overwrite=False):
print(f"Updating {resolution} data...")
command = f'lean data download --dataset "US Index Options" --data-type "Bulk" --resolution "{resolution}"'
if overwrite:
command += " --overwrite"
os.system(command)

# Update data files
END_DATE = datetime.now(timezone("US/Eastern")).strftime("%Y%m%d")
latest_date = sorted([f for f in os.listdir(f"indexoption/usa/minute/spx")])[-1].split('_')[0]
if latest_date &gt;= END_DATE:
print(f"Your data is already up to date.")
else:
download_data("minute")
for resolution in ['hour', 'daily']:
download_data(resolution, True)</pre>
</div>

<p>The preceding script checks the date of the most recent minute resolution data you have for SPX. If there is new minute data available, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>



<?
$dataset = "US Index Options";
$securityType = "indexoption";
$market = "usa";
$ticker = "spx";
$highResolutions = "[\"minute\"]";
$extraArgs = "";
include(DOCS_RESOURCES."/datasets/download_bulk_data_script.php");
?>

<p>The preceding script checks the date of the most recent minute resolution data you have for SPX. If there is new minute data available, it downloads the new data files and overwrites your hourly and daily files. If you don't intend to download all resolutions, adjust this script to your needs.</p>
58 changes: 58 additions & 0 deletions Resources/datasets/download_bulk_data_script.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
<div class="section-example-container">
<pre class="python">import os
import pandas as pd
from datetime import datetime, time, timedelta
from pytz import timezone
from os.path import abspath, dirname
os.chdir(dirname(abspath(__file__)))

OVERWRITE = False

# Define a method to download the data
def __download_data(resolution, start=None, end=None):
print(f"Updating {resolution} data...")
command = f'lean data download --dataset "<?=$dataset?>" --data-type "Bulk" <?=$extraArgs?> --resolution "{resolution}"'
if start:
end = end if end else start
command += f" --start {start} --end {end}"
if OVERWRITE:
command += " --overwrite"
print(command)
os.system(command)

def __get_end_date() -> str:
now = datetime.now(timezone("US/Eastern"))
if now.time() > time(7,30):
return (now - timedelta(1)).strftime("%Y%m%d")
print('New data is available at 07:30 AM EST')
return (now - timedelta(2)).strftime("%Y%m%d")

def __download_high_frequency_data(latest_on_cloud):
for resolution in <?=$highResolutions?>:
dir_name = f"<?=$securityType?>/<?=$market?>/{resolution}/<?=$ticker?>".lower()
if not os.path.exists(dir_name):
__download_data(resolution, '19980101')
continue
latest_on_disk = sorted(os.listdir(dir_name))[-1].split('_')[0]
if latest_on_disk >= latest_on_cloud:
print(f"{resolution} data is already up to date.")
continue
__download_data(resolution, latest_on_disk, latest_on_cloud)

def __download_low_frequency_data(latest_on_cloud):
for resolution in ["daily", "hour"]:
file_name = f"<?=$securityType?>/<?=$market?>/{resolution}/<?=$ticker?>.zip".lower()
if not os.path.exists(file_name):
__download_data(resolution)
continue
latest_on_disk = str(pd.read_csv(file_name, header=None)[0].iloc[-1])[:8]
if latest_on_disk >= latest_on_cloud:
print(f"{resolution} data is already up to date.")
continue
__download_data(resolution)

if __name__ == "__main__":
latest_on_cloud = __get_end_date()
__download_low_frequency_data(latest_on_cloud)
__download_high_frequency_data(latest_on_cloud)</pre>
</div>

0 comments on commit a06f0a8

Please sign in to comment.