Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate rel_path from rel_url_path, use one url #113

Merged
merged 3 commits into from
Oct 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 12 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -263,17 +263,23 @@ datalogistik_metadata.ini
Schema of the table.

``url``
Download url in case this is a single-file table.

``base_url``
Base download url in case this is a multi-file table. Each file will append
their `rel_path` to this to form the full download url.
Download url for the table. This can be:
* A URL specifying the file to be downloaded for that table (which could be a
single file, or a directory that contains many files to be downloaded)
* A base URL that is concatenated with ``rel_url_path``s in the ``files`` attribute
if the table is a multi-file table and it is preferable to list out the files
Comment on lines +267 to +270
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this rst isn't right, please suggest the right syntax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just need more line breaks (usually the solution to RST woes :) ). Also watch out for the "``s"

    ``url``

        Download url for the table. This can be:

        * A URL specifying the file to be downloaded for that table (which could be a 
          single file, or a directory that contains many files to be downloaded)
        * A base URL that is concatenated with ``rel_url_path`` in the ``files`` attribute 
          if the table is a multi-file table and it is preferable to list out the files


``files``
A list of files in this table. Each entry in the list has the following properties:

``rel_path``
Path to the file, relative to the directory of this table.
Path to the file(s), relative to the directory of this table. This is the
location on disk in the cache.

``rel_url_path``
URL path to the file(s), relative to the directory of this table where it is stored
remotely. This is used only when downloading the file. This is only necesary when a
multi table file has the files that make up the table listed out individually.

``file_size``
Size of the file.
Expand Down
76 changes: 37 additions & 39 deletions datalogistik/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,7 +357,6 @@ def validate_table_files(self, table):
f"No metadata found for table {table}, could not perform validation (assuming valid)"
)
return True

new_file_listing = self.create_file_listing(table)
# we can't perform a simple equality check on the whole listing,
# because the orig_file_listing does not contain the metadata file.
Expand All @@ -372,6 +371,8 @@ def validate_table_files(self, table):
found = None
for new_file in new_file_listing:
if new_file["rel_path"] == orig_file["rel_path"]:
# drop the rel_url_path for comparison, because it's not relevant!
orig_file.pop("rel_url_path", None)
Comment on lines +374 to +375
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do the reverse of this some day and only check a subset of specific fields instead of dropping this, but this was the easiest way to working code.

found = new_file
break

Expand Down Expand Up @@ -448,7 +449,7 @@ def download(self):
msg = (
"No table entries were found. "
"To download a dataset, at least 1 table entry must exist "
"that has a 'url' or 'base_url' property."
"that has a 'url' property."
)
log.error(msg)
raise ValueError(msg)
Expand All @@ -460,51 +461,48 @@ def download(self):

# For now, we always download all tables. So we need to loop through each table
for table in self.tables:

# There are 2 possible types downloads:
# 1 - The table entry has a url property. Either this table is a single-file, or it is a single
# download that will produce multiple files.
# 2 - multi file (either single or multi table). The table entry has a base_url property,
# and each file has a rel_path property. This is appended to the base_url to form
# the download link. The files will be placed in the table directory (generated from the table name).

# create table dir
table_path = self.ensure_table_loc(table)

# Type 1
if table.url:
# Note that table_path will be a file if this is a single-file table,
# and a dir if it is a multi-file table (download_file will produce multiple files)
util.download_file(table.url, output_path=table_path)
util.set_readonly(table_path)

# Type 2
elif table.base_url:
if len(table.files) <= 1:
msg = f"Single-file table '{table.table}' has 'base_url' property set. It should only have a 'url'."
for file in table.files:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to support repo entries that do not have a files entry? A simple entry could just have a table entry with a url.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but let's do that in a follow on or later PR. The one (open) question there is: do we want to support supplying the checksum, etc. at a higher level without the files entry. That gets complicated so IMO we probably shouldn't, but let's make an issue for that + discuss there

# Validate that there is a url at all
if not table.url:
msg = (
f"Could not find a url property for Table '{table.table}'."
)
log.error(msg)
raise ValueError(msg)
for file in table.files:
# contains the suffix for the download url
rel_path = file.get("rel_path")
if not rel_path:
msg = f"Missing rel_path property for multi-file table '{table.table}'."
raise RuntimeError(msg)

# contains the suffix for the download url
rel_url_path = file.get("rel_url_path")

# if the filename is not at the end of full_path, join
table_url = table.url
if rel_url_path and not table_url.endswith(rel_url_path):
table_url = table_url + "/" + rel_url_path

download_path = table_path

# Set the rel_path
file["rel_path"] = download_path.name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why overwrite file["rel_path"] here? Don't we need to it figure out where to place the file? The contents could be different from download_path.name, especially if it is placed in a subdir (because of partitioning for example), and then the validation would fail

Or did I misunderstand and is rel_path now not a user input anymore? It seems I did, but then my question is; if rel_url_path is set, it is alway identical to rel_path, so do we need the distinction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is alway identical to rel_path, so do we need the distinction?

You're right that it's (almost) always identical, but there are two cases where it's not identical:

  • in the multi-file table scenario below
  • when the table name is different from the filename


# if this is a multi file table, then we need to do validate that
# there are rel_paths + append them. All files constituting a table
# must be in a dir with name table.name (created by ensure_table_loc)
# note that the resulting dir structure is not necessarily flat,
# because the table can have multiple levels of partitioning.
if len(table.files) > 1 or table.multi_file:
if not rel_url_path:
msg = f"Missing rel_url_path property for multi-file table '{table.table}'."
log.error(msg)
raise ValueError(msg)
download_path = download_path / rel_url_path

# All files constituting a table must be in a dir with name table.name (created by ensure_table_loc)
# note that the resulting dir structure is not necessarily flat,
# because the table can have multiple levels of partitioning.
download_path = table_path / rel_path
url = table.base_url + "/" + rel_path

util.download_file(url, output_path=download_path)
util.set_readonly(download_path)
# but for multi-file tables, we override this with the rel_url_path
file["rel_path"] = rel_url_path

else:
msg = f"Could not find a url or base_url property for Table '{table.table}'."
log.error(msg)
raise RuntimeError(msg)
util.download_file(table_url, output_path=download_path)
util.set_readonly(download_path)

# Try validation in case the dataset info contained checksums
if not self.validate_table_files(table):
Expand Down
1 change: 0 additions & 1 deletion datalogistik/table.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,4 +33,3 @@ class Table:
header_line: Optional[bool] = None
dim: Optional[List] = field(default_factory=list)
url: Optional[str] = None
base_url: Optional[str] = None
11 changes: 1 addition & 10 deletions repo.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
"header_line": false,
"files": [
{
"rel_path": "2016Q4.csv.gz",
"file_size": 262125134,
"md5": "6e8ff7ae76033c452a9b72350461a4e6"
}
Expand Down Expand Up @@ -77,7 +76,6 @@
],
"files": [
{
"rel_path": "nyctaxi_2010-01.csv.gz",
"file_size": 591876633,
"md5": "22bd86bf54da91faf26fa8a243644a9f"
}
Expand All @@ -98,7 +96,6 @@
],
"files": [
{
"rel_path": "chi_traffic_2020_Q1.parquet",
"file_size": 182895135,
"md5": "b53ea738eda7ee051bcf33b3056b83f6"
}
Expand All @@ -119,7 +116,6 @@
],
"files": [
{
"rel_path": "type_strings.parquet",
"file_size": 87174822,
"md5": "927c0cf481fb3c4870806b096b525d90"
}
Expand All @@ -140,7 +136,6 @@
],
"files": [
{
"rel_path": "type_dict.parquet",
"file_size": 2890770,
"md5": "4186ad2f3072db6558aba3c535da1b97"
}
Expand All @@ -161,7 +156,6 @@
],
"files": [
{
"rel_path": "type_integers.parquet",
"file_size": 15882666,
"md5": "7d44caa83025c9c7ef2ded67081d6d68"
}
Expand All @@ -182,7 +176,6 @@
],
"files": [
{
"rel_path": "type_floats.parquet",
"file_size": 23851672,
"md5": "d083e46ddeff5cc5a3be3ea5c2f7f9d7"
}
Expand All @@ -203,7 +196,6 @@
],
"files": [
{
"rel_path": "type_nested.parquet",
"file_size": 130538033,
"md5": "4bcf6735b00b32fccb6da338262def23"
}
Expand All @@ -224,12 +216,11 @@
],
"files": [
{
"rel_path": "type_simple_features.parquet",
"file_size": 28637722,
"md5": "7776bea2a08e0466f77058676e2bb567"
}
]
}
]
}
]
]
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
"tables": [
{
"table": "taxi_2013",
"base_url": "http://www.example.com",
"url": "http://www.example.com",
"files": [
{
"file_size": 5653,
Expand Down
10 changes: 9 additions & 1 deletion tests/test_datalogistik.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,9 +297,12 @@ def test_compress(comp_string):


# Integration-style tests
@pytest.mark.skipif(sys.platform == "win32", reason="windows path seperator")
def test_main(capsys):
# This should be in the cache already, so no conversion needed
exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|")
exact_dataset = dataset.Dataset(
name="fanniemae_sample", format="csv", delim="|", compression="gzip"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we were missing the compression here, so this was actually transforming a dataset (oops!). I've added to the test below an assertion that the hash is the same so we catch that if it happens again (with a comment why...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None versus "uncompressed" distinction (where None is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"

)

with pytest.raises(SystemExit) as e:
datalogistik.main(exact_dataset)
Expand All @@ -310,6 +313,11 @@ def test_main(capsys):
assert captured["name"] == "fanniemae_sample"
assert captured["format"] == "csv"
assert isinstance(captured["tables"], dict)
# this is the path from the fixtures, if this doesn't match, we've actualy converted and not just found the extant one
assert (
captured["tables"]["fanniemae_sample"]["path"]
== "tests/fixtures/test_cache/fanniemae_sample/a77e575/fanniemae_sample.csv.gz"
)


@pytest.mark.skipif(sys.platform == "win32", reason="windows errors on the cleanup")
Expand Down
Loading