Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate rel_path from rel_url_path, use one url #113

Merged
merged 3 commits into from
Oct 25, 2022
Merged

Separate rel_path from rel_url_path, use one url #113

merged 3 commits into from
Oct 25, 2022

Conversation

jonkeane
Copy link
Contributor

@jonkeane jonkeane commented Oct 20, 2022

Resolves #110 and #111

This introduces a disambiguation of rel_path from rel_url_path. The idea is that:

  • rel_path is that path in the cache, on disk
  • rel_url_path is the path needed for downloading only

This distinction is really only needed for cases where the table name (which is what a single-file table will be named once downloaded) is different from the path that it is stored as. We might not even need this "feature" though we have a table where that happens | is true already. We could also overwrite | change rel_path after we download the file, but that doesn't feel great (in case we maybe want to redownload? maybe it's not worth maintaining at all...)

rel_url_path is really only needed in cases where there is a multi-file table, where the files are listed (e.g. not the case of the taxi s3 bucket, we don't yet have examples of this!). But it can be specified elsewhere if that's helpful.

All of this means that: for the easy path, an author of a new repo entry generally just uses url and that's it. But we can save typing if there are multi-files by using url as a base with rel_url_paths added on. And then we (internally! with code) maintain the rel_path rel_url_path if we need those.

Comment on lines +267 to +270
* A URL specifying the file to be downloaded for that table (which could be a
single file, or a directory that contains many files to be downloaded)
* A base URL that is concatenated with ``rel_url_path``s in the ``files`` attribute
if the table is a multi-file table and it is preferable to list out the files
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this rst isn't right, please suggest the right syntax

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you just need more line breaks (usually the solution to RST woes :) ). Also watch out for the "``s"

    ``url``

        Download url for the table. This can be:

        * A URL specifying the file to be downloaded for that table (which could be a 
          single file, or a directory that contains many files to be downloaded)
        * A base URL that is concatenated with ``rel_url_path`` in the ``files`` attribute 
          if the table is a multi-file table and it is preferable to list out the files

Comment on lines +374 to +375
# drop the rel_url_path for comparison, because it's not relevant!
orig_file.pop("rel_url_path", None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do the reverse of this some day and only check a subset of specific fields instead of dropping this, but this was the easiest way to working code.

exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|")
# import pdb; pdb.set_trace()
exact_dataset = dataset.Dataset(
name="fanniemae_sample", format="csv", delim="|", compression="gzip"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out we were missing the compression here, so this was actually transforming a dataset (oops!). I've added to the test below an assertion that the hash is the same so we catch that if it happens again (with a comment why...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None versus "uncompressed" distinction (where None is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"

Comment on lines +254 to +279
"table_url,rel_url_path,final_url",
[
(
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
None,
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
),
# we can append rel_url_path
(
"https://ursa-qa.s3.amazonaws.com/chitraffic",
"chi_traffic_2020_Q1.parquet",
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
),
# we don't duplicate rel_url_path when it's there
(
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
"chi_traffic_2020_Q1.parquet",
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
),
# we can have a tablename that is differnet from rel_url_path
(
"https://ursa-qa.s3.amazonaws.com/chitraffic",
"traffic.parquet",
"https://ursa-qa.s3.amazonaws.com/chitraffic/traffic.parquet",
),
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably overkill (basically to support someone specifying the full url and the rel_path), but it's a handful of lines of code + some tests to cover a relatively confusing bit. We can pull it out (or turn the fix-up code that removes the filename if it's at the end of the url into validation that tells someone they've messed up their repo code)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it this way, because it's nice to still support a simple url (that is not made up by a base + rel_url_path)

Comment on lines +291 to +301
# We probably should move write_metadata out from download, but let's mock it for now
def _fake_write_metadata(self):
return True

monkeypatch.setattr(Dataset, "write_metadata", _fake_write_metadata)

# This probably also means that set_readonly isn't in the right place, but let's mock it away nonetheless
def _fake_set_readonly(path):
return True

monkeypatch.setattr("datalogistik.util.set_readonly", _fake_set_readonly)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In working on these tests I've left things in place as much as possible, but that we need to monkey patch these functions tells me we probably don't have Dataset.download() factored like we should. I would propose we have a method that does the downloading and no validation or anything (maybe Dataset._download()`), and then we wrap that with the validation etc. stuff later (so then when we test that we only need to mock the actual downloading bit and not also permissions+metadata writing+permissions munging).

#114

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment on lines +352 to +358
# add rel_url_path as if this were coming from a repo file with that.
ds = copy.deepcopy(multi_file_ds)
files = ds.tables[0].files
# add rel_url_path
[fl.update({"rel_url_path": fl["rel_path"]}) for fl in files]
# remote rel_path
[fl.pop("rel_path") for fl in files]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit funny, but it's because we're using metadata from the fixture cache to then test downloading. So we need to add rel_url_path in here as if it were in repo.json

We could also add rel_url_path in to the fixtures and take out some of this (though it's good that we're testing the rel_path setting which (rightfully) happens during downloading.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add rel_url_path in the metadata files in the fixture cache, I think these properties will be carried from the repo file entry so they would be there in a real metadata file too

@@ -298,7 +392,8 @@ def _simulate_download_by_copying(url, output_path):
# Note that if we were not working in a tmp dir, the dir should have been deleted
# and the exception message would be: "File integrity check for newly downloaded dataset failed."

# Test a succeeding download

def test_validated_download_dataset(monkeypatch):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I broke these out to make them a bit easier to see which failed when. Happy to put them back if we're 👎 on that

Copy link
Collaborator

@joosthooz joosthooz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I posed some questions but feel free to merge

README.rst Outdated
location on disk in the cache.

``rel_url_path``
URL path to the file(s), relative to the directory of this table. This is used
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "relative to the base URL"?

elif table.base_url:
if len(table.files) <= 1:
msg = f"Single-file table '{table.table}' has 'base_url' property set. It should only have a 'url'."
for file in table.files:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to support repo entries that do not have a files entry? A simple entry could just have a table entry with a url.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but let's do that in a follow on or later PR. The one (open) question there is: do we want to support supplying the checksum, etc. at a higher level without the files entry. That gets complicated so IMO we probably shouldn't, but let's make an issue for that + discuss there

download_path = table_path

# Set the rel_path
file["rel_path"] = download_path.name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why overwrite file["rel_path"] here? Don't we need to it figure out where to place the file? The contents could be different from download_path.name, especially if it is placed in a subdir (because of partitioning for example), and then the validation would fail

Or did I misunderstand and is rel_path now not a user input anymore? It seems I did, but then my question is; if rel_url_path is set, it is alway identical to rel_path, so do we need the distinction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is alway identical to rel_path, so do we need the distinction?

You're right that it's (almost) always identical, but there are two cases where it's not identical:

  • in the multi-file table scenario below
  • when the table name is different from the filename

exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|")
# import pdb; pdb.set_trace()
exact_dataset = dataset.Dataset(
name="fanniemae_sample", format="csv", delim="|", compression="gzip"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None versus "uncompressed" distinction (where None is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"

README.rst Outdated

``files``
A list of files in this table. Each entry in the list has the following properties:

``rel_path``
Path to the file, relative to the directory of this table.
Path to the file(s), relative to the directory of this table. This is the the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate "the"

Comment on lines +254 to +279
"table_url,rel_url_path,final_url",
[
(
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
None,
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
),
# we can append rel_url_path
(
"https://ursa-qa.s3.amazonaws.com/chitraffic",
"chi_traffic_2020_Q1.parquet",
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
),
# we don't duplicate rel_url_path when it's there
(
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
"chi_traffic_2020_Q1.parquet",
"https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
),
# we can have a tablename that is differnet from rel_url_path
(
"https://ursa-qa.s3.amazonaws.com/chitraffic",
"traffic.parquet",
"https://ursa-qa.s3.amazonaws.com/chitraffic/traffic.parquet",
),
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it this way, because it's nice to still support a simple url (that is not made up by a base + rel_url_path)

Comment on lines +352 to +358
# add rel_url_path as if this were coming from a repo file with that.
ds = copy.deepcopy(multi_file_ds)
files = ds.tables[0].files
# add rel_url_path
[fl.update({"rel_url_path": fl["rel_path"]}) for fl in files]
# remote rel_path
[fl.pop("rel_path") for fl in files]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add rel_url_path in the metadata files in the fixture cache, I think these properties will be carried from the repo file entry so they would be there in a real metadata file too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possibly unify url and base_url
3 participants