Separate `rel_path` from `rel_url_path`, use one `url` #113

jonkeane · 2022-10-20T20:30:55Z

Resolves #110 and #111

This introduces a disambiguation of rel_path from rel_url_path. The idea is that:

rel_path is that path in the cache, on disk
rel_url_path is the path needed for downloading only

This distinction is really only needed for cases where the table name (which is what a single-file table will be named once downloaded) is different from the path that it is stored as. We might not even need this "feature" though we have a table where that happens | is true already. We could also overwrite | change rel_path after we download the file, but that doesn't feel great (in case we maybe want to redownload? maybe it's not worth maintaining at all...)

rel_url_path is really only needed in cases where there is a multi-file table, where the files are listed (e.g. not the case of the taxi s3 bucket, we don't yet have examples of this!). But it can be specified elsewhere if that's helpful.

All of this means that: for the easy path, an author of a new repo entry generally just uses url and that's it. But we can save typing if there are multi-files by using url as a base with rel_url_paths added on. And then we (internally! with code) maintain the rel_path rel_url_path if we need those.

jonkeane · 2022-10-20T20:31:58Z

README.rst

+        * A URL specifying the file to be downloaded for that table (which could be a 
+          single file, or a directory that contains many files to be downloaded)
+        * A base URL that is concatenated with ``rel_url_path``s in the ``files`` attribute 
+          if the table is a multi-file table and it is preferable to list out the files


I'm sure this rst isn't right, please suggest the right syntax

I think you just need more line breaks (usually the solution to RST woes :) ). Also watch out for the "``s"

``url`` Download url for the table. This can be: * A URL specifying the file to be downloaded for that table (which could be a single file, or a directory that contains many files to be downloaded) * A base URL that is concatenated with ``rel_url_path`` in the ``files`` attribute if the table is a multi-file table and it is preferable to list out the files

jonkeane · 2022-10-20T20:32:51Z

datalogistik/dataset.py

+                    # drop the rel_url_path for comparison, because it's not relevant!
+                    orig_file.pop("rel_url_path", None)


We might want to do the reverse of this some day and only check a subset of specific fields instead of dropping this, but this was the easiest way to working code.

jonkeane · 2022-10-20T20:34:07Z

tests/test_datalogistik.py

-    exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|")
+    # import pdb; pdb.set_trace()
+    exact_dataset = dataset.Dataset(
+        name="fanniemae_sample", format="csv", delim="|", compression="gzip"


It turns out we were missing the compression here, so this was actually transforming a dataset (oops!). I've added to the test below an assertion that the hash is the same so we catch that if it happens again (with a comment why...)

Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None versus "uncompressed" distinction (where None is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"

jonkeane · 2022-10-20T20:35:29Z

tests/test_dataset.py

+    "table_url,rel_url_path,final_url",
+    [
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+            None,
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+        ),
+        # we can append rel_url_path
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic",
+            "chi_traffic_2020_Q1.parquet",
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+        ),
+        # we don't duplicate rel_url_path when it's there
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+            "chi_traffic_2020_Q1.parquet",
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+        ),
+        # we can have a tablename that is differnet from rel_url_path
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic",
+            "traffic.parquet",
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/traffic.parquet",
+        ),
+    ],


This is probably overkill (basically to support someone specifying the full url and the rel_path), but it's a handful of lines of code + some tests to cover a relatively confusing bit. We can pull it out (or turn the fix-up code that removes the filename if it's at the end of the url into validation that tells someone they've messed up their repo code)

I like it this way, because it's nice to still support a simple url (that is not made up by a base + rel_url_path)

jonkeane · 2022-10-20T20:38:37Z

tests/test_dataset.py

+    # We probably should move write_metadata out from download, but let's mock it for now
+    def _fake_write_metadata(self):
+        return True
+
+    monkeypatch.setattr(Dataset, "write_metadata", _fake_write_metadata)
+
+    # This probably also means that set_readonly isn't in the right place, but let's mock it away nonetheless
+    def _fake_set_readonly(path):
+        return True
+
+    monkeypatch.setattr("datalogistik.util.set_readonly", _fake_set_readonly)


In working on these tests I've left things in place as much as possible, but that we need to monkey patch these functions tells me we probably don't have Dataset.download() factored like we should. I would propose we have a method that does the downloading and no validation or anything (maybe Dataset._download()`), and then we wrap that with the validation etc. stuff later (so then when we test that we only need to mock the actual downloading bit and not also permissions+metadata writing+permissions munging).

#114

jonkeane · 2022-10-20T20:40:07Z

tests/test_dataset.py

+    # add rel_url_path as if this were coming from a repo file with that.
+    ds = copy.deepcopy(multi_file_ds)
+    files = ds.tables[0].files
+    # add rel_url_path
+    [fl.update({"rel_url_path": fl["rel_path"]}) for fl in files]
+    # remote rel_path
+    [fl.pop("rel_path") for fl in files]


This is a bit funny, but it's because we're using metadata from the fixture cache to then test downloading. So we need to add rel_url_path in here as if it were in repo.json

We could also add rel_url_path in to the fixtures and take out some of this (though it's good that we're testing the rel_path setting which (rightfully) happens during downloading.

We could add rel_url_path in the metadata files in the fixture cache, I think these properties will be carried from the repo file entry so they would be there in a real metadata file too

jonkeane · 2022-10-20T20:40:38Z

tests/test_dataset.py

@@ -298,7 +392,8 @@ def _simulate_download_by_copying(url, output_path):
            # Note that if we were not working in a tmp dir, the dir should have been deleted
            # and the exception message would be: "File integrity check for newly downloaded dataset failed."

-    # Test a succeeding download
+
+def test_validated_download_dataset(monkeypatch):


I broke these out to make them a bit easier to see which failed when. Happy to put them back if we're 👎 on that

joosthooz

Thanks for this, I posed some questions but feel free to merge

joosthooz · 2022-10-21T09:00:13Z

README.rst

+            location on disk in the cache.
+
+        ``rel_url_path``
+            URL path to the file(s), relative to the directory of this table. This is used 


Should this be "relative to the base URL"?

joosthooz · 2022-10-21T09:10:38Z

datalogistik/dataset.py

-                elif table.base_url:
-                    if len(table.files) <= 1:
-                        msg = f"Single-file table '{table.table}' has 'base_url' property set. It should only have a 'url'."
+                for file in table.files:


Don't we want to support repo entries that do not have a files entry? A simple entry could just have a table entry with a url.

That would be nice, but let's do that in a follow on or later PR. The one (open) question there is: do we want to support supplying the checksum, etc. at a higher level without the files entry. That gets complicated so IMO we probably shouldn't, but let's make an issue for that + discuss there

joosthooz · 2022-10-21T09:15:14Z

datalogistik/dataset.py

+                    download_path = table_path
+
+                    # Set the rel_path
+                    file["rel_path"] = download_path.name


Why overwrite file["rel_path"] here? Don't we need to it figure out where to place the file? The contents could be different from download_path.name, especially if it is placed in a subdir (because of partitioning for example), and then the validation would fail

Or did I misunderstand and is rel_path now not a user input anymore? It seems I did, but then my question is; if rel_url_path is set, it is alway identical to rel_path, so do we need the distinction?

it is alway identical to rel_path, so do we need the distinction?

You're right that it's (almost) always identical, but there are two cases where it's not identical:

in the multi-file table scenario below

when the table name is different from the filename

joosthooz · 2022-10-21T09:20:54Z

tests/test_datalogistik.py

-    exact_dataset = dataset.Dataset(name="fanniemae_sample", format="csv", delim="|")
+    # import pdb; pdb.set_trace()
+    exact_dataset = dataset.Dataset(
+        name="fanniemae_sample", format="csv", delim="|", compression="gzip"


Oh wow good catch, it has proved rather difficult to get the default behavior right in all situations! Maybe we should reconsider the None versus "uncompressed" distinction (where None is basically "dont-care"), then our equality operator could just return equal if one of the objects has compression "dont-care"

joosthooz · 2022-10-21T09:22:26Z

README.rst


    ``files``
        A list of files in this table. Each entry in the list has the following properties:

        ``rel_path``
-            Path to the file, relative to the directory of this table.
+            Path to the file(s), relative to the directory of this table. This is the the 


duplicate "the"

joosthooz · 2022-10-21T09:25:38Z

tests/test_dataset.py

+    "table_url,rel_url_path,final_url",
+    [
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+            None,
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+        ),
+        # we can append rel_url_path
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic",
+            "chi_traffic_2020_Q1.parquet",
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+        ),
+        # we don't duplicate rel_url_path when it's there
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+            "chi_traffic_2020_Q1.parquet",
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/chi_traffic_2020_Q1.parquet",
+        ),
+        # we can have a tablename that is differnet from rel_url_path
+        (
+            "https://ursa-qa.s3.amazonaws.com/chitraffic",
+            "traffic.parquet",
+            "https://ursa-qa.s3.amazonaws.com/chitraffic/traffic.parquet",
+        ),
+    ],


I like it this way, because it's nice to still support a simple url (that is not made up by a base + rel_url_path)

joosthooz · 2022-10-21T09:38:03Z

tests/test_dataset.py

+    # add rel_url_path as if this were coming from a repo file with that.
+    ds = copy.deepcopy(multi_file_ds)
+    files = ds.tables[0].files
+    # add rel_url_path
+    [fl.update({"rel_url_path": fl["rel_path"]}) for fl in files]
+    # remote rel_path
+    [fl.pop("rel_path") for fl in files]


We could add rel_url_path in the metadata files in the fixture cache, I think these properties will be carried from the repo file entry so they would be there in a real metadata file too

separate rel_path from rel_url_path, use one url

665f4cb

jonkeane commented Oct 20, 2022

View reviewed changes

jonkeane requested review from joosthooz and austin3dickey October 20, 2022 20:40

joosthooz approved these changes Oct 21, 2022

View reviewed changes

jonkeane added 2 commits October 25, 2022 09:49

PR comments

5a82ec5

Windows CI silliness

93059cd

jonkeane merged commit 8c7d9e4 into main Oct 25, 2022

jonkeane deleted the rel_path branch October 25, 2022 15:15

jonkeane mentioned this pull request Oct 25, 2022

Add in a missing comment that pre-commit failed to include #115

Merged

joosthooz mentioned this pull request Nov 1, 2022

Differentiate rel_path as url suffix and relative path in cache #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate `rel_path` from `rel_url_path`, use one `url` #113

Separate `rel_path` from `rel_url_path`, use one `url` #113

jonkeane commented Oct 20, 2022 •

edited

Loading

jonkeane Oct 20, 2022

austin3dickey Oct 21, 2022

jonkeane Oct 20, 2022

jonkeane Oct 20, 2022

joosthooz Oct 21, 2022

jonkeane Oct 20, 2022

joosthooz Oct 21, 2022

jonkeane Oct 20, 2022

austin3dickey Oct 21, 2022

jonkeane Oct 20, 2022

joosthooz Oct 21, 2022

jonkeane Oct 20, 2022

joosthooz left a comment

joosthooz Oct 21, 2022

joosthooz Oct 21, 2022

jonkeane Oct 25, 2022

joosthooz Oct 21, 2022

jonkeane Oct 25, 2022

joosthooz Oct 21, 2022

joosthooz Oct 21, 2022

joosthooz Oct 21, 2022

joosthooz Oct 21, 2022

		# drop the rel_url_path for comparison, because it's not relevant!
		orig_file.pop("rel_url_path", None)

Separate rel_path from rel_url_path, use one url #113

Separate rel_path from rel_url_path, use one url #113

Conversation

jonkeane commented Oct 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joosthooz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Separate `rel_path` from `rel_url_path`, use one `url` #113

Separate `rel_path` from `rel_url_path`, use one `url` #113

jonkeane commented Oct 20, 2022 •

edited

Loading