docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

lwjohnst86 · 2024-10-25T02:32:29Z

Description

Based on @martonvago's suggestion, I'll write things in "pseudocode" from now on. But instead of pseudocode, I will write an outline of the Python function with how I think it might flow inside. Plus, I can write the full docstrings inside, so you all don't need and we don't need to move it over from the Quarto doc. I have NOT ran this, tested it, or did any execution, this is purely how I think it might work, hence "pseudo" 😛. I'll add some comments directly to the code in the PR.

Closes #642

This PR needs an in-depth review.

Checklist

Updated documentation

Closes #642

docs/design/implementation/python-functions.qmd

sprout/core/write_resource_parquet.py

lwjohnst86 · 2024-10-25T02:35:48Z

sprout/core/write_resource_parquet.py

@@ -0,0 +1,79 @@
+# ruff: noqa
+def write_resource_parquet(
+    raw_files_path: list[Path], parquet_path: Path, properties_path: Path


I couldn't decide how to handle the properties part, or even whether we should do any verification against the properties. My reasoning here is that maybe, just maybe, someone might make a change in the files outside of Sprout. So I thought, we can assume generally that the raw data are already verified, but can't guarantee it. Better to be safe than sorry!

I could also see a situation where someone would change the files outside Sprout

sprout/core/write_resource_parquet.py

lwjohnst86 · 2024-10-25T02:38:04Z

sprout/core/write_resource_parquet.py

+
+    Examples:
+
+        ``` python


Based on the decision post for using quartodoc to generate the docstrings into the website, we can actually include executable code in the examples via using {python}. Without the {}, it will not execute. Just something to keep in mind, if we want to show users what it might look like when executed.

lwjohnst86 · 2024-10-25T02:38:50Z

sprout/core/write_resource_parquet.py

+    data_list = [polars.read_csv(path) for path in paths]
+    # Merge them all together.
+    data = polars.concat(data_list)


I added this real code here because I was curious how Polars might do it. It seems it can!

lwjohnst86 · 2024-10-25T02:39:25Z

sprout/core/write_resource_parquet.py

+    verify_data(data, properties)
+
+    # Could include validation here?
+    # validate_data(data, properties)


I think we can keep this as a TODO item and keep it commented, but not implement it for now.

martonvago

Very nice!! Just some questions.

I think I've developed some confusion about what the raw data files represent. Are they different versions of the data (with later versions overwriting earlier ones) or different sections of the data (e.g. one file for rows 1-100 and another one for rows 101-200)? Well, I guess there is no reason why they couldn't be used as both...

sprout/core/write_resource_parquet.py

martonvago · 2024-10-25T10:39:04Z

sprout/core/write_resource_parquet.py

@@ -0,0 +1,79 @@
+# ruff: noqa
+def write_resource_parquet(
+    raw_files_path: list[Path], parquet_path: Path, properties_path: Path


Is it necessary to have all these paths as arguments? I'm just thinking that given either raw_files_path or parquet_path, the others can be constructed. (Or just given the two IDs they can be constructed.)

Yea, I was thinking that, but at least within core, I'd really like to avoid having internal assumptions on where things are located in the folder. Hence why I wrote up the path_ functions, so that there is that separation. In CLI and in the app, we should definitely have that assumption, but I'm not sure about in core. I'd like to keep it a bit more flexible, just in case.

Okay, that makes sense! I'm just thinking that sometimes (not necessarily here) you want to go from one path to another one (e.g. you know the resource folder path and want the path to datapackage.json), and I'm not always sure what I'm allowed to assume / how I'm allowed to manipulate them. (Not really a question now, but I'll ask it properly when it comes up 😅 )

What would you think about adding some functions that are the inverse of the path functions? So for path_package(package_id: int) -> Path we could have package_id_from_path_package(path: Path) -> int. This wouldn’t add any new assumptions (other than the invertibility) and it would allow us to go from one path to another via the IDs. We could structure them analogously to the path functions and keep them all together.

martonvago · 2024-10-25T10:55:05Z

sprout/core/write_resource_parquet.py

+    return data
+
+
+def verify_data(data: DataFrame, properties: dict) -> Path:


frictionless-py should be able to do this for us hopefully!

Are we moving away from frictionless-py altogether, or do we want to use it for verification still? In relation to the discussion in #826

martonvago · 2024-10-25T10:56:53Z

sprout/core/write_resource_parquet.py

+
+    # Confirms that the data matches the resource properties found in `datapackage.json`.
+    # Not sure if this is the best solution here to load in the properties file.
+    verify_is_file(properties_path)


Can the verification/validation methods be the same as in the write_resource_data_to_raw flow? There, we're processing one raw resource, while here it's all of them, but presumably the logic would be the same?

I think I'm a little confused about the difference between verify_data and validate_data. In the diagram for write_resource_data_to_raw it is the validate function that checks against the properties, but here it's verify.

For the first question, yea, that makes sense, to have it the same as in write_resource_data_to_raw().

For the second question. Both verify_data() and validate_data() compare against properties. Validating compares against only the constraints field. Think of the difference being, with verify, it is checking that a given column is called age and that it is a int or float and that it has a given number of rows. With validate, it is checking that age is between 0 and 120 (since most humans don't live past 120), but only if that constraints field is provided by the user.

I don't know how frictionless handles this, based on how I read the docs, it seems the validate function does mostly verifying. Which is a bit annoying semantically, but 🤷... Maybe I'll open an issue in their repo 😛

Ah okay, I get it now, thanks! Yeah, based on what I've seen, their validate does both: it checks the data against the type and number of fields specified in the schema (so verifies it) and also checks it against any constraints (so validates it). If we want to treat these separately, we could try filtering the error messages (like we do for ResourceError vs PackageError).

Should we treat these separately? I've looked quite a bit at this for write_resource_data_to_raw, so could do it either way.

In relation to the specifics of this, I'm awaiting the discussion in #826

sprout/core/write_resource_parquet.py

lwjohnst86 · 2024-10-29T13:02:05Z

@martonvago I forgot to respond to your initial question.

Raw files are kept from the initial upload to keep a record just in case something happens.

A potential scenario might be, a first round of surveys are sent to people and that data gets uploaded to Sprout. That's one raw file. Maybe a few months later, the same survey is sent out and that data gets uploaded. That's another raw file. So those two raw files get merged together and saved as the data.parquet file, which would be the file that researchers actually use to do analyses.

signekb

The overall picture of this makes sense to me as well 👍

signekb · 2024-11-04T13:16:55Z

sprout/core/write_resource_parquet.py

@@ -0,0 +1,79 @@
+# ruff: noqa
+def write_resource_parquet(
+    raw_files_path: list[Path], parquet_path: Path, properties_path: Path


I could also see a situation where someone would change the files outside Sprout

sprout/core/write_resource_parquet.py

signekb · 2024-11-05T09:08:51Z

sprout/core/write_resource_parquet.py

+
+    # Confirms that the data matches the resource properties found in `datapackage.json`.
+    # Not sure if this is the best solution here to load in the properties file.
+    verify_is_file(properties_path)


In relation to the specifics of this, I'm awaiting the discussion in #826

signekb · 2024-11-05T09:10:11Z

sprout/core/write_resource_parquet.py

+    return data
+
+
+def verify_data(data: DataFrame, properties: dict) -> Path:


Are we moving away from frictionless-py altogether, or do we want to use it for verification still? In relation to the discussion in #826

…prout into docs/write-resource-parquet-pseudocode

docs: 📝 pseudo code and docstring for write_resource_parquet()

9355eb9

Closes #642

lwjohnst86 requested a review from a team as a code owner October 25, 2024 02:32

github-actions bot assigned lwjohnst86 Oct 25, 2024

lwjohnst86 commented Oct 25, 2024

View reviewed changes

martonvago reviewed Oct 25, 2024

View reviewed changes

lwjohnst86 commented Oct 28, 2024

View reviewed changes

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved

signekb reviewed Nov 5, 2024

View reviewed changes

lwjohnst86 added 2 commits November 11, 2024 13:45

Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…

07ef602

…prout into docs/write-resource-parquet-pseudocode

chore: 🚚 move file into pseudocode folder

b316ae5

lwjohnst86 marked this pull request as draft November 13, 2024 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

lwjohnst86 commented Oct 25, 2024 •

edited

Loading

lwjohnst86 Oct 25, 2024 •

edited

Loading

signekb Nov 4, 2024

lwjohnst86 Oct 25, 2024

lwjohnst86 Oct 25, 2024

lwjohnst86 Oct 25, 2024

martonvago left a comment

martonvago Oct 25, 2024

lwjohnst86 Oct 28, 2024

martonvago Oct 28, 2024

martonvago Oct 29, 2024

martonvago Oct 25, 2024

signekb Nov 5, 2024

martonvago Oct 25, 2024

lwjohnst86 Oct 28, 2024

martonvago Oct 28, 2024

signekb Nov 5, 2024

lwjohnst86 commented Oct 29, 2024

signekb left a comment

signekb Nov 4, 2024

signekb Nov 5, 2024

signekb Nov 5, 2024

		return data


		def verify_data(data: DataFrame, properties: dict) -> Path:

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Are you sure you want to change the base?

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Conversation

lwjohnst86 commented Oct 25, 2024 • edited Loading

Description

Checklist

lwjohnst86 Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martonvago left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lwjohnst86 commented Oct 29, 2024

signekb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

lwjohnst86 commented Oct 25, 2024 •

edited

Loading

lwjohnst86 Oct 25, 2024 •

edited

Loading