Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lwjohnst86
Copy link
Member

@lwjohnst86 lwjohnst86 commented Oct 25, 2024

Description

Based on @martonvago's suggestion, I'll write things in "pseudocode" from now on. But instead of pseudocode, I will write an outline of the Python function with how I think it might flow inside. Plus, I can write the full docstrings inside, so you all don't need and we don't need to move it over from the Quarto doc. I have NOT ran this, tested it, or did any execution, this is purely how I think it might work, hence "pseudo" 😛. I'll add some comments directly to the code in the PR.

Closes #642

This PR needs an in-depth review.

Checklist

  • Updated documentation

@lwjohnst86 lwjohnst86 requested a review from a team as a code owner October 25, 2024 02:32
sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
@@ -0,0 +1,79 @@
# ruff: noqa
def write_resource_parquet(
raw_files_path: list[Path], parquet_path: Path, properties_path: Path
Copy link
Member Author

@lwjohnst86 lwjohnst86 Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't decide how to handle the properties part, or even whether we should do any verification against the properties. My reasoning here is that maybe, just maybe, someone might make a change in the files outside of Sprout. So I thought, we can assume generally that the raw data are already verified, but can't guarantee it. Better to be safe than sorry!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also see a situation where someone would change the files outside Sprout

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved

Examples:

``` python
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the decision post for using quartodoc to generate the docstrings into the website, we can actually include executable code in the examples via using {python}. Without the {}, it will not execute. Just something to keep in mind, if we want to show users what it might look like when executed.

Comment on lines 65 to 67
data_list = [polars.read_csv(path) for path in paths]
# Merge them all together.
data = polars.concat(data_list)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this real code here because I was curious how Polars might do it. It seems it can!

verify_data(data, properties)

# Could include validation here?
# validate_data(data, properties)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep this as a TODO item and keep it commented, but not implement it for now.

Copy link
Contributor

@martonvago martonvago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!! Just some questions.

I think I've developed some confusion about what the raw data files represent. Are they different versions of the data (with later versions overwriting earlier ones) or different sections of the data (e.g. one file for rows 1-100 and another one for rows 101-200)? Well, I guess there is no reason why they couldn't be used as both...

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved
@@ -0,0 +1,79 @@
# ruff: noqa
def write_resource_parquet(
raw_files_path: list[Path], parquet_path: Path, properties_path: Path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have all these paths as arguments? I'm just thinking that given either raw_files_path or parquet_path, the others can be constructed. (Or just given the two IDs they can be constructed.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I was thinking that, but at least within core, I'd really like to avoid having internal assumptions on where things are located in the folder. Hence why I wrote up the path_ functions, so that there is that separation. In CLI and in the app, we should definitely have that assumption, but I'm not sure about in core. I'd like to keep it a bit more flexible, just in case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense! I'm just thinking that sometimes (not necessarily here) you want to go from one path to another one (e.g. you know the resource folder path and want the path to datapackage.json), and I'm not always sure what I'm allowed to assume / how I'm allowed to manipulate them. (Not really a question now, but I'll ask it properly when it comes up 😅 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about adding some functions that are the inverse of the path functions? So for path_package(package_id: int) -> Path we could have package_id_from_path_package(path: Path) -> int. This wouldn’t add any new assumptions (other than the invertibility) and it would allow us to go from one path to another via the IDs. We could structure them analogously to the path functions and keep them all together.

return data


def verify_data(data: DataFrame, properties: dict) -> Path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frictionless-py should be able to do this for us hopefully!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we moving away from frictionless-py altogether, or do we want to use it for verification still? In relation to the discussion in #826


# Confirms that the data matches the resource properties found in `datapackage.json`.
# Not sure if this is the best solution here to load in the properties file.
verify_is_file(properties_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the verification/validation methods be the same as in the write_resource_data_to_raw flow? There, we're processing one raw resource, while here it's all of them, but presumably the logic would be the same?

I think I'm a little confused about the difference between verify_data and validate_data. In the diagram for write_resource_data_to_raw it is the validate function that checks against the properties, but here it's verify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first question, yea, that makes sense, to have it the same as in write_resource_data_to_raw().

For the second question. Both verify_data() and validate_data() compare against properties. Validating compares against only the constraints field. Think of the difference being, with verify, it is checking that a given column is called age and that it is a int or float and that it has a given number of rows. With validate, it is checking that age is between 0 and 120 (since most humans don't live past 120), but only if that constraints field is provided by the user.

I don't know how frictionless handles this, based on how I read the docs, it seems the validate function does mostly verifying. Which is a bit annoying semantically, but 🤷... Maybe I'll open an issue in their repo 😛

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I get it now, thanks! Yeah, based on what I've seen, their validate does both: it checks the data against the type and number of fields specified in the schema (so verifies it) and also checks it against any constraints (so validates it). If we want to treat these separately, we could try filtering the error messages (like we do for ResourceError vs PackageError).

Should we treat these separately? I've looked quite a bit at this for write_resource_data_to_raw, so could do it either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In relation to the specifics of this, I'm awaiting the discussion in #826

@lwjohnst86
Copy link
Member Author

@martonvago I forgot to respond to your initial question.

Raw files are kept from the initial upload to keep a record just in case something happens.

A potential scenario might be, a first round of surveys are sent to people and that data gets uploaded to Sprout. That's one raw file. Maybe a few months later, the same survey is sent out and that data gets uploaded. That's another raw file. So those two raw files get merged together and saved as the data.parquet file, which would be the file that researchers actually use to do analyses.

Copy link
Member

@signekb signekb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall picture of this makes sense to me as well 👍

@@ -0,0 +1,79 @@
# ruff: noqa
def write_resource_parquet(
raw_files_path: list[Path], parquet_path: Path, properties_path: Path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also see a situation where someone would change the files outside Sprout

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved

# Confirms that the data matches the resource properties found in `datapackage.json`.
# Not sure if this is the best solution here to load in the properties file.
verify_is_file(properties_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In relation to the specifics of this, I'm awaiting the discussion in #826

return data


def verify_data(data: DataFrame, properties: dict) -> Path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we moving away from frictionless-py altogether, or do we want to use it for verification still? In relation to the discussion in #826

@lwjohnst86 lwjohnst86 marked this pull request as draft November 13, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

Create internal flow diagram for write_resource_parquet()
3 participants