-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: 📝 pseudo code and docstring for write_resource_parquet()
#816
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,79 @@ | |||
# ruff: noqa | |||
def write_resource_parquet( | |||
raw_files_path: list[Path], parquet_path: Path, properties_path: Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't decide how to handle the properties part, or even whether we should do any verification against the properties. My reasoning here is that maybe, just maybe, someone might make a change in the files outside of Sprout. So I thought, we can assume generally that the raw data are already verified, but can't guarantee it. Better to be safe than sorry!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also see a situation where someone would change the files outside Sprout
|
||
Examples: | ||
|
||
``` python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the decision post for using quartodoc to generate the docstrings into the website, we can actually include executable code in the examples via using {python}
. Without the {}
, it will not execute. Just something to keep in mind, if we want to show users what it might look like when executed.
data_list = [polars.read_csv(path) for path in paths] | ||
# Merge them all together. | ||
data = polars.concat(data_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this real code here because I was curious how Polars might do it. It seems it can!
verify_data(data, properties) | ||
|
||
# Could include validation here? | ||
# validate_data(data, properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep this as a TODO item and keep it commented, but not implement it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!! Just some questions.
I think I've developed some confusion about what the raw data files represent. Are they different versions of the data (with later versions overwriting earlier ones) or different sections of the data (e.g. one file for rows 1-100 and another one for rows 101-200)? Well, I guess there is no reason why they couldn't be used as both...
@@ -0,0 +1,79 @@ | |||
# ruff: noqa | |||
def write_resource_parquet( | |||
raw_files_path: list[Path], parquet_path: Path, properties_path: Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have all these paths as arguments? I'm just thinking that given either raw_files_path
or parquet_path
, the others can be constructed. (Or just given the two IDs they can be constructed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I was thinking that, but at least within core, I'd really like to avoid having internal assumptions on where things are located in the folder. Hence why I wrote up the path_
functions, so that there is that separation. In CLI and in the app, we should definitely have that assumption, but I'm not sure about in core. I'd like to keep it a bit more flexible, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense! I'm just thinking that sometimes (not necessarily here) you want to go from one path to another one (e.g. you know the resource folder path and want the path to datapackage.json
), and I'm not always sure what I'm allowed to assume / how I'm allowed to manipulate them. (Not really a question now, but I'll ask it properly when it comes up 😅 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about adding some functions that are the inverse of the path functions? So for path_package(package_id: int) -> Path
we could have package_id_from_path_package(path: Path) -> int
. This wouldn’t add any new assumptions (other than the invertibility) and it would allow us to go from one path to another via the IDs. We could structure them analogously to the path functions and keep them all together.
return data | ||
|
||
|
||
def verify_data(data: DataFrame, properties: dict) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frictionless-py
should be able to do this for us hopefully!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we moving away from frictionless-py
altogether, or do we want to use it for verification still? In relation to the discussion in #826
|
||
# Confirms that the data matches the resource properties found in `datapackage.json`. | ||
# Not sure if this is the best solution here to load in the properties file. | ||
verify_is_file(properties_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the verification/validation methods be the same as in the write_resource_data_to_raw
flow? There, we're processing one raw resource, while here it's all of them, but presumably the logic would be the same?
I think I'm a little confused about the difference between verify_data
and validate_data
. In the diagram for write_resource_data_to_raw
it is the validate
function that checks against the properties, but here it's verify
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first question, yea, that makes sense, to have it the same as in write_resource_data_to_raw()
.
For the second question. Both verify_data()
and validate_data()
compare against properties. Validating compares against only the constraints
field. Think of the difference being, with verify, it is checking that a given column is called age
and that it is a int
or float
and that it has a given number of rows. With validate, it is checking that age
is between 0 and 120 (since most humans don't live past 120), but only if that constraints
field is provided by the user.
I don't know how frictionless handles this, based on how I read the docs, it seems the validate function does mostly verifying. Which is a bit annoying semantically, but 🤷... Maybe I'll open an issue in their repo 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, I get it now, thanks! Yeah, based on what I've seen, their validate
does both: it checks the data against the type and number of fields specified in the schema (so verifies it) and also checks it against any constraints (so validates it). If we want to treat these separately, we could try filtering the error messages (like we do for ResourceError
vs PackageError
).
Should we treat these separately? I've looked quite a bit at this for write_resource_data_to_raw, so could do it either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In relation to the specifics of this, I'm awaiting the discussion in #826
@martonvago I forgot to respond to your initial question. Raw files are kept from the initial upload to keep a record just in case something happens. A potential scenario might be, a first round of surveys are sent to people and that data gets uploaded to Sprout. That's one raw file. Maybe a few months later, the same survey is sent out and that data gets uploaded. That's another raw file. So those two raw files get merged together and saved as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall picture of this makes sense to me as well 👍
@@ -0,0 +1,79 @@ | |||
# ruff: noqa | |||
def write_resource_parquet( | |||
raw_files_path: list[Path], parquet_path: Path, properties_path: Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also see a situation where someone would change the files outside Sprout
|
||
# Confirms that the data matches the resource properties found in `datapackage.json`. | ||
# Not sure if this is the best solution here to load in the properties file. | ||
verify_is_file(properties_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In relation to the specifics of this, I'm awaiting the discussion in #826
return data | ||
|
||
|
||
def verify_data(data: DataFrame, properties: dict) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we moving away from frictionless-py
altogether, or do we want to use it for verification still? In relation to the discussion in #826
…prout into docs/write-resource-parquet-pseudocode
Description
Based on @martonvago's suggestion, I'll write things in "pseudocode" from now on. But instead of pseudocode, I will write an outline of the Python function with how I think it might flow inside. Plus, I can write the full docstrings inside, so you all don't need and we don't need to move it over from the Quarto doc. I have NOT ran this, tested it, or did any execution, this is purely how I think it might work, hence "pseudo" 😛. I'll add some comments directly to the code in the PR.
Closes #642
This PR needs an in-depth review.
Checklist