-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to specify schema #41
Conversation
From #36 (comment):
|
This code has the basic functionality and is able to convert the fanniemae dataset. It can't handle datatypes with parameters yet. |
Out of curiosity, are there files/datasets where 1MB has not been enough to resolve its schema? I ask because this can be an improvement that Arrow will benefit as well. |
They definitely exist, and there's no magic number here; at some point you just have to specify a schema. An example is an export from a database table sorted by timestamp where a nullable field was added at some point—values may be entirely null for millions+ of observations even though there's useful data at the end of the dataset. Especially for interactive work, being able to specify a partial schema just for columns you know are screwy can be handy; readr::cols() takes this approach with a default of inferring types for unspecified columns. Obviously specifying a complete schema is better, but forcing users to do so can lead to other antipatterns. What we have is good, just maybe a little idealistic. |
@alistaire47 Thank you for the explanation. A possible solution for file format standardization is to include metadata with a value per column identifying where the first non-null value appears, this way tools can fetch the value directly for schema resolution. |
@edponce Ooh that would be great, and handle a lot of problem cases! The remaining issues I've run into have all been in regards to schema params of fields that are difficult to infer from data, e.g. unit/timezone for timestamps. I think at least timestamp unit handling may be improved (i.e. they roundtrip file <-> memory correctly with defaults) in more recent parquet versions, though? |
@edponce thanks for contributing to the discussion! Suppose you go through the trouble of storing the row number where the first value is encountered, wouldn't it be just as much work to simply store the datatype of the column? |
(Related JIRA BTW: https://issues.apache.org/jira/browse/ARROW-8221) |
@joosthooz yeah, you are right. There are many variations data details can be provided to help with schema resolution. In an ideal world, most data formats in file/memory would have metadata (min, max, null count, nrows, ncols, types, etc) and consistent across formats. Another issue which seems more related to this PR's work, revolves on how to handle data updates? Viewing data as a data structure that can be modified directly creates many issues. If datasets would only be allowed to be modified via some sort of API, then metadata could be updated accordingly during data updates. BTW, this is just me thinking out loud. Thanks for the discussion! |
Is it possible to have this PR be based on main instead of the validation branch? I know they might be inter-related in a way that's hard to tease them apart, but it would make reviewing much easier if I could see what the changes are here for schemas and not the validation bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this is looking good!
5eedd96
to
cf95769
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, this is looking good. I've commented a few places about the validation code that seems to be in here, though I think that code isn't actually intended to be here or be merged in this PR (a rebase might help, or possibly even pulling those commits out totally?).
We should definitely add some more tests in that check various conversion possibilities (and now might be a good time to setup a general smallish dataset we use in our tests generally speaking that contains all or many of the types we expect to see)
datalogistik/util.py
Outdated
def arrow_type_function_lookup(function_name): | ||
if isinstance(function_name, str): | ||
pa_type_func = getattr(pa, function_name) | ||
return pa_type_func | ||
|
||
# The argument was not a pyarrow type (maybe a nested structure?) | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding some tests for this function? I find it super helpful to read through for things like this to get a better idea of what this is trying to do + what the defensiveness in it is trying to prevent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added tests for these new functions
tests/test_datalogistik.py
Outdated
"format": "parquet", | ||
"partitioning-nrows": 0, | ||
} | ||
pydict = {"int": [1, 2], "str": ["a", "b"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to have a dataset with one of each type of class and reuse that in tests like this?
tests/test_datalogistik.py
Outdated
).to_table() | ||
print(converted_table.schema) | ||
assert converted_table == orig_table | ||
util.prune_cache_entry("test_parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this test does this (it doesn't look like it does, but maybe I'm missing something!), but could we add a test of the actual roundtrip and actually converting schema types?
It would be good to capture: changing things like int32 to float, int32 to int64, a numeric of some sort to a string, etc. Ideally we would have tests for all of them to confirm that they really do all work (though this should fall into pyarrow pretty quickly, so maybe we don't need total coverage, but at least a decent amount of coverage to have confidence we can do what we say we do would be good)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality is not supported yet, and I think it falls outside the scope of the PR. Even though the code for this might even be rather simple, how to interface this to the user will be a challenge! Created a new issue: #58
It seems I need to rebase this, I thought retargeting to the main branch after merging #34 would fix it but it didn't. |
cf467b7
to
0aea7f1
Compare
Ok things should be up to date again, I'll get started on the recommendations |
539a8e7
to
c61df6b
Compare
…h (non-nested) Arrow datatype
…rwrite in the metadata after conversion
…t does not keep ordering of partitions
c61df6b
to
7fcfa3e
Compare
datalogistik/util.py
Outdated
co = csv.ConvertOptions() | ||
# TODO: Should we autogenerate column names by default? | ||
# Or add a property in the metadata about it? | ||
# or allow a fall-back to read_csv in case schema detection fails? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got both csv datasets with no headers (fanniemae) and with headers (nyctaxi). If we're specifying a schema, we should use the names specified there. If we're not...well, we probably should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we should be careful not to accidentally skip the first line if there are no headers and we overwrite ones inferred from the first line of data; we need to store metadata about this for each dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a property for this! We can now handle both. Default is no header line, so field names will be auto-generated. I had to upgrade our Arrow version for this, though, because I reported a problem with field name auto-generation which was resolved in v9.0.0... https://issues.apache.org/jira/browse/ARROW-16436
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great, especially all these comprehensive tests! Just a few small things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple nits, but this looks really good!
field_list.append(pa.field(field_name, arrow_type)) | ||
|
||
output_schema = pa.schema(field_list) | ||
return output_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super useful. Do we have the reverse (serialize a schema to json) here somewhere as well? (Not sure if it's necessary here at the moment, but I've run into cases where it would've been very handy.) It might be nice to put all this schema stuff in a separate file so when I, um, borrow this code in the future it's in a nice self-contained package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, that would be really handy and the inferred-schema
was actually a workaround for not having it at first (but then after thinking about it I think it is good to know the difference between a user-specified schema vs an inferred one).
The other way is more difficult, though. It might come down to a large case statement with all the types with parameters. I remember seeing some Rust code for this at some point, maybe in datafusion, but I couldn't find it anymore.
Let's implement this in the near future.
Side comment; I found this old spec for Arrow Schema JSON representation: https://github.com/apache/arrow/pull/158/files
return arrow_type_function_lookup(type_name)(args) | ||
|
||
|
||
# Convert the given dict to a pyarrow.schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make all these comments into docstrings they'll pop up nicely for us in editors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I shouldv'e done that properly from the start. Maybe we should start a PR that does that and adds type hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compression=parquet_compression | ||
compression=parquet_compression, | ||
use_deprecated_int96_timestamps=False, | ||
coerce_timestamps="us", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've upgraded to arrow 9.0.0, this may be unnecessary now; some of the weird behavior around this was part of the old parquet versions, and the default version got bumped with 9.0.0. Fine to leave here for safety though.
Instead of relying on pyarrow.dataset, we would like to be able to specify a schema in the repo file. This will help solve type problems in large files, because pyarrow.dataset only inspects the first 1MB of data to infer types of columns.