Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to specify schema #41

Merged
merged 39 commits into from
Sep 1, 2022

Conversation

joosthooz
Copy link
Collaborator

Instead of relying on pyarrow.dataset, we would like to be able to specify a schema in the repo file. This will help solve type problems in large files, because pyarrow.dataset only inspects the first 1MB of data to infer types of columns.

@joosthooz joosthooz linked an issue Aug 12, 2022 that may be closed by this pull request
@joosthooz
Copy link
Collaborator Author

joosthooz commented Aug 12, 2022

From #36 (comment):

We might need to go a little deeper than that for fields with options, e.g. unit/timezone for timestamps and precision/scale for decimal. Maybe just optional kwargs that can be passed through like

{
  "id": {"type_name": "string"},
  "click_ts": {"type_name": "timestamp", "type_args": {"unit": "ms"}},
  "some_value": {"type_name": "decimal128", "type_args": {"precision": 7, "index": 3}},
  ...
}

Or flat or otherwise—I don't really care about arrangement, just that it's flexible enough when we need it

@joosthooz
Copy link
Collaborator Author

This code has the basic functionality and is able to convert the fanniemae dataset. It can't handle datatypes with parameters yet.

@edponce
Copy link

edponce commented Aug 12, 2022

Out of curiosity, are there files/datasets where 1MB has not been enough to resolve its schema? I ask because this can be an improvement that Arrow will benefit as well.

@alistaire47
Copy link
Collaborator

Out of curiosity, are there files/datasets where 1MB has not been enough to resolve its schema? I ask because this can be an improvement that Arrow will benefit as well.

They definitely exist, and there's no magic number here; at some point you just have to specify a schema. An example is an export from a database table sorted by timestamp where a nullable field was added at some point—values may be entirely null for millions+ of observations even though there's useful data at the end of the dataset.

Especially for interactive work, being able to specify a partial schema just for columns you know are screwy can be handy; readr::cols() takes this approach with a default of inferring types for unspecified columns. Obviously specifying a complete schema is better, but forcing users to do so can lead to other antipatterns. What we have is good, just maybe a little idealistic.

@edponce
Copy link

edponce commented Aug 12, 2022

@alistaire47 Thank you for the explanation. A possible solution for file format standardization is to include metadata with a value per column identifying where the first non-null value appears, this way tools can fetch the value directly for schema resolution.

@alistaire47
Copy link
Collaborator

A possible solution for file format standardization is to include metadata with a value per column identifying where the first non-null value appears, this way tools can fetch the value directly for schema resolution.

@edponce Ooh that would be great, and handle a lot of problem cases! The remaining issues I've run into have all been in regards to schema params of fields that are difficult to infer from data, e.g. unit/timezone for timestamps. I think at least timestamp unit handling may be improved (i.e. they roundtrip file <-> memory correctly with defaults) in more recent parquet versions, though?

@joosthooz
Copy link
Collaborator Author

@edponce thanks for contributing to the discussion! Suppose you go through the trouble of storing the row number where the first value is encountered, wouldn't it be just as much work to simply store the datatype of the column?

@joosthooz
Copy link
Collaborator Author

(Related JIRA BTW: https://issues.apache.org/jira/browse/ARROW-8221)

@joosthooz joosthooz marked this pull request as ready for review August 15, 2022 15:53
@edponce
Copy link

edponce commented Aug 15, 2022

@joosthooz yeah, you are right. There are many variations data details can be provided to help with schema resolution. In an ideal world, most data formats in file/memory would have metadata (min, max, null count, nrows, ncols, types, etc) and consistent across formats.

Another issue which seems more related to this PR's work, revolves on how to handle data updates? Viewing data as a data structure that can be modified directly creates many issues. If datasets would only be allowed to be modified via some sort of API, then metadata could be updated accordingly during data updates.

BTW, this is just me thinking out loud. Thanks for the discussion!

@jonkeane
Copy link
Contributor

Is it possible to have this PR be based on main instead of the validation branch? I know they might be inter-related in a way that's hard to tease them apart, but it would make reviewing much easier if I could see what the changes are here for schemas and not the validation bits

@joosthooz joosthooz changed the base branch from main to 1-validate-file-integrity August 18, 2022 15:43
@joosthooz
Copy link
Collaborator Author

@jonkeane I changed the target branch to the validation branch, so now you only see the additional changes specific to this task. That should help. Once we merge #34 we can change the target branch back to main.

Copy link
Member

@austin3dickey austin3dickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is looking good!

datalogistik/util.py Outdated Show resolved Hide resolved
datalogistik/util.py Outdated Show resolved Hide resolved
@joosthooz joosthooz changed the base branch from 1-validate-file-integrity to main August 19, 2022 12:15
Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, this is looking good. I've commented a few places about the validation code that seems to be in here, though I think that code isn't actually intended to be here or be merged in this PR (a rebase might help, or possibly even pulling those commits out totally?).

We should definitely add some more tests in that check various conversion possibilities (and now might be a good time to setup a general smallish dataset we use in our tests generally speaking that contains all or many of the types we expect to see)

README.rst Outdated Show resolved Hide resolved
datalogistik/cli.py Outdated Show resolved Hide resolved
datalogistik/cli.py Outdated Show resolved Hide resolved
datalogistik/cli.py Outdated Show resolved Hide resolved
datalogistik/util.py Outdated Show resolved Hide resolved
Comment on lines 296 to 313
def arrow_type_function_lookup(function_name):
if isinstance(function_name, str):
pa_type_func = getattr(pa, function_name)
return pa_type_func

# The argument was not a pyarrow type (maybe a nested structure?)
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding some tests for this function? I find it super helpful to read through for things like this to get a better idea of what this is trying to do + what the defensiveness in it is trying to prevent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for these new functions

datalogistik/util.py Outdated Show resolved Hide resolved
datalogistik/util.py Outdated Show resolved Hide resolved
"format": "parquet",
"partitioning-nrows": 0,
}
pydict = {"int": [1, 2], "str": ["a", "b"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to have a dataset with one of each type of class and reuse that in tests like this?

).to_table()
print(converted_table.schema)
assert converted_table == orig_table
util.prune_cache_entry("test_parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this test does this (it doesn't look like it does, but maybe I'm missing something!), but could we add a test of the actual roundtrip and actually converting schema types?

It would be good to capture: changing things like int32 to float, int32 to int64, a numeric of some sort to a string, etc. Ideally we would have tests for all of them to confirm that they really do all work (though this should fall into pyarrow pretty quickly, so maybe we don't need total coverage, but at least a decent amount of coverage to have confidence we can do what we say we do would be good)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functionality is not supported yet, and I think it falls outside the scope of the PR. Even though the code for this might even be rather simple, how to interface this to the user will be a challenge! Created a new issue: #58

@joosthooz
Copy link
Collaborator Author

It seems I need to rebase this, I thought retargeting to the main branch after merging #34 would fix it but it didn't.
Thanks for all the comments, especially for all the ideas for tests 👍

@joosthooz
Copy link
Collaborator Author

Ok things should be up to date again, I'll get started on the recommendations

@austin3dickey austin3dickey force-pushed the 36-add-functionality-to-specify-schema branch 2 times, most recently from 539a8e7 to c61df6b Compare August 30, 2022 14:35
co = csv.ConvertOptions()
# TODO: Should we autogenerate column names by default?
# Or add a property in the metadata about it?
# or allow a fall-back to read_csv in case schema detection fails?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got both csv datasets with no headers (fanniemae) and with headers (nyctaxi). If we're specifying a schema, we should use the names specified there. If we're not...well, we probably should be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we should be careful not to accidentally skip the first line if there are no headers and we overwrite ones inferred from the first line of data; we need to store metadata about this for each dataset

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a property for this! We can now handle both. Default is no header line, so field names will be auto-generated. I had to upgrade our Arrow version for this, though, because I reported a problem with field name auto-generation which was resolved in v9.0.0... https://issues.apache.org/jira/browse/ARROW-16436

Copy link
Member

@austin3dickey austin3dickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, especially all these comprehensive tests! Just a few small things.

datalogistik/util.py Outdated Show resolved Hide resolved
datalogistik/util.py Outdated Show resolved Hide resolved
tests/test_datalogistik.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@alistaire47 alistaire47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple nits, but this looks really good!

field_list.append(pa.field(field_name, arrow_type))

output_schema = pa.schema(field_list)
return output_schema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super useful. Do we have the reverse (serialize a schema to json) here somewhere as well? (Not sure if it's necessary here at the moment, but I've run into cases where it would've been very handy.) It might be nice to put all this schema stuff in a separate file so when I, um, borrow this code in the future it's in a nice self-contained package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, that would be really handy and the inferred-schema was actually a workaround for not having it at first (but then after thinking about it I think it is good to know the difference between a user-specified schema vs an inferred one).
The other way is more difficult, though. It might come down to a large case statement with all the types with parameters. I remember seeing some Rust code for this at some point, maybe in datafusion, but I couldn't find it anymore.
Let's implement this in the near future.

Side comment; I found this old spec for Arrow Schema JSON representation: https://github.com/apache/arrow/pull/158/files

return arrow_type_function_lookup(type_name)(args)


# Convert the given dict to a pyarrow.schema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make all these comments into docstrings they'll pop up nicely for us in editors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I shouldv'e done that properly from the start. Maybe we should start a PR that does that and adds type hints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#59

compression=parquet_compression
compression=parquet_compression,
use_deprecated_int96_timestamps=False,
coerce_timestamps="us",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've upgraded to arrow 9.0.0, this may be unnecessary now; some of the weird behavior around this was part of the old parquet versions, and the default version got bumped with 9.0.0. Fine to leave here for safety though.

@joosthooz joosthooz merged commit 1e91c41 into main Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add functionality to specify schema
5 participants