Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to specify schema #41
Add functionality to specify schema #41
Changes from 27 commits
b626cd7
a83d800
0db0571
1a5011d
a742722
da20871
406cec0
3c87dcd
0a67b29
50186e5
4a9e3ac
b139a4f
1e00a48
e6dfd39
1129d10
477a036
860a4d3
b1e862c
3ac13ea
b775bed
8885e3f
df615e5
6d395db
a2f938d
84ed050
389a548
7fcfa3e
274a490
cf6d998
0d5c06b
99f8391
b09f261
14941f2
fae88cb
ba487c6
a7cd552
c7c6008
57157b2
b645261
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make all these comments into docstrings they'll pop up nicely for us in editors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I shouldv'e done that properly from the start. Maybe we should start a PR that does that and adds type hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#59
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super useful. Do we have the reverse (serialize a schema to json) here somewhere as well? (Not sure if it's necessary here at the moment, but I've run into cases where it would've been very handy.) It might be nice to put all this schema stuff in a separate file so when I, um, borrow this code in the future it's in a nice self-contained package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, that would be really handy and the
inferred-schema
was actually a workaround for not having it at first (but then after thinking about it I think it is good to know the difference between a user-specified schema vs an inferred one).The other way is more difficult, though. It might come down to a large case statement with all the types with parameters. I remember seeing some Rust code for this at some point, maybe in datafusion, but I couldn't find it anymore.
Let's implement this in the near future.
Side comment; I found this old spec for Arrow Schema JSON representation: https://github.com/apache/arrow/pull/158/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got both csv datasets with no headers (fanniemae) and with headers (nyctaxi). If we're specifying a schema, we should use the names specified there. If we're not...well, we probably should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we should be careful not to accidentally skip the first line if there are no headers and we overwrite ones inferred from the first line of data; we need to store metadata about this for each dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a property for this! We can now handle both. Default is no header line, so field names will be auto-generated. I had to upgrade our Arrow version for this, though, because I reported a problem with field name auto-generation which was resolved in v9.0.0... https://issues.apache.org/jira/browse/ARROW-16436
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've upgraded to arrow 9.0.0, this may be unnecessary now; some of the weird behavior around this was part of the old parquet versions, and the default version got bumped with 9.0.0. Fine to leave here for safety though.