Add functionality to specify schema #41

joosthooz · 2022-08-11T15:26:20Z

Instead of relying on pyarrow.dataset, we would like to be able to specify a schema in the repo file. This will help solve type problems in large files, because pyarrow.dataset only inspects the first 1MB of data to infer types of columns.

joosthooz · 2022-08-12T09:14:33Z

From #36 (comment):

We might need to go a little deeper than that for fields with options, e.g. unit/timezone for timestamps and precision/scale for decimal. Maybe just optional kwargs that can be passed through like

{
  "id": {"type_name": "string"},
  "click_ts": {"type_name": "timestamp", "type_args": {"unit": "ms"}},
  "some_value": {"type_name": "decimal128", "type_args": {"precision": 7, "index": 3}},
  ...
}

Or flat or otherwise—I don't really care about arrangement, just that it's flexible enough when we need it

joosthooz · 2022-08-12T14:47:39Z

This code has the basic functionality and is able to convert the fanniemae dataset. It can't handle datatypes with parameters yet.

edponce · 2022-08-12T15:16:56Z

Out of curiosity, are there files/datasets where 1MB has not been enough to resolve its schema? I ask because this can be an improvement that Arrow will benefit as well.

alistaire47 · 2022-08-12T16:06:34Z

Out of curiosity, are there files/datasets where 1MB has not been enough to resolve its schema? I ask because this can be an improvement that Arrow will benefit as well.

They definitely exist, and there's no magic number here; at some point you just have to specify a schema. An example is an export from a database table sorted by timestamp where a nullable field was added at some point—values may be entirely null for millions+ of observations even though there's useful data at the end of the dataset.

Especially for interactive work, being able to specify a partial schema just for columns you know are screwy can be handy; readr::cols() takes this approach with a default of inferring types for unspecified columns. Obviously specifying a complete schema is better, but forcing users to do so can lead to other antipatterns. What we have is good, just maybe a little idealistic.

edponce · 2022-08-12T16:18:11Z

@alistaire47 Thank you for the explanation. A possible solution for file format standardization is to include metadata with a value per column identifying where the first non-null value appears, this way tools can fetch the value directly for schema resolution.

alistaire47 · 2022-08-12T16:30:05Z

A possible solution for file format standardization is to include metadata with a value per column identifying where the first non-null value appears, this way tools can fetch the value directly for schema resolution.

@edponce Ooh that would be great, and handle a lot of problem cases! The remaining issues I've run into have all been in regards to schema params of fields that are difficult to infer from data, e.g. unit/timezone for timestamps. I think at least timestamp unit handling may be improved (i.e. they roundtrip file <-> memory correctly with defaults) in more recent parquet versions, though?

joosthooz · 2022-08-15T11:36:06Z

@edponce thanks for contributing to the discussion! Suppose you go through the trouble of storing the row number where the first value is encountered, wouldn't it be just as much work to simply store the datatype of the column?

joosthooz · 2022-08-15T11:40:33Z

(Related JIRA BTW: https://issues.apache.org/jira/browse/ARROW-8221)

edponce · 2022-08-15T17:40:32Z

@joosthooz yeah, you are right. There are many variations data details can be provided to help with schema resolution. In an ideal world, most data formats in file/memory would have metadata (min, max, null count, nrows, ncols, types, etc) and consistent across formats.

Another issue which seems more related to this PR's work, revolves on how to handle data updates? Viewing data as a data structure that can be modified directly creates many issues. If datasets would only be allowed to be modified via some sort of API, then metadata could be updated accordingly during data updates.

BTW, this is just me thinking out loud. Thanks for the discussion!

jonkeane · 2022-08-16T16:02:14Z

Is it possible to have this PR be based on main instead of the validation branch? I know they might be inter-related in a way that's hard to tease them apart, but it would make reviewing much easier if I could see what the changes are here for schemas and not the validation bits

joosthooz · 2022-08-18T15:44:19Z

@jonkeane I changed the target branch to the validation branch, so now you only see the additional changes specific to this task. That should help. Once we merge #34 we can change the target branch back to main.

austin3dickey

nice, this is looking good!

datalogistik/util.py

jonkeane

Thanks for this, this is looking good. I've commented a few places about the validation code that seems to be in here, though I think that code isn't actually intended to be here or be merged in this PR (a rebase might help, or possibly even pulling those commits out totally?).

We should definitely add some more tests in that check various conversion possibilities (and now might be a good time to setup a general smallish dataset we use in our tests generally speaking that contains all or many of the types we expect to see)

README.rst

datalogistik/cli.py

datalogistik/util.py

jonkeane · 2022-08-19T17:10:00Z

datalogistik/util.py

+def arrow_type_function_lookup(function_name):
+    if isinstance(function_name, str):
+        pa_type_func = getattr(pa, function_name)
+        return pa_type_func
+
+    # The argument was not a pyarrow type (maybe a nested structure?)
+    return None


Would you mind adding some tests for this function? I find it super helpful to read through for things like this to get a better idea of what this is trying to do + what the defensiveness in it is trying to prevent

I added tests for these new functions

datalogistik/util.py

jonkeane · 2022-08-19T17:16:23Z

tests/test_datalogistik.py

+        "format": "parquet",
+        "partitioning-nrows": 0,
+    }
+    pydict = {"int": [1, 2], "str": ["a", "b"]}


Would it be helpful to have a dataset with one of each type of class and reuse that in tests like this?

jonkeane · 2022-08-19T17:20:06Z

tests/test_datalogistik.py

+    ).to_table()
+    print(converted_table.schema)
+    assert converted_table == orig_table
+    util.prune_cache_entry("test_parquet")


I'm not sure if this test does this (it doesn't look like it does, but maybe I'm missing something!), but could we add a test of the actual roundtrip and actually converting schema types?

It would be good to capture: changing things like int32 to float, int32 to int64, a numeric of some sort to a string, etc. Ideally we would have tests for all of them to confirm that they really do all work (though this should fall into pyarrow pretty quickly, so maybe we don't need total coverage, but at least a decent amount of coverage to have confidence we can do what we say we do would be good)

This functionality is not supported yet, and I think it falls outside the scope of the PR. Even though the code for this might even be rather simple, how to interface this to the user will be a challenge! Created a new issue: #58

joosthooz · 2022-08-19T19:04:20Z

It seems I need to rebase this, I thought retargeting to the main branch after merging #34 would fix it but it didn't.
Thanks for all the comments, especially for all the ideas for tests 👍

joosthooz · 2022-08-19T19:40:47Z

Ok things should be up to date again, I'll get started on the recommendations

…o file

…h (non-nested) Arrow datatype

…rwrite in the metadata after conversion

…t does not keep ordering of partitions

alistaire47 · 2022-08-30T15:31:16Z

datalogistik/util.py

        co = csv.ConvertOptions()
+        # TODO: Should we autogenerate column names by default?
+        # Or add a property in the metadata about it?
+        # or allow a fall-back to read_csv in case schema detection fails?


We've got both csv datasets with no headers (fanniemae) and with headers (nyctaxi). If we're specifying a schema, we should use the names specified there. If we're not...well, we probably should be

But we should be careful not to accidentally skip the first line if there are no headers and we overwrite ones inferred from the first line of data; we need to store metadata about this for each dataset

I added a property for this! We can now handle both. Default is no header line, so field names will be auto-generated. I had to upgrade our Arrow version for this, though, because I reported a problem with field name auto-generation which was resolved in v9.0.0... https://issues.apache.org/jira/browse/ARROW-16436

austin3dickey

This is looking great, especially all these comprehensive tests! Just a few small things.

datalogistik/util.py

tests/test_datalogistik.py

…perty

…g/jira/browse/ARROW-16436

alistaire47

A couple nits, but this looks really good!

alistaire47 · 2022-08-31T16:25:16Z

datalogistik/util.py

+        field_list.append(pa.field(field_name, arrow_type))
+
+    output_schema = pa.schema(field_list)
+    return output_schema


This is super useful. Do we have the reverse (serialize a schema to json) here somewhere as well? (Not sure if it's necessary here at the moment, but I've run into cases where it would've been very handy.) It might be nice to put all this schema stuff in a separate file so when I, um, borrow this code in the future it's in a nice self-contained package.

I know, that would be really handy and the inferred-schema was actually a workaround for not having it at first (but then after thinking about it I think it is good to know the difference between a user-specified schema vs an inferred one).
The other way is more difficult, though. It might come down to a large case statement with all the types with parameters. I remember seeing some Rust code for this at some point, maybe in datafusion, but I couldn't find it anymore.
Let's implement this in the near future.

Side comment; I found this old spec for Arrow Schema JSON representation: https://github.com/apache/arrow/pull/158/files

alistaire47 · 2022-08-31T16:25:58Z

datalogistik/util.py

+        return arrow_type_function_lookup(type_name)(args)
+
+
+# Convert the given dict to a pyarrow.schema


If we make all these comments into docstrings they'll pop up nicely for us in editors

Yeah I shouldv'e done that properly from the start. Maybe we should start a PR that does that and adds type hints.

alistaire47 · 2022-08-31T16:29:44Z

datalogistik/util.py

-                    compression=parquet_compression
+                    compression=parquet_compression,
+                    use_deprecated_int96_timestamps=False,
+                    coerce_timestamps="us",


If you've upgraded to arrow 9.0.0, this may be unnecessary now; some of the weird behavior around this was part of the old parquet versions, and the default version got bumped with 9.0.0. Fine to leave here for safety though.

joosthooz mentioned this pull request Aug 12, 2022

Add functionality to specify schema #36

Closed

joosthooz linked an issue Aug 12, 2022 that may be closed by this pull request

Add functionality to specify schema #36

Closed

joosthooz marked this pull request as ready for review August 15, 2022 15:53

joosthooz requested review from jonkeane, austin3dickey and alistaire47 August 16, 2022 15:45

joosthooz changed the base branch from main to 1-validate-file-integrity August 18, 2022 15:43

austin3dickey requested changes Aug 18, 2022

View reviewed changes

datalogistik/util.py Outdated Show resolved Hide resolved

datalogistik/util.py Outdated Show resolved Hide resolved

joosthooz force-pushed the 1-validate-file-integrity branch from 5eedd96 to cf95769 Compare August 19, 2022 11:43

joosthooz changed the base branch from 1-validate-file-integrity to main August 19, 2022 12:15

jonkeane requested changes Aug 19, 2022

View reviewed changes

joosthooz force-pushed the 36-add-functionality-to-specify-schema branch from cf467b7 to 0aea7f1 Compare August 19, 2022 19:09

joosthooz mentioned this pull request Aug 19, 2022

Add fallback mechanisms when schema detection is unsuccessful #50

Open

austin3dickey force-pushed the 36-add-functionality-to-specify-schema branch 2 times, most recently from 539a8e7 to c61df6b Compare August 30, 2022 14:35

austin3dickey force-pushed the main branch from f2ab27b to ba8bc42 Compare August 30, 2022 14:40

Only adding auto-detected schema if no schema was provided in the rep…

b626cd7

…o file

joosthooz added 10 commits August 30, 2022 09:48

Added code to generate random data for a dataset with a column of eac…

b1e862c

…h (non-nested) Arrow datatype

Fixed issue with timezone in tests

3ac13ea

Fix None vs empty string comparison

b775bed

If a schema is known, don't write out a CSV header line and don't ove…

8885e3f

…rwrite in the metadata after conversion

Added csv partitioning conversion test (failing!)

df615e5

Added parquet partitioning conversion test

6d395db

Removed commented code

a2f938d

Sorting table after multi->single partition conversion because datase…

84ed050

…t does not keep ordering of partitions

Added some f;ags for handling timestampls in parquet

389a548

Added parquet partitioning conversion test

7fcfa3e

austin3dickey force-pushed the main branch from ba8bc42 to a4e9ecd Compare August 30, 2022 14:49

austin3dickey force-pushed the 36-add-functionality-to-specify-schema branch from c61df6b to 7fcfa3e Compare August 30, 2022 14:49

joosthooz requested review from jonkeane and austin3dickey August 30, 2022 15:17

alistaire47 reviewed Aug 30, 2022

View reviewed changes

austin3dickey approved these changes Aug 30, 2022

View reviewed changes

datalogistik/util.py Outdated Show resolved Hide resolved

datalogistik/util.py Outdated Show resolved Hide resolved

tests/test_datalogistik.py Outdated Show resolved Hide resolved

joosthooz added 12 commits August 31, 2022 13:02

Separated out infered vs user-specified schema, added header-line pro…

274a490

…perty

Updated Arrow dependency to 9.0.0 because of https://issues.apache.or…

cf6d998

…g/jira/browse/ARROW-16436

Fixed cache entry pruning

0d5c06b

Writing header-line when converting parquet to csv

99f8391

Removed TODO

b09f261

Processed review comments

14941f2

Formatting

fae88cb

Added info about schema specification to README

ba487c6

Reformatted schema example

a7cd552

Reformatted example schema attempt 3

c7c6008

Reformatted example schema attempt 4

57157b2

Reformatted example schema attempt 5

b645261

alistaire47 approved these changes Aug 31, 2022

View reviewed changes

joosthooz merged commit 1e91c41 into main Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to specify schema #41

Add functionality to specify schema #41

joosthooz commented Aug 11, 2022

joosthooz commented Aug 12, 2022 •

edited

Loading

joosthooz commented Aug 12, 2022

edponce commented Aug 12, 2022

alistaire47 commented Aug 12, 2022

edponce commented Aug 12, 2022

alistaire47 commented Aug 12, 2022

joosthooz commented Aug 15, 2022

joosthooz commented Aug 15, 2022

edponce commented Aug 15, 2022

jonkeane commented Aug 16, 2022

joosthooz commented Aug 18, 2022

austin3dickey left a comment

jonkeane left a comment

jonkeane Aug 19, 2022

joosthooz Aug 26, 2022

jonkeane Aug 19, 2022

jonkeane Aug 19, 2022

joosthooz Aug 25, 2022

joosthooz commented Aug 19, 2022

joosthooz commented Aug 19, 2022

alistaire47 Aug 30, 2022

alistaire47 Aug 30, 2022

joosthooz Aug 31, 2022

austin3dickey left a comment

alistaire47 left a comment

alistaire47 Aug 31, 2022

joosthooz Sep 1, 2022

alistaire47 Aug 31, 2022

joosthooz Sep 1, 2022

joosthooz Sep 1, 2022

alistaire47 Aug 31, 2022

		return arrow_type_function_lookup(type_name)(args)


		# Convert the given dict to a pyarrow.schema

Add functionality to specify schema #41

Add functionality to specify schema #41

Conversation

joosthooz commented Aug 11, 2022

joosthooz commented Aug 12, 2022 • edited Loading

joosthooz commented Aug 12, 2022

edponce commented Aug 12, 2022

alistaire47 commented Aug 12, 2022

edponce commented Aug 12, 2022

alistaire47 commented Aug 12, 2022

joosthooz commented Aug 15, 2022

joosthooz commented Aug 15, 2022

edponce commented Aug 15, 2022

jonkeane commented Aug 16, 2022

joosthooz commented Aug 18, 2022

austin3dickey left a comment

Choose a reason for hiding this comment

jonkeane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joosthooz commented Aug 19, 2022

joosthooz commented Aug 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austin3dickey left a comment

Choose a reason for hiding this comment

alistaire47 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joosthooz commented Aug 12, 2022 •

edited

Loading