Schema definition type #927

norberttech · 2024-01-20T17:31:35Z

Change Log

Added

Schema Definition::type() : Type
Type::merge() : Type

Fixed

Changed

reduced size of csv files to speedup tests

Removed

Schema definition union type
Schema narrowing that was replaced by Type::merge

Deprecated

Security

Description

When I first worked on Schema, there was no concept of PHP\Type, because of that schema definitions were relaying only on Entry::class. However later we introduced Types which are more precise.
Thanks to exposing Types through Definitions we can also get rid of recently added Schema::narrowing as this logic was taken over by Type::merege(Type) : Type

We have now 3 mechanisms in place that are dealing with types:

Flow\ETL\PHP\Type\TypeDetector - detects the Type instance based of a variable, the main goal is to properly identify logical types. This class is not turning "2023-01-01" into DateTimeObjects, it will still detect it as a string.
Flow\ETL\Row\Factory\NativeEntryFactory - the goal of this class is to us TypeDetector or Schema (this is not yet fully implemented) and convert values into Entries. This class is also not autocasting anything. If NativeEntryFactory would do that, it would be impossible to use properly Cast ScalarFunction, because results of functions before turned into entries are going through NativeEntryFactory, which would end up in infinite cycle of casting strings to DateTime objects.
Flow\ETL\Transformer\AutoCastTransformer - this is the class responsible for automatically detecting types based on data patterns. I need to extract that logic into AutoCaster class.

Now those 3 things can be used in a following way:

extract data from lets say CSV (here Type detector will detect that data is neither String or Null)
NativeEntryFactory will create mostly StringEntries (with some nulls)
then we can use ->autoCast() which will automatically turn those StringEntries into more strict types.

However autocasting is not aware of the entire column context, so if have something like this:

id,tracking_number
1,123456
2,123456
3,A12345

First two rows will get tracking_number as IntegerEntry and the last one will be StringEntry.
And here is where Schema::merge or rather Definition::merge comes into play.

Whenever we are merging two different schema definitions for the same entry we are going to try to merge them into more flexible type, for example:

Integer and String will be merged into String because all integers can be represented as strings and not all strings can be represtend as integers.
Integer and Floatwill get merged into Float
Map and List will get merged into Array
etc...

All of those things gives us something that is called Schema Inferring, we are iterating through the CSV file trying guess the schema based on given data, then narrow this schema so later while reading that file again we can apply that schema and avoid guessing again. It's especially useful when in order to infer schema we are not going through the entire file but lets say first 1000 rows.
Bigger dataset is more beneficial schema inferring will become. Because if we get the proper schema from 1k rows, and we need to process 20k rows, those remaining 19k will not need to go through guessing and type checking, since native entry factory will just use provided schema to create entries.

github-actions · 2024-01-20T17:34:14Z

Flow PHP - Benchmarks

^{_{Results of the benchmarks from this PR are compared with the results from 1.x branch.}}

Extractors

+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.110mb +0.08%  | 818.409ms +0.50% | ±0.58% -27.18%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.839mb +0.28%   | 340.877ms +0.46% | ±0.44% -18.01%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.909mb +0.49%   | 1.051s +0.23%    | ±1.19% -8.78%   |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 239.612mb +0.01% | 1.248s +0.64%    | ±1.98% +475.29% |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.670mb +0.11%   | 35.443ms +2.21%  | ±0.40% -19.87%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.672mb +0.11%   | 429.377ms -1.25% | ±0.89% +220.02% |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+

Transformers

+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev           |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 110.385mb +0.01% | 65.016ms +1.85% | ±3.12% +2233.25% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+

Loaders

+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 95.490mb +0.03%  | 467.875ms +1.14% | ±0.32% -78.26%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 53.913mb +0.03%  | 73.541ms +2.42%  | ±2.46% -8.14%   |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 106.325mb +0.08% | 52.399ms +0.25%  | ±1.04% +800.76% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 321.534mb +0.03% | 1.299s +4.01%    | ±0.85% +41.71%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.731mb +0.06%  | 41.239ms -0.34%  | ±0.50% -39.34%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+

Building Blocks

+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev           |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.497mb +0.02% | 482.191ms -2.77% | ±1.10% +69.69%   |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 59.975mb +0.04%  | 246.264ms +2.41% | ±0.49% +34.13%   |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.908mb +0.16%  | 52.506ms +3.64%  | ±1.65% +167.39%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.910mb +0.03%  | 432.337ms +0.69% | ±0.07% -90.66%   |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.449mb +0.13%  | 85.952ms +1.30%  | ±0.31% +1075.24% |
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 76.465mb +0.01%  | 3.393ms +3.41%   | ±2.27% +6.01%    |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 96.258mb +0.01%  | 190.178ms +1.83% | ±0.86% +185.48%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 74.784mb +0.01%  | 19.559ms +4.32%  | ±1.54% +144.68%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 77.705mb +0.01%  | 1.606ms -3.74%   | ±2.46% +163.00%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 77.705mb +0.01%  | 1.627ms -2.51%   | ±2.17% +63.73%   |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 74.817mb +0.01%  | 2.579ms +0.62%   | ±0.60% -14.68%   |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 75.346mb +0.01%  | 14.861ms -0.77%  | ±0.90% -66.74%   |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 75.346mb +0.01%  | 14.840ms +3.40%  | ±0.73% -16.06%   |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 73.249mb +0.01%  | 1.706μs -4.90%   | ±2.72% +1.92%    |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 73.249mb +0.01%  | 0.300μs -25.00%  | ±0.00% -100.00%  |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 86.805mb +0.09%  | 12.905ms +2.01%  | ±0.96% -3.50%    |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 116.165mb +0.01% | 64.469ms +1.11%  | ±1.36% +9.08%    |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 75.866mb +0.01%  | 1.522ms -4.27%   | ±2.01% -8.08%    |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 79.213mb +0.01%  | 58.586ms +0.18%  | ±0.42% -57.65%   |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 77.967mb +0.01%  | 4.232ms +11.40%  | ±3.31% +277.74%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 73.394mb +0.01%  | 39.926ms -1.23%  | ±1.54% +172.08%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 73.395mb +0.01%  | 40.579ms +0.44%  | ±0.26% -92.32%   |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 73.394mb +0.01%  | 40.775ms -0.13%  | ±0.88% -29.65%   |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 75.691mb +0.01%  | 7.371ms +0.56%   | ±0.57% +40.43%   |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 73.249mb +0.01%  | 29.258ms +0.38%  | ±0.96% +113.82%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 73.249mb +0.01%  | 13.754μs +2.55%  | ±3.25% +364.33%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 73.249mb +0.01%  | 15.612μs -1.23%  | ±0.60% +102.13%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 96.259mb +0.01%  | 185.998ms -2.88% | ±0.48% -7.64%    |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+

…a different entry

norberttech added 2 commits January 20, 2024 12:34

Removed union types from schema definitions

4edb171

Move merging schema definitions logic to types

4818e7d

github-actions bot added core adapter-ampavrohp adapter-csv adapter-parquet examples lib-parquet size: XL labels Jan 20, 2024

norberttech added 4 commits January 21, 2024 11:50

Make impossible to merge two schema definitions that are pointing to …

6664f7e

…a different entry

Removed development leftovers

e10a210

Improved failed Definition::merge exception

f46a392

Removed schema narrowing

ede39db

norberttech merged commit e81e228 into flow-php:1.x Jan 21, 2024
20 checks passed

norberttech mentioned this pull request Jan 21, 2024

Expose PHP Type from Schema Definition #920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema definition type #927

Schema definition type #927

norberttech commented Jan 20, 2024 •

edited

Loading

github-actions bot commented Jan 20, 2024 •

edited

Loading

Schema definition type #927

Schema definition type #927

Conversation

norberttech commented Jan 20, 2024 • edited Loading

Change Log

Added

Fixed

Changed

Removed

Deprecated

Security

Description

github-actions bot commented Jan 20, 2024 • edited Loading

Flow PHP - Benchmarks

norberttech commented Jan 20, 2024 •

edited

Loading

github-actions bot commented Jan 20, 2024 •

edited

Loading