-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema definition type #927
Merged
norberttech
merged 6 commits into
flow-php:1.x
from
norberttech:feature/schema-definition-type
Jan 21, 2024
Merged
Schema definition type #927
norberttech
merged 6 commits into
flow-php:1.x
from
norberttech:feature/schema-definition-type
Jan 21, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
github-actions
bot
added
core
adapter-ampavrohp
adapter-csv
adapter-parquet
examples
lib-parquet
size: XL
labels
Jan 20, 2024
Flow PHP - BenchmarksResults of the benchmarks from this PR are compared with the results from 1.x branch. Extractors+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench | bench_extract_10k | 1 | 3 | 35.110mb +0.08% | 818.409ms +0.50% | ±0.58% -27.18% |
| CSVExtractorBench | bench_extract_10k | 1 | 3 | 4.839mb +0.28% | 340.877ms +0.46% | ±0.44% -18.01% |
| JsonExtractorBench | bench_extract_10k | 1 | 3 | 4.909mb +0.49% | 1.051s +0.23% | ±1.19% -8.78% |
| ParquetExtractorBench | bench_extract_10k | 1 | 3 | 239.612mb +0.01% | 1.248s +0.64% | ±1.98% +475.29% |
| TextExtractorBench | bench_extract_10k | 1 | 3 | 4.670mb +0.11% | 35.443ms +2.21% | ±0.40% -19.87% |
| XmlExtractorBench | bench_extract_10k | 1 | 3 | 4.672mb +0.11% | 429.377ms -1.25% | ±0.89% +220.02% |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1 | 3 | 110.385mb +0.01% | 65.016ms +1.85% | ±3.12% +2233.25% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
Loaders+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench | bench_load_10k | 1 | 3 | 95.490mb +0.03% | 467.875ms +1.14% | ±0.32% -78.26% |
| CSVLoaderBench | bench_load_10k | 1 | 3 | 53.913mb +0.03% | 73.541ms +2.42% | ±2.46% -8.14% |
| JsonLoaderBench | bench_load_10k | 1 | 3 | 106.325mb +0.08% | 52.399ms +0.25% | ±1.04% +800.76% |
| ParquetLoaderBench | bench_load_10k | 1 | 3 | 321.534mb +0.03% | 1.299s +4.01% | ±0.85% +41.71% |
| TextLoaderBench | bench_load_10k | 1 | 3 | 17.731mb +0.06% | 41.239ms -0.34% | ±0.50% -39.34% |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| benchmark | subject | revs | its | mem_peak | mode | rstdev |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 116.497mb +0.02% | 482.191ms -2.77% | ±1.10% +69.69% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 59.975mb +0.04% | 246.264ms +2.41% | ±0.49% +34.13% |
| NativeEntryFactoryBench | bench_entry_factory | 1 | 3 | 14.908mb +0.16% | 52.506ms +3.64% | ±1.65% +167.39% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 59.910mb +0.03% | 432.337ms +0.69% | ±0.07% -90.66% |
| TypeDetectorBench | bench_type_detector | 1 | 3 | 14.449mb +0.13% | 85.952ms +1.30% | ±0.31% +1075.24% |
| RowsBench | bench_chunk_10_on_10k | 2 | 3 | 76.465mb +0.01% | 3.393ms +3.41% | ±2.27% +6.01% |
| RowsBench | bench_diff_left_1k_on_10k | 2 | 3 | 96.258mb +0.01% | 190.178ms +1.83% | ±0.86% +185.48% |
| RowsBench | bench_diff_right_1k_on_10k | 2 | 3 | 74.784mb +0.01% | 19.559ms +4.32% | ±1.54% +144.68% |
| RowsBench | bench_drop_1k_on_10k | 2 | 3 | 77.705mb +0.01% | 1.606ms -3.74% | ±2.46% +163.00% |
| RowsBench | bench_drop_right_1k_on_10k | 2 | 3 | 77.705mb +0.01% | 1.627ms -2.51% | ±2.17% +63.73% |
| RowsBench | bench_entries_on_10k | 2 | 3 | 74.817mb +0.01% | 2.579ms +0.62% | ±0.60% -14.68% |
| RowsBench | bench_filter_on_10k | 2 | 3 | 75.346mb +0.01% | 14.861ms -0.77% | ±0.90% -66.74% |
| RowsBench | bench_find_on_10k | 2 | 3 | 75.346mb +0.01% | 14.840ms +3.40% | ±0.73% -16.06% |
| RowsBench | bench_find_one_on_10k | 10 | 3 | 73.249mb +0.01% | 1.706μs -4.90% | ±2.72% +1.92% |
| RowsBench | bench_first_on_10k | 10 | 3 | 73.249mb +0.01% | 0.300μs -25.00% | ±0.00% -100.00% |
| RowsBench | bench_flat_map_on_1k | 2 | 3 | 86.805mb +0.09% | 12.905ms +2.01% | ±0.96% -3.50% |
| RowsBench | bench_map_on_10k | 2 | 3 | 116.165mb +0.01% | 64.469ms +1.11% | ±1.36% +9.08% |
| RowsBench | bench_merge_1k_on_10k | 2 | 3 | 75.866mb +0.01% | 1.522ms -4.27% | ±2.01% -8.08% |
| RowsBench | bench_partition_by_on_10k | 2 | 3 | 79.213mb +0.01% | 58.586ms +0.18% | ±0.42% -57.65% |
| RowsBench | bench_remove_on_10k | 2 | 3 | 77.967mb +0.01% | 4.232ms +11.40% | ±3.31% +277.74% |
| RowsBench | bench_sort_asc_on_1k | 2 | 3 | 73.394mb +0.01% | 39.926ms -1.23% | ±1.54% +172.08% |
| RowsBench | bench_sort_by_on_1k | 2 | 3 | 73.395mb +0.01% | 40.579ms +0.44% | ±0.26% -92.32% |
| RowsBench | bench_sort_desc_on_1k | 2 | 3 | 73.394mb +0.01% | 40.775ms -0.13% | ±0.88% -29.65% |
| RowsBench | bench_sort_entries_on_1k | 2 | 3 | 75.691mb +0.01% | 7.371ms +0.56% | ±0.57% +40.43% |
| RowsBench | bench_sort_on_1k | 2 | 3 | 73.249mb +0.01% | 29.258ms +0.38% | ±0.96% +113.82% |
| RowsBench | bench_take_1k_on_10k | 10 | 3 | 73.249mb +0.01% | 13.754μs +2.55% | ±3.25% +364.33% |
| RowsBench | bench_take_right_1k_on_10k | 10 | 3 | 73.249mb +0.01% | 15.612μs -1.23% | ±0.60% +102.13% |
| RowsBench | bench_unique_on_1k | 2 | 3 | 96.259mb +0.01% | 185.998ms -2.88% | ±0.48% -7.64% |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Log
Added
Fixed
Changed
Removed
Deprecated
Security
Description
When I first worked on Schema, there was no concept of PHP\Type, because of that schema definitions were relaying only on Entry::class. However later we introduced Types which are more precise.
Thanks to exposing Types through Definitions we can also get rid of recently added Schema::narrowing as this logic was taken over by Type::merege(Type) : Type
We have now 3 mechanisms in place that are dealing with types:
Flow\ETL\PHP\Type\TypeDetector
- detects the Type instance based of a variable, the main goal is to properly identify logical types. This class is not turning"2023-01-01"
into DateTimeObjects, it will still detect it as a string.Flow\ETL\Row\Factory\NativeEntryFactory
- the goal of this class is to us TypeDetector or Schema (this is not yet fully implemented) and convert values into Entries. This class is also not autocasting anything. If NativeEntryFactory would do that, it would be impossible to use properly Cast ScalarFunction, because results of functions before turned into entries are going through NativeEntryFactory, which would end up in infinite cycle of casting strings to DateTime objects.Flow\ETL\Transformer\AutoCastTransformer
- this is the class responsible for automatically detecting types based on data patterns. I need to extract that logic into AutoCaster class.Now those 3 things can be used in a following way:
However autocasting is not aware of the entire column context, so if have something like this:
First two rows will get tracking_number as IntegerEntry and the last one will be StringEntry.
And here is where Schema::merge or rather Definition::merge comes into play.
Whenever we are merging two different schema definitions for the same entry we are going to try to merge them into more flexible type, for example:
Integer
andString
will be merged intoString
because all integers can be represented as strings and not all strings can be represtend as integers.Integer
andFloat
will get merged intoFloat
Map
andList
will get merged intoArray
All of those things gives us something that is called
Schema Inferring
, we are iterating through the CSV file trying guess the schema based on given data, then narrow this schema so later while reading that file again we can apply that schema and avoid guessing again. It's especially useful when in order to infer schema we are not going through the entire file but lets say first 1000 rows.Bigger dataset is more beneficial schema inferring will become. Because if we get the proper schema from 1k rows, and we need to process 20k rows, those remaining 19k will not need to go through guessing and type checking, since native entry factory will just use provided schema to create entries.