Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema definition type #927

Merged
merged 6 commits into from
Jan 21, 2024

Conversation

norberttech
Copy link
Member

@norberttech norberttech commented Jan 20, 2024

Change Log

Added

  • Schema Definition::type() : Type
  • Type::merge() : Type

Fixed

Changed

  • reduced size of csv files to speedup tests

Removed

  • Schema definition union type
  • Schema narrowing that was replaced by Type::merge

Deprecated

Security


Description

When I first worked on Schema, there was no concept of PHP\Type, because of that schema definitions were relaying only on Entry::class. However later we introduced Types which are more precise.
Thanks to exposing Types through Definitions we can also get rid of recently added Schema::narrowing as this logic was taken over by Type::merege(Type) : Type

We have now 3 mechanisms in place that are dealing with types:

  • Flow\ETL\PHP\Type\TypeDetector - detects the Type instance based of a variable, the main goal is to properly identify logical types. This class is not turning "2023-01-01" into DateTimeObjects, it will still detect it as a string.
  • Flow\ETL\Row\Factory\NativeEntryFactory - the goal of this class is to us TypeDetector or Schema (this is not yet fully implemented) and convert values into Entries. This class is also not autocasting anything. If NativeEntryFactory would do that, it would be impossible to use properly Cast ScalarFunction, because results of functions before turned into entries are going through NativeEntryFactory, which would end up in infinite cycle of casting strings to DateTime objects.
  • Flow\ETL\Transformer\AutoCastTransformer - this is the class responsible for automatically detecting types based on data patterns. I need to extract that logic into AutoCaster class.

Now those 3 things can be used in a following way:

  • extract data from lets say CSV (here Type detector will detect that data is neither String or Null)
  • NativeEntryFactory will create mostly StringEntries (with some nulls)
  • then we can use ->autoCast() which will automatically turn those StringEntries into more strict types.

However autocasting is not aware of the entire column context, so if have something like this:

id,tracking_number
1,123456
2,123456
3,A12345

First two rows will get tracking_number as IntegerEntry and the last one will be StringEntry.
And here is where Schema::merge or rather Definition::merge comes into play.

Whenever we are merging two different schema definitions for the same entry we are going to try to merge them into more flexible type, for example:

  • Integer and String will be merged into String because all integers can be represented as strings and not all strings can be represtend as integers.
  • Integer and Floatwill get merged into Float
  • Map and List will get merged into Array
  • etc...

All of those things gives us something that is called Schema Inferring, we are iterating through the CSV file trying guess the schema based on given data, then narrow this schema so later while reading that file again we can apply that schema and avoid guessing again. It's especially useful when in order to infer schema we are not going through the entire file but lets say first 1000 rows.
Bigger dataset is more beneficial schema inferring will become. Because if we get the proper schema from 1k rows, and we need to process 20k rows, those remaining 19k will not need to go through guessing and type checking, since native entry factory will just use provided schema to create entries.

Copy link
Contributor

github-actions bot commented Jan 20, 2024

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak         | mode             | rstdev          |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
| AvroExtractorBench    | bench_extract_10k | 1    | 3   | 35.110mb +0.08%  | 818.409ms +0.50% | ±0.58% -27.18%  |
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 4.839mb +0.28%   | 340.877ms +0.46% | ±0.44% -18.01%  |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 4.909mb +0.49%   | 1.051s +0.23%    | ±1.19% -8.78%   |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 239.612mb +0.01% | 1.248s +0.64%    | ±1.98% +475.29% |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 4.670mb +0.11%   | 35.443ms +2.21%  | ±0.40% -19.87%  |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 4.672mb +0.11%   | 429.377ms -1.25% | ±0.89% +220.02% |
+-----------------------+-------------------+------+-----+------------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev           |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 110.385mb +0.01% | 65.016ms +1.85% | ±3.12% +2233.25% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+------------------+
Loaders
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode             | rstdev          |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
| AvroLoaderBench    | bench_load_10k | 1    | 3   | 95.490mb +0.03%  | 467.875ms +1.14% | ±0.32% -78.26%  |
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 53.913mb +0.03%  | 73.541ms +2.42%  | ±2.46% -8.14%   |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 106.325mb +0.08% | 52.399ms +0.25%  | ±1.04% +800.76% |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 321.534mb +0.03% | 1.299s +4.01%    | ±0.85% +41.71%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 17.731mb +0.06%  | 41.239ms -0.34%  | ±0.50% -39.34%  |
+--------------------+----------------+------+-----+------------------+------------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev           |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.497mb +0.02% | 482.191ms -2.77% | ±1.10% +69.69%   |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 59.975mb +0.04%  | 246.264ms +2.41% | ±0.49% +34.13%   |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.908mb +0.16%  | 52.506ms +3.64%  | ±1.65% +167.39%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.910mb +0.03%  | 432.337ms +0.69% | ±0.07% -90.66%   |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.449mb +0.13%  | 85.952ms +1.30%  | ±0.31% +1075.24% |
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 76.465mb +0.01%  | 3.393ms +3.41%   | ±2.27% +6.01%    |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 96.258mb +0.01%  | 190.178ms +1.83% | ±0.86% +185.48%  |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 74.784mb +0.01%  | 19.559ms +4.32%  | ±1.54% +144.68%  |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 77.705mb +0.01%  | 1.606ms -3.74%   | ±2.46% +163.00%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 77.705mb +0.01%  | 1.627ms -2.51%   | ±2.17% +63.73%   |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 74.817mb +0.01%  | 2.579ms +0.62%   | ±0.60% -14.68%   |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 75.346mb +0.01%  | 14.861ms -0.77%  | ±0.90% -66.74%   |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 75.346mb +0.01%  | 14.840ms +3.40%  | ±0.73% -16.06%   |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 73.249mb +0.01%  | 1.706μs -4.90%   | ±2.72% +1.92%    |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 73.249mb +0.01%  | 0.300μs -25.00%  | ±0.00% -100.00%  |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 86.805mb +0.09%  | 12.905ms +2.01%  | ±0.96% -3.50%    |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 116.165mb +0.01% | 64.469ms +1.11%  | ±1.36% +9.08%    |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 75.866mb +0.01%  | 1.522ms -4.27%   | ±2.01% -8.08%    |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 79.213mb +0.01%  | 58.586ms +0.18%  | ±0.42% -57.65%   |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 77.967mb +0.01%  | 4.232ms +11.40%  | ±3.31% +277.74%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 73.394mb +0.01%  | 39.926ms -1.23%  | ±1.54% +172.08%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 73.395mb +0.01%  | 40.579ms +0.44%  | ±0.26% -92.32%   |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 73.394mb +0.01%  | 40.775ms -0.13%  | ±0.88% -29.65%   |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 75.691mb +0.01%  | 7.371ms +0.56%   | ±0.57% +40.43%   |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 73.249mb +0.01%  | 29.258ms +0.38%  | ±0.96% +113.82%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 73.249mb +0.01%  | 13.754μs +2.55%  | ±3.25% +364.33%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 73.249mb +0.01%  | 15.612μs -1.23%  | ±0.60% +102.13%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 96.259mb +0.01%  | 185.998ms -2.88% | ±0.48% -7.64%    |
+-------------------------+----------------------------+------+-----+------------------+------------------+------------------+

@norberttech norberttech merged commit e81e228 into flow-php:1.x Jan 21, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant