Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UniqueFactory for creating random string or int from given range #1128

Merged
merged 8 commits into from
Jul 31, 2024

Conversation

mleczakm
Copy link
Contributor

Change Log

Added

  • UniqueFactory as source of random ints/strings with given length

Fixed

Changed

Removed

Deprecated

Security


Description

Solves #1124

Copy link
Contributor

github-actions bot commented Jul 20, 2024

Flow PHP - Benchmarks

Results of the benchmarks from this PR are compared with the results from 1.x branch.

Extractors
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| benchmark             | subject           | revs | its | mem_peak        | mode             | rstdev          |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
| CSVExtractorBench     | bench_extract_10k | 1    | 3   | 3.937mb +0.21%  | 510.985ms +0.64% | ±2.89% +482.09% |
| JsonExtractorBench    | bench_extract_10k | 1    | 3   | 3.969mb +0.21%  | 1.058s -1.14%    | ±0.80% +16.37%  |
| ParquetExtractorBench | bench_extract_10k | 1    | 3   | 28.530mb +0.03% | 422.100ms -3.32% | ±0.57% +11.36%  |
| TextExtractorBench    | bench_extract_10k | 1    | 3   | 3.696mb +0.22%  | 34.270ms +1.36%  | ±2.14% +137.70% |
| XmlExtractorBench     | bench_extract_10k | 1    | 3   | 3.643mb +0.22%  | 430.277ms -0.67% | ±0.29% -67.77%  |
+-----------------------+-------------------+------+-----+-----------------+------------------+-----------------+
Transformers
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| benchmark                   | subject                  | revs | its | mem_peak         | mode            | rstdev         |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
| RenameEntryTransformerBench | bench_transform_10k_rows | 1    | 3   | 116.038mb +0.01% | 59.390ms -0.97% | ±0.37% -78.93% |
+-----------------------------+--------------------------+------+-----+------------------+-----------------+----------------+
Loaders
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| benchmark          | subject        | revs | its | mem_peak         | mode            | rstdev          |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
| CSVLoaderBench     | bench_load_10k | 1    | 3   | 54.159mb +0.02%  | 83.817ms -4.08% | ±3.62% +59.10%  |
| JsonLoaderBench    | bench_load_10k | 1    | 3   | 106.526mb +0.01% | 51.253ms -2.13% | ±0.54% +2.92%   |
| ParquetLoaderBench | bench_load_10k | 1    | 3   | 123.808mb +0.01% | 1.230s -0.32%   | ±0.61% -79.90%  |
| TextLoaderBench    | bench_load_10k | 1    | 3   | 16.953mb +0.05%  | 43.739ms -2.65% | ±1.10% +452.90% |
+--------------------+----------------+------+-----+------------------+-----------------+-----------------+
Building Blocks
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| benchmark               | subject                    | revs | its | mem_peak         | mode             | rstdev          |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 59.711mb +0.01%  | 426.753ms -0.44% | ±0.28% -50.46%  |
| TypeDetectorBench       | bench_type_detector        | 1    | 3   | 14.250mb +0.03%  | 85.273ms +0.65%  | ±1.05% +64.91%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 116.529mb +0.00% | 498.090ms -5.32% | ±0.91% -70.48%  |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 60.007mb +0.01%  | 248.786ms -1.16% | ±2.15% +466.60% |
| NativeEntryFactoryBench | bench_entry_factory        | 1    | 3   | 14.941mb +0.03%  | 54.709ms +0.60%  | ±1.50% +859.45% |
| RowsBench               | bench_chunk_10_on_10k      | 2    | 3   | 86.800mb +0.00%  | 3.226ms -6.76%   | ±3.19% +27.24%  |
| RowsBench               | bench_diff_left_1k_on_10k  | 2    | 3   | 102.397mb +0.00% | 190.432ms +0.20% | ±1.01% -0.81%   |
| RowsBench               | bench_diff_right_1k_on_10k | 2    | 3   | 85.117mb +0.00%  | 19.307ms +3.46%  | ±1.69% +8.97%   |
| RowsBench               | bench_drop_1k_on_10k       | 2    | 3   | 88.040mb +0.00%  | 1.578ms -4.87%   | ±1.15% -63.34%  |
| RowsBench               | bench_drop_right_1k_on_10k | 2    | 3   | 88.040mb +0.00%  | 1.635ms -9.12%   | ±1.52% +60.95%  |
| RowsBench               | bench_entries_on_10k       | 2    | 3   | 85.152mb +0.00%  | 2.497ms -4.79%   | ±0.43% -50.92%  |
| RowsBench               | bench_filter_on_10k        | 2    | 3   | 85.681mb +0.00%  | 14.853ms -0.19%  | ±2.46% +121.71% |
| RowsBench               | bench_find_on_10k          | 2    | 3   | 85.681mb +0.00%  | 14.606ms -2.27%  | ±1.11% -4.98%   |
| RowsBench               | bench_find_one_on_10k      | 10   | 3   | 83.584mb +0.00%  | 1.606μs -11.07%  | ±2.89% +12.24%  |
| RowsBench               | bench_first_on_10k         | 10   | 3   | 83.584mb +0.00%  | 0.300μs -25.00%  | ±0.00% -100.00% |
| RowsBench               | bench_flat_map_on_1k       | 2    | 3   | 92.935mb +0.00%  | 12.098ms -0.69%  | ±1.19% +186.43% |
| RowsBench               | bench_map_on_10k           | 2    | 3   | 122.305mb +0.00% | 60.230ms -2.24%  | ±0.35% -79.31%  |
| RowsBench               | bench_merge_1k_on_10k      | 2    | 3   | 86.200mb +0.00%  | 1.231ms +0.60%   | ±1.89% +6.74%   |
| RowsBench               | bench_partition_by_on_10k  | 2    | 3   | 89.546mb +0.00%  | 62.001ms -4.43%  | ±1.21% +92.18%  |
| RowsBench               | bench_remove_on_10k        | 2    | 3   | 88.302mb +0.00%  | 3.825ms -1.45%   | ±0.64% -59.05%  |
| RowsBench               | bench_sort_asc_on_1k       | 2    | 3   | 83.728mb +0.00%  | 39.437ms -3.62%  | ±0.53% -74.37%  |
| RowsBench               | bench_sort_by_on_1k        | 2    | 3   | 83.729mb +0.00%  | 38.985ms -2.36%  | ±1.76% +57.43%  |
| RowsBench               | bench_sort_desc_on_1k      | 2    | 3   | 83.728mb +0.00%  | 40.850ms -0.26%  | ±1.75% +31.82%  |
| RowsBench               | bench_sort_entries_on_1k   | 2    | 3   | 86.026mb +0.00%  | 7.344ms -0.05%   | ±0.58% -44.39%  |
| RowsBench               | bench_sort_on_1k           | 2    | 3   | 83.584mb +0.00%  | 29.063ms -1.90%  | ±0.55% -56.78%  |
| RowsBench               | bench_take_1k_on_10k       | 10   | 3   | 83.584mb +0.00%  | 13.288μs -1.37%  | ±1.95% -43.91%  |
| RowsBench               | bench_take_right_1k_on_10k | 10   | 3   | 83.584mb +0.00%  | 16.010μs +0.44%  | ±2.76% -12.17%  |
| RowsBench               | bench_unique_on_1k         | 2    | 3   | 102.399mb +0.00% | 194.288ms +0.24% | ±0.40% -49.17%  |
+-------------------------+----------------------------+------+-----+------------------+------------------+-----------------+

src/core/etl/src/Flow/ETL/UniqueFactory.php Outdated Show resolved Hide resolved
src/core/etl/src/Flow/ETL/UniqueFactory.php Outdated Show resolved Hide resolved
src/core/etl/src/Flow/ETL/UniqueFactory.php Outdated Show resolved Hide resolved
@@ -163,13 +163,13 @@ public function test_partition_by_partitions_order() : void
function (int $i) : array {
$data = [];

$maxItems = \random_int(2, 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More I look at this, more I think I would add new function for unique int/string to "hide" object call & improve DX perspective, WDYT?

@norberttech
Copy link
Member

norberttech commented Jul 21, 2024

The PR look good and we could merge it like that (beside few small things @stloyd pointed out) but I think we might be able to still improve even more. Here are my thoughts.

Naming - now when I look at it something is missing in the name, like RandomValueGenerator would be more descriptive.

Dependency Injection - I think I would turn RandomValueGenerator into an interface with NativePHPRandomValueGenerator that we can simply inject into objects that would need a unique values (I'm thinking about caching strategies here and more advanced data processing algorithms).

Scalar Functions - Additionally (can go to a standalone PR) we could also create Flow scalar functions that could be used during data processing, like:

df()
   ->read(...) 
   ->withEntry('random_string', random_string(length: 16))
   ->write(to_output())
   ->run()

df()
   ->read(...) 
   ->withEntry('random_string', random_string(length: ref('some_int_column'))))
   ->write(to_output())
   ->run()

This function would implement Flow\ETL\Function\ScalarFunction and could take $lenght as another ScalarFunction so this way length could be taken from a column or output of chained functions (some calculations). Additionally it would also need RandomValueGenerator $generator = new NatievPHPRandomValueGenerator() in case anyone would like to change the generation strategy.

DSL Functions - as pointed by @stloyd we probably should also add generate_random_string(int $length = 16) and generate_random_int(int $start = PHP_INT_MIN, int $end = PHP_INT_MAX, RandomValueGenerator $generator = new NatievPHPRandomValueGenerator()) functions so we can use them in tests instead of static factories.

Thoughts?

Copy link
Member

@norberttech norberttech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes described above are required to merge this PR :)

@github-actions github-actions bot added size: L and removed size: M labels Jul 31, 2024
@norberttech norberttech merged commit 63ef619 into flow-php:1.x Jul 31, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants