Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add faster implementation of _IonNature.ion_hash #22

Merged
merged 2 commits into from
Sep 3, 2021

Conversation

popematt
Copy link
Contributor

@popematt popematt commented Sep 2, 2021

Issue #, if available:

Not necessarily a fix; at least an improvement for #21

Description of changes:

New implementation of .ion_hash() that does away with the overhead of instantiating ion_writers. As part of this change, I improved the performance of the _escape() function, which has the side effect of also benefitting the ion_hash_reader and the ion_hash_writer.

I did some very informal performance testing using this script.

                      Time to hash single value             Ops/Sec                   Approx
Test Case                 (old)       (new)           (old)             (new)       Improvement
===============================================================================================
ion_hash_python_21     2734.39 μs   389.30 μs      365.71 ops/s      2568.72 ops/s       7x
int                     106.84 μs     9.36 μs     9359.53 ops/s    106820.70 ops/s      11x
bool                    100.58 μs     6.17 μs     9942.43 ops/s    162009.17 ops/s      16x
null                     98.02 μs     7.11 μs    10201.86 ops/s    140577.22 ops/s      14x
string                  104.39 μs     7.59 μs     9579.21 ops/s    131760.86 ops/s      14x
decimal                 125.09 μs    16.29 μs     7994.50 ops/s     61396.71 ops/s       8x
clob                    102.30 μs     8.38 μs     9775.37 ops/s    119359.82 ops/s      12x
timestamp               178.36 μs    40.24 μs     5606.68 ops/s     24850.82 ops/s       4x
empty_struct            139.38 μs     7.16 μs     7174.65 ops/s    139569.14 ops/s      19x
struct_one_field        307.97 μs    38.24 μs     3247.12 ops/s     26150.63 ops/s       8x
struct_few_fields       572.92 μs    75.29 μs     1745.44 ops/s     13281.10 ops/s       8x
struct_many_fields     1735.05 μs   329.23 μs      576.35 ops/s      3037.42 ops/s       5x
empty_list              118.40 μs     6.21 μs     8445.82 ops/s    161004.19 ops/s      19x
long_list              2890.12 μs   180.54 μs      346.01 ops/s      5539.09 ops/s      16x
nested_containers       487.38 μs    19.84 μs     2051.80 ops/s     50411.10 ops/s      25x
one_annotation          265.96 μs    11.36 μs     3760.01 ops/s     87993.25 ops/s      23x
many_annotations        506.17 μs    20.03 μs     1975.62 ops/s     49921.85 ops/s      25x
very_long_string        246.75 μs    10.28 μs     4052.64 ops/s     97308.19 ops/s      24x
very_long_blob          304.76 μs    15.67 μs     3281.30 ops/s     63807.71 ops/s      19x

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@cheqianh cheqianh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good!

Copy link

@tgregg tgregg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice.

ionhash/__init__.py Outdated Show resolved Hide resolved
Comment on lines +481 to +483
return _bytes.replace(bytes([_ESCAPE_BYTE]), bytes([_ESCAPE_BYTE, _ESCAPE_BYTE])) \
.replace(_BEGIN_MARKER, bytes([_ESCAPE_BYTE, _BEGIN_MARKER_BYTE])) \
.replace(_END_MARKER, bytes([_ESCAPE_BYTE, _END_MARKER_BYTE]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid recomputing bytes([...]) on each call by storing the values?

Copy link
Contributor Author

@popematt popematt Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I have also heard that looking up non-local values can be expensive in Python, and if we want to compute once and store the value, then it would need to be a non-local value. I'll create a followup issue to see if it would improve performance.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me, although it does seem like a simple enough thing to try given that you already have a basic perf testing setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants