Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computing Ion hash is slow for larger Ion inputs #21

Open
desaikd opened this issue Aug 20, 2021 · 1 comment
Open

Computing Ion hash is slow for larger Ion inputs #21

desaikd opened this issue Aug 20, 2021 · 1 comment

Comments

@desaikd
Copy link

desaikd commented Aug 20, 2021

Reference issue: awslabs/amazon-qldb-driver-python#67

For the following document, ion_hash takes ~2.5 ms and native hash takes < 0.01 ms.

$ion_1_0 { id:"0001", type:"donut", name:"Cake", ppu:0.55, batters:{ batter:[ { id:"1001", type:"Regular" }, { id:"1002", type:"Chocolate" }, { id:"1003", type:"Blueberry" }, { id:"1004", type:"Devil's Food" } ] }, topping:[ { id:"5001", type:"None" }, { id:"5002", type:"Glazed" }, { id:"5005", type:"Sugar" }, { id:"5007", type:"Powdered Sugar" }, { id:"5006", type:"Chocolate with Sprinkles" }, { id:"5003", type:"Chocolate" }, { id:"5004", type:"Maple" } ] }

ion_hash function internally uses the reader and writer from ion-python implementation, it might be worth trying out new C extension work around ion-python which improves performance of reader and writer.

@popematt
Copy link
Contributor

popematt commented Aug 25, 2021

I've investigated the possibility of using the C extension for ion-hash-python, and it does not seem to be possible—here's why.

Currently, ion-hash-python works overall like this (source code is from here, comments added by me):

    # Creates a custom ion_writer that keeps a running hash value
    hw = hash_writer(blocking_writer(binary_writer(), BytesIO()), hfp)
    # Writes the Ion value to the custom ion_writer
    _dump(self, hw, _FROM_TYPE)
    hw.send(ION_STREAM_END_EVENT)
    # Signals the custom writer to output the current hash value
    return hw.send(HashEvent.DIGEST)

The _dump function has a pure Python implementation, so we would need to switch this code to use dump in order to use the C extension. However, their signatures are not compatible. The dump function requires an Ion value and a file-object, whereas _dump requires an Ion value and an ion_writer.

In theory, one could create a custom file-object that operates like the custom ion_writer—intercepting the data as it is being written and writing the hash of the data instead of the actual value. However, that relies on the assumption that values are written to the file-object one at a time, which is not necessarily true if there is any buffering going on. Even if the assumption were to hold, the custom file-object logic would have to read the values in order to track whether it is in any containers and to rewrite the ion binary values to their respective hashed values. I am also not sure if this is reasonable because the C code would need to be able to call the functions of our Python-code, custom file-object. Finally, given that the data would need to be read and rewritten (in Python code), it seems unlikely that this approach would result in a useful performance improvement.

The other, more reasonable option (in my opinion) is to call ion-hash-go from ion-hash-python (possibly using gopy). As long as we only need to delegate the ion_hash function, it should be simple to implement. If distributing a package with both Python and Go proves to be too challenging, we could create an ion-hash-c implementation that the ion-hash-python library can leverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants