Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for decoding floating-point typed arrays from RFC8746 #111

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

tgockel
Copy link

@tgockel tgockel commented May 25, 2021

This adds support for decoding arrays of floating point numbers of IEEE 754 formats binary16, binary32, and binary64 in both the big- and little-endian form.


If this looks good, we can add unsigned and signed integers using the same general ideas...and also encoders for these special markers.

@tgockel tgockel changed the title Work-in-progress: Adds support for typed arrays from RFC8746 Adds support for decoding floating-point typed arrays from RFC8746 May 26, 2021
@Sekenre
Copy link
Collaborator

Sekenre commented May 28, 2021

Hi @tgockel, thanks for doing this. I had done some experiments a while back decoding typed arrays into python array.array types. I think that might be faster. It also lets you do a round-trip:

Sekenre@a117ad3

This is related to #32 and is maybe a simple way to handle it without needing numpy as a dependency.

Let me know what you think, I'm open to suggestions.

@coveralls
Copy link

coveralls commented May 28, 2021

Coverage Status

Coverage decreased (-0.3%) to 96.892% when pulling 70e6f6c on tgockel:rfc8746 into 9f30439 on agronholm:master.

@tgockel
Copy link
Author

tgockel commented May 31, 2021

I have never seen array before, but it definitely seems like the right approach instead of the weird struct trickery I did. Unfortunately, array.array doesn't have support for half-precision floats, but I updated the single- and double-precision floating point algorithm to use it.

The biggest issue I see is immutability -- array.array does not have a convenient method like numpy's array.setflags(write=False) for this. I left comments with TODO(tgockel/111) for this, but I don't know an elegant way to address this one.

@Sekenre
Copy link
Collaborator

Sekenre commented Jun 2, 2021

The biggest issue I see is immutability -- array.array does not have a convenient method like numpy's array.setflags(write=False) for this. I left comments with TODO(tgockel/111) for this, but I don't know an elegant way to address this one.

If you want it to be immutable, you can wrap the bytes in a memoryview and then cast it, like this:

>>> my_array = memoryview(b'\x1f\x85\xebQ\xb8\x1e\t@').cast('d')
>>> assert my_array[0] == 3.14
>>> my_array[0] = 2.16
Traceback (most recent call last):
  File "<pyshell#57>", line 1, in <module>
    myarray[0] = 2.14
TypeError: cannot modify read-only memory

@tgockel
Copy link
Author

tgockel commented Jun 2, 2021

That unfortunately doesn't work because the ultimate point of making this read-only is so that it can be used as keys in a dictionary, but memoryview hashing has a shortcoming:

ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'

This adds support for decoding arrays of floating point numbers of IEEE
754 formats binary16, binary32, and binary64 in both the big- and
little-endian form.
@Sekenre
Copy link
Collaborator

Sekenre commented Jun 5, 2021

I tried writing a little class to represent a float16 array instead of converting to a list of floats and posted it here: https://codereview.stackexchange.com/q/261573/243247. This lets you write an encoder that can just copy the underlying buffer into the output. This could be added to cbor2.types.

@tgockel
Copy link
Author

tgockel commented Jun 6, 2021

There's an interesting question on hashing -- should the endianness of the generated source affect hashing? Let's say an x86 machine and an AArch64 machine both generate [1.5, 2.5] and encode it as a half-precision typed array...let's call them arr_le and arr_be. Should the hash(arr_le) == hash(arr_be)? What about hash((1.5, 2.5))? I think a user would expect all 3 hashes to be equal.

This gets even more hairy when we get into integer v float comparisons. In Python, hash(2) == hash(2.0). Per the documentation of hash:

Numeric values that compare equal have the same hash value (even if they are of different types, as is the case for 1 and 1.0).

This extends to tuples, as hash((2, 3, 4)) == hash((2.0, 3.0, 4.0)).

I'm not sure there is a good answer here. My solution of calling tuple(input) has the disadvantage of poor performance, but it only happens when a typed array is used as a key to a map, which I don't think happens all that frequently in the world.

@Sekenre
Copy link
Collaborator

Sekenre commented Jun 9, 2021

should the endianness of the generated source affect hashing?

IMO: No it should not, foreign endian data should always be converted to native endian prior to hashing, and each platform should write arrays in their native format since it can always be unambiguously tagged as such.

This extends to tuples, as hash((2, 3, 4)) == hash((2.0, 3.0, 4.0))

Does that hashing behaviour hold true for numpy 1d arrays? Would it just be easier to require numpy for handling these?

@tgockel
Copy link
Author

tgockel commented Jun 14, 2021

numpy arrays avoid the problem by not being hashable.

@escherstair
Copy link

@tgockel @Sekenre do you have plans to merge this pull request?
Typed arrays is exactly the feature I miss

@brendan-simon-indt
Copy link

Bump. Any movement on getting various floating point formats encoded with CBOR?

@agronholm
Copy link
Owner

Bump. Any movement on getting various floating point formats encoded with CBOR?

The problem with immutability/hashability has not been solved yet. If you want this faster, participate in the process of finding solutions.

@brendan-simon-indt
Copy link

Bump. Any movement on getting various floating point formats encoded with CBOR?

The problem with immutability/hashability has not been solved yet. If you want this faster, participate in the process of finding solutions.

I found a solution that works for me - casting to np.floatX, then back to float, then use canonical=True when encoding.

value_to_encode = float( np.float16( value ) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants