Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A minimal implementation of part 2 : msgpack #6

Closed
wants to merge 4 commits into from

Conversation

eregon
Copy link

@eregon eregon commented Apr 21, 2014

Just enough to pass the given tests, yet considering the extension to multiple extended types.
The Hash type is certainly the most interesting one to pack/unpack.
And the use of the Enumerator feels definitely as the best way considering we never need to go back.

@practicingruby
Copy link
Member

This is very close to what I did as a rough sketch when writing the exercise, good work!

Enumerator was a huge help here, it'll be interesting to see if someone submits a solution that does not use it so we can compare.

@practicingruby
Copy link
Member

One thing I noticed in #8 that also applies here: You're converting messagepack fixed strings as if they're raw binary data, but they're specified to be in UTF-8 format. There's an example on #8 that shows the difference.

On a tangential note, I've never seen String#b before. Neat!

@eregon eregon changed the title An minimal implementation of part 2 : msgpack A minimal implementation of part 2 : msgpack Apr 30, 2014
* UTF-8 is the chosen encoding for the 'str' type
@eregon
Copy link
Author

eregon commented Apr 30, 2014

Thanks, indeed I missed to convert to UTF-8 if not already the case in pack. But this conversion to UTF-8 is not always possible and might be lossy.
Notice I already thought to convert back to UTF-8 in unpack though.
In my complete solution, much better care is taken with strings, and actually strings of any encoding can be transferred.
I wonder what is the best way to unpack a string from an Enumerator of bytes.
bytesize.times.map { bytes.next }.pack("C#{bytesize}") could work but it creates an extra Array. The way I implemented is clear but might resize a lot the String.

@eregon
Copy link
Author

eregon commented Apr 30, 2014

It is not correct to use the U directive for this problem like you mention in #8.

"Ĕ".unpack("U*") # => [276]

276 is not representable as a single byte (it actually is the first UTF-8 codepoint of that string).

@practicingruby
Copy link
Member

@eregon: Hmm... I didn't realize that's what it would do, I guess I had assumed it'd handle multibyte sequences rather than requiring conversion to codepoints but that's clearly not the case.

So what's the right way to solve this?

@practicingruby
Copy link
Member

@eregon:

When going from MessagePack to Ruby, It appears like we can use force_encoding which should be safe, because MessagePack requires UTF-8 text. Can you confirm, or recommend a better solution?

>> "Ĕ".bytes
=> [196, 148]
>> x = "".b
=> ""
>> x << 196
=> "\xC4"
>> x << 148
=> "\xC4\x94"
>> x.encode("UTF-8")
Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
    from (irb):44:in `encode'
    from (irb):44
    from /Users/seacreature/.rubies/ruby-2.1.1/bin/irb:11:in `<main>'
>> x.force_encoding("UTF-8")
=> "Ĕ"

@eregon
Copy link
Author

eregon commented Apr 30, 2014

Indeed, force_encoding is the way to set the encoding of a String, and the only way to read byte by byte correctly is to make a binary String of them. One could use String#valid_encoding? to check if the result is indeed UTF-8 or not.

When going to bytes, the obvious choice is String#bytes (especially since it returns an Array), but String#unpack('C*') should be equivalent. The String should be first encoded in UTF-8 with encode('UTF-8') of course if it is to be decoded in UTF-8.

@practicingruby
Copy link
Member

OK, I think this is how I'm going to proceed in my own solution then. thanks!

As for the rather annoying thing with getting the next N elements from an Enumerator, I haven't been able to come up with a better solution. I had hoped enum.next(N) existed in Ruby, but sadly it does not seem to. I wonder if it's worth suggesting it.

@eregon
Copy link
Author

eregon commented Apr 30, 2014

Yeah, I expected as well enum.next(N) would work. take(N) somewhat works on IO as #each is modifying the Enumerator but I have no idea how to do that for a normal Enumerator. It is likely worth discussing it if this has not already been the case.

@practicingruby
Copy link
Member

I also tried take(N) without luck. I think I understand the source of the problem. With an I/O object an external counter is maintained... the position of the read pointer in the file. No such thing exists for arrays or other collections, as iteration is non-destructive.

Here's a hack I built that works around the problem:

class PersistentEnumerable
  def initialize(collection)
    @collection = collection
  end

  def each
    counter = -1

    Enumerator.new do |y|
      loop do
        counter += 1

        break if counter == @collection.length

        y.yield(@collection[counter])
      end
    end
  end
end


pe = PersistentEnumerable.new([1,3,5,7,9,11])

enum = pe.each 

p enum.take(2) #=> [1, 3]
p enum.take(2) #=> [5, 7]

# create a new closure
enum = pe.each

p enum.take(2) #=> [1,3]

Relying on storing state in closures is a little scary, but it works!

@practicingruby
Copy link
Member

Note my original PersistentEnumerable example had mistakes. Should work now. (for some definition of work 😁)

@eregon
Copy link
Author

eregon commented May 1, 2014

Neat, quite similar to what I was thinking!
I would say relying on @collection.length not changing is more dangerous than closure state.
If you do not mind another level of Enumerator, it can be defined for every Enumerator as well:

class Enumerator
  def to_persistent
    enum = rewind
    Enumerator.new do |y|
      loop { y << enum.next }
    end
  end
end

@eregon eregon closed this Nov 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants