Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seek to frame numbers #19

Open
bebehei opened this issue Sep 16, 2024 · 2 comments
Open

Seek to frame numbers #19

bebehei opened this issue Sep 16, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@bebehei
Copy link

bebehei commented Sep 16, 2024

I'm currently testing the zstd seekable frame format. I'm missing to seek to a specific frame instead to its byte position.

from pyzstd import SeekableZstdFile, ZstdFile

FILENAME="framed.zst"

with SeekableZstdFile(FILENAME, 'w') as filewriter:
    filewriter.write(b"first_frame")
    filewriter.flush(mode=ZstdFile.FLUSH_FRAME)
    filewriter.write(b"second_frame")
    filewriter.flush(mode=ZstdFile.FLUSH_FRAME)
    filewriter.write(b"third_frame")
    filewriter.close()

with SeekableZstdFile(FILENAME, 'r') as filereader:
    # Feature Request: Seek to frame numbers
    filereader.seek_to_frame(-1)
    # Prints out `third_frame`
    print(filereader.read())

I've got a big file (about 600GB uncompressed), which I read and segment logically into zstd frames via SeekableZstdFile.flush as written in the example above.

After all data is compressed, I flush again the data and then I also add an index of every segement with the metadata as an additional frame. This index ist just a JSON serialized text.

To find that metadata JSON, the program must do some weird binary search now. Since it's already exposed in its own frame, simply jumping to the last frame would do it as O(1).


zstd exposes the frame data in its .h files too: https://github.com/facebook/zstd/blob/20707e3718ee14250fb8a44b3bf023ea36bd88df/contrib/seekable_format/zstd_seekable.h#L199-L220

Is it possible to expose this in the python module, too?

@Rogdham
Copy link
Owner

Rogdham commented Sep 22, 2024

Hello, unfortunately pyzstd does not have a public API for this as of now.

In the meanwhile, could you try the following?

with SeekableZstdFile(FILENAME, 'r') as filereader:
    # Seek to last frame
    filereader.seek(filereader._buffer.raw._seek_table.get_frame_sizes(-1)[1])
    # Prints out `third_frame`
    print(filereader.read())

@bebehei
Copy link
Author

bebehei commented Oct 4, 2024

Thanks @Rogdham, this works flawlessly.

Hadn't been able to come up with this myself by introspection. Therefore would definitely vote for a public API of this. Others will probably have this problem, too.

@Rogdham Rogdham added the enhancement New feature or request label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants