You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm currently testing the zstd seekable frame format. I'm missing to seek to a specific frame instead to its byte position.
from pyzstd import SeekableZstdFile, ZstdFile
FILENAME="framed.zst"
with SeekableZstdFile(FILENAME, 'w') as filewriter:
filewriter.write(b"first_frame")
filewriter.flush(mode=ZstdFile.FLUSH_FRAME)
filewriter.write(b"second_frame")
filewriter.flush(mode=ZstdFile.FLUSH_FRAME)
filewriter.write(b"third_frame")
filewriter.close()
with SeekableZstdFile(FILENAME, 'r') as filereader:
# Feature Request: Seek to frame numbers
filereader.seek_to_frame(-1)
# Prints out `third_frame`
print(filereader.read())
I've got a big file (about 600GB uncompressed), which I read and segment logically into zstd frames via SeekableZstdFile.flush as written in the example above.
After all data is compressed, I flush again the data and then I also add an index of every segement with the metadata as an additional frame. This index ist just a JSON serialized text.
To find that metadata JSON, the program must do some weird binary search now. Since it's already exposed in its own frame, simply jumping to the last frame would do it as O(1).
Hello, unfortunately pyzstd does not have a public API for this as of now.
In the meanwhile, could you try the following?
withSeekableZstdFile(FILENAME, 'r') asfilereader:
# Seek to last framefilereader.seek(filereader._buffer.raw._seek_table.get_frame_sizes(-1)[1])
# Prints out `third_frame`print(filereader.read())
Hadn't been able to come up with this myself by introspection. Therefore would definitely vote for a public API of this. Others will probably have this problem, too.
I'm currently testing the zstd seekable frame format. I'm missing to seek to a specific frame instead to its byte position.
I've got a big file (about 600GB uncompressed), which I read and segment logically into zstd frames via
SeekableZstdFile.flush
as written in the example above.After all data is compressed, I flush again the data and then I also add an index of every segement with the metadata as an additional frame. This index ist just a JSON serialized text.
To find that metadata JSON, the program must do some weird binary search now. Since it's already exposed in its own frame, simply jumping to the last frame would do it as
O(1)
.zstd exposes the frame data in its
.h
files too: https://github.com/facebook/zstd/blob/20707e3718ee14250fb8a44b3bf023ea36bd88df/contrib/seekable_format/zstd_seekable.h#L199-L220Is it possible to expose this in the python module, too?
The text was updated successfully, but these errors were encountered: