Data size: disk size or gzip size? #1840

sffc · 2022-05-02T20:20:56Z

sffc
May 2, 2022
Maintainer

When evaluating changes to ICU4X, we take code and data size very seriously.

Most of the time, growth in data size on disk, i.e. file size, correlates closely with the size of that data when gzipped. However, as demonstrated in #1839, there are real-life examples of data that are very large on disk but have extremely good compression.

My question is: which metric should we care more about, or should we weigh them both when making decisions?

Here is my understanding: disk size matters for the static memory usage of the app (for zero-copy data), as well as the size of downloaded data files. GZip size matters for download speed and perceived latency.

What principles should we apply when making decisions in this space?

yjbanov · 2022-05-02T20:40:12Z

yjbanov
May 2, 2022

On the web, typically, the compressed size (gzip or brotli) is what matters the most. Exceptions may include file types that are already compressed (JPEG, WebP, Woff, etc). Since encoding of application code (JavaScript or WASM) does not include any lossless compression (there's symbol minification, dead code elimination, and lazy loading, but they are lossy) you want to look at compressed size.

2 replies

bhamiltoncx May 2, 2022

I think this also holds true on mobile. For the most part, getting the data over the network as efficiently as possible (e.g., compressed maximally) is the most important. Once the data is on disk, I think decompressing it is fine.

sffc May 2, 2022
Maintainer Author

Thanks for the fast replies!

A follow-up question: decompressed data may need to be loaded into memory. In this particular case, for Segmenter, we're talking about approximately 1 MB per segmenter type (word/line/sentence/grapheme). 1 MB is fairly significant when it comes to download size. Is it also significant when it comes to the fixed memory usage / RAM of an app?

shaybarak · 2022-05-02T21:04:30Z

shaybarak
May 2, 2022

On Fuchsia we keep read-only blobs compressed on disk (currently using zstd but we considered alternatives). So download size and disk size may correlate well.

2 replies

sffc May 2, 2022
Maintainer Author

When the read-only blobs are executed, they have to be decompressed and loaded into memory. So if you have data that is like 10 KB compressed but 1 MB decompressed (as is the case in #1839), you end up eating a lot of static memory (RAM). How much of a concern should that be, with those types of figures?

shaybarak May 2, 2022

Whether this is a problem really depends on the device and on the usage patterns.

For one thing, Fuchsia's blobfs supports paging backed by streaming decompression. So if multiple processes use the same underlying blob then the memory-mapped pages are deduplicated in physical RAM.

zbraniecki · 2022-05-02T23:29:32Z

zbraniecki
May 2, 2022
Maintainer

My mental model is that compressed data has higher impact on system capabilities than uncompressed data.

As mentioned above, compressed data is what is carried over the wire, updated over time, and stored locally. The cost of compressed data dictates reliability of the OTA downloads, and live-patching of data. It also makes or breaks the reality of being able to ship less data and download more as needed dynamically by the system.
Finally, compressed data makes various DataProvider sources viable or not - especially proxy partial data provider sources that, if storing data is cheap may become more viable.

Uncompressed data mostly guards runtime memory consumption, and with heavy chunking that we're aiming for in ICU4X, I hope that this is a less significant issue since we are less at risk of having to decide on loading large amount of data to supply a system that needs just a small portion of it.

And if the correlation is strong - a system needs large amount of data, then the hardware has to provide that memory. If the hardware is too limited for the runtime memory to handle that amount of data, then this solution cannot be used, and I doubt that micro-optimizing the runtime memory can really achieve much here.
I rather imagine systems that cap memory and in result use low-res ML model for SEA segmentation and never upgrade to a high-res one because of that. That's an ok tradeoff imo.

My main concern about using gzipped/compressed data as a base data unit is the problem space of deltas and diffing. On an uncompressed data it is easy to reason about architectures that will be able to OTA delta between data revisions, patch live and maintain a system that requires minimal/spotty network connection to maintain reliable quality of latest data on, say, 1 billion devices loosely connected to the provider source.

But if the data is compressed, I'm worried about that architecture locking us in the reality of "blob A" vs "blob B" where partial deltas are unsustainable and we resolve to "full downloads" of data for every change.

0 replies

sffc · 2022-05-03T12:07:02Z

sffc
May 3, 2022
Maintainer Author

Comment from @gvictor regarding static memory usage:

For mobile devices, unreleased and permanent RAM usage due to ICU data pre-loading / caching causes more problems than intermediate RAM usage. If RAM usage is released eventually when the segmenter is released, mobile app / frameworks can control when acquiring and releasing a segmenter. In contrast, mobile apps are quite sensitive to increasing permanent RAM usage. (Due to app heap limit imposed by the OS & killing other apps, etc)

1 reply

bhamiltoncx May 3, 2022

Yes, this is especially true for iOS extensions, which are extremely limited in the amount of heap memory they can allocate.

Memory mapping data files (uncompressed, of course) is one good way to handle this, as is using a standard format like a SQLite3 database.

dminor · 2022-05-03T20:27:50Z

dminor
May 3, 2022
Maintainer

I think we should put more emphasis on gzipped size.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data size: disk size or gzip size? #1840

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Data size: disk size or gzip size? #1840

sffc May 2, 2022 Maintainer

Replies: 5 comments · 5 replies

yjbanov May 2, 2022

bhamiltoncx May 2, 2022

sffc May 2, 2022 Maintainer Author

shaybarak May 2, 2022

sffc May 2, 2022 Maintainer Author

shaybarak May 2, 2022

zbraniecki May 2, 2022 Maintainer

sffc May 3, 2022 Maintainer Author

bhamiltoncx May 3, 2022

dminor May 3, 2022 Maintainer

sffc
May 2, 2022
Maintainer

Replies: 5 comments 5 replies

yjbanov
May 2, 2022

sffc May 2, 2022
Maintainer Author

shaybarak
May 2, 2022

sffc May 2, 2022
Maintainer Author

zbraniecki
May 2, 2022
Maintainer

sffc
May 3, 2022
Maintainer Author

dminor
May 3, 2022
Maintainer