Replies: 5 comments 5 replies
-
On the web, typically, the compressed size (gzip or brotli) is what matters the most. Exceptions may include file types that are already compressed (JPEG, WebP, Woff, etc). Since encoding of application code (JavaScript or WASM) does not include any lossless compression (there's symbol minification, dead code elimination, and lazy loading, but they are lossy) you want to look at compressed size. |
Beta Was this translation helpful? Give feedback.
-
On Fuchsia we keep read-only blobs compressed on disk (currently using zstd but we considered alternatives). So download size and disk size may correlate well. |
Beta Was this translation helpful? Give feedback.
-
My mental model is that compressed data has higher impact on system capabilities than uncompressed data. As mentioned above, compressed data is what is carried over the wire, updated over time, and stored locally. The cost of compressed data dictates reliability of the OTA downloads, and live-patching of data. It also makes or breaks the reality of being able to ship less data and download more as needed dynamically by the system. Uncompressed data mostly guards runtime memory consumption, and with heavy chunking that we're aiming for in ICU4X, I hope that this is a less significant issue since we are less at risk of having to decide on loading large amount of data to supply a system that needs just a small portion of it. And if the correlation is strong - a system needs large amount of data, then the hardware has to provide that memory. If the hardware is too limited for the runtime memory to handle that amount of data, then this solution cannot be used, and I doubt that micro-optimizing the runtime memory can really achieve much here. My main concern about using gzipped/compressed data as a base data unit is the problem space of deltas and diffing. On an uncompressed data it is easy to reason about architectures that will be able to OTA delta between data revisions, patch live and maintain a system that requires minimal/spotty network connection to maintain reliable quality of latest data on, say, 1 billion devices loosely connected to the provider source. But if the data is compressed, I'm worried about that architecture locking us in the reality of "blob A" vs "blob B" where partial deltas are unsustainable and we resolve to "full downloads" of data for every change. |
Beta Was this translation helpful? Give feedback.
-
Comment from @gvictor regarding static memory usage:
|
Beta Was this translation helpful? Give feedback.
-
I think we should put more emphasis on gzipped size. |
Beta Was this translation helpful? Give feedback.
-
When evaluating changes to ICU4X, we take code and data size very seriously.
Most of the time, growth in data size on disk, i.e. file size, correlates closely with the size of that data when gzipped. However, as demonstrated in #1839, there are real-life examples of data that are very large on disk but have extremely good compression.
My question is: which metric should we care more about, or should we weigh them both when making decisions?
Here is my understanding: disk size matters for the static memory usage of the app (for zero-copy data), as well as the size of downloaded data files. GZip size matters for download speed and perceived latency.
What principles should we apply when making decisions in this space?
Beta Was this translation helpful? Give feedback.
All reactions