A customisable compression utility dedicated to short inputs. Get the newest release.
Skrót allows you to build a model of your data and use it to compress short byte sequences of predictable contents. It can efficiently compress byte sequences shorter than 200B. It's based on established dictionary-based data compression algorithms: LZMA and LZ4.
Interested in some numbers? Take a look at results of our benchmarks.
Skrót comes in two variants:
- a native library written in portable C, and
- a pure Java library with no native dependencies.
Skrót has two implementations: a native one and a JVM one. They share no code and the way they're built and used differs.
A tentative C API is specified in skr.h
.
It's implemented in form of a shared library libskr
.
Its only dependencies are liblz4 and liblzma;
they're present in repositories of most contemporary GNU/Linux distributions and
they're available in Homebrew as lz4
and xz
.
int skr_model(const uint8_t* input, size_t input_len,
uint8_t** output, size_t output_len,
skr_opts_t* opts);
int skr_compress(const uint8_t* model, size_t model_len,
const uint8_t* input, size_t input_len,
uint8_t** output, size_t output_len,
skr_opts_t* opts);
int skr_decompress(const uint8_t* model, size_t model_len,
const uint8_t* input, size_t input_len,
uint8_t** output, size_t output_len,
skr_opts_t* opts);
skr
and its aliases unskr
and mkskr
are command line wrappers around
libskr
.
As of time of writing it's written in Python.
They should be invoked in a following way:
mkskr < raw_model > model
builds a Skrót model file from standard input,skr model < input > compressed
compresses contents of standard input using the model,unskr model < compressed > uncompressed
decompresses contents of standard input using the model.
Both skr
and libskr
can be built by invoking make
.
The API of the Java library has some documentation.
The Java library is built with Leiningen;
Install it first if you want to build a JAR.
Afterwards execute lein jar
in the java
directory.
The JAR will be placed in the java/target
directory.
LZMA is a dictionary based compression algorithm used in, among others, 7zip and xz-utils. LZ4 is a compression algorithm belonging to the same family as LZMA and it's used among others in the Linux kernel, Hadoop and BSD implementation of ZFS. One of interesting features of these algorithms is the fact that changes introduced at the end of an uncompressed input stream result only in changes at the end of the compressed output stream; the rest remains intact. Moreover, LZ4 and the basic form of LZMA—i.e. LZMA1—don't use any checksums.
Let's focus on LZMA1 in the rest of this section
(the same applies to LZ4, though).
Consider the following back-of-the-envelope reasoning.
Let's take two non-empty byte sequences, a
and b
.
Compressing a
with LZMA1 produces comp_a
.
Compressing a concatenation of a
and b
produces comp_ab
.
If a
and b
are similar the difference of lengths of comp_ab
and
comp_a
is small compared to b
.
As an example of similar byte sequences consider a C source file and a
corresponding header, or two JSON documents with an identical schema.
╭───┄─╮ ╭─────────╮
│ a │ → │ comp_a │
╰───┄─╯ ╰─────────╯
╭───┄─┬───┄─╮ ╭───────────╮
│ a │ b │ → │ comp_ab │
╰───┄─┴───┄─╯ ╰───────────╯
Sequences comp_a
and comp_ab
are identical up to index n
,
where the first differing byte has index n + 1 ≤ length(comp_a)
.
╭─────────────────╮
│ comp_a │
╰─────────────────╯
╭───────────────────────────╮
│ comp_ab │
╰───────────────────────────╯
╭─────────────┬─────────────╮
│ comp_a[:n] │ comp_ab[n:] │
╰─────────────┴─────────────╯
0 n
Since LZMA1 doesn't use any form of checksumming, in order to recover b
all
we need is comp_a
, n
and comp_ab[n:]
.
Typically, the difference between length(comp_a)
and n
is significantly
lesser than 255.
As a consequence, comp_ab[n:]
should be relatively short as long as a
and
b
are similar.
This allows us to build a specialised compression tool for any byte sequence
x
similar to a
.
A compressed representation of a
, comp_a
, shall be built into the tool.
Given x
, in order to compress it the tool
- uncompresses
comp_a
to producea
, - compress concatenation of
a
andx
to producecomp_ax
, - determines smallest
n
such ascomp_ax[n] ≠ comp_a[n]
, and - saves
n
andcomp_ax[n:]
as a compressed representation ofx
.
In order to obtain x
given n
and byte sequence patch
,
- uncompresses
comp_a
to producea
, - uncompresses a concatenation of
comp_a[:n]
andpatch
to producecomp_ax
, and - recovers
x
by droppinglen(a)
first elements ofcomp_ax
.
That's the whole algorithm. It tends to work.
Smaz “is a simple compression library suitable for compressing very short strings”. It excels at compressing English.
Copyright (c) 2014 Jan Stępień
Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation
files (the "Software"), to deal in the Software without
restriction, including without limitation the rights to use,
copy, modify, merge, publish, distribute, sublicense, and/or
sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.