jbig2enc.html

<html>
  <head>
    <title>jbig2enc: Documentation</title>
  </head>

  <body style="max-width: 70em; font-family: Arial; text-align: justify;">
    <h1><tt>jbig2enc</tt>: Documentation</h1>
    <p>Adam Langley <tt>&lt;agl@imperialviolet.org&gt;</tt></p>

    <h5>What is JBIG2</h5>

    <p>JBIG2 is an image compression standard from the same people who brought
    you the JPEG format. It compresses 1bpp (black and white) images only.
    These images can consist of <i>only</i> black and while, there are no
    shades of gray - that would be a grayscale image. Any "gray" areas must,
    therefore be simulated using black dots in a pattern called <a
      href="http://en.wikipedia.org/wiki/Halftone">halftoning</a>.</p>

    <p>The JBIG2 standard has several major areas:</p>
    <ul>
      <li>Generic region coding</li>
      <li>Symbol encoding (and text regions)</li>
      <li>Refinement</li>
      <li>Halftoning</li>
    </ul>

    <p>There are two major compression technologies which JBIG2 builds on:
    <a href="http://en.wikipedia.org/wiki/Arithmetic_coding">arithmetic encoding</a>
    and <a href="http://en.wikipedia.org/wiki/Huffman_coding">Huffman encoding</a>. You can
    choose between them and use both in the same JBIG2 file, but this is rare.
    Arithmetic encoding is slower, but compresses better. Huffman encoding was
    included in the standard because one of the (intended) users of JBIG2 were
    fax machines and they might not have the processing power for arithmetic
    coding.</p>

    <p><tt>jbig2enc</tt> <i>only</i> supports arithmetic encoding</p>

    <h5>Generic region coding</h5>

    <p>Generic region coding is used to compress bitmaps. It is progressive and
    uses a context around the current pixel to be decoded to estimate the
    probability that the pixel will be black. If the probability is 50% it uses
    a single bit to encode that pixel. If the probability is 99% then it takes less
    than a bit to encode a black pixel, but more than a bit to encode a white
    one.</p>

    <p>The context can only refer to pixels above and to the left of the
    current pixel, because the decoder doesn't know the values of any of the
    other pixels yet (pixels are decoded left-to-right, top-to-bottom). Based
    on the values of these pixels it estimates a probability and updates it's
    estimation for that context based on the actual pixel found. All contexts
    start off with a 50% chance of being black.</p>

    <p>You can encode whole pages with this and you will end up with a perfect
    reconstruction of the page. However, we can do better...</p>

    <h5>Symbol encoding</h5>

    <p>Most input images to JBIG2 encoders are scanned text. These have many
    repeating symbols (letters). The idea of symbol encoding is to encode what
    a letter &ldquo;a&rdquo; looks like and, for all the &ldquo;a&rdquo;s on
    the page, just give their locations. (This is lossy encoding)</p>

    <p>Unfortunately, all scanned images have noise in them: no two
    &ldquo;a&rdquo;s will look quite the same so we have to group all the
    symbols on a page into groups. Hopefully each member of a given group will
    be the same letter, otherwise we might place the wrong letter on the page!
    These, very surprising, errors are called cootoots.</p>

    <p>However, assuming that we group the symbols correctly, we can get great
    compression this way. Remember that the stricter the classifier, the more
    symbol groups (classes) will be generated, leading to bigger files. But,
    also, there is a lower risk of cootoots (misclassification).</p>

    <p>This is great, but we can do better...</p>

    <h5>Symbol retention</h5>

    <p>Symbol retention is the process of compressing multi-page documents by
    extracting the symbols from all the pages at once and classifing them all
    together. Thus we only have to encoding a single letter &ldquo;a&rdquo; for
    the whole document (in an ideal world).</p>

    <p>This is obviously slower, but generates smaller files (about half the
    size on average, with a decent number of similar typeset pages).</p>

    <p>One downside you should be aware of: If you are generating JBIG2 streams
    for inclusion to a linearised PDF file, the PDF reader has to download all
    the symbols before it can display the first page. There is solution to this
    involing multiple dictionaries and symbol importing, but that's not
    currently supported by <tt>jbig2enc</tt>.</p>

    <h5>Refinement</h5>

    <p>Symbol encoding is lossy because of noise, which is classified away and
    also because the symbol classifier is imperfect. Refinement allows us, when
    placing a symbol on the page, to encode the difference between the actual
    symbol at that location, and what the classifer told us was &ldquo;close
    enough&rdquo;. We can choose to do this for each symbol on the page, so we
    don't have to refine when we are only a couple of pixel off. If we refine
    whenever we a wrong pixel, we have lossless encoding using symbols.</p>

    <h5>Halftoning</h5>

    <p><tt>jbig2enc</tt> doesn't support this at all - so I will only mention
    this quickly. The JBIG2 standard supports the efficient encoding of
    halftoning by building a dictionary of halftone blocks (like the
    dictionaries of symbols which we build for text pages). The lack of support
    for halftones in G4 (the old fax standard) was a major weakness.</p>

    <h5>Some numbers</h5>

    <p>My sample is a set of 90 pages scanning pages from the middle of a
    recent book. The scanned images are 300dpi grayscale and they are being
    upsampled to 600dpi 1-bpp for encoding.</p>

    <ul>
      <li>Generic encoding each page: 3435177 bytes</li>
      <li>Symbol encoding each page (default classifier settings): 1075185 bytes</li>
      <li>Symbol encoding with refinement for more than 10 incorrect pixels: 3382605 bytes</li>
      </li>
    </ul>

    <h2>Command line options</h2>

    <p><tt>jbig2enc</tt> comes with a handy command line tool for encoding
    images.</p>

    <ul>
      <li><tt>-d | --duplicate-line-removal</tt>: When encoding generic
      regions each scan line can be tagged to indicate that it's the same as
      the last scanline - and encoding that scanline is skipped. This
      drastically reduces the encoding time (by a factor of about 2 on some
      images) although it doesn't typically save any bytes. This is an option
      because some versions of <tt>jbig2dec</tt> (an open source decoding
      library) cannot handle this.</li>

      <li><tt>-p | --pdf</tt>: The PDF spec includes support for JBIG2
      (Syntax&rarr;Filters&rarr;JBIG2Decode in the PDF references for versions
      1.4 and above). However, PDF requires a slightly different format for
      JBIG2 streams: no file/page headers or trailers and all pages are
      numbered 1. In symbol mode the output is to a series of files:
      <tt>symboltable</tt> and <tt>page-</tt><i>n</i> (numbered from 0)</li>

      <li><tt>-s | --symbol-mode</tt>: use symbol encoding. Turn on for scanned
      text pages.</li>

      <li><tt>-t &lt;threshold&gt;</tt>: sets the fraction of pixels which have
      to match in order for two symbols to be classed the same. This isn't
      strictly true, as there are other tests as well, but increasing this will
      generally increase the number of symbol classes.</li>

      <li><tt>-T &lt;threshold&gt;</tt>: sets the black threshold (0-255). Any gray value darker
      than this is considered black. Anything lighter is considered white.</li>

      <li><tt>-r | --refine &lt;tolerance&gt;</tt>: (requires <tt>-s</tt>) turn
      on refinement for symbols with more than <tt>tolerance</tt> incorrect
      pixels. (10 is a good value for 300dpi, try 40 for 600dpi). Note: this is
      known to crash Adobe products.</li>

      <li><tt>-O &lt;outfile&gt;</tt>: dump a PNG of the 1 bpp image before
      encoding. Can be used to test loss.</li>

      <li><tt>-2</tt> or <tt>-4</tt>: upscale either two or four times before
      converting to black and white.</li>

      <li><tt>-S</tt> Segment an image into text and non-text regions. This isn't perfect, but running text through the symbol compressor is terrible so it's worth doing if your input has images in it (like a magazine page). You can also give the <tt>--image-output</tt> option to set a filename to which the parts which were removed are written (PNG format).</li>
  </ul>
  </body>
</html>