Skip to content
Will Clinger edited this page Jun 17, 2016 · 6 revisions

Immutable texts

In early 2015, John Cowan wrote an API for spans, which are sequences of characters of fixed length.

John's proposal was abandoned in favor of SRFI 130, which is a cursor-based string library, but immutable texts would have several advantages over Scheme's mutable strings:

  • space efficiency asymptotically approaching that of UTF-8
  • character-based random access in O(1) time
  • subtexts can share memory with the original text, making creation of subtexts much faster

I have already built and tested an implementation of spans that achieves those advantages.

Here I will sketch the design of a much simpler API for immutable texts and an implementation that should improve upon my best implementation of spans.

The API is simplified by using character indexes throughout. Cursors may be used internally by some operations, but are not exposed to users of immutable texts.

An immutable text can be represented as a record with these fields:

  • length of text
  • number of characters at the beginning of the first bytevector that do not belong to the text
  • vector of bytevectors
    • every bytevector is a UTF-8 sequence of characters (code points)
    • every bytevector except possibly the last contains exactly N characters

where N is a fixed parameter whose selection involves a time/space tradeoff. Character indexing will take O(N) time (which becomes O(1) time because N is fixed), but really small values of N increase the number of bytevectors needed and thus increase space.

The first two fields of that record could be combined into a single field, because the number of skipped characters will never be more than N. That trick would save 8 bytes per text, and would probably not cost any time.

Analysis

The following table compares several possible representations for texts and strings of length n, assuming Ascii characters and a 64-bit word with Larceny-like representations for vectors, bytevectors, and records. The "random access" and "substring" columns show the number of bytes that must be scanned or copied, neglecting range checking, header/record allocation/initialization, and other constant overhead.

                           space (bytes)    random access    substring
                                            (worst case)     (worst case)
immutable text             56+(16*n/N)+n          N            8*(n/N)
bytevector, UTF-8           8     +    n          n               n
bytevector, UTF-16          8     +  2*n        2*n             2*n
bytevector, UTF-32          8     +  4*n        4*n             4*n

Space

UTF representations of an empty text occupy 8 bytes of storage, while an immutable empty text occupies 40 bytes (1 word for each of three fields plus 2 words of record overhead; the vector containing an empty bytevector can be shared among all empty texts).

UTF representations of a 1-character text or 8-character text occupy 16 bytes (because of padding to a word boundary), while an immutable 1-character or 8-character text occupies 64 bytes.

For N=80, the additional vector element and bytevector header required by each additional N characters would be at most 20% of the space required by those characters. For N=160, that overhead would drop to 10%. For N=128, that overhead would be 12.5%.

                      as text    UTF-8    UTF-16    UTF-32
empty text               40         8        8         8
1-8 characters           64        16    <= 24     <= 40
9-16 characters          72        24    <= 40     <= 72
17-24 characters         80        32    <= 56    <= 104
25-32 characters         88        48    <= 72    <= 136
64 characters           120        72   <= 136    <= 264
128 characters          184       136   <= 264    <= 520
256 characters          328       264   <= 520   <= 1032
512 characters          616       520  <= 1032   <= 2056
1024 characters        1192      1032  <= 2056   <= 4104

text-ref

Suppose t is a text containing n characters, a vector v of bytevectors bv0, ..., bvk with the first i0 characters of bv0 not part of t. Then (text-ref t i) would be computed as follows:

  • make sure t is a text and i is a fixnum
  • fetch n and i0 from t
  • range check 0 <= i < n
  • compute
    • j = (div (+ i i0) N)
    • ii = (mod (+ i i0) N)
  • fetch bvj = (vector-ref v j)
  • if
    • either of these is true:
      • j is less than (- (vector-length v) 1) and (bytevector-length bvj) is N, or
      • j is (- (vector-length v) 1) and (bytevector-length bvj) is (mod (+ i0 n) N)
    • then return (integer->char (bytevector-ref bvj ii))
  • else scan bvj for character ii

Even the last step takes O(1) time because N is fixed.

(On 64-bit machines, you can scan 8 Ascii characters at a time by masking them to extract the high-order bits of each byte. That should be a worthwhile speedup for texts that contain a lot of Ascii characters.)

subtext

Subtexts can always be calculated without allocating new bytevectors to hold characters, but it will generally be necessary to allocate a new record and a new vector to hold the bytevectors whose characters belong to the subtext.

If the original text has only one bytevector, then its bytevector and vector can be shared with the subtext.

If the subtext's length is closer to 0 than to N, it may make sense to allocate a new vector and small bytevector provided the subtext's required bytevector length is significantly more than 16 bytes less than the original's bytevector length. This is probably in the noise.

text-append

When appending two texts, it is always possible to avoid copying more than N characters of the longer text.

If the second text's characters start at the beginning of its first bytevector, then it will not be necessary to copy any of its characters provided all of the first text's characters are copied (or its last bytevector contains N characters).

If the first text's last bytevector contains N characters, then it will not be necessary to copy any of its characters provided all of the second text's characters are copied (or its characters start at the beginning of its first bytevector).

If both are true, then no copying of characters is necessary.

The substring and text-append procedures can be written to make those special cases more common by biasing them toward creating texts whose characters either start at the beginning of the first bytevector or completely fill the last bytevector.

In the general case, when appending several texts, the thing to do is to find the longest sequence of texts that can be appended without copying characters, and then copy all characters on either side of that longest sequence.

Clone this wiki locally