-
Notifications
You must be signed in to change notification settings - Fork 32
ImmutableTexts
In early 2015, John Cowan wrote an API for spans, which are sequences of characters of fixed length.
John's proposal was abandoned in favor of SRFI 130, which is a cursor-based string library, but immutable texts would have several advantages over Scheme's mutable strings:
- space efficiency asymptotically approaching that of UTF-8
- character-based random access in O(1) time
- subtexts can share memory with the original text, making creation of subtexts much faster
I have already built and tested an implementation of spans that achieves those advantages.
Here I will sketch the design of a much simpler API for immutable texts and an implementation that should improve upon my best implementation of spans.
The API is simplified by using character indexes throughout. Cursors may be used internally by some operations, but are not exposed to users of immutable texts.
An immutable text can be represented as a record with these fields:
- length of text
- number of characters at the beginning of the first bytevector that do not belong to the text
- vector of bytevectors
- every bytevector is a UTF-8 sequence of characters (code points)
- every bytevector except possibly the last contains exactly N characters
where N is a fixed parameter whose selection involves a time/space tradeoff. Character indexing will take O(N) time (which becomes O(1) time because N is fixed), but really small values of N increase the number of bytevectors needed and thus increase space.
The first two fields of that record could be combined into a single field, because the number of skipped characters will never be more than N. That trick would save 8 bytes per text, and would probably not cost any time.
The following table compares several possible representations for texts and strings of length n, assuming Ascii characters and a 64-bit word with Larceny-like representations for vectors, bytevectors, and records. The "random access" and "substring" columns show the number of bytes that must be scanned or copied, neglecting range checking, header/record allocation/initialization, and other constant overhead.
space (bytes) random access substring
(worst case) (worst case)
immutable text 56+(16*n/N)+n N 8*(n/N)
bytevector, UTF-8 8 + n n n
bytevector, UTF-16 8 + 2*n 2*n 2*n
bytevector, UTF-32 8 + 4*n 4*n 4*n
UTF representations of an empty text occupy 8 bytes of storage, while an immutable empty text occupies 40 bytes (1 word for each of three fields plus 2 words of record overhead; the vector containing an empty bytevector can be shared among all empty texts).
UTF representations of a 1-character text or 8-character text occupy 16 bytes (because of padding to a word boundary), while an immutable 1-character or 8-character text occupies 64 bytes.
For N=80, the additional vector element and bytevector header required by each additional N characters would be at most 20% of the space required by those characters. For N=160, that overhead would drop to 10%. For N=128, that overhead would be 12.5%.
as text UTF-8 UTF-16 UTF-32
empty text 40 8 8 8
1-8 characters 64 16 <= 24 <= 40
9-16 characters 72 24 <= 40 <= 72
17-24 characters 80 32 <= 56 <= 104
25-32 characters 88 48 <= 72 <= 136
64 characters 120 72 <= 136 <= 264
128 characters 184 136 <= 264 <= 520
256 characters 328 264 <= 520 <= 1032
512 characters 616 520 <= 1032 <= 2056
1024 characters 1192 1032 <= 2056 <= 4104
Suppose t
is a text containing n characters,
a vector v
of bytevectors bv0, ..., bvk
with the first i0
characters of bv0
not part of t
.
Then (text-ref t i)
would be computed as follows:
- make sure
t
is a text andi
is a fixnum - fetch
n
andi0
fromt
- range check
0 <= i < n
- compute
j = (div (+ i i0) N)
ii = (mod (+ i i0) N)
- fetch
bvj = (vector-ref v j)
- if
- either of these is true:
-
j
is less than(- (vector-length v) 1)
and(bytevector-length bvj)
is N, or -
j
is(- (vector-length v) 1)
and(bytevector-length bvj)
is(mod (+ i0 n) N)
-
- then return
(integer->char (bytevector-ref bvj ii))
- either of these is true:
- else scan
bvj
for characterii
Even the last step takes O(1) time because N is fixed.
(On 64-bit machines, you can scan 8 Ascii characters at a time by masking them to extract the high-order bits of each byte. That should be a worthwhile speedup for texts that contain a lot of Ascii characters.)
Subtexts can always be calculated without allocating new bytevectors to hold characters, but it will generally be necessary to allocate a new record and a new vector to hold the bytevectors whose characters belong to the subtext.
If the original text has only one bytevector, then its bytevector and vector can be shared with the subtext.
If the subtext's length is closer to 0 than to N, it may make sense to allocate a new vector and small bytevector provided the subtext's required bytevector length is significantly more than 16 bytes less than the original's bytevector length. This is probably in the noise.
When appending two texts, it is always possible to avoid copying more than N characters of the longer text.
If the second text's characters start at the beginning of its first bytevector, then it will not be necessary to copy any of its characters provided all of the first text's characters are copied (or its last bytevector contains N characters).
If the first text's last bytevector contains N characters, then it will not be necessary to copy any of its characters provided all of the second text's characters are copied (or its characters start at the beginning of its first bytevector).
If both are true, then no copying of characters is necessary.
The substring
and text-append
procedures can be written to make
those special cases more common by biasing them toward creating texts
whose characters either start at the beginning of the first bytevector
or completely fill the last bytevector.
In the general case, when appending several texts, the thing to do is to find the longest sequence of texts that can be appended without copying characters, and then copy all characters on either side of that longest sequence.