-
Notifications
You must be signed in to change notification settings - Fork 32
StringRepresentations
= Representation of Strings =
Implementations of R5RS and R6RS Scheme can choose from a wide variety of representations for strings, including:
flat1:: bytevector-like objects, with 1 byte/character; works for ASCII or Latin-1 but not for Unicode
flat4:: bytevector-like objects, with 4 bytes/character; works for Unicode and R6RS
record4:: records that encapsulate a flat4
record2:: records that encapsulate a bytevector-like object with 2 bytes/character plus a table for supplementary characters
record1:: records that encapsulate a flat1 plus a table for non-ASCII characters
utf16:: records that encapsulate a bytevector-like object with UTF-16 encoding
utf8:: records that encapsulate a bytevector-like object with UTF-8 encoding
The utf16 and utf8 representations do not work very well for Scheme, but are discussed here because they may seem like the most obvious representations to programmers who work with Unicode in other languages. Before proceeding to detailed descriptions of these representations, here is a quick comparison:
#!rst
================== ======== ============== =============== ============= ===========
Representation Supports ``string-ref`` ``string-set!`` Space (32-bit and 64-bit)
------------------ -------- -------------- --------------- -------------------------
flat1 Latin-1 O(1) O(1) n+4 n+8
flat4 Unicode O(1) O(1) 4*n+4 4*n+8
record4 Unicode O(1) O(1) 4*n+20 4*n+32
record2 Unicode O(1) O(1) 2*n+20 2*n+40
record1 Unicode O(1) O(1) n+20 n+40
utf16 Unicode O(n) O(n) 2*n+20 2*n+32
utf8 Unicode O(n) O(n) n+20 n+32
================== ======== ============== =============== ============= ===========
The last two columns show the space that would be occupied by an n-character ASCII string in 32-bit and 64-bit implementations of Larceny. On both 32-bit and 64-bit systems, the space should be rounded up to the next multiple of 8.
This representation was used in Larceny through v0.93. In Larceny v0.94, the use of this representation became a build-time option for users who do not need Unicode.
A prototype of this representation was introduced in Larceny v0.93 and replaced the flat1 representation in v0.94, although the flat1 representation can still be specified at build time (for native Larceny) by changing one of the setup parameters. (Note that Larceny's flat4 representation is not UTF-32, because the constituent characters are boxed. See the discussion of record2 below.)
The main advantage of this representation over flat4 is that the length of a string can change dynamically.
The main advantage of this representation over record4 is that long strings take about half as much space. It may also improve performance when interoperating with external libraries that use UTF-16 encodings.
The auxiliary table is #f
for strings that have never contained any supplementary characters. For all such strings, the main bytevector-like object contains the standard UTF-16 encoding of a string.
When string-set!
is used to store a supplementary character into a record2 string whose auxiliary table is #f
, a second bytevector-like object is allocated, just as long as the original, to serve as the auxiliary table that holds the second code units of surrogate pairs. The first code units of surrogate pairs go into the original bytevector-like object. Once the auxiliary table has been allocated, the main bytevector-like object is unlikely to contain a valid UTF-16 encoding.
In the future, we might consider using record2 in Common Larceny, where the flat4 representation is less efficient. (Common Larceny has no boxed immediates, so all of the representations described here require string-ref
to box its result and string-set!
to unbox its argument.) The record2 representation might result in faster conversions between Scheme and native .NET representations, although it appears that fast conversions can be obtained only by using a less compact representation that would make string-ref
and string-set!
run slower.
The main advantage of this representation over record2 is that long strings of ASCII characters take about half as much space.
The auxiliary table is #f
for strings that have never contained any non-ASCII characters. For all such strings, the main bytevector-like object contains the standard UTF-8 encoding of an ASCII string.
When string-set!
is used to store a non-ASCII character into a record1 string whose auxiliary table is #f
, a bytevector (twice as long as the main table) is allocated to serve as the auxiliary table that holds the low-order 16 bits of all non-ASCII scalar values. If a byte of the main bytevector-like object corresponds to a non-ASCII character, then the low-order 5 bits of the byte contain the scalar value's high-order bits, and the byte's sign bit is set to indicate that it is a non-ASCII character whose 16 low-order bits must be obtained from the auxiliary table. Once the auxiliary table has been allocated, the main bytevector-like object is unlikely to contain a valid UTF-8 encoding.
The record1 representation should become a build-time option for native Larceny and for Petit Larceny.
The main advantage of this representation is ease of interoperation with external libraries that expect Unicode strings to be encoded in UTF-16. The disadvantage is that string-ref
and string-set!
take O(n) time. Caching will provide O(1) time for common access patterns, but storing a supplementary character into the string is likely to take O(n) time even with caching.
For Larceny, the record2 representation is better than utf16 in almost every respect for almost all applications.
The main advantage of this representation is ease of interoperation with external libraries that expect Unicode strings to be encoded in UTF-8. The disadvantage is that string-ref
and string-set!
take O(n) time. Caching will provide O(1) time for common access patterns, but storing a non-ASCII character into the string is likely to take O(n) time even with caching.
For Larceny, the record1 representation is better than utf8 provided most long strings consist of English text.
The auxiliary tables used by the record2 and record1 representations could be hash tables or association lists. That is an attractive option for interpreted systems, but would make string-ref
so complex that compilers would be unlikely to generate inline code for it. As with generic arithmetic, compilers might generate inline code for the common case, trapping to out-of-line code for the uncommon case.
The string-ref
operation does not involve locking to exclude other string-ref
operations, but it may involve locking to exclude a concurrent string-set!
or length-changing operation. On most architectures, with the flat4 representation, a string-ref
should never have to lock. With the record2 and record1 representations, and strings of fixed size, a string-ref
operation should not have to lock unless it accesses the auxiliary table. With the record4, utf16, and utf8 representations (and some careful programming), a string-ref
operation should never have to lock, even if the size of a string can change dynamically.
With representations other than flat4, the string-set!
procedure may have to acquire a lock, and some string-ref
operations may have to wait until the lock is released. On architectures that support atomic update of single bytes, storing an ASCII character into a flat1 or record1 string can be done without locking. On architectures that support atomic update of 16-bit quantities aligned on double-byte boundaries, storing a non-supplementary character into a record2 string can be done without locking.
With the utf16 and utf8 representations, every string-set!
operation will have to acquire a lock. Otherwise a concurrent string-set!
might have to replace the bytevector-like object with one of a different length, causing any side effect to the original to be lost.
With the record2 representation, string-set!
operations that store a supplementary character will have to acquire a lock. With the record1 representation, string-set!
operations that store a non-ASCII character will have to acquire a lock. The auxiliary table field can be used as the lock field, and concurrent string-set!
operations that do not involve locking may proceed while the auxiliary table is locked.
If the size of a string never changes, then fewer operations will have to wait. With the record4 representation, on 32-bit architectures, no operation should ever have to lock or wait.
All of the Unicode-capable representations require more space than the Latin-1 strings that Larceny used through v0.93. The following table shows how many bytes it would take to represent the 10917 strings, with an average length of 14.476 characters, that reside in the initial heap image for Larceny v0.93.
32-bit 64-bit
------ ------
flat1 239120 285000
flat4 697184 741760
record4 871872 1003792
record2 567016 786152
record1 413808 634376
utf16 567016 698808
utf8 413808 547032
In the initial heap image, the print names of symbols account for 190040 of the 239120 bytes occupied by strings. Storing those print names in compressed form, e.g. as a bytevector in UTF-8, would reduce the storage requirements to:
32-bit 64-bit
------ ------
flat1 239120 285000
flat4 346568 391960
record4 373640 432568
record2 302064 374688
record1 266192 339144
utf16 302064 361152
utf8 266192 325608