From f9ff3557accbfef7c4c1f458055e53b2e0efdd44 Mon Sep 17 00:00:00 2001
From: Matthew Malcomson <matthew.malcomson@arm.com>
Date: Tue, 21 Feb 2023 10:41:29 +0000
Subject: [PATCH] Updates to bit-precise-types rationale

Have attempted to expose more of our rationale and use-case analysis.
Have also turned some prose into a more structured explanation using
bullet-points.
---
 design-documents/bit-precise-types.rst | 426 ++++++++++++++++++++-----
 1 file changed, 349 insertions(+), 77 deletions(-)

diff --git a/design-documents/bit-precise-types.rst b/design-documents/bit-precise-types.rst
index fad9dcce..08f77073 100644
--- a/design-documents/bit-precise-types.rst
+++ b/design-documents/bit-precise-types.rst
@@ -20,72 +20,327 @@ a different type.
 The proposal for these types can be found in following link.
 https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2763.pdf
 
-As the rationale mentioned, some applications have uses for a specific bit-width
-type.  In the case of writing C code which can be used to determine FPGA
-hardware these specific bit-width types can lead to large performance and space
-savings.
+As the rationale in that proposal mentioned, some applications have uses for a
+specific bit-width type.  In the case of writing C code which can be used to
+determine FPGA hardware these specific bit-width types can lead to large
+performance and space savings.
 
-From the perspective of the Arm ABI we have some trade-offs to determine.  We
-need to choose a representation for these objects in memory and in registers
-along with the size and alignment of the objects.  The main trade-offs we have
-identified in this case are on performance between different types of C-level
-operations, whether certain hardware-level atomic operations are possible, and
-general familiarity of programmers with the representation.
+From the perspective of the Arm ABI we have some trade-offs and decisions to
+make:
 
-For this particular type we are estimating that the use of ``_BitInt`` types
-will not be such that operations on these types are performance critical.
+- We need to choose a representation for these objects in registers.
+- We need to choose a representation, size and alignment of these objects in memory.
+
+The main trade-offs we have identified in this case are:
+
+- Performance of different C-level operations.
+- Whether certain hardware-level atomic operations are possible.
+- Size cost of storing values in memory.
+- General familiarity of programmers with the representation.
+
+Since this is a new type there is large uncertainty on how it will be used by
+programmers in the future.  Decisions we make here may also influence future
+usage.  Nonetheless we must make trade-off decisions with this uncertainty.  The
+below attempts to analyze possible use-cases to make our best guess as to how
+these types may be used when targeting Arm CPU's.
+
+
+Use-cases known of so far
+-------------------------
 
 There seem to be two different regimes for these types.  The "small" regime
-where bit-precise types could be stored in a single register, and the "large"
-regime where bit-precise types must span multiple registers.
+where bit-precise types could be stored in a single general-purpose register,
+and the "large" regime where bit-precise types must span multiple
+general-purpose registers.
+
+Here we discuss the use-cases for bit-precise integer types that we have
+identified or been alerted to so far.
+
+
+C code to describe FPGA behavior
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A major motivating use-case for this new type is to aid writing C code which
+describes the desired behavior of an FPGA.  Without the availability of the new
+``_BitInt`` type such C code would semantically have much wider types than
+necessary when performing operations, especially given that operations on small
+integral types promote their operands to ``int``.
+
+If these wider than necessary operations end up in the FPGA they would use many
+more many more logic gates than necessary.  Using ``_BitInt`` allows programmers
+to write code which directly expresses what is needed.  This can ensure the FPGA
+description generated saves space and has better performance.
+
+The notable thing about this use-case is that though the C code may be run on an
+Arm architecture (e.g. for testing), the most critical use is when transferred
+to an FPGA (i.e. not an Arm architecture).
+
+That said, if the operation that this FPGA performs becomes popular there may be
+a need to run the code directly on CPU's in the future.
+
+The requirements on Arm ABI's from this use-case are relatively small since the
+main focus is around running on an FPGA.  We believe it adds weight to both the
+need for performance and familiarity of programmers.  This belief comes from the
+estimate that this may lead to bit-precise types being used in performance
+critical code in the future, and that it may mean that bit-precise types are
+used on Arm architectures when testing FPGA descriptions (where ease of
+debugging can be prioritized).
+
+
+24-bit Color
+~~~~~~~~~~~~~
+
+Some image file-types use 24-bit color.  The new ``_BitInt`` type may be used to
+hold such information.
+
+As it stands we do not know of any specific reason to use a bit-precise integral
+type as opposed to a structure of three bytes for these data types.
+
+If used for 24-bit color we believe that the performance of standard arithmetic
+operations would not be critical.  This because each 24-bit pixel usually
+represents three 8-bit colors so operations would unlikely be performed on the
+single value as a whole.
+
+We also believe that if used for 24-bit color it would be helpful to specify a
+size and alignment scheme such that an array of ``_BitInt(24)`` is well packed.
+
+
+Networking Protocols
+~~~~~~~~~~~~~~~~~~~~
+
+Many networking protocols have packed structures in order to minimize data sent
+over the wire.  In order to be perfectly packed the code will need to use
+bit-fields rather than bit-precise types for storage, since the bit-precise types
+must be accessible and hence at least byte-aligned.
+
+The incentives to use bit-precise integral types for networking code would be in
+order to maintain the best representation of the operation that is being
+performed.
+
+One negative of using bit-precise integral types for networking code would be
+that idioms like ``if (x + y > max_representable)`` where ``x`` and ``y`` have
+been loaded from small bit-fields would no longer be viable.  We have seen such
+idioms for small values in networking code in the Linux kernel.  These are
+intuitive to write but if ``x`` and ``y`` were to bit-precise types would not
+work as expected.
+
+If used in code handling networking protocols, our estimate is that the
+arithmetic manipulation performed on such values will not be the main
+performance bottleneck in networking protocols.  This estimation comes from the
+belief that networking is often IO bound, and that small packed values in
+networking protocols tend to have limited arithmetic performed on them.
+
+Hence we believe that ease of debugging of values in registers may be more
+critical than performance concerns in this use-case.
+
+
+To help the compiler optimize (e.g. for auto vectorization)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The behavior that bit-precise types do not automatically promote to an ``int``
+during operations could remove some casts which are necessary for C semantics
+but can obscure the intention of a users code.  One place this may help is in
+auto vectorization, where the compiler must be able to see through intermediate
+casts in order to identify the operations being performed.
+
+The incentive for this use-case is an increased likelihood of the compiler
+generating optimal autovectorized code.
+
+Points which might imply less take-up of this use-case are that the option to
+use compiler intrinsics are there for programmers which want to put in extra
+effort to ensure good vectorization of a loop.  This means that using
+bit-precise types would be a mid-range option providing less-guaranteed codegen
+improvement for less effort.
+
+The ABI should not have much of an affect on this use-case directly, since the
+optimization would be done in the target-independent part of compilers and the
+eventual operations in auto vectorized code would be acting on vector machine
+types.
+
+That said, bit-precise types would also be used in the surrounding code.  Given
+that in this use-case these types are added for performance reasons it seems
+reasonable to guess that this concern around performance would apply to the
+surrounding code as well.  Hence it seems that this use-case would benefit from
+choosing performance concerns.
+
+In this use-case the programmer would be converting a codebase using either 8
+bit integers or 16 bit integers to a bit-precise type of the same size.  Such a
+codebase may include calls to variadic functions (like ``printf``) in
+surrounding code.  Variadic functions like this may be missed when changing
+types in a codebase, so it would be helpful if the bit-precise machine types
+passed matched what the relevant standard integral types looked like in order to
+avoid extra difficulties during the conversion.  The C semantics require that
+variadic arguments undergo standard integral promotions.  While ``int8_t`` and
+the like undergo integral promotion, ``_BitInt`` does not.  Hence this use-case
+would benefit from having the representation of ``_BitInt(8)`` in the PCS match
+that of ``int`` and similar for the ``16`` bit and unsigned variants (which
+implies having them sign- or zero-extended).
+
+One further point around this use-case, is that decisions which do not affect 8
+and 16 bit types would not affect this use-case.
+
+
+For representing cryptography algorithms
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many cryptography algorithms perform operations on large objects.  It seems
+to be that using a ``_BitInt(128)`` or ``_BitInt(256)`` could express
+cryptographic algorithms more concisely.
+
+For symmetric algorithms the existing block cipher and hash algorithms do not
+tend to operate on chunks this size as single integers.  This seems like it will
+remain the case due to CPU limitations and a desire to understand the
+performance characteristics of written algorithms.
+
+For asymmetric algorithms something like elliptic curve cryptography seems like
+it could gain readability from using the new bit-precise types.  However there
+would likely be concern around whether code generated from using these types is
+guaranteed to use constant-time operations.
+
+This use-case would only be using "large" bit-precise types.  Moreover all
+relevant sizes are powers of two.
+
+Translating some more esoteric languages to C
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+At the moment there exist some high-level languages which support arbitrary
+bit-width integers.  Translating such languages to C would benefit from the new
+C type.
+
+We do not know of any specific use-case within these languages other than for
+cryptography algorithms as above.  Hence the trade-offs in this space are
+assumed to be based on the trade-offs from the cryptography use-case above.
+
+We estimate the use of translating a more esoteric language to C to be less
+common than writing code directly in C.  Hence the weighting of this use-case in
+our trade-offs is correspondingly lower than others.
+
+Possible transparent BigNum libraries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We have heard of interest in using the new bit-precise integer types to
+implement transparent BigNum libraries in C.
+
+Such a use-case unfortunately does not directly correspond to what kind of code
+will be using this (e.g. would this be algorithmic code or I/O bound code).
+Given the mention of 512x512 matrices in the comment where we heard of this we
+assume that in general such a library would be CPU-bound code.
+
+Hence we assume that the main consideration here would be performance.
+
+
+Summary of use-case trade-offs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In our estimation, the C to FPGA use case seems to be the most promising.  We
+estimate that the use in this space will be a great majority of the use of this
+new type.
+
+Uses for cryptography, networking, and in order to help the compiler optimize
+certain code seem like they are large enough to consider but not as widespread.
+
+For the C to FPGA use case, the majority of the use is not expected to be seen
+on Arm Architectures.  For helping the compiler optimize code we expect to only
+see bit-precise types with sizes matching that of standard integral types.
+Cryptographic uses are only expected on "large" sizes which are powers of two.
+Networking uses are likely to be using bit-fields for in-memory representations.
+
+All use-cases would have concerns around performance and the familiarity of
+representations.  There does not seem to be a clear choice to prefer one or the
+other.
+
 
 Alignment and sizes
 -------------------
 
+Options and their trade-offs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 These types must be at least byte-aligned so they are addressable, and at least
-rounded to a byte boundary in size for ``sizeof``.  Since these types have an
-aesthetic similarity to bit-fields one might expect better packing in an array
-of ``_BitInt(24)`` than an array of ``int32_t`` types (i.e. packing as good as a
-byte-array).  However, this would require a low alignment of such types and that
-would mean loading and storing of even "small" sized ``_BitInt``'s crossing
-cache boundaries -- leading to an unnecessary performance hit and hindering any
-atomic operations on these.
+rounded to a byte boundary in size for ``sizeof``.
+
+"Small" regime
+//////////////
+For the "small" regime there are 2 obvious options:
+
+A. Byte alignment.
+B. Alignment and size "as if" stored in the next-largest Fundamental Data Type.
+   (Where the Fundamental Data Types are defined in the relevant PCS documents).
+
+Option ``A`` has the following benefits:
+
+- Better packing in an array of ``_BitInt(24)`` than an array of ``int32_t``.
+  This is more relevant for bit-precise types than others since these types have
+  an aesthetic similarity to bit-fields and hence programmers might expect good
+  packing.
+
+Option ``B`` has the following benefits (both following from the alignment being
+greater than or equal to the size of the object in memory):
+
+- Avoid a performance hit since loading and storing of these "small" sized
+  ``_BitInt``'s will not cross cache boundaries.
+- Atomic loads and stores can be made on these objects.
+- The representation of bit-precise types of the same size as standard integer
+  types will have the same alignment and size in memory.
+
+In the use-cases we have identified above we did not notice any special need for
+tight packing.  All of the use-cases we identified would benefit from better
+performance characteristics, and the use-case to help the compiler in optimizing
+some code would benefit greatly from ``_BitInt(8)`` having the same alignment
+and size as a ``int8_t``.
 
 Hence for "small" sizes we are choosing to define a ``_BitInt(N)`` size and
 alignment according to the smallest Fundamental Data Type which has a bit-size
 greater or equal to ``N``.  Similar for ``unsigned`` versions.
 
+
+"Large" regime
+//////////////
 For "large" sizes the only approach considered has been to treat these
-bit-precise types as an array of ``M`` sized chunks, for some ``M``.  The two
-"reasonable" choices for this ``M`` seem to either be register sized or
-double-register sized.  Choosing a register sized chunk would mean smaller sizes
-of types for half of the values of ``N``, while choosing a double-register sized
-chunk would allow atomic operations on types in the range between the register
-and double-register sizes due to the associated extra alignment allowing
-operations like ``CASP`` on aarch64 and ``LDRD`` on aarch32.  Moreover, the
-majority of "large" size use-cases proposed so far are of power-of-two sizes
-like sha256 which would not be in the range which suffers in space-cost from
-this choice.  Finally, defining the ``_BitInt`` representation in this manner
-means that on AArch32 a ``_BitInt(64)`` has the same alignment and size as a
-``int64_t`` which is the largest size defined on that platform, and on AArch64
-a ``_BitInt(128)`` has the same alignment and size as a ``__int128`` which is
-the largest type defined on that platform.  This falls out of the fact that
-double-register size maps to the largest integral Fundamental Data Type defined
-on both platforms.
-
-Hence for "large" sizes we are choosing to define a ``_BitInt(N)`` size and
-alignment by treating them "as if" they are an array of double-register sized
-Fundamental Data Types.
+bit-precise types as an array of ``M`` sized chunks, for some ``M``.
+
+There are two obvious choices for ``M``:
+
+A. Register sized.
+B. Double-register sized.
+
+Option ``A`` has the following benefits:
+
+- Less space used for half of the values of ``N``.
+
+Option ``B`` has the following benefit:
+
+- Would allow atomic operations on types in the range between register
+  and double-register sizes.
+  This is due to the associated extra alignment allowing operations like
+  ``CASP`` on aarch64 and ``LDRD`` on aarch32.
+- On AArch32 a ``_BitInt(64)`` would have the same alignment and size as an
+  ``int64_t``, and on AArch64 a ``_BitInt(128)`` would have the same alignment
+  and size as a ``__int128``.
+  These are the largest types defined on the relevant architectures, and
+  correspond to the largest integral Fundamental Data Type defined in the PCS
+  for both platforms.
+
+The "large" size use-cases we have identified so far are of power-of-two sizes.
+These sizes would not benefit from the positives of either of the options
+presented here.
+
+Hence for "large" sizes we are choosing based on an estimate of which choice is
+more "generally useful".  Our estimate is that the benefits of option ``B`` are
+more generally useful than those from option ``A``.  That is we choose to define
+the size and alignment of ``_BitInt(N > [register-size])`` types by treating
+them "as if" they are an array of double-register sized Fundamental Data Types.
 
 Representation in bits
 ----------------------
 
 There are two decisions around the representation of a "small" ``_BitInt`` that
 we have identified.  (1) Whether required bits are stored in the least
-significant end of a register or most significant end of a register. (2) Whether
-the "remaining" bits after rounding up to the size specified in `Alignment and
-sizes`_ are specified or not -- with how these bits would naturally be specified
-depending on the choice made for (1).
+significant end or most significant end of a register or region in memory. (2)
+Whether the "remaining" bits after rounding up to the size specified in
+`Alignment and sizes`_ are specified or not.  The choice of *how* "remaining"
+bits would be specified would tie in to the choice made for (1).
+
 
 Options and their trade-offs
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -100,19 +355,19 @@ C. Required bits stored in least significant end.
    Not-required bits are specified as zero- or sign-extended.
 
 While it would be possible to make different requirements for bit-precise
-integer types in memory vs in registers, we believe that the negatives of having
-to perform a transformation on loading and storing values and the programmer
-confusion associated with different representations are reason enough to not
-look into this option further.  Especially since the differentiating factors
-were not drastically different between memory and register regimes.
-
-Similarly, it would be possible to define a representation that does something
-like specifying bits ``[2-7]`` of a ``_BitInt(2)`` but leaves bits ``[8-63]``
-unspecified.  This would seem to choose the worst of both worlds in terms of
-performance, since one must both ensure "overflow" from an addition of
-``_BitInt(2)`` types does not affect the specified bits **and** ensure that the
-unspecified bits do not affect multiplication or division operations.
-Hence we do not look at variations of this kind.
+integer types in memory vs in registers, we believe that the combined negatives
+of the choice are reason enough to not look into the option.  These negatives
+being that code would have to perform a transformation on loading and storing
+values, and that different representations in memory and registers is likely to
+cause programmer confusion.
+
+Similarly, it would be possible to define a representation in registers that
+does something like specifying bits ``[2-7]`` of a ``_BitInt(2)`` but leaves
+bits ``[8-63]`` unspecified.  This would seem to choose the worst of both worlds
+in terms of performance, since one must both ensure "overflow" from an addition
+of ``_BitInt(2)`` types does not affect the specified bits **and** ensure that
+the unspecified bits above bit number 7 do not affect multiplication or division
+operations.  Hence we do not look at variations of this kind.
 
 For option ``A`` there is an extra choice around how "large" values are stored.
 One could either have the "padding" bits in the least significant "chunk", or
@@ -135,9 +390,9 @@ It has the following negatives:
 
 - This would be a less familiar representation to programmers.  Especially the
   fact that a ``_BitInt(8)`` would not have the same representation in a
-  register as a ``char`` would likely cause confusion (e.g. when debugging, or
-  writing assembly code).  This would likely be increased if other architectures
-  that programmers may use have a more familiar representation.
+  register as a ``char`` could cause confusion (e.g. when debugging, or writing
+  assembly code).  This would likely be increased if other architectures that
+  programmers may use have a more familiar representation.
 
 - Operations ``*,/``, saving and loading values to memory, and casting to
   another type would all require extra cost.
@@ -145,6 +400,9 @@ It has the following negatives:
 - Operations ``+,-`` on "large" values (greater than one register) would require
   an extra instruction to "normalize" the carry-bit.
 
+- If used in calls to variadic functions which were written for standard
+  integral types this can give surprising results.
+
 
 Option ``B`` has the following benefits:
 
@@ -162,14 +420,17 @@ It has the following negatives:
 - The AArch64 ``LD{S,U}MAX`` operations would not work naturally on small values
   of this representation.
 
-- Operations ``/,%,==,<,>,<=,>=,>>`` and widening conversions would not
-  require extra work.
+- Operations ``/,%,==,<,>,<=,>=,>>`` and widening conversions on operands coming
+  from an ABI boundary would require masking the operands.
 
 - On AArch32 this could cause surprises to developers, given that on this
   architecture small Fundamental Data Types are have zero- or sign-extended
   extra bits.  So a ``char`` would not have the same representation as a
   ``_BitInt(8)`` on this architecture.
 
+- If used in calls to variadic functions which were written for standard
+  integral types this can give surprising results.
+
 
 Option ``C`` has the following benefits:
 
@@ -182,6 +443,9 @@ Option ``C`` has the following benefits:
 - On AArch32 this could match the expectation of developers, with a
   ``_BitInt(8)`` in a register matching the representation of a ``char``.
 
+- If used in variadic function calls, mismatches between ``_BitInt`` types and
+  standard integral types would not cause as much of a problem.
+
 It has the following negatives:
 
 - The AArch64 ``LDADD`` operations would not work naturally.
@@ -194,20 +458,28 @@ It has the following negatives:
 
 Summary, suggestion, and reasoning
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Overall it seems that for operations on small values option ``A`` is more
-performant.  However, when acting on "large" values (i.e. greater than the size
-of one register) it loses some of that benefit.  Storing to and from memory
-would also come at a cost for this representation.  This is also likely to be
-the most surprising representation for developers on an Arm platform.
+
+Overall it seems that option ``A`` is more performant for operations on small
+values.  However, when acting on "large" values (i.e. greater than the size of
+one register) it loses some of that benefit.  Storing to and from memory would
+also come at a cost for this representation.  This is also likely to be the most
+surprising representation for developers on an Arm platform.
 
 Between option ``B`` and option ``C`` there is not a great difference in
 performance characteristics.  However it should be noted that option ``C`` is
-likely the most natural extension of the AArch32 PCS rules for unspecified bits
-in a register containing a small Fundamental Data Type, while option ``B`` is
-the most natural extension of the similar rules in AArch64 PCS.
-
-As mentioned above, we do not expect operations on ``_BitInt`` types to be
-performance critical.  Given that providing a productive environment for
-developers is valuable and following the "principle of least surprise" is a
-good way to achieve that, we suggest choosing option ``C`` for AArch32 and
-option ``B`` for AArch64.
+the most natural extension of the AArch32 PCS rules for unspecified bits in a
+register containing a small Fundamental Data Type, while option ``B`` is the
+most natural extension of the similar rules in AArch64 PCS.  Furthermore, option
+``C`` would mean that accidental misuse of a bit-precise type instead of a
+standard integral type should not cause problems, while ``B`` could give strange
+values.  This would be most visible with variadic functions.
+
+As mentioned above, both performance concerns and a familiar representation are
+valuable in the use-cases that we have identified.  This has made the decision
+non-obvious.  We have chosen to favor representation familiarity.
+
+Choosing between ``C`` and ``B`` is also non-obvious.  It seems relatively clear
+to choose option ``C`` for AArch32.  We choose option ``B`` for AArch64 to
+prefer that across most ABI boundaries a ``char`` and a ``_BitInt(8)`` have the
+same representation, but acknowledge that this could cause surprise to
+programmers when using variadic functions.