From f9ff3557accbfef7c4c1f458055e53b2e0efdd44 Mon Sep 17 00:00:00 2001 From: Matthew Malcomson Date: Tue, 21 Feb 2023 10:41:29 +0000 Subject: [PATCH] Updates to bit-precise-types rationale Have attempted to expose more of our rationale and use-case analysis. Have also turned some prose into a more structured explanation using bullet-points. --- design-documents/bit-precise-types.rst | 426 ++++++++++++++++++++----- 1 file changed, 349 insertions(+), 77 deletions(-) diff --git a/design-documents/bit-precise-types.rst b/design-documents/bit-precise-types.rst index fad9dcce..08f77073 100644 --- a/design-documents/bit-precise-types.rst +++ b/design-documents/bit-precise-types.rst @@ -20,72 +20,327 @@ a different type. The proposal for these types can be found in following link. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2763.pdf -As the rationale mentioned, some applications have uses for a specific bit-width -type. In the case of writing C code which can be used to determine FPGA -hardware these specific bit-width types can lead to large performance and space -savings. +As the rationale in that proposal mentioned, some applications have uses for a +specific bit-width type. In the case of writing C code which can be used to +determine FPGA hardware these specific bit-width types can lead to large +performance and space savings. -From the perspective of the Arm ABI we have some trade-offs to determine. We -need to choose a representation for these objects in memory and in registers -along with the size and alignment of the objects. The main trade-offs we have -identified in this case are on performance between different types of C-level -operations, whether certain hardware-level atomic operations are possible, and -general familiarity of programmers with the representation. +From the perspective of the Arm ABI we have some trade-offs and decisions to +make: -For this particular type we are estimating that the use of ``_BitInt`` types -will not be such that operations on these types are performance critical. +- We need to choose a representation for these objects in registers. +- We need to choose a representation, size and alignment of these objects in memory. + +The main trade-offs we have identified in this case are: + +- Performance of different C-level operations. +- Whether certain hardware-level atomic operations are possible. +- Size cost of storing values in memory. +- General familiarity of programmers with the representation. + +Since this is a new type there is large uncertainty on how it will be used by +programmers in the future. Decisions we make here may also influence future +usage. Nonetheless we must make trade-off decisions with this uncertainty. The +below attempts to analyze possible use-cases to make our best guess as to how +these types may be used when targeting Arm CPU's. + + +Use-cases known of so far +------------------------- There seem to be two different regimes for these types. The "small" regime -where bit-precise types could be stored in a single register, and the "large" -regime where bit-precise types must span multiple registers. +where bit-precise types could be stored in a single general-purpose register, +and the "large" regime where bit-precise types must span multiple +general-purpose registers. + +Here we discuss the use-cases for bit-precise integer types that we have +identified or been alerted to so far. + + +C code to describe FPGA behavior +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A major motivating use-case for this new type is to aid writing C code which +describes the desired behavior of an FPGA. Without the availability of the new +``_BitInt`` type such C code would semantically have much wider types than +necessary when performing operations, especially given that operations on small +integral types promote their operands to ``int``. + +If these wider than necessary operations end up in the FPGA they would use many +more many more logic gates than necessary. Using ``_BitInt`` allows programmers +to write code which directly expresses what is needed. This can ensure the FPGA +description generated saves space and has better performance. + +The notable thing about this use-case is that though the C code may be run on an +Arm architecture (e.g. for testing), the most critical use is when transferred +to an FPGA (i.e. not an Arm architecture). + +That said, if the operation that this FPGA performs becomes popular there may be +a need to run the code directly on CPU's in the future. + +The requirements on Arm ABI's from this use-case are relatively small since the +main focus is around running on an FPGA. We believe it adds weight to both the +need for performance and familiarity of programmers. This belief comes from the +estimate that this may lead to bit-precise types being used in performance +critical code in the future, and that it may mean that bit-precise types are +used on Arm architectures when testing FPGA descriptions (where ease of +debugging can be prioritized). + + +24-bit Color +~~~~~~~~~~~~~ + +Some image file-types use 24-bit color. The new ``_BitInt`` type may be used to +hold such information. + +As it stands we do not know of any specific reason to use a bit-precise integral +type as opposed to a structure of three bytes for these data types. + +If used for 24-bit color we believe that the performance of standard arithmetic +operations would not be critical. This because each 24-bit pixel usually +represents three 8-bit colors so operations would unlikely be performed on the +single value as a whole. + +We also believe that if used for 24-bit color it would be helpful to specify a +size and alignment scheme such that an array of ``_BitInt(24)`` is well packed. + + +Networking Protocols +~~~~~~~~~~~~~~~~~~~~ + +Many networking protocols have packed structures in order to minimize data sent +over the wire. In order to be perfectly packed the code will need to use +bit-fields rather than bit-precise types for storage, since the bit-precise types +must be accessible and hence at least byte-aligned. + +The incentives to use bit-precise integral types for networking code would be in +order to maintain the best representation of the operation that is being +performed. + +One negative of using bit-precise integral types for networking code would be +that idioms like ``if (x + y > max_representable)`` where ``x`` and ``y`` have +been loaded from small bit-fields would no longer be viable. We have seen such +idioms for small values in networking code in the Linux kernel. These are +intuitive to write but if ``x`` and ``y`` were to bit-precise types would not +work as expected. + +If used in code handling networking protocols, our estimate is that the +arithmetic manipulation performed on such values will not be the main +performance bottleneck in networking protocols. This estimation comes from the +belief that networking is often IO bound, and that small packed values in +networking protocols tend to have limited arithmetic performed on them. + +Hence we believe that ease of debugging of values in registers may be more +critical than performance concerns in this use-case. + + +To help the compiler optimize (e.g. for auto vectorization) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The behavior that bit-precise types do not automatically promote to an ``int`` +during operations could remove some casts which are necessary for C semantics +but can obscure the intention of a users code. One place this may help is in +auto vectorization, where the compiler must be able to see through intermediate +casts in order to identify the operations being performed. + +The incentive for this use-case is an increased likelihood of the compiler +generating optimal autovectorized code. + +Points which might imply less take-up of this use-case are that the option to +use compiler intrinsics are there for programmers which want to put in extra +effort to ensure good vectorization of a loop. This means that using +bit-precise types would be a mid-range option providing less-guaranteed codegen +improvement for less effort. + +The ABI should not have much of an affect on this use-case directly, since the +optimization would be done in the target-independent part of compilers and the +eventual operations in auto vectorized code would be acting on vector machine +types. + +That said, bit-precise types would also be used in the surrounding code. Given +that in this use-case these types are added for performance reasons it seems +reasonable to guess that this concern around performance would apply to the +surrounding code as well. Hence it seems that this use-case would benefit from +choosing performance concerns. + +In this use-case the programmer would be converting a codebase using either 8 +bit integers or 16 bit integers to a bit-precise type of the same size. Such a +codebase may include calls to variadic functions (like ``printf``) in +surrounding code. Variadic functions like this may be missed when changing +types in a codebase, so it would be helpful if the bit-precise machine types +passed matched what the relevant standard integral types looked like in order to +avoid extra difficulties during the conversion. The C semantics require that +variadic arguments undergo standard integral promotions. While ``int8_t`` and +the like undergo integral promotion, ``_BitInt`` does not. Hence this use-case +would benefit from having the representation of ``_BitInt(8)`` in the PCS match +that of ``int`` and similar for the ``16`` bit and unsigned variants (which +implies having them sign- or zero-extended). + +One further point around this use-case, is that decisions which do not affect 8 +and 16 bit types would not affect this use-case. + + +For representing cryptography algorithms +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Many cryptography algorithms perform operations on large objects. It seems +to be that using a ``_BitInt(128)`` or ``_BitInt(256)`` could express +cryptographic algorithms more concisely. + +For symmetric algorithms the existing block cipher and hash algorithms do not +tend to operate on chunks this size as single integers. This seems like it will +remain the case due to CPU limitations and a desire to understand the +performance characteristics of written algorithms. + +For asymmetric algorithms something like elliptic curve cryptography seems like +it could gain readability from using the new bit-precise types. However there +would likely be concern around whether code generated from using these types is +guaranteed to use constant-time operations. + +This use-case would only be using "large" bit-precise types. Moreover all +relevant sizes are powers of two. + +Translating some more esoteric languages to C +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +At the moment there exist some high-level languages which support arbitrary +bit-width integers. Translating such languages to C would benefit from the new +C type. + +We do not know of any specific use-case within these languages other than for +cryptography algorithms as above. Hence the trade-offs in this space are +assumed to be based on the trade-offs from the cryptography use-case above. + +We estimate the use of translating a more esoteric language to C to be less +common than writing code directly in C. Hence the weighting of this use-case in +our trade-offs is correspondingly lower than others. + +Possible transparent BigNum libraries +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We have heard of interest in using the new bit-precise integer types to +implement transparent BigNum libraries in C. + +Such a use-case unfortunately does not directly correspond to what kind of code +will be using this (e.g. would this be algorithmic code or I/O bound code). +Given the mention of 512x512 matrices in the comment where we heard of this we +assume that in general such a library would be CPU-bound code. + +Hence we assume that the main consideration here would be performance. + + +Summary of use-case trade-offs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In our estimation, the C to FPGA use case seems to be the most promising. We +estimate that the use in this space will be a great majority of the use of this +new type. + +Uses for cryptography, networking, and in order to help the compiler optimize +certain code seem like they are large enough to consider but not as widespread. + +For the C to FPGA use case, the majority of the use is not expected to be seen +on Arm Architectures. For helping the compiler optimize code we expect to only +see bit-precise types with sizes matching that of standard integral types. +Cryptographic uses are only expected on "large" sizes which are powers of two. +Networking uses are likely to be using bit-fields for in-memory representations. + +All use-cases would have concerns around performance and the familiarity of +representations. There does not seem to be a clear choice to prefer one or the +other. + Alignment and sizes ------------------- +Options and their trade-offs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + These types must be at least byte-aligned so they are addressable, and at least -rounded to a byte boundary in size for ``sizeof``. Since these types have an -aesthetic similarity to bit-fields one might expect better packing in an array -of ``_BitInt(24)`` than an array of ``int32_t`` types (i.e. packing as good as a -byte-array). However, this would require a low alignment of such types and that -would mean loading and storing of even "small" sized ``_BitInt``'s crossing -cache boundaries -- leading to an unnecessary performance hit and hindering any -atomic operations on these. +rounded to a byte boundary in size for ``sizeof``. + +"Small" regime +////////////// +For the "small" regime there are 2 obvious options: + +A. Byte alignment. +B. Alignment and size "as if" stored in the next-largest Fundamental Data Type. + (Where the Fundamental Data Types are defined in the relevant PCS documents). + +Option ``A`` has the following benefits: + +- Better packing in an array of ``_BitInt(24)`` than an array of ``int32_t``. + This is more relevant for bit-precise types than others since these types have + an aesthetic similarity to bit-fields and hence programmers might expect good + packing. + +Option ``B`` has the following benefits (both following from the alignment being +greater than or equal to the size of the object in memory): + +- Avoid a performance hit since loading and storing of these "small" sized + ``_BitInt``'s will not cross cache boundaries. +- Atomic loads and stores can be made on these objects. +- The representation of bit-precise types of the same size as standard integer + types will have the same alignment and size in memory. + +In the use-cases we have identified above we did not notice any special need for +tight packing. All of the use-cases we identified would benefit from better +performance characteristics, and the use-case to help the compiler in optimizing +some code would benefit greatly from ``_BitInt(8)`` having the same alignment +and size as a ``int8_t``. Hence for "small" sizes we are choosing to define a ``_BitInt(N)`` size and alignment according to the smallest Fundamental Data Type which has a bit-size greater or equal to ``N``. Similar for ``unsigned`` versions. + +"Large" regime +////////////// For "large" sizes the only approach considered has been to treat these -bit-precise types as an array of ``M`` sized chunks, for some ``M``. The two -"reasonable" choices for this ``M`` seem to either be register sized or -double-register sized. Choosing a register sized chunk would mean smaller sizes -of types for half of the values of ``N``, while choosing a double-register sized -chunk would allow atomic operations on types in the range between the register -and double-register sizes due to the associated extra alignment allowing -operations like ``CASP`` on aarch64 and ``LDRD`` on aarch32. Moreover, the -majority of "large" size use-cases proposed so far are of power-of-two sizes -like sha256 which would not be in the range which suffers in space-cost from -this choice. Finally, defining the ``_BitInt`` representation in this manner -means that on AArch32 a ``_BitInt(64)`` has the same alignment and size as a -``int64_t`` which is the largest size defined on that platform, and on AArch64 -a ``_BitInt(128)`` has the same alignment and size as a ``__int128`` which is -the largest type defined on that platform. This falls out of the fact that -double-register size maps to the largest integral Fundamental Data Type defined -on both platforms. - -Hence for "large" sizes we are choosing to define a ``_BitInt(N)`` size and -alignment by treating them "as if" they are an array of double-register sized -Fundamental Data Types. +bit-precise types as an array of ``M`` sized chunks, for some ``M``. + +There are two obvious choices for ``M``: + +A. Register sized. +B. Double-register sized. + +Option ``A`` has the following benefits: + +- Less space used for half of the values of ``N``. + +Option ``B`` has the following benefit: + +- Would allow atomic operations on types in the range between register + and double-register sizes. + This is due to the associated extra alignment allowing operations like + ``CASP`` on aarch64 and ``LDRD`` on aarch32. +- On AArch32 a ``_BitInt(64)`` would have the same alignment and size as an + ``int64_t``, and on AArch64 a ``_BitInt(128)`` would have the same alignment + and size as a ``__int128``. + These are the largest types defined on the relevant architectures, and + correspond to the largest integral Fundamental Data Type defined in the PCS + for both platforms. + +The "large" size use-cases we have identified so far are of power-of-two sizes. +These sizes would not benefit from the positives of either of the options +presented here. + +Hence for "large" sizes we are choosing based on an estimate of which choice is +more "generally useful". Our estimate is that the benefits of option ``B`` are +more generally useful than those from option ``A``. That is we choose to define +the size and alignment of ``_BitInt(N > [register-size])`` types by treating +them "as if" they are an array of double-register sized Fundamental Data Types. Representation in bits ---------------------- There are two decisions around the representation of a "small" ``_BitInt`` that we have identified. (1) Whether required bits are stored in the least -significant end of a register or most significant end of a register. (2) Whether -the "remaining" bits after rounding up to the size specified in `Alignment and -sizes`_ are specified or not -- with how these bits would naturally be specified -depending on the choice made for (1). +significant end or most significant end of a register or region in memory. (2) +Whether the "remaining" bits after rounding up to the size specified in +`Alignment and sizes`_ are specified or not. The choice of *how* "remaining" +bits would be specified would tie in to the choice made for (1). + Options and their trade-offs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -100,19 +355,19 @@ C. Required bits stored in least significant end. Not-required bits are specified as zero- or sign-extended. While it would be possible to make different requirements for bit-precise -integer types in memory vs in registers, we believe that the negatives of having -to perform a transformation on loading and storing values and the programmer -confusion associated with different representations are reason enough to not -look into this option further. Especially since the differentiating factors -were not drastically different between memory and register regimes. - -Similarly, it would be possible to define a representation that does something -like specifying bits ``[2-7]`` of a ``_BitInt(2)`` but leaves bits ``[8-63]`` -unspecified. This would seem to choose the worst of both worlds in terms of -performance, since one must both ensure "overflow" from an addition of -``_BitInt(2)`` types does not affect the specified bits **and** ensure that the -unspecified bits do not affect multiplication or division operations. -Hence we do not look at variations of this kind. +integer types in memory vs in registers, we believe that the combined negatives +of the choice are reason enough to not look into the option. These negatives +being that code would have to perform a transformation on loading and storing +values, and that different representations in memory and registers is likely to +cause programmer confusion. + +Similarly, it would be possible to define a representation in registers that +does something like specifying bits ``[2-7]`` of a ``_BitInt(2)`` but leaves +bits ``[8-63]`` unspecified. This would seem to choose the worst of both worlds +in terms of performance, since one must both ensure "overflow" from an addition +of ``_BitInt(2)`` types does not affect the specified bits **and** ensure that +the unspecified bits above bit number 7 do not affect multiplication or division +operations. Hence we do not look at variations of this kind. For option ``A`` there is an extra choice around how "large" values are stored. One could either have the "padding" bits in the least significant "chunk", or @@ -135,9 +390,9 @@ It has the following negatives: - This would be a less familiar representation to programmers. Especially the fact that a ``_BitInt(8)`` would not have the same representation in a - register as a ``char`` would likely cause confusion (e.g. when debugging, or - writing assembly code). This would likely be increased if other architectures - that programmers may use have a more familiar representation. + register as a ``char`` could cause confusion (e.g. when debugging, or writing + assembly code). This would likely be increased if other architectures that + programmers may use have a more familiar representation. - Operations ``*,/``, saving and loading values to memory, and casting to another type would all require extra cost. @@ -145,6 +400,9 @@ It has the following negatives: - Operations ``+,-`` on "large" values (greater than one register) would require an extra instruction to "normalize" the carry-bit. +- If used in calls to variadic functions which were written for standard + integral types this can give surprising results. + Option ``B`` has the following benefits: @@ -162,14 +420,17 @@ It has the following negatives: - The AArch64 ``LD{S,U}MAX`` operations would not work naturally on small values of this representation. -- Operations ``/,%,==,<,>,<=,>=,>>`` and widening conversions would not - require extra work. +- Operations ``/,%,==,<,>,<=,>=,>>`` and widening conversions on operands coming + from an ABI boundary would require masking the operands. - On AArch32 this could cause surprises to developers, given that on this architecture small Fundamental Data Types are have zero- or sign-extended extra bits. So a ``char`` would not have the same representation as a ``_BitInt(8)`` on this architecture. +- If used in calls to variadic functions which were written for standard + integral types this can give surprising results. + Option ``C`` has the following benefits: @@ -182,6 +443,9 @@ Option ``C`` has the following benefits: - On AArch32 this could match the expectation of developers, with a ``_BitInt(8)`` in a register matching the representation of a ``char``. +- If used in variadic function calls, mismatches between ``_BitInt`` types and + standard integral types would not cause as much of a problem. + It has the following negatives: - The AArch64 ``LDADD`` operations would not work naturally. @@ -194,20 +458,28 @@ It has the following negatives: Summary, suggestion, and reasoning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Overall it seems that for operations on small values option ``A`` is more -performant. However, when acting on "large" values (i.e. greater than the size -of one register) it loses some of that benefit. Storing to and from memory -would also come at a cost for this representation. This is also likely to be -the most surprising representation for developers on an Arm platform. + +Overall it seems that option ``A`` is more performant for operations on small +values. However, when acting on "large" values (i.e. greater than the size of +one register) it loses some of that benefit. Storing to and from memory would +also come at a cost for this representation. This is also likely to be the most +surprising representation for developers on an Arm platform. Between option ``B`` and option ``C`` there is not a great difference in performance characteristics. However it should be noted that option ``C`` is -likely the most natural extension of the AArch32 PCS rules for unspecified bits -in a register containing a small Fundamental Data Type, while option ``B`` is -the most natural extension of the similar rules in AArch64 PCS. - -As mentioned above, we do not expect operations on ``_BitInt`` types to be -performance critical. Given that providing a productive environment for -developers is valuable and following the "principle of least surprise" is a -good way to achieve that, we suggest choosing option ``C`` for AArch32 and -option ``B`` for AArch64. +the most natural extension of the AArch32 PCS rules for unspecified bits in a +register containing a small Fundamental Data Type, while option ``B`` is the +most natural extension of the similar rules in AArch64 PCS. Furthermore, option +``C`` would mean that accidental misuse of a bit-precise type instead of a +standard integral type should not cause problems, while ``B`` could give strange +values. This would be most visible with variadic functions. + +As mentioned above, both performance concerns and a familiar representation are +valuable in the use-cases that we have identified. This has made the decision +non-obvious. We have chosen to favor representation familiarity. + +Choosing between ``C`` and ``B`` is also non-obvious. It seems relatively clear +to choose option ``C`` for AArch32. We choose option ``B`` for AArch64 to +prefer that across most ABI boundaries a ``char`` and a ``_BitInt(8)`` have the +same representation, but acknowledge that this could cause surprise to +programmers when using variadic functions.