-
Notifications
You must be signed in to change notification settings - Fork 274
Change SEW be the "largest element width" #425
Comments
Regarding error checking, an implementation still has to check if EEW and/or LMUL underflow. |
LEW would be a pretty disruptive change - and I think I've convinced myself that if we stick with fixed effective-width load/stores in base ({8,16,32,SEW}) then there's no big advantage to changing base EW point, and a big disadvantage that the assembler instructions would not read as well (e.g., w is always 2*v now). Either way, we need to add some more widening operations, but if we stick with current SEW definition, we'll need ones that sign/zero-extend a <SEW type to SEW. As these are unary operations, they can easily fit in encoding space. |
LEW would definitely improve the amount of LMUL switches for DSP applications, where most (if not all: #287 ) of the single-width operations on narrow-width elements are loading and storing them; as a data point, in my YUV to RGB implementation, this would reduce the number of vsetvli from 4 to 2 per loop iteration. Given a sufficiently powerful system for loads and stores, like a Memory Element Width, this could even be reduced to 0. (for the record, I assume loads and stores that don't widen or narrow, at least for my applications, on the assumption that most implementations would prefer to perform LEW/2 * LEW/2, rather than LEW * LEW on inputs that happen to have top LEW/2 bits at 0 or identical top LEW/2 + 1 bits) |
SEW and LMUL values are essential to correct code execution regardless of load/store width encoding. They should be assembler directive variables set automatically by vsetvli (and vsetvl when its xs2 argument is statically defined). For dynamic xs2 and vsetvl, a manual assembler directive should be available. This should help in various situations, including validation that SEW/LMUL ratio is maintained by a given vsetvli, and also for load/store syntax: With this in place the assembler code can translate e8 to the correct SEW * factor value in load/stores eliminating the readability concern.
I agree that the base should have as robust an encoding without over committing available bits. Thus I would want to also want to move 32b when SEW=16, and in addition
I also believe that load/store are so important, so pivotal (e.g. matrix transforms) that flexibility and efficiency are both mandated. The compress load/store format seeks to address efficiency. The encoding needs the flexibility of SEW * factors. I propose the factors be depent upon current SEW value For SEW=8 the encoding yields factors of 1,2,4 and 8 For SEW=16 the encoding yields factors of 1/2, 1, 2 and 4 For SEW of 32 and above the encoding yields factors of 1/4, 1/2, 1 and 2. Thus we always support LEW = SEW *2 operations and support load/store SEW/2 and SEW/4 when they exist. |
This proposal is a modification of earlier idea to add effective
element width to load/store instructions to mitigate dropping
fixed-width load/stores and to provide greater efficiency for
mixed-width floating-point codes.
This proposal redefines SEW to be the largest element width (LEW?),
and correspondingly the definition of widening/narrowing operations:
Previously a double-widening add was defined as:
2*SEW = SEW + SEW
the new proposal is to specify
SEW = SEW/2 + SEW/2
This proposal does not change the behavior/implementation of existing
instructions, except to change how the effective EW and effective LMUL
are obtained from vtype.
For load/store instructions, we could modify these to have relative
element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}.
Assembly syntax could be something like vle, vlef2, vlef4, vlfe8.
There is a challenge in readability that these are relative to last
vtype setting, and also when SEW is less than 64, some become useless.
I think for this reason we should stick with fixed vle8, vle16, vle32,
vle, in base encoding {8,16,32,SEW}. These are more readable and can
be used to interleave load/stores of larger than SEW values without
changing vtype (e.g., moving 32b values when SEW=16). Extending past
base, we could add relative EEW load/store using unused mop bit, for
example.
Mask registers have an EEW of SEW/LMUL, and so defining the relative
sizes as fractions of SEW also makes it more likely that code can
save/restore a mask register without changing vtype. The fixed sizes
will cover some of these cases, though might have to consider
saving/restoring more than needed if SEW/LMUL < 8.
There is possibly actually a minor hardware checking advantage to
knowing that SEW holds the largest possible element width, since
following instructions cannot request an effective SEW larger than
this. Even with fixed-size load/stores, if the sizes are 8,16,32,SEW
then all of these will also be legal on standard implementations.
Some of the examples we've been discussing.
The "worst-case" example from TG slide:
This proposal, assuming quad-widening and fractional LMUL
Obviously, adding a quad-widening left shift would remove any
difference between two.
The previous widening compute operation:
The example code from #362
Original fixed-width code assuming 32b float.
converts to:
which is just one more instruction in inner loop.
With a quad-widening add (vqadd.vx), there would be no penalty in this
case:
The vle8 version also reduce the number of vector registers tied up
buffering load values when the loop is software-pipelined/loop
unrolled, allowing more loads in flight for a given LMUL.
Basically, in software-pipelined/unrolled loops, the widen-at-load
instructions tie up more architectural registers than widen-at-use
when the widen-at-use is part of instruction, reducing flexibility in
scheduling and/or reducing LMUL.
Even if an explicit widen instruction is used, at that point the
values are in the registers and the widen will proceed at max rate,
generally reducing architectural register occupancy versus having the
wider values in-flight from memory. For example:
versus
As mentioned earlier, the widened loads also might run at slower rate
unless full register write port bandwidth is given.
Other code example extracted from 3x3 convolutions with 32b += 16b * 8b
New quad instructions that did 32b += 16b * 8b would obviously help here (e.g., vqmacc.wv), and is maybe a good idea in an extension.
The text was updated successfully, but these errors were encountered: