-
Notifications
You must be signed in to change notification settings - Fork 274
additional instructions to set vtype fields. #423
Comments
The non-transient instructions don't provide an advantage in current design, but provide some path to adding different bits in future design. OTOH, the transient form could provide an advantage but this type of instruction prefix is quite fragile. I believe ARM SVE had to go through similar hoops to get non-destructive operations. Note that there is no code size saving versus 64-bit instructions and if those are defined well, they could be easily processed 32 bits at a time in smallest implementations. |
To avoid the exception/interrupt problems with the transient instruction, it would have to set some architectural state in vtype that is reset by a committing vector instruction. E.g., can provide an "override" sew/lmul in upper bits of vtype that supplants current sew/lmul setting but that is cleared by every vector instruction on commit. This would be very similar to how vstart operates. |
OK - on second reading, I think that's a limited form of what @David-Horner actually proposed for transient, but I couldn't understand why there are exception/interrupt issues, or some of other concerns. |
In point 5, I was suggesting microarchitecture optimization approaches. The proposal stands on its own without it. |
It would be useful to boil each issue to a specific proposal. I think this one has been refined to proposal to add an instruction that would set some CSR bits (could be in upper bits of vtype) to provide new SEW/LMUL setting that is cleared on graduation of any vector instruction (similar to vstart). The benefit is for code where want to change SEW/LMUL for one instruction, but keep constant SEW/LMUL setting for most code.
The override removes one The downsides of this proposal are that it lengthens critical path to determine SEW/LMUL and adds implementation/verification complexity. The benefits are also diminished by the presence of the new loads/stores with explicit SEW settings. I'm not in favor of adding this to base for 1.0 as it appears to be a small benefit with some implementation/verification cost, and would be moot for ILEN=64 extension. It could be added as extension for ILEN=32 implementations possibly. |
minutes of TG meeting suggested that if the encoding runs out in vsetvli we can introduce another instruction to set other bits.
That allows avoiding the register based vsetvl instruction in common cases.
However, within the encoding space used for vsetvli and vsetvl, there is no room for a further vsetvl2i with another 11 bit immediate.
Quoting further from meeting notes:
We don't need another instruction to calculate vl. The currently proposed addition fields, TAMA (Tail and Mask fill directives) and EDIV (Element DIVision extension) do not modify vl as the SEW/LMUL ratio is maintained. Similarly a SEW scaling factor that also adjusts LMUL to maintain the SEW/LMUL ratio does not change vl.
A vmodtype instruction can be encoded in the remaining opcode space that uses rs1, rs2 and rd as 15 additional bits for setting additional vtype fields.
Further the opcode space allows for 64 such instructions and variations of them. Obviously we don't need all 64 of them and we will want to reserve the opcode space for future needs. However, I have a proposal for two distinct instruction types.
The first is as outlined above, the vmodtype instruction, further defined here:
the 15 bits encoded in rs1, rs2 and rd are subdivided for specific purposes to modify the vtype register, that is its controlling fields.
rd could be reserved for now, to simplify decoding. Noting rd=0 already has special meaning.
rs1 could also be initially reserved as it only maps to immediate fields in the U-type format.
rs2 already maps into the immediate fields for I-type, so it is the obvious choice for initial use.
any modification must retain the SEW/LMUL ratio or fail, e.g. if the resultant LMUL>8.
any modifications that change SEW and hence LMUL will store the resultant sew/lmul bits in vtype (a persistent change)
other bits that change other existing or new vtype fields will store the modification of that affect field bits in vtype.
It can be expected that the modification will simply be over-writing the affect fields with the corresponding bits from the register fields.
The second is a "transient" setting, vmodinstr (maybe OK name?) that could provide up to 15 prefix bits for the next executed vector instruction.
The potentially 15 bits are also sourced from rs1, rs2 and rd in the instruction.
As suggested above, rd could be excluded for simpler decoding and limited need.
For RVV32 there is a further reason to limit to 10 bits:
the proposed use will compete with persistent vtype bits "modded" by the above vmodtype instruction.
Of course, "10 bits should be enough for anyone": to misquote a famous misquote ;)
So let us, reasonably, assume 10 bits in vtype will suffice for this purpose for now.
All 10 bits are set in a corresponding 10 bit field in vtype (suggested bits [30:21] for RVV32 and bits [62:52] for RVV64)
All 10 bits are interpreted in conjunction with the next execute vector instruction, effectively increasing its instruction length by 10 bits.
at the completion of that vector instruction the 10 bits are cleared.
NOTE: the following point is NOT essential to the proposal.
It is solely an attempt to suggest a micro-architectural approach to reduce the potential overhead of the transient form:
An implementation is free to physically implement the 10 transient bits.
In any case, the effect is: vmodinstr will update the transient bits and the following executed prefixed vector instruction clears them.
The text was updated successfully, but these errors were encountered: