Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

additional instructions to set vtype fields. #423

Open
David-Horner opened this issue Apr 19, 2020 · 5 comments
Open

additional instructions to set vtype fields. #423

David-Horner opened this issue Apr 19, 2020 · 5 comments
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft

Comments

@David-Horner
Copy link
Contributor

David-Horner commented Apr 19, 2020

minutes of TG meeting suggested that if the encoding runs out in vsetvli we can introduce another instruction to set other bits.

noting that later extensions could redefine vtype CSR written by
vsetvl as a "window" into a larger group of vector configuration CSRs.

That allows avoiding the register based vsetvl instruction in common cases.

However, within the encoding space used for vsetvli and vsetvl, there is no room for a further vsetvl2i with another 11 bit immediate.

Quoting further from meeting notes:

It was noted that there is no space to add more configuration
instructions in existing footprint,

We don't need another instruction to calculate vl. The currently proposed addition fields, TAMA (Tail and Mask fill directives) and EDIV (Element DIVision extension) do not modify vl as the SEW/LMUL ratio is maintained. Similarly a SEW scaling factor that also adjusts LMUL to maintain the SEW/LMUL ratio does not change vl.

A vmodtype instruction can be encoded in the remaining opcode space that uses rs1, rs2 and rd as 15 additional bits for setting additional vtype fields.

Further the opcode space allows for 64 such instructions and variations of them. Obviously we don't need all 64 of them and we will want to reserve the opcode space for future needs. However, I have a proposal for two distinct instruction types.

The first is as outlined above, the vmodtype instruction, further defined here:

  1. the 15 bits encoded in rs1, rs2 and rd are subdivided for specific purposes to modify the vtype register, that is its controlling fields.
    rd could be reserved for now, to simplify decoding. Noting rd=0 already has special meaning.
    rs1 could also be initially reserved as it only maps to immediate fields in the U-type format.
    rs2 already maps into the immediate fields for I-type, so it is the obvious choice for initial use.

  2. any modification must retain the SEW/LMUL ratio or fail, e.g. if the resultant LMUL>8.

  3. any modifications that change SEW and hence LMUL will store the resultant sew/lmul bits in vtype (a persistent change)

  4. other bits that change other existing or new vtype fields will store the modification of that affect field bits in vtype.

    It can be expected that the modification will simply be over-writing the affect fields with the corresponding bits from the register fields.

The second is a "transient" setting, vmodinstr (maybe OK name?) that could provide up to 15 prefix bits for the next executed vector instruction.

  1. The potentially 15 bits are also sourced from rs1, rs2 and rd in the instruction.
    As suggested above, rd could be excluded for simpler decoding and limited need.
    For RVV32 there is a further reason to limit to 10 bits:
    the proposed use will compete with persistent vtype bits "modded" by the above vmodtype instruction.
    Of course, "10 bits should be enough for anyone": to misquote a famous misquote ;)
    So let us, reasonably, assume 10 bits in vtype will suffice for this purpose for now.

  2. All 10 bits are set in a corresponding 10 bit field in vtype (suggested bits [30:21] for RVV32 and bits [62:52] for RVV64)

  3. All 10 bits are interpreted in conjunction with the next execute vector instruction, effectively increasing its instruction length by 10 bits.

  4. at the completion of that vector instruction the 10 bits are cleared.

NOTE: the following point is NOT essential to the proposal.
It is solely an attempt to suggest a micro-architectural approach to reduce the potential overhead of the transient form:

  1. To aid the expectation that the store of this prefix into vtype can be made virtual and be virtually supported:

     a) vstart is not changed by the vmodinstr instruction.
    
     b) if vmodinstr sources unexpected bits, the instruction raises and invalid instruction exception (no surprise here, but it does support virtualization)
    
     c) if vmodinstr is not immediately followed by an appropriate vector instruction
    
         i) an exception occurs on the vmodinstr and
    
         ii) the prefix bits in vtype are cleared.
    
         (this supports a virtual write of the bits to vtype and clearing of them)
    
     d) if vector instruction immediately following vmodinstr does not support its 10 bits of prefix,
    
             i) the vector instruction raises an illegal instruction exception and
    
             ii) the prefix bits in vtype are cleared.
    
     d)  if an interrupt occurs during the execution of the prefixed vector instruction, it may
    
             i) update the epc to point to the preceding vmodinstr and also
    
             ii) set vstart appropriately for resumption of the prefixed instruction
    

An implementation is free to physically implement the 10 transient bits.

In any case, the effect is: vmodinstr will update the transient bits and the following executed prefixed vector instruction clears them.

@kasanovic
Copy link
Collaborator

The non-transient instructions don't provide an advantage in current design, but provide some path to adding different bits in future design.

OTOH, the transient form could provide an advantage but this type of instruction prefix is quite fragile. I believe ARM SVE had to go through similar hoops to get non-destructive operations. Note that there is no code size saving versus 64-bit instructions and if those are defined well, they could be easily processed 32 bits at a time in smallest implementations.

@kasanovic
Copy link
Collaborator

kasanovic commented Apr 26, 2020

To avoid the exception/interrupt problems with the transient instruction, it would have to set some architectural state in vtype that is reset by a committing vector instruction. E.g., can provide an "override" sew/lmul in upper bits of vtype that supplants current sew/lmul setting but that is cleared by every vector instruction on commit. This would be very similar to how vstart operates.

@kasanovic
Copy link
Collaborator

OK - on second reading, I think that's a limited form of what @David-Horner actually proposed for transient, but I couldn't understand why there are exception/interrupt issues, or some of other concerns.

@David-Horner
Copy link
Contributor Author

In point 5, I was suggesting microarchitecture optimization approaches. The proposal stands on its own without it.
The exception/interrupt were implementation specific optimization considerations that would allow for not physically storing the 10 bits in vtype. I should leave such musing for micro-architects. However, proposals are more readily accepted if a means to overcome a perceived or potential technical challenge has an obvious solution.
I was attempting to do due diligence.
Obviously not well at all.

@kasanovic
Copy link
Collaborator

It would be useful to boil each issue to a specific proposal. I think this one has been refined to proposal to add an instruction that would set some CSR bits (could be in upper bits of vtype) to provide new SEW/LMUL setting that is cleared on graduation of any vector instruction (similar to vstart).

The benefit is for code where want to change SEW/LMUL for one instruction, but keep constant SEW/LMUL setting for most code.

 vsetvli t0, e32,m4
 vwadd.vv v16, v8, v12 # SEW=32/LMUL=4
 csrwi tmpsewlmul, e64,m8 # Set temporary SEW=64/LMUL=8
 vadd.vv v24, v24, v16 # Operates at SEW=64/LMUL=8
 vsll.vi v4, v8, 2 # SEW=32/LMUL=4

The override removes one vsetvli that would otherwise be needed between the vadd.vv and the vsll.vi. It does not provide a code size benefit over a future ILEN=64 encoding, but could provide some of the ILEN=64 performance benefits in an ILEN=32 system.

The downsides of this proposal are that it lengthens critical path to determine SEW/LMUL and adds implementation/verification complexity. The benefits are also diminished by the presence of the new loads/stores with explicit SEW settings.

I'm not in favor of adding this to base for 1.0 as it appears to be a small benefit with some implementation/verification cost, and would be moot for ILEN=64 extension. It could be added as extension for ILEN=32 implementations possibly.

@kasanovic kasanovic added the Resolve after v1.0 Does not need to be resolved for v1.0 draft label Jun 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft
Projects
None yet
Development

No branches or pull requests

2 participants