-
Notifications
You must be signed in to change notification settings - Fork 274
Allow zeroing tail as an implementation option #367
Comments
This seems quite clever, and I do like it. I do have some further
thoughts which I'll add below... some of them involve replacing or
changing this proposal in more radical ways.
(1) Consider adding "tail undefined" as one of the mode options. This
is easier to implement than tail zeroing, but gives the same
performance benefits. (I know of the information/security risks, but
in some applications, particularly lightweight embedded use, that is
not an issue.) Within the "tail undefined" mode, there can be a
recommendation that "implementations which care about information
leakage should write zeros to the tail under this mode" (note: I would
still consider keeping the "tail zeroed" mode, though if you read
below I can't see a direct programmer use for it so it could be
considered for removal)
(2) This proposal assumes there is only one way to preserve tail
elements in OoO, where reading the tail of the destination is
necessary so it can be rewritten to the physical register being
renamed to vd. This extra readout imposes overhead, which is undesired
because it impacts performance -- in fact, that's the entire
motivation for this proposal. However, there are alternative ways to
implement this -- I have one alternative in mind that has a lower
hardware cost to implement than tail zeroing, still allows OoO and
retains its performance benefits, and yet leaves the original tail
elements unmodified. That is, I don't think this proposal is
absolutely necessary to achieve performance, unless you specifically
want the feature of tail zeroing and you are fixated on that being the
only way to achieve performance.
(3) Since this proposal introduces additional state, and requires
portable software to provide multiple implementations depending upon
the underlying microarchitecture, I believe it is prudent to look at
other ways of doing this.
(4) The programmer shouldn't have to write two sequences, one for
performance W on architecture family X and another for performanze Y
on architecture family Z, yet that's what will happen if this is
adopted.
(5) The proposal is designed and written from the perspective of a
computer architect, who are the minority of users/readers of the spec.
I believe it should be designed and written for programmers, who will
be using the spec constantly to produce new programs. The ISA is a
contract to those users, not to the microarchitectural designers.
From this new point of view, the programmer is thinking "which mode is
the most natural for the code sequence that I am going to write?". Eg,
the programmer wants to say "the code below depends upon the tail
being zeroed", or "the code below depends on the tail remaining
undisturbed" or "the code below does not depend upon the tail". The
expected use:
(a) Most code sequences will use the last option, where it does not
depend upon the tail so it doesn't care which option.
(b) In only some (presumably short) code sequences will it want the
tail to remain undisturbed; in these cases, the code is shorter and
likely faster (less data copying, less register spilling, etc) if the
tail is left undisturbed.
(c) Although I can't think of any cases where the programmer wants to
say "the code below depends on the tail being zeroed", I suppose it
might be possible.
Why does this perspective matter? Because I don't think the ISA should
expose too much about "runtime modes" for performance. These quickly
become stale when there are new ways to do things. Once the
programmer's intent is known, the underlying microarchitecture can do
what it wants to implement that intent with maximum performance.
I don't think it's a clean design to ask the programmer to change the
code sequence depending upon the underlying microarchitecture (though
that will happen regardless, we should try to limit how often such
divergences may occur by more carefully designing the spec).
I have been thinking of another way to achieve this without having
modes. For example, do we need to define the behaviour of certain
instructions differently (eg, most instructions zero the tail, but
these small number of instructions -- such as vslide -- preserve it?).
Or, do we need to add an instruction or two, perhaps a "tail copy"
instruction that gets macro-op fused with its successor?
Guy
…On Wed, Feb 5, 2020 at 3:51 AM Roger Ferrer Ibáñez ***@***.***> wrote:
We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].
However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.
Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.
This proposal adds the following architectural changes:
Add a new 2-bit field in bits 9:8 of the the vtype CSR called vtail with the following meaning
00 preferred behaviour of the implementation (either undisturbed or zeroing)
01 undisturbed tail
10 zeroing tail
11 (reserved)
Add a new (unprivileged) RO CSR called vtaildefault. Bits 1:0 of such CSR state what is the preferred behaviour of the implementation and its values can only be 01 or 10.
An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes vtaildefault to 01. The reset state of an implementation always sets vtail to 00.
Changing the tail behaviour can be done using vsetvli:
If no tail behaviour is specified the preferred behaviour is used (i.e. vtail is left as 00)
vsetvli x1, x2, e64
vsetvli x1, x2, e64,m1
If the operand u appears after the length multiplier, undisturbed tail is chosen and vtail is set to 01.
vsetvli x1, x2, e64,m1,u
If the operand z appears after the length multiplier, zeroing tail is chosen and vtail is set to 10
vsetvli x1, x2, e64,m1,z
If the implementation does not support zeroing, the vill bit of vtype is set.
Execution of an instruction then honours the tail behaviour in vtail:
The tail elements during a vector instruction’s execution are the elements past the current vector length setting.
When vtail = 01 the tail elements do not raise exceptions, and do not update any destination vector register group.
When vtail = 10 the tail elements do not raise exceptions, but do zero the results in any destination vector register group.
When vtail = 00 the implementation behaves either as vtail = 01 or vtail = 10.
Note: under this proposal, when vtail = 00 software may not rely on the actual contents of the tail of the destination vector register group unless it knows, beforehand, the preferred behaviour of the implementation, for instance having read vtaildefault.
[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
*I am very wary of complicating the software visible tail behaviour.*
Specifically, I am opposed to a change in default behaviour to support a
facility that is admittedly intended for a sub-market (HPC).
*First the good characteristics.*
1) Program control of tail behaviour on a (potentially) vector operation
(instruction) basis.
a) Facility can enable explicit tail behaviour (unchanged or zeroed).
b) Facility uses an existing instruction needed by the code and
no additional control is needed.
c) Facility uses existing vtype register for state information
(tail zero or unchanged) and no other state is needed.
d) if only (a) the explict tail behaviour is utilized, (b) and
(c) are sufficient.
There is no need for the |vtaildefaul| CSR, neither to
set it nor interrogate it.
*And the bad.*
1) As proposed this could not reasonably be added post ratification.
It is too invasive, demanding EE support.
The advisement highlights the issues that pre-extension code (
in which vtail=00 is always true) must be written agnostic to tail
behaviour. This would most certainly not be the case as Section 5 of the
git document highlights potential software benefits of tail undisturbed
Note: under this proposal, *when **|vtail = 00|**software may not
rely on the actual contents of the tail of the destination vector
register group* unless it knows, beforehand, the preferred
behaviour of the implementation, for instance having read
|vtaildefault|.
2) The advisement also implicitly acknowledges agnostic to tail
behaviour code, but provides no support to identify or utilize this.
3) It conflates two performance benefits, ooo reg-rename and
software reliance on tail-zeroing.
4) It is unnecessarily complex, among others adding additional
state that must be saved on context switch.
5) it re-opens the tail-zero/unchanged debate/ambiguity by
providing overriding default behaviour.
6) because of the above it complicates the software ecosystem
unnecessarily without providing specific benefits.
This counter proposal addresses these concerns, retaining the good
points and avoiding the bad.
*Counter Proposal:*
Add a new 2-bit field in bits |9:8| of the the |vtype| CSR called
|vtail| with the following meaning
* |00| undisturbed tail is required
* |01| code tolerates either undisturbed or zero tail (or any other
behaviour in tail portion)
* |10| tail zeroing is required
* |11| (reserved)
This has same pros as original.
Addresses bad aspects:
2) It appears Guy is thinking along similar lines and this addresses the
agnostic concern Guy raises.
For statically managed vsetvli the linkage editor will change the
agnostic to "requires undisturbed" for processors that only support the
base.
3) It seperates the two optimizations in settings 01 and 10.
4) No other CSR are required.
5) default behaviour is established as unambiguously as undisturbed tail
is required.
6) Only the agnostic and the linkage editor change to 00 is necessarily
visible to the software eco system.
1) as a result of addressing these issues, this can be defined as an
extension.
…On 2020-02-05 6:51 a.m., Roger Ferrer Ibáñez wrote:
We (at the BSC) are aware of the trade-offs described for
implementations when it comes to choose undisturbed tail or zeroing
tail. However we believe the option of implementing zeroing of the
tail in the V-extension has to exist. In particular for
implementations tailored for the High-Performance Computing (HPC)
market, undisturbed tail poses a problem for implementations using
renaming "as additional cycles are required to read out old tail
elements to copy to the tail of the new destination physical register"[1].
However, we are also sensitive to the fact that zeroing complicates
smaller implementations of the V-extension not directed at the HPC
market and as such it is regarded as undesirable.
Thus we suggest, that the possibility of zeroing the tail exists in a
way that adds little burden if there is no intent to support zeroing.
This proposal adds the following architectural changes:
Add a new 2-bit field in bits |9:8| of the the |vtype| CSR called
|vtail| with the following meaning
* |00| preferred behaviour of the implementation (either undisturbed
or zeroing)
* |01| undisturbed tail
* |10| zeroing tail
* |11| (reserved)
Add a new (unprivileged) RO CSR called |vtaildefault|. Bits |1:0| of
such CSR state what is the preferred behaviour of the implementation
and its values can only be |01| or |10|.
An implementation of V-ext, under this proposal, must always implement
undisturbed tail, so the minimal implementation of this proposal
simply hardcodes |vtaildefault| to |01|. The reset state of an
implementation always sets |vtail| to |00|.
Changing the tail behaviour can be done using |vsetvli|:
* If no tail behaviour is specified the preferred behaviour is used
(i.e. vtail is left as |00|)
|vsetvli x1, x2, e64 vsetvli x1, x2, e64,m1 |
* If the operand |u| appears after the length multiplier,
undisturbed tail is chosen and |vtail| is set to |01|.
|vsetvli x1, x2, e64,m1,u |
* If the operand |z| appears after the length multiplier, zeroing
tail is chosen and |vtail| is set to |10|
|vsetvli x1, x2, e64,m1,z |
If the implementation does not support zeroing, the |vill| bit of
|vtype| is set.
Execution of an instruction then honours the tail behaviour in |vtail|:
The tail elements during a vector instruction’s execution are the
elements past the current vector length setting.
* When |vtail = 01| the tail elements do not raise exceptions, and
do not update any destination vector register group.
* When |vtail = 10| the tail elements do not raise exceptions, but
do zero the results in any destination vector register group.
* When |vtail = 00| the implementation behaves either as |vtail =
01| or |vtail = 10|.
Note: under this proposal, when |vtail = 00| software may not rely
on the actual contents of the tail of the destination vector
register group unless it knows, beforehand, the preferred
behaviour of the implementation, for instance having read
|vtaildefault|.
[1]
https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#367>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFAWIKOZTKFBOKZ3YKK4ZLDRBKR2JANCNFSM4KQJWSUA>.
|
I left some details unspecified. An additional operand is needed for vsetvli to specifiy the "tail agnostic" setting. I'm OK with vill set if vtail is unsupported. However, I believe we should provide a note that emphasizes that implementations are free to trap on vsetvl{i}
Some open questions and observations: vtailag only implies tail non-undisturbed/unzero tail options are possible. Do we want to explicitly permit? Guy appears to recommend that tailag be the default. I like this approach it aligns with programmer thinking/awareness. And it can be done in software (as explained in the prior note with linkeditor support). It does make the software model more complex. We provide a facility that in reality does not exist on some (perhaps most) implementations. I see no way to reasonably make tailag the default hardware standard. We anticipated that programmers will want explicit behaviour (tailun or tailzero) as there are software benefits to them. And as we believe software most benefits tailun we chose that as the (hardware) default. We can formulate this proposal as an optional extension to the base.. |
At risk of stating the obvious, the granularity at which registers are
renamed is an implementation choice, and extra rename state can eliminate
this overhead. For a really big machine with, say, a 1024-bit datapath and
VLEN=4096, renaming at the 1024-bit granularity avoids the need to spend
extra cycles copying old values. The rename table snapshots are not
inconsiderable in this case (~1 Kbit apiece), but compare that to the ~256
Kbit PRF.
Have you got a specific machine configuration in mind where this is truly
problematic?
…On Wed, Feb 5, 2020 at 3:50 AM Roger Ferrer Ibáñez ***@***.***> wrote:
We (at the BSC) are aware of the trade-offs described for implementations
when it comes to choose undisturbed tail or zeroing tail. However we
believe the option of implementing zeroing of the tail in the V-extension
has to exist. In particular for implementations tailored for the
High-Performance Computing (HPC) market, undisturbed tail poses a problem
for implementations using renaming "as additional cycles are required to
read out old tail elements to copy to the tail of the new destination
physical register"[1].
However, we are also sensitive to the fact that zeroing complicates
smaller implementations of the V-extension not directed at the HPC market
and as such it is regarded as undesirable.
Thus we suggest, that the possibility of zeroing the tail exists in a way
that adds little burden if there is no intent to support zeroing.
This proposal adds the following architectural changes:
Add a new 2-bit field in bits 9:8 of the the vtype CSR called vtail with
the following meaning
- 00 preferred behaviour of the implementation (either undisturbed or
zeroing)
- 01 undisturbed tail
- 10 zeroing tail
- 11 (reserved)
Add a new (unprivileged) RO CSR called vtaildefault. Bits 1:0 of such CSR
state what is the preferred behaviour of the implementation and its values
can only be 01 or 10.
An implementation of V-ext, under this proposal, must always implement
undisturbed tail, so the minimal implementation of this proposal simply
hardcodes vtaildefault to 01. The reset state of an implementation always
sets vtail to 00.
Changing the tail behaviour can be done using vsetvli:
- If no tail behaviour is specified the preferred behaviour is used
(i.e. vtail is left as 00)
vsetvli x1, x2, e64
vsetvli x1, x2, e64,m1
- If the operand u appears after the length multiplier, undisturbed
tail is chosen and vtail is set to 01.
vsetvli x1, x2, e64,m1,u
- If the operand z appears after the length multiplier, zeroing tail
is chosen and vtail is set to 10
vsetvli x1, x2, e64,m1,z
If the implementation does not support zeroing, the vill bit of vtype is
set.
Execution of an instruction then honours the tail behaviour in vtail:
The tail elements during a vector instruction’s execution are the elements
past the current vector length setting.
- When vtail = 01 the tail elements do not raise exceptions, and do
not update any destination vector register group.
- When vtail = 10 the tail elements do not raise exceptions, but do
zero the results in any destination vector register group.
- When vtail = 00 the implementation behaves either as vtail = 01 or vtail
= 10.
Note: under this proposal, when vtail = 00 software may not rely on the
actual contents of the tail of the destination vector register group unless
it knows, beforehand, the preferred behaviour of the implementation, for
instance having read vtaildefault.
[1]
https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#367>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAH3XQQXH5YYUC7FRHISVALRBKR2DANCNFSM4KQJWSUA>
.
|
On Wed, Feb 5, 2020 at 7:10 PM David-Horner ***@***.***> wrote:
Guy appears to recommend that tailag be the default. I like this approach
it aligns with programmer thinking/awareness. And it can be done in
software (as explained in the prior note with linkeditor support). It does
make the software model more complex. We provide a facility that in reality
does not exist on some (perhaps most) implementations.
I see no way to reasonably make tailag the default hardware standard. We
anticipated that programmers will want explicit behaviour (tailun or
tailzero) as there are software benefits to them. And as we believe
software most benefits tailun we chose that as the (hardware) default.
I'm not recommending tailag be the default. I support tailun as the
default, for the reasons already given.
However, I'm saying that most of the time the programmer would be fine with
tailag. By adopting tailun, that fits with the "most of the time" use as
well, and offers the performance benefits, which is why it should be
default.
For the implementations that think tailzero is needed, I am pretty sure
there are other implementation workarounds that do not require zeroing but
still allow OoO execution. So, I do not advocate including tailzero as an
option for implementation reasons. I do support it if we can justify it at
the software/programming level as an important use case. I would like to
see this use case.
If such a use case exists, then I suggest we think about a more restrictive
form of tailzero. Eg, instead of writing all tails with zeros, perhaps we
keep tailun behaviour but add a single unary instruction called "vtailzero"
that zeros all elements at positions from vl to VLMAX-1 in the specified
register or register group. This achieves the desired
software functionality, which can be implemented efficiently through fusion
with a preceding or successor instruction, and does not require
implementing any special way of writing 0s for practically all executed
instructions (just those tagged with vtailzero).
And, like Andrew has just posted, tailzero option is unnecessary for
performance if you use an adequate renaming strategy. (again, this is why
it shouldn't be in the ISA based solely on implementation performance)
Guy
… |
Thanks everyone for the comments. Much appreciated. I can't answer to all of them (I'm a compiler guy actually, my colleagues will address them) but I'd like to clarify our stance that led to this proposal. @David-Horner (counter-)proposal of having an agnostic tail is really good. Even better than my initial proposal because undisturbed is still a baseline that is always available. Now I realise that the "preferred tail behaviour" is not desirable and being able to state "I don't care what you do in the tail" is much better. We most of the time don't care about what ends in the tail. So this is why it felt unnecessary to us to have to honour the undisturbed tail semantics. I'm aware of the cases where undisturbed is better, and those do justify always having undisturbed available. In short: undisturbed must stay. However, we need a way to realize the agnostic case. Undisturbed is a way. Another one is zeroing. Before posting this proposal we did internally discuss having "undefined" values in the tail. But, as @vbx-glemieux already pointed out, it comes with security risks and, in a post-Spectre world, I would be extremely cautious to open new avenues in which we leak implementation details. Hence zeroing seemed reasonable. In the line of what @David-Horner suggested, we might simplify even further the proposal by making Under that simplified angle, a Kind regards. |
This was discussed in WG meeting and as @aswaterman pointed out is not actually a problem for OOO processor with renaming. Renaming has to be done at least at the granularity of individual vector register, not vector register group, limiting performance impact. Of course, renaming can be done at finer granularity. |
Hi all, this is Oscar from the BSC hardware team. Thank you all for the comments and suggestions. This is helpful, there are many valid and interesting points. We are aware that it could be possible to use the renaming table rather than copying values to implement undisturbed tail. In the example from @aswaterman, that is a perfectly reasonable solution. However, this does not work well for us. In our case, we have a VLEN=256*64=16384, and a data path of 64 * 8 lanes = 512. That means having 32 individual "sub-renamings" per register, so the overheads are much larger than in the example @aswaterman posted, and doing this would complicate the renaming logic of our design a lot. There are multiple ways to implement this, but freeing registers would be complex, and we may need indirection in accessing the physical register. I believe we may loose many advantages of vectors with this approach. I hope this clarifies why we are not comfortable with the current status. We are happy to implement in our design support for undisturbed tail with lower performance, we are simply looking for room in the specification to allow executing instructions at higher performance (when undisturbed tail isn't necessary). |
The discussions in WG were in the context of the appropriate default tail behaviour. With undisturbed, software will opportunistically use such regions to avoid spills, etc. However, a large body of code can be written that is tail contents agnostic. The vector ISA has catered to simple implementations; with among other things restrictions on overlapping source and destination registers. It is difficult to assess how large a systematic bias is present favouring simple and other "mainline" implementations. Roger's minimal single bit proposal targets this possibility. It could be deferred as it can be implemented as a standalone extension. I suggest we allocate currently unused bits in vtype as reserved for custom use. |
Miscellaneous thoughts: Appropriateness of bit 11 of vtype as custom use - Renaming can help mitigate the tail undisturbed cost but does not eliminate it. tailag as default software stance
If we can succeed in encouraging through the software ecosystem this mindset and code annotation, then all the code can benefit from BSC’s proposed extension. |
My experience is that tail undisturbed is useful behavior in some common idioms, including reductions, while tail zeroing is rarely useful, so I'd agree that if we're going to support options that the default is undisturbed and the option can be "don't care". I don't actually believe there are security implications to "don't care", the context switch code should zero/save/restore the state explicitly regardless. The don't care state will hence come from the same security domain. There are software portability concerns however. Bugs won't be portable between systems with the same VLEN, which will irritate programmers who like to see their bugs' behavior preserved reproducibly, including migrating between big/llittle cores with the same VLEN but different microarchitectures. The renaming cost is purely a cost/performance tradeoff. There is no need to rename at the granularity of a single beat of vector execution. For the BSC machine, it's OK to only rename at the granularity of every four beats (2048 bits), In rare cases where there's enough vector issue bandwidth and enough vector ILP to otherwise fill functional unit pipelines with different instructions every clock cycle there could be some slowdown from coarser execution granularity, but for example, NVIDIA GPUs always execute four beats atomically (last systems I checked). I don't believe renaming at sub-register granule has any additional complexity over renaming to handle LMUL (not saying it isn't complex, just not additional complexity beyond storage). Finer-grain renaming can also have benefits, with effectively more rename registers at smaller vector lengths. I understand the desire to avoid this hardware cost, but there is a software ecosystem cost here too. |
I belive that even in said granularized renaming, the zeroed blocks could be renamed to some kind of virtual zero register giving a bit less PRF pressure. As long as the agnostic behaviour is contained within an "if unsure, don't use it" configuration, we should be fine with it. I also agree that only 1 bit for proposed |
I wonder if it would help the issue the BSC folks have if we allowed VLEN to be variable instead of fixed. I assume a WARL field in a register that in most designs would be hardwired to one value, but in some designs could hold several values. For codes with shorter vectors, a smaller VLEN could be programmed. This might not be reasonable unless the granularity of short vs. long vector use was fairly large. |
Variable VLEN could easily be added as a feature at supervisor level without affecting unprivileged spec. Having as unprivileged feature is also possible, though I wonder how really useful given that it would be a global setting across whole program. More local VLEN setting should just be done via "vl". |
Supervisor level seems too heavy-weight. The compiler knows when there are no live vector registers and can change VLEN then. Shorter vector registers work with the same binary code but reduce tail copying. Changing vl still leaves the tail copying to be done in hardware. I'm only thinking of this for machines with very long VLEN. Most machines would support only one VLEN value. |
Compiler support for runtime-variable vector sizes is far from trivial, even if you make concessions about where the changes can occur (e.g., only on function entry/exit). It's not enough to know whether there are live vector registers, any live value anywhere (registers or memory) of any type (e.g. scalar integers computed from the vector size) can be a problem. It is far from trivial to adapt compiler IR(s) to keep track of & control these vector-size-dependent values, and it's dubious if it's worth the complexity and engineering effort. I developed a concrete proposal for how do achieve it in LLVM back when RISC-V V practically required it (spec versions before 0.6, IIRC) and it was still a very invasive proposal despite my best efforts to make it as acceptable as possible for the rest of the LLVM community. These problems are also why LLVM explicitly does not support using the equivalent SVE feature in this way. Starting different processes with different vector sizes is fine, of course, but changing the vector size of a running process is not supported. (I don't know how much discussion about this happened in GCC but I am rather sure that the problems described before are just as hard there.) Besides, I am skeptical how much making VLEN variable would help with the problem at hand. While large VLEN is unnecessary for some workloads (in particular, for loops with few iterations), this can't always be predicted at the time the code is written/compiled, and in other cases (e.g. when you need two or three iterations of strip-mining at maximum VLEN) it's not clear how good of a trade-off it is to increase dynamic instruction count just to reduce tail copying. |
I agree with @hanna-kruppe's analysis, which is why I suggested to only really support at privileged levels where it is more of an emulation support mechanism rather than a performance optimization. The regular vl-setting mechanism should handle dynamic run time lengths. |
In that case, I withdraw my "I wonder if..." about varying VLEN. |
I believe the tail-agnostic design cannot actually help a renamed The major optimization that motivates the tail-agnostic option is to avoid having to write the However, we cannot allow implementations to simply not write to the The security challenge is that the privileged layer is not able to clear this hidden microarchitectural state on a context swap in a deterministic non-microarchitecture dependent way. Having some way to explicitly clear the free physical register pool state is a possibility, but is architecturally messy. Note that regular whole vector register save/restore is sufficient to avoid this security hole I think this means we have to require tail-agnostic must be strictly either tail-undisturbed or tail-zeroed. But to make tail-zeroing efficient on long temporal vector registers requires the sub-vector-register renaming support anyway. Zeroing is actually worse than undisturbed in this case as all tail sub-vector-register units have to be renamed to point to single zero register, versus just left alone. Zeroing does also avoid reading the old physical destination register values, but this is only an So, I think tail-agnostic does not actually really save much over tail-undisturbed, even for long temporal vector registers that are renamed, and might be worse if requiring tail-zeroing and so we should only support tail-undisturbed in the standard. Providing a non-standard extension that allows state to leak between contexts would be an option. |
TL;DR
Avoiding the maintenance of sub-vector-register renaming.
Agreed. Also part of a win-win-win solution.
Agreed. Any tail allocated register component must be vetted or avoided.
Agreed. However, the microarchiture could vet the vector physical register free list on each return from interrupt/exception. There may be many ways to do this that are not currently viable but become trivial with support for domain identification/tracking.. An approach: By marking the elements in the list as unclean on a return from more priv mode, and marking as clean either by ensuring the specific register allocation will be fully overwritten by the vector operation (a common situation) or when explicitly "cleansed" (with zero or otherwise, and this can occur during interrupt/exception return even before first vector instruction is scheduled. A preferred method would be one that is inherent in normal operations , with no additional internal state for problem avoidance but primarily optimization use.
Agreed.
By "tail-zeroed" I believe you mean tail-fill with vetted data. I.e. Data from architectural registers. e.g. The tail fill value could be the last written element value, or derived from one or either of the source registers. Basically whatever is convenient/optimal for the micro-architecture. Here's where I disagree. There are more options than those two.. One idea is closely related to the variable VLEN suggestion. I here propose a possible micro-architecture implementation: A "valid-length" (count of valid segments) defines the segments that are fully defined, segments beyond this are the agnostic tail. This is internal state information within the vector processor. Especially if such elements are considered not an element, That is, not only the value could be anything but it is also acceptable to consider the values “invalid”; i.e. Not a Value (NaV). Access to tail segments could do any or all of the following
I realize NaV can introduce ambiguity and inconsistency, but if it can be tamed it could provide for meaningful optimizations.
To me it appears that viable, reasonable, performant and practical implementations of tail-agnostic implementation that do not leak state are possible. As a result I believe we should continue to consider tail-agnostic in the standard. However, |
An effect of tail-undisturbed is that now vector operations have logically an extra operand that represent the values of the tail element. A compiler will have to assign this extra operand to the destination register of the vector operation (even to an "undefined" value when the code doesn't actually care about the tail). Masked intrinsics in the compiler often have a "merge" (or "dest") operand that states what are the values of the inactive elements. It looks very similar to the tail-undisturbed situation. The way I see it, however, is that the vector length represents the logical extent of the vector being processed. The mask does not represent such extent but a subset of it. As the inactive elements are still inside that logical extent, it makes sense to give them a value, hence the "merge" operand. There is value in tail-undisturbed for algorithms that accumulate partial results on a register. I'm worried however, that this is just the only case where tail-undisturbed is actually needed and there are many other instances where the tail behaviour is not relevant. Being able to communicate this fact to the architecture seems beneficial. That said I understand that mandating the possibility of zeroing can be a burden for smaller implementations. Maybe we can turn this into a, say, At the level of assembly it could look like this
|
A few comments from a HW perspective:
In order to implement tail-zeroing, we consider that there are alternatives to sub-vector-register renaming (e.g. keep internally the vector length for each register, masks, ...) that will work well in architectures with long temporal vectors.
|
Adding tail-zeroing leads to fragmentation and overburdening hardware implementations. @rofirrim I agree that there is value for the compiler to only track tail undisturbed when useful.
I do expect that tail undisturbed to be more generally useful than just for algorithms that accumulate partial results on a register. Programmers and compiler writers continue to be extremely creative in using the target ISA unique characteristics and fringe cases. However, I want to emphasize once again that the options need not be tail-zeroing and tail-undisturbed. Instead of tail-zero, tail-agnostic does exactly what you want: allows the compiler to "not care" and not need to track tail contents. So the vlseti constructs would be:
I propose the base includes the extra bit in vtype and it be set with ta/tu even if the hardware is always tu. |
One more agnostic approach is to zero the tail in last sub-vector and undisturb the rest of the register. |
I believe it should be considered the same agnostic approach. Addendum:
|
We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].
However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.
Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.
This proposal adds the following architectural changes:
Add a new 2-bit field in bits
9:8
of the thevtype
CSR calledvtail
with the following meaning00
preferred behaviour of the implementation (either undisturbed or zeroing)01
undisturbed tail10
zeroing tail11
(reserved)Add a new (unprivileged) RO CSR called
vtaildefault
. Bits1:0
of such CSR state what is the preferred behaviour of the implementation and its values can only be01
or10
.An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes
vtaildefault
to01
. The reset state of an implementation always setsvtail
to00
.Changing the tail behaviour can be done using
vsetvli
:00
)u
appears after the length multiplier, undisturbed tail is chosen andvtail
is set to01
.z
appears after the length multiplier, zeroing tail is chosen andvtail
is set to10
If the implementation does not support zeroing, the
vill
bit ofvtype
is set.Execution of an instruction then honours the tail behaviour in
vtail
:The tail elements during a vector instruction’s execution are the elements past the current vector length setting.
vtail = 01
the tail elements do not raise exceptions, and do not update any destination vector register group.vtail = 10
the tail elements do not raise exceptions, but do zero the results in any destination vector register group.vtail = 00
the implementation behaves either asvtail = 01
orvtail = 10
.[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc
The text was updated successfully, but these errors were encountered: