Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Allow zeroing tail as an implementation option #367

Closed
rofirrim opened this issue Feb 5, 2020 · 25 comments
Closed

Allow zeroing tail as an implementation option #367

rofirrim opened this issue Feb 5, 2020 · 25 comments

Comments

@rofirrim
Copy link

rofirrim commented Feb 5, 2020

We (at the BSC) are aware of the trade-offs described for implementations when it comes to choose undisturbed tail or zeroing tail. However we believe the option of implementing zeroing of the tail in the V-extension has to exist. In particular for implementations tailored for the High-Performance Computing (HPC) market, undisturbed tail poses a problem for implementations using renaming "as additional cycles are required to read out old tail elements to copy to the tail of the new destination physical register"[1].

However, we are also sensitive to the fact that zeroing complicates smaller implementations of the V-extension not directed at the HPC market and as such it is regarded as undesirable.

Thus we suggest, that the possibility of zeroing the tail exists in a way that adds little burden if there is no intent to support zeroing.

This proposal adds the following architectural changes:

Add a new 2-bit field in bits 9:8 of the the vtype CSR called vtail with the following meaning

  • 00 preferred behaviour of the implementation (either undisturbed or zeroing)
  • 01 undisturbed tail
  • 10 zeroing tail
  • 11 (reserved)

Add a new (unprivileged) RO CSR called vtaildefault. Bits 1:0 of such CSR state what is the preferred behaviour of the implementation and its values can only be 01 or 10.

An implementation of V-ext, under this proposal, must always implement undisturbed tail, so the minimal implementation of this proposal simply hardcodes vtaildefault to 01. The reset state of an implementation always sets vtail to 00.

Changing the tail behaviour can be done using vsetvli:

  • If no tail behaviour is specified the preferred behaviour is used (i.e. vtail is left as 00)
vsetvli x1, x2, e64
vsetvli x1, x2, e64,m1
  • If the operand u appears after the length multiplier, undisturbed tail is chosen and vtail is set to 01.
vsetvli x1, x2, e64,m1,u
  • If the operand z appears after the length multiplier, zeroing tail is chosen and vtail is set to 10
vsetvli x1, x2, e64,m1,z

If the implementation does not support zeroing, the vill bit of vtype is set.

Execution of an instruction then honours the tail behaviour in vtail:

The tail elements during a vector instruction’s execution are the elements past the current vector length setting.

  • When vtail = 01 the tail elements do not raise exceptions, and do not update any destination vector register group.
  • When vtail = 10 the tail elements do not raise exceptions, but do zero the results in any destination vector register group.
  • When vtail = 00 the implementation behaves either as vtail = 01 or vtail = 10.

Note: under this proposal, when vtail = 00 software may not rely on the actual contents of the tail of the destination vector register group unless it knows, beforehand, the preferred behaviour of the implementation, for instance having read vtaildefault.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-undisturbed-versus-zeroing.adoc

@vbx-glemieux
Copy link

vbx-glemieux commented Feb 5, 2020 via email

@David-Horner
Copy link
Contributor

David-Horner commented Feb 5, 2020 via email

@David-Horner
Copy link
Contributor

I left some details unspecified.

An additional operand is needed for vsetvli to specifiy the "tail agnostic" setting.
I would recommend that u and z not be used for tail undisturbed and zero respectively but rather a longer acronym that more clearly states the setting.
I agree that absence of all these operand values should mean the tail undisturbed default.
As a strawman I propose tailag, tailun, and tailzero as hopefully self explanatory acronyms.

I'm OK with vill set if vtail is unsupported. However, I believe we should provide a note that emphasizes that implementations are free to trap on vsetvl{i}

  • on any specific combination of vtype values
  • to trap conditionally (typically with a CSR configuration setting)
  • recommend to not trap on implementation supported vtype settings
  • and highlight the benefit of immediate trapping (either explicitly itemizing and explaining or referencing another document). [ some of those being: definitive identification of instruction, allowing precise determination of arguments, which in turn allows setting an equivalent vtype (in a typical case substitute tailun for tailag)]

Some open questions and observations:
Although, an implementation could conceivably only support tailag and tailzero, emulating tailag via traps on the non-vsetvl{i}, are we wanting to mandate native tailag support?

vtailag only implies tail non-undisturbed/unzero tail options are possible. Do we want to explicitly permit?

Guy appears to recommend that tailag be the default. I like this approach it aligns with programmer thinking/awareness. And it can be done in software (as explained in the prior note with linkeditor support). It does make the software model more complex. We provide a facility that in reality does not exist on some (perhaps most) implementations.

I see no way to reasonably make tailag the default hardware standard. We anticipated that programmers will want explicit behaviour (tailun or tailzero) as there are software benefits to them. And as we believe software most benefits tailun we chose that as the (hardware) default.

We can formulate this proposal as an optional extension to the base..

@aswaterman
Copy link
Collaborator

aswaterman commented Feb 6, 2020 via email

@vbx-glemieux
Copy link

vbx-glemieux commented Feb 6, 2020 via email

@rofirrim
Copy link
Author

rofirrim commented Feb 6, 2020

Thanks everyone for the comments. Much appreciated. I can't answer to all of them (I'm a compiler guy actually, my colleagues will address them) but I'd like to clarify our stance that led to this proposal.

@David-Horner (counter-)proposal of having an agnostic tail is really good. Even better than my initial proposal because undisturbed is still a baseline that is always available. Now I realise that the "preferred tail behaviour" is not desirable and being able to state "I don't care what you do in the tail" is much better.

We most of the time don't care about what ends in the tail. So this is why it felt unnecessary to us to have to honour the undisturbed tail semantics. I'm aware of the cases where undisturbed is better, and those do justify always having undisturbed available. In short: undisturbed must stay.

However, we need a way to realize the agnostic case. Undisturbed is a way. Another one is zeroing. Before posting this proposal we did internally discuss having "undefined" values in the tail. But, as @vbx-glemieux already pointed out, it comes with security risks and, in a post-Spectre world, I would be extremely cautious to open new avenues in which we leak implementation details. Hence zeroing seemed reasonable.

In the line of what @David-Horner suggested, we might simplify even further the proposal by making vtail a single bit that means either "requires undisturbed in the tail", "doesn't care what goes in the tail". For the latter case, we can make implementation-defined what goes in the tail. Zeroing is a valid implementation, so are undisturbed and undefined (modulo security issues), even "all ones" (not necessarily meaningful) would be a valid implementation.

Under that simplified angle, a vtailzero instruction doesn't seem necessary to me.

Kind regards.

@solomatnikov
Copy link

This was discussed in WG meeting and as @aswaterman pointed out is not actually a problem for OOO processor with renaming. Renaming has to be done at least at the granularity of individual vector register, not vector register group, limiting performance impact. Of course, renaming can be done at finer granularity.

@opalomar
Copy link

opalomar commented Feb 6, 2020

Hi all,

this is Oscar from the BSC hardware team. Thank you all for the comments and suggestions. This is helpful, there are many valid and interesting points.

We are aware that it could be possible to use the renaming table rather than copying values to implement undisturbed tail. In the example from @aswaterman, that is a perfectly reasonable solution.

However, this does not work well for us. In our case, we have a VLEN=256*64=16384, and a data path of 64 * 8 lanes = 512. That means having 32 individual "sub-renamings" per register, so the overheads are much larger than in the example @aswaterman posted, and doing this would complicate the renaming logic of our design a lot. There are multiple ways to implement this, but freeing registers would be complex, and we may need indirection in accessing the physical register. I believe we may loose many advantages of vectors with this approach.

I hope this clarifies why we are not comfortable with the current status. We are happy to implement in our design support for undisturbed tail with lower performance, we are simply looking for room in the specification to allow executing instructions at higher performance (when undisturbed tail isn't necessary).

@David-Horner
Copy link
Contributor

David-Horner commented Feb 7, 2020

The discussions in WG were in the context of the appropriate default tail behaviour.
Agnostic was dismissed early in the tail default discussion. The general consensus (as I perceived it) was that deterministic behaviour was preferred, for numerous philosophical and perception/practical reasons. The remaining choices were zero or undisturbed for the default.

With undisturbed, software will opportunistically use such regions to avoid spills, etc.
And there are some algorithms that could benefit from processing successively from small vl to larger vl keeping the tail data intact.

However, a large body of code can be written that is tail contents agnostic.
This code provides an opportunity to optimize hardware.
Specifically, unmasked vector operations do not need to read the target register set if agnostic is enabled.

The vector ISA has catered to simple implementations; with among other things restrictions on overlapping source and destination registers. It is difficult to assess how large a systematic bias is present favouring simple and other "mainline" implementations.
The WG discussion noted that masked operations do need to read the target set, so support of undisturbed was "almost free'. That is not true when multiple work units are enable to provide parallel operations. Most work units could be unmasked optimized and thus save substantial real estate, complexity and power without "tail undisturbed" .

Roger's minimal single bit proposal targets this possibility. It could be deferred as it can be implemented as a standalone extension.
However, if the whole ecosystem, and not just HPC sub-market is to benefit from tail agnostic code, support needs to be provisioned early on.

I suggest we allocate currently unused bits in vtype as reserved for custom use.
Of the total 11 bits mapped into vsetvli, 7 are in use with lmul/sew/ediv. Of the 4 remaining I suggest 1 bit (bit 10 in vtype) be reserved for custom use. This is a 25% of the remaining vsetvli directly addressable bits. I further suggest we allocate bit 11 for custom use. This will allow a single load immediate to provide both lower custom use bits. And further I suggest we allocate from the vtype [30:12] further bits for custom use. Doing so now will allow proposals early on in this current design and testing phase that may be incorporated as their merit is empirically assessed.

@David-Horner
Copy link
Contributor

David-Horner commented Feb 7, 2020

Miscellaneous thoughts:

Appropriateness of bit 11 of vtype as custom use -
bit 11 is particularly problematic to use with vsetvl, at least as part of the standard.
A load immediate will generate a negative value which would set vill when used with vsetvl.
To generate the bit 11 itself requires more than one instruction.
However, as a custom bit, setting it can have any meaning , including
- ignoring the vill
- zeroing bits [31:12]
- complimenting all the bits (or just those [31:12])
Indeed behaviour that would make the use load immediate viable but would not be acceptable in the standard vector extension.

Renaming can help mitigate the tail undisturbed cost but does not eliminate it.
Again, in the context of determining the default tail behaviour this mitigation helped tip the scale to tail undisturbed . However, increased granularity causes non-linear increased overhead. If vl is not at the end of a granule the partial fill of the tail elements requires a read of the target register that might otherwise not be needed (and definitely not if an unmasked instruction).

tailag as default software stance
Tail undisturbed is established as the default hardware behaviour.
Software however does not need to assume that.
Rather it is better to have software assume that the hardware does not preserve tail data unless explicitly requested.
I believe this is the corollary to guy’s statement:

From this new point of view, the programmer is thinking "which mode is
the most natural for the code sequence that I am going to write?". Eg,
the programmer wants to say "the code below depends upon the tail
being zeroed", or "the code below depends on the tail remaining
undisturbed" or "the code below does not depend upon the tail". The
expected use:

(a) Most code sequences will use the last option, where it does not
depend upon the tail so it doesn't care which option.

(b) In only some (presumably short) code sequences will it want the
tail to remain undisturbed; in these cases, the code is shorter and
likely faster (less data copying, less register spilling, etc) if the
tail is left undisturbed.

If we can succeed in encouraging through the software ecosystem this mindset and code annotation, then all the code can benefit from BSC’s proposed extension.

@kasanovic
Copy link
Collaborator

My experience is that tail undisturbed is useful behavior in some common idioms, including reductions, while tail zeroing is rarely useful, so I'd agree that if we're going to support options that the default is undisturbed and the option can be "don't care".

I don't actually believe there are security implications to "don't care", the context switch code should zero/save/restore the state explicitly regardless. The don't care state will hence come from the same security domain.

There are software portability concerns however. Bugs won't be portable between systems with the same VLEN, which will irritate programmers who like to see their bugs' behavior preserved reproducibly, including migrating between big/llittle cores with the same VLEN but different microarchitectures.

The renaming cost is purely a cost/performance tradeoff. There is no need to rename at the granularity of a single beat of vector execution. For the BSC machine, it's OK to only rename at the granularity of every four beats (2048 bits), In rare cases where there's enough vector issue bandwidth and enough vector ILP to otherwise fill functional unit pipelines with different instructions every clock cycle there could be some slowdown from coarser execution granularity, but for example, NVIDIA GPUs always execute four beats atomically (last systems I checked). I don't believe renaming at sub-register granule has any additional complexity over renaming to handle LMUL (not saying it isn't complex, just not additional complexity beyond storage). Finer-grain renaming can also have benefits, with effectively more rename registers at smaller vector lengths.

I understand the desire to avoid this hardware cost, but there is a software ecosystem cost here too.

@jnk0le
Copy link

jnk0le commented Feb 7, 2020

I belive that even in said granularized renaming, the zeroed blocks could be renamed to some kind of virtual zero register giving a bit less PRF pressure.

As long as the agnostic behaviour is contained within an "if unsure, don't use it" configuration, we should be fine with it. I also agree that only 1 bit for proposed vtail is suffcient.

@billhuffman
Copy link

I wonder if it would help the issue the BSC folks have if we allowed VLEN to be variable instead of fixed. I assume a WARL field in a register that in most designs would be hardwired to one value, but in some designs could hold several values. For codes with shorter vectors, a smaller VLEN could be programmed. This might not be reasonable unless the granularity of short vs. long vector use was fairly large.

@kasanovic
Copy link
Collaborator

Variable VLEN could easily be added as a feature at supervisor level without affecting unprivileged spec. Having as unprivileged feature is also possible, though I wonder how really useful given that it would be a global setting across whole program. More local VLEN setting should just be done via "vl".

@billhuffman
Copy link

Supervisor level seems too heavy-weight. The compiler knows when there are no live vector registers and can change VLEN then. Shorter vector registers work with the same binary code but reduce tail copying. Changing vl still leaves the tail copying to be done in hardware.

I'm only thinking of this for machines with very long VLEN. Most machines would support only one VLEN value.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Feb 23, 2020

Compiler support for runtime-variable vector sizes is far from trivial, even if you make concessions about where the changes can occur (e.g., only on function entry/exit). It's not enough to know whether there are live vector registers, any live value anywhere (registers or memory) of any type (e.g. scalar integers computed from the vector size) can be a problem. It is far from trivial to adapt compiler IR(s) to keep track of & control these vector-size-dependent values, and it's dubious if it's worth the complexity and engineering effort. I developed a concrete proposal for how do achieve it in LLVM back when RISC-V V practically required it (spec versions before 0.6, IIRC) and it was still a very invasive proposal despite my best efforts to make it as acceptable as possible for the rest of the LLVM community.

These problems are also why LLVM explicitly does not support using the equivalent SVE feature in this way. Starting different processes with different vector sizes is fine, of course, but changing the vector size of a running process is not supported. (I don't know how much discussion about this happened in GCC but I am rather sure that the problems described before are just as hard there.)

Besides, I am skeptical how much making VLEN variable would help with the problem at hand. While large VLEN is unnecessary for some workloads (in particular, for loops with few iterations), this can't always be predicted at the time the code is written/compiled, and in other cases (e.g. when you need two or three iterations of strip-mining at maximum VLEN) it's not clear how good of a trade-off it is to increase dynamic instruction count just to reduce tail copying.

@kasanovic
Copy link
Collaborator

I agree with @hanna-kruppe's analysis, which is why I suggested to only really support at privileged levels where it is more of an emulation support mechanism rather than a performance optimization. The regular vl-setting mechanism should handle dynamic run time lengths.

@billhuffman
Copy link

In that case, I withdraw my "I wonder if..." about varying VLEN.

@kasanovic
Copy link
Collaborator

I believe the tail-agnostic design cannot actually help a renamed
vector register implementation with long temporal vectors, because of
security concerns (my comment above regarding security was incorrect).

The major optimization that motivates the tail-agnostic option is to avoid having to write the
tail elements of the new physical destination vector register when not supporting sub-vector-register renaming. Another advantage is to avoid needing to read values from the old
physical destination vector register.

However, we cannot allow implementations to simply not write to the
tail of a new physical destination vector register allocated off the
vector physical register free list. The new physical register can
potentially contain data from some other context.

The security challenge is that the privileged layer is not able to clear this hidden microarchitectural state on a context swap in a deterministic non-microarchitecture dependent way. Having some way to explicitly clear the free physical register pool state is a possibility, but is architecturally messy. Note that regular whole vector register save/restore is sufficient to avoid this security hole
for non-tail-agnostic machines.

I think this means we have to require tail-agnostic must be strictly either tail-undisturbed or tail-zeroed. But to make tail-zeroing efficient on long temporal vector registers requires the sub-vector-register renaming support anyway. Zeroing is actually worse than undisturbed in this case as all tail sub-vector-register units have to be renamed to point to single zero register, versus just left alone.

Zeroing does also avoid reading the old physical destination register values, but this is only an
issue for the last active sub-vector as otherwise sub-vector renaming avoids the copy. Zeroing also provides a little more effective rename register capacity.

So, I think tail-agnostic does not actually really save much over tail-undisturbed, even for long temporal vector registers that are renamed, and might be worse if requiring tail-zeroing and so we should only support tail-undisturbed in the standard.

Providing a non-standard extension that allows state to leak between contexts would be an option.

@David-Horner
Copy link
Contributor

TL;DR
Suggest alternatives to avoid data/state leakage.
Suggest idea of tail-avoidance and tail-disabled approaches for tail-agnostic.
Suggest Not a Value (NaV) as also supported internal "value" for tail-agnostic support.
Propose individual "valid-length" approach to manage tail-agnostic support using above ideas.
Suggest if tail-agnostic is a non-standard extension support be provided. Consider #369.

@kasanovic

The major optimization that motivates the tail-agnostic option is to avoid having to write the
tail elements of the new physical destination vector register when not supporting sub-vector-register renaming.

Avoiding the maintenance of sub-vector-register renaming.
The win-win is to do both.

Another advantage is to avoid needing to read values from the old
physical destination vector register.

Agreed. Also part of a win-win-win solution.

we cannot allow implementations to simply not write to the
tail of a new physical destination vector register allocated off the
vector physical register free list.

Agreed. Any tail allocated register component must be vetted or avoided.

The security challenge is that the privileged layer is not able to clear this hidden microarchitectural state on a context swap in a deterministic non-microarchitecture dependent way.

Agreed.

However, the microarchiture could vet the vector physical register free list on each return from interrupt/exception. There may be many ways to do this that are not currently viable but become trivial with support for domain identification/tracking..

An approach: By marking the elements in the list as unclean on a return from more priv mode, and marking as clean either by ensuring the specific register allocation will be fully overwritten by the vector operation (a common situation) or when explicitly "cleansed" (with zero or otherwise, and this can occur during interrupt/exception return even before first vector instruction is scheduled.

A preferred method would be one that is inherent in normal operations , with no additional internal state for problem avoidance but primarily optimization use.

Note that regular whole vector register save/restore is sufficient to avoid this security hole
for non-tail-agnostic machines.

Agreed.

I think this means we have to require tail-agnostic must be strictly either tail-undisturbed or tail-zeroed.

By "tail-zeroed" I believe you mean tail-fill with vetted data. I.e. Data from architectural registers. e.g. The tail fill value could be the last written element value, or derived from one or either of the source registers. Basically whatever is convenient/optimal for the micro-architecture.

Here's where I disagree. There are more options than those two..
Specifically tail-avoidance and tail-disabled are possible. And some of these can yield the win-win-win benefit sought above.

One idea is closely related to the variable VLEN suggestion.
Architecturally visible VLEN changes are problematic.
But within the micro-architectural , specifying a VLEN per register could establish a tail-management avoidance or disabled approach. This is especially valuable for the use case described above.

I here propose a possible micro-architecture implementation:

A "valid-length" (count of valid segments) defines the segments that are fully defined, segments beyond this are the agnostic tail. This is internal state information within the vector processor.

Especially if such elements are considered not an element, That is, not only the value could be anything but it is also acceptable to consider the values “invalid”; i.e. Not a Value (NaV).

Access to tail segments could do any or all of the following

  1. increase a performance counter
  2. trap and with extra exception information complete the operation, perhaps re-executing with modified vl. (consider vmax/min and agnostic data)
  3. provide constant data (like zeros)
  4. provide data from the last valid segment (reasonable if processing is always performed sequentially) This could be provided to the ALU, or directly provided to the
  5. provide data from the first segment (which in some designs will always be valid) and is consistent with the scalar processing. This is appropriate when parallel operations such as reductions are in play.
  6. for the duration of the instruction, use an effective vl of the min of vl and the 2 sources’ valid-lengths. For instructions running under tail-agnostic the valid-length is set based on this effective vl. When running under tail-undisturbed everything from the effective vl could be undisturbed which would also benefit non-rename implementations. This approach is appropriate to many operations (especially non-masked versions) in which the destination value is fully determined by the sources.

I realize NaV can introduce ambiguity and inconsistency, but if it can be tamed it could provide for meaningful optimizations.
Special cases of xor/and/or register with itself are often identified and optimized so anomalous behaviour is avoided.
Many of the possibilities when consistently applied (especially 5) do not cause anomalies to arise.
And a judicial definition of tail-agnostic is sufficient to allow option 6.
Something along the lines of : Successive agnostic operations combine their field of agnostic behaviour.
Further, NaV to some extent is a superposition that collapses with specific operations. E.g. An OS write back of a Whole Register Read will reset the valid-length of the register. As a result interrupt can cause specific values different from those without the interrupt, but they could still be consistent with the range of values allowed by tail-agnostic.

Providing a non-standard extension that allows state to leak between contexts would be an option.

To me it appears that viable, reasonable, performant and practical implementations of tail-agnostic implementation that do not leak state are possible.

As a result I believe we should continue to consider tail-agnostic in the standard.

However,
Providing a non-standard extension that DOES NOT allow state to leak between contexts could also be an option.
And if so, #369 was presented for this purpose.

@rofirrim
Copy link
Author

rofirrim commented Mar 4, 2020

An effect of tail-undisturbed is that now vector operations have logically an extra operand that represent the values of the tail element. A compiler will have to assign this extra operand to the destination register of the vector operation (even to an "undefined" value when the code doesn't actually care about the tail).

Masked intrinsics in the compiler often have a "merge" (or "dest") operand that states what are the values of the inactive elements. It looks very similar to the tail-undisturbed situation.

The way I see it, however, is that the vector length represents the logical extent of the vector being processed. The mask does not represent such extent but a subset of it. As the inactive elements are still inside that logical extent, it makes sense to give them a value, hence the "merge" operand.

There is value in tail-undisturbed for algorithms that accumulate partial results on a register. I'm worried however, that this is just the only case where tail-undisturbed is actually needed and there are many other instances where the tail behaviour is not relevant. Being able to communicate this fact to the architecture seems beneficial.

That said I understand that mandating the possibility of zeroing can be a burden for smaller implementations. Maybe we can turn this into a, say, Zvzerotail extension of V-ext, that adds one bit in vtype to express the desired zeroing behaviour.

At the level of assembly it could look like this

# Tail-undisturbed (base, always valid)
vsetvli x1, x2, e64
vsetvli x1, x2, e64,m1
# Tail-zeroing (only valid under Zvzerotail)
vsetvli x1, x2, e64,m1,z
# Tail-undisturbed (only valid under Zvzerotail)
vsetvli x1, x2, e64,m1,u    # alias of `vsetvli x1, x2, e64,m1`

@opalomar
Copy link

opalomar commented Mar 4, 2020

A few comments from a HW perspective:

  • "But to make tail-zeroing efficient on long temporal vector registers requires the sub-vector-register renaming support anyway. "

In order to implement tail-zeroing, we consider that there are alternatives to sub-vector-register renaming (e.g. keep internally the vector length for each register, masks, ...) that will work well in architectures with long temporal vectors.

  • The issue with the current tail-undisturbed scheme in architectures with register renaming is that the overhead is large for long temporal vectors, since it requires copying values. This may be alleviated by the "granularised renaming" proposed earlier. It limits the amount of values to copy, and can help increase the number of rename registers. However, it has significant complexity. It requires support in the rename, issue and commit logic. In the logic reading from the register file (for example, an instruction may not start reading from the first "granule", if VSTART is larger than the granule size). It has also area overhead in the renaming logic structures.

  • Tail-undisturbed creates additional dependences, limiting concurrency. For example, in the sequence Load V0, Add v1<-v1,v0, Load V0, the second load artificially depends on the first one.
    This will happen for example, in a reduction loop. This will prevent that the second Load executes in parallel with the first one.

@David-Horner
Copy link
Contributor

Adding tail-zeroing leads to fragmentation and overburdening hardware implementations.
Adding tail-agnostic does not burden hardware implementations.
They can continue to use tail-undisturbed for tail-agnostic situations.

@rofirrim I agree that there is value for the compiler to only track tail undisturbed when useful.
Specifically, as you said to avoid tracking

even to an "undefined" value when the code doesn't actually care about the tail.

I do expect that tail undisturbed to be more generally useful than just for algorithms that accumulate partial results on a register. Programmers and compiler writers continue to be extremely creative in using the target ISA unique characteristics and fringe cases.

However, I want to emphasize once again that the options need not be tail-zeroing and tail-undisturbed. Instead of tail-zero, tail-agnostic does exactly what you want: allows the compiler to "not care" and not need to track tail contents.

So the vlseti constructs would be:

# Tail-undisturbed always valid to use, but use when undisturbed is intended
vsetvli x1, x2, e64,m1,tu    # alias of `vsetvli x1, x2, e64,m1`
 # Tail-agnostic also always valid to use, but use when tail values are not useful
vsetvli x1, x2, e64,m1,ta

I propose the base includes the extra bit in vtype and it be set with ta/tu even if the hardware is always tu.

@jnk0le
Copy link

jnk0le commented Apr 22, 2020

One more agnostic approach is to zero the tail in last sub-vector and undisturb the rest of the register.
It benefits in InO as well as sub renamed OoO but bugs will be even harder to port though.

@David-Horner
Copy link
Contributor

David-Horner commented Apr 22, 2020

I believe it should be considered the same agnostic approach.
That is each byte in tail (or masked) are either the undisturbed destination byte or the designated fill byte (currently proposed to be x'FF'.
We can expect that fill bytes will be in SEW bit groups on SEW boundaries.
This meets a recommended criteria for agnostic:
that only two states need to be checked for validation.
Either undisturbed or designated value.

Addendum:

  1. checking becomes more problematic if the granularity is less than a byte (not horribly, but significantly.
  2. byte granularity allows better poisoning of the agnostic values so that accidental dependence upon -1 values or unchanged is detected.
    ,

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants