Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Change SEW be the "largest element width" #425

Closed
kasanovic opened this issue Apr 19, 2020 · 4 comments
Closed

Change SEW be the "largest element width" #425

kasanovic opened this issue Apr 19, 2020 · 4 comments

Comments

@kasanovic
Copy link
Collaborator

This proposal is a modification of earlier idea to add effective
element width to load/store instructions to mitigate dropping
fixed-width load/stores and to provide greater efficiency for
mixed-width floating-point codes.

This proposal redefines SEW to be the largest element width (LEW?),
and correspondingly the definition of widening/narrowing operations:

Previously a double-widening add was defined as:
2*SEW = SEW + SEW
the new proposal is to specify
SEW = SEW/2 + SEW/2

This proposal does not change the behavior/implementation of existing
instructions, except to change how the effective EW and effective LMUL
are obtained from vtype.

For load/store instructions, we could modify these to have relative
element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}.
Assembly syntax could be something like vle, vlef2, vlef4, vlfe8.
There is a challenge in readability that these are relative to last
vtype setting, and also when SEW is less than 64, some become useless.

I think for this reason we should stick with fixed vle8, vle16, vle32,
vle, in base encoding {8,16,32,SEW}. These are more readable and can
be used to interleave load/stores of larger than SEW values without
changing vtype (e.g., moving 32b values when SEW=16). Extending past
base, we could add relative EEW load/store using unused mop bit, for
example.

Mask registers have an EEW of SEW/LMUL, and so defining the relative
sizes as fractions of SEW also makes it more likely that code can
save/restore a mask register without changing vtype. The fixed sizes
will cover some of these cases, though might have to consider
saving/restoring more than needed if SEW/LMUL < 8.

There is possibly actually a minor hardware checking advantage to
knowing that SEW holds the largest possible element width, since
following instructions cannot request an effective SEW larger than
this. Even with fixed-size load/stores, if the sizes are 8,16,32,SEW
then all of these will also be legal on standard implementations.

Some of the examples we've been discussing.

The "worst-case" example from TG slide:

  # int32_t a[i] = int8_t b[i] << 15
  # With fixed-width load/store
  vsetvli t0, a0, e32,m1
  vlb.v v4, (rb)
  vsll.vi v4, v4, 15
  vsw.v v4, (ra)

This proposal, assuming quad-widening and fractional LMUL

  vsetvli t0, a0, e32,m1
  vle8.v v1, (rb)      # load bytes into e8,m1 register
  vqcvt.x.x.v v4, v1   # Quad-widen to 32 bits
  vsll.vi v4, v4, 15   # 32-bit shift
  vse32.v v4, (ra)     # could also use vse

Obviously, adding a quad-widening left shift would remove any
difference between two.

The previous widening compute operation:

# Widening C[i]+=A[i]*B[i], where A and B are FP16, C is FP32

vsetvli t0, a0, e32,m8  # vtype SEW=32b
vle16.v v4, (a1)
vle16.v v8, (a2)
vle32.v v16, (a3)       # Get previous 32b values, no vsetvli needed.
vfwmacc.vv v16, v4, v8  # EEW,ELMUL of source operands is SEW/2,LMUL/2
vse32.v v16, (a3)       # EEW=32b, no vsetvli needed

The example code from #362

Original fixed-width code assuming 32b float.

   vsetvli t0, a0, e32,m8   # Assuming 32b float
loop:
    [...]                   # Other instructions with fixed vl and sew
   vlb.v v0, (x12)          # Get byte value
   vadd.vx v16, v0, x11     # Add scalar integer offset
   vfcvt.f.x.v v16, v16     # Convert to 32b floating vlaue
    [...]

converts to:

   vsetvli t0, a0, e32, m8
loop:
   [...]
   vle8.v v0, (x12)         # Load bytes
   vqcvt.x.x.v v16, v0      # Quad-widening sign-extension
   vadd.vx v16, v16, x11    # Add offset
   vfcvt.f.x.v v16, v16     # Convert to float
   [...]

which is just one more instruction in inner loop.

With a quad-widening add (vqadd.vx), there would be no penalty in this
case:

   vsetvli t0, a0, e32, m8
loop:
   [...]
   vle8.v v0, (x12)         # Load bytes
   vqadd.vx v16, v0, x11    # Add offset, quad-widen
   vfcvt.f.x.v v16, v16     # Convert to float
   [...]

The vle8 version also reduce the number of vector registers tied up
buffering load values when the loop is software-pipelined/loop
unrolled, allowing more loads in flight for a given LMUL.

Basically, in software-pipelined/unrolled loops, the widen-at-load
instructions tie up more architectural registers than widen-at-use
when the widen-at-use is part of instruction, reducing flexibility in
scheduling and/or reducing LMUL.

Even if an explicit widen instruction is used, at that point the
values are in the registers and the widen will proceed at max rate,
generally reducing architectural register occupancy versus having the
wider values in-flight from memory. For example:

      setvli t0, a0, e32,m8

      vlb.v v0, ()          # v0-v7 tied up while
      .                     # scheduling around memory latency
      .                     
      .
      vadd.vx v16, v0, x11  # consume here
      vlb.v v0, ()          # schedule next iteration here

versus

      setvli t0, a0, e32,m8

      vle8.v v0, (x12)      # only v0-v1 tied up
      .  vle8.v v2, (x13)   # fetch next iteration into v2-v3
      .                     # 
      .
      vqcvt.x.x.v v16, v0      # Quad-widening sign-extension
      vadd.vx v16,v16,x11      # Add offset

As mentioned earlier, the widened loads also might run at slower rate
unless full register write port bandwidth is given.

Other code example extracted from 3x3 convolutions with 32b += 16b * 8b

      vsetvli t0, a0, e16,m2
      vle8.v v1, (x)      # occupies v1
      vwcvt.x.x.v v4, v1  # widen v1 into v4-v5
      vsetvli t0, a0, e32,m4
      vwmacc.vx v8, x11, v4  # widening mul-add into v8-v11

New quad instructions that did 32b += 16b * 8b would obviously help here (e.g., vqmacc.wv), and is maybe a good idea in an extension.

@kasanovic
Copy link
Collaborator Author

Regarding error checking, an implementation still has to check if EEW and/or LMUL underflow.

@kasanovic
Copy link
Collaborator Author

LEW would be a pretty disruptive change - and I think I've convinced myself that if we stick with fixed effective-width load/stores in base ({8,16,32,SEW}) then there's no big advantage to changing base EW point, and a big disadvantage that the assembler instructions would not read as well (e.g., w is always 2*v now). Either way, we need to add some more widening operations, but if we stick with current SEW definition, we'll need ones that sign/zero-extend a <SEW type to SEW. As these are unary operations, they can easily fit in encoding space.

@ZPedro
Copy link

ZPedro commented Apr 20, 2020

LEW would definitely improve the amount of LMUL switches for DSP applications, where most (if not all: #287 ) of the single-width operations on narrow-width elements are loading and storing them; as a data point, in my YUV to RGB implementation, this would reduce the number of vsetvli from 4 to 2 per loop iteration. Given a sufficiently powerful system for loads and stores, like a Memory Element Width, this could even be reduced to 0.

(for the record, I assume loads and stores that don't widen or narrow, at least for my applications, on the assumption that most implementations would prefer to perform LEW/2 * LEW/2, rather than LEW * LEW on inputs that happen to have top LEW/2 bits at 0 or identical top LEW/2 + 1 bits)

@David-Horner
Copy link
Contributor

@kasanovic

I think for this reason
[re-positioned below]
we should stick with fixed vle8, vle16, vle32,
vle, in base encoding {8,16,32,SEW}. These are more readable and can
be used to interleave load/stores of larger than SEW values without
changing vtype (e.g., moving 32b values when SEW=16).
[re-positioned here]
For load/store instructions, we could modify these to have relative
element widths that are fractions of SEW {SEW, SEW/2, SEW/4, SEW/8}.
Assembly syntax could be something like vle, vlef2, vlef4, vlfe8.
There is a challenge in readability that these are relative to last
vtype setting,

SEW and LMUL values are essential to correct code execution regardless of load/store width encoding.

They should be assembler directive variables set automatically by vsetvli (and vsetvl when its xs2 argument is statically defined).

For dynamic xs2 and vsetvl, a manual assembler directive should be available.

This should help in various situations, including validation that SEW/LMUL ratio is maintained by a given vsetvli, and also for load/store syntax:

With this in place the assembler code can translate e8 to the correct SEW * factor value in load/stores eliminating the readability concern.

and also when SEW is less than 64, some become useless.

I agree that the base should have as robust an encoding without over committing available bits.

Thus I would want to also want to move 32b when SEW=16, and in addition

  • move 64b values when SEW=16
  • move 64b values when SEW=256, or 128 or 512
  • and various more combinations
  • and not waste encoding when SEW < 64.

I also believe that load/store are so important, so pivotal (e.g. matrix transforms) that flexibility and efficiency are both mandated.

The compress load/store format seeks to address efficiency.

The encoding needs the flexibility of SEW * factors.

I propose the factors be depent upon current SEW value

For SEW=8 the encoding yields factors of 1,2,4 and 8

For SEW=16 the encoding yields factors of 1/2, 1, 2 and 4

For SEW of 32 and above the encoding yields factors of 1/4, 1/2, 1 and 2.

Thus we always support LEW = SEW *2 operations and support load/store SEW/2 and SEW/4 when they exist.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants