Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Shall we have insert instructions? #326

Open
HanKuanChen opened this issue Nov 13, 2019 · 3 comments
Open

Shall we have insert instructions? #326

HanKuanChen opened this issue Nov 13, 2019 · 3 comments
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft

Comments

@HanKuanChen
Copy link
Contributor

I notice #276 and #318 may require insert instructions.

In addition, if vinsert is supported, the following problems can be solved easily.

vinsert.vv vd, vs2, vs1, vm  # vd[vs2[i]] = vs1[i]
vinsert.vx vd, vs2, rs1, vm  # vd[vs2[i]] = x[rs1]
vinsert.vi vd, vs2, uimm, vm # vd[vs2[i]] = uimm
for(i=0;i!=n;++i)
{
    # compute fn
    fsw fn, 0(buffer)
    addi buffer, buffer, 4
}
# f0, f1, f2, ..., fn
vlw.v v0, (buffer)
vfmul.vv v0, coefficient, v0
vsw.v v0, (des)

If memory is slow, load and store will reduce performance.

for(i=0;i!=n;++i)
{
    # compute xn
    fmv.x.w xn, fn
    vslide1down.vx v0, v0, xn
}
vfmul.vv v0, coefficient, v0
vsw.v v0, (des)

If vl is very big, vslide1down.vx will reduce performance.

li one, 1
for(i=0;i!=n;++i)
{
    # compute xn
    fmv.x.w xn, fn
    vmv.s.x vindex, i
    vsetvli x0, one, e32, m8
    vinsert.vx v0, vindex, xn
    # set vl and vtype
}
vfmul.vv v0, coefficient, v0
vsw.v v0, (des)

It solves the above problems. No memory transfer is required, and vl is always 1.

@aswaterman
Copy link
Collaborator

We have considered this instruction. This is what we wrote in 9a4d92e:

NOTE: The complementary vins.v.x instruction, which allows a write
to any element in a vector register, has been removed. This
instruction would be the only instruction (apart from vsetvl) that
requires two integer source operands, and also would be slow to
execute in an implementation with vector register renaming, relegating
its main use to debugger modifications to state. The alternative and
more generally useful vslide1up and vslide1down instructions can
be used to update vector register state in place over a debug link
without accessing memory.

For your example, option 1 is the best answer. If fsw -> vlw performance is poor, you're fucked anyway :)

@HanKuanChen
Copy link
Contributor Author

Why?
If buffer is located at DRAM because the system has a very limited SRAM. Load/store from DRAM causes performance issue.
But option 2 and 3 only do register operations, no DRAM access, and the speed purely depends on CPU frequency (which is faster than memory access), isn't it?

@kasanovic
Copy link
Collaborator

You don't put the buffer in DRAM. If you can't spare one vector register's worth of SRAM space, then you're in a strangely configured system (big vector registers but tiny SRAM). Also note that using a memory buffer allows you to unroll and remove index updates (e.g., fsw f0, (a0); fsw f1, 4(a0); fsw f2, 8(a0),...).

@kasanovic kasanovic added the Resolve after v1.0 Does not need to be resolved for v1.0 draft label Jun 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft
Projects
None yet
Development

No branches or pull requests

3 participants