Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further question about instruction atomicity vs. faults #7

Closed
grayresearch opened this issue Mar 13, 2024 · 21 comments · Fixed by #8
Closed

further question about instruction atomicity vs. faults #7

grayresearch opened this issue Mar 13, 2024 · 21 comments · Fixed by #8

Comments

@grayresearch
Copy link

This issue clarifies my tech-chairs question about permissible behaviors of Zilsd WRT faults.

The tech-chairs discussion referred to the priv spec. Here, I believe, is the pertinent language: "On some implementations, misaligned loads, stores, and instruction fetches may also be decomposed into multiple accesses, some of which may succeed before an access-fault exception occurs." Also: "If mtval is written with a nonzero value when a misaligned load or store causes an access-fault or page-fault exception, then mtval will contain the virtual address of the portion of the access that caused the fault."

  1. For the Zilsd's spec "crack these instructions" and Fault Handling sections, perhaps the spec should explicitly require atomicity of the destination register pair writes. For example given ld x14,0(x14), the spec must be clear that the x14,x15 pair are updated atomically: it is never possible to observe any partial register updates of the ld destinations. Here if x14 addresses 64b that spans pages, and the second page is inaccessible, the exception handler must not observe that x14 was updated by the first 32b load -- this could preclude correct reissue of the faulting instruction. Similarly for ld x14,0(x15).

  2. Should the spec require any particular order of accesses and hence of faults -- or explicitly note that any order may occur? For example, for ld x4,0(x2) if the 64b spans two pages which are both inaccessible, should the spec constrain which of the two fault addresses occur first? Similarly, when the 64b address is misaligned and the implementation incurs an address-misaligned trap.

My take: Zilsd ld should follow what "ordinary ld" does, which is issue multiple accesses, and per the priv spec text above, order of accesses / observable order of traps is unspecified.

@tovine
Copy link
Collaborator

tovine commented Mar 13, 2024

Hi, my original spec was a bit more verbose on this topic. Do you think we should bring in more of the text from there to cover it better?

And you raise an interesting point about loads being atomic when it comes to updating the destination register, this definitely sounds like something we should require in order to make error handling and retry possible. Not sure how much this would complicate implementations, but I guess that's a challenge that those who support misaligned accesses without trapping would manage without big issues.

Initially I mainly thought about the case of aligned accesses and in that case you wouldn't be crossing page boundaries so it should either fail or not, but of course that's a rather tight restriction to enforce at the ISA level.

@christian-herber-nxp
Copy link
Collaborator

the current text on this topic is really based on the text included for Zcmp, see https://github.com/riscvarchive/riscv-code-size-reduction/blob/main/Zc-specification/pushpop.adoc#pushpop-fault-handling. Push/pop requires e.g. the entire pop sequence to be re-executed. Thus, implementations do not need to be constrained to updating all registers at the same time.

The text from the priv spec should hold also for this extension, but it technically does not cover it (and does not cover Zcmp).
Would be interesting to understand if this came up during Zcmp, and how mtval should be written on errors occuring during the sequence. @tariqkurd-repo could you give your input?

@grayresearch
Copy link
Author

grayresearch commented Mar 14, 2024

If I understand correctly, c.pop is idempotent, and so may be safely reissued over and over, because the only value that is read and written is sp, and sp is only written once, as the last action of the instruction, after all the loads complete.

Whereas with Zilsd ld x14,0(x14) and ld x14,0(x15), since x14 or x15 are both source and destination of a sequence of unordered operations, the instruction is not idempotent under multiple load faults unless the dest reg pair write is atomic.

Interrupts

I note c.pop is also interruptable such that the interrrupt handler may observe partial completion. Again, no problem because c.pop is idempotent.

The present Zilsd ld spec reads: "It is implementation defined whether interrupts can also be taken during the sequence execution." As with page and alignment faults, it is a problem if you can interrupt a Zilsd-ld such that the interrupted instruction may not be correctly reissued.

@grayresearch
Copy link
Author

One simple approach to preserve idempotency of ld, as well as partial sequence completion observability in traps and interrupts, that does not require atomic destination register pair writes, would be to constrain ld semantics so that the write to the rs1 source register, if any, must be the second of the two writes.

So ld x14,0(x14) would expand to lw x15,4(x14); lw x14,0(x14) whereas ld x14,0(x15) would expand to lw x14,0(x15); lw x15,4(x15).

It is not clear that the irregularity this introduces into an implementation is better or worse than the extra microarch complexity of deferring the first write-back until both loads complete.

@tariqkurd-repo
Copy link
Collaborator

@grayresearch is right in my opinion - the sequence must be correctly restartable after an interrupt and the possiblity of overwriting the address register with a partially executed load is quite annoying to say the least.

My take: Zilsd ld should follow what "ordinary ld" does, which is issue multiple accesses, and per the priv spec text above, order of accesses / observable order of traps is unspecified

I agree with this, and I think that it should also follow the normal load semantics of updating all or none of the destination register(s). For a store, a misaligned store is allowed to do a partial update if it crosses VM pages. I think that should also be applicable here, so that the first store could be committed if the second store fails due to different permissions on VM pages.

@ubc-guy
Copy link
Collaborator

ubc-guy commented Mar 14, 2024

@grayresearch I'm not sure changing the order of expansion is the "simple" approach.

I think it is far easier to forbid these types of instructions and either call this encoding reserved or throw an illegal instruction trap. This would, for example, forbid certain pointer-chasing tight loops which load-pair a "next" and "data" struct. However, this can be solved by unrolling the loop twice and using two different ld instructions for each unrolled loop body.

Forbidding certain combinations where regs are both source & dest is easy to detect by hardware. I'm pretty sure certain other instructions have similar restrictions when the rules get wonky, but I can't remember any off hand (too early, not enough coffee).

@ubc-guy
Copy link
Collaborator

ubc-guy commented Mar 14, 2024

from https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vec-operands

"A destination vector register group can overlap a source vector register group only if one of the following holds:
[...]
Any instruction encoding that violates the overlap constraints is reserved."

Also, from the same section, the following is an even simpler restriction (something I asked about during the meeting, and should be covered by the spec, is ensuring only the even register number is used as rd but reserve the odd encodings):

"Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding."

@tariqkurd-repo
Copy link
Collaborator

yes - forbidding the combination is sensible. Would you allow partial update though @ubc-guy ?

@ubc-guy
Copy link
Collaborator

ubc-guy commented Mar 14, 2024

what does partial update provide? how hard is it to implement? seems unlikely to be useful and hard to implement to me.

btw, can you simply encode a 32b LD instruction as two 16b C instructions? if so, then why use the 32b encoding space at all? specifying it in this way allows complete restart capability and avoids the ordering problem. however, you lose any guarantees you want to add (eg: atomic bus transaction, atomic update of both registers). you should be clear about the guarantees you want, because if you don't want them then I would just use the dual C instruction encoding because it is more flexible (the order of update is explicit and resuming after an exception is already clearly defined). I would be tempted to leave them out of the spec for this reason.

the 16b C.LD instructions are a different beast entirely and should remain in the spec, but then you still need to solve the problems identified by this github issue.

@tovine
Copy link
Collaborator

tovine commented Mar 14, 2024

Good point about using a pair of C.LW instead of an uncompressed LD.
I would say there are two potential reasons to keep 32-bit LD:

  • Atomic 64-bit load
  • Performance - by being able to load 64 bits in a single instruction you could theoretically move twice as much data per cycle compared to issuing two C.LW (given that your bus supports it, otherwise the point is kinda moot).

Given these points; does it even make sense to allow misaligned LD at all?
I mean, if you want (or need) to do a misaligned 64-bit load (into two adjacent registers) then you obviously don't need it for the performance reason (assuming that a non-aligned access won't be as fast as an aligned one), and if you don't need it for performance then you might be better off with two (compressed) LWs?

PS: Another point of keeping the 32-bit version around is the increased flexibility you get with the full register set and 12-bit immediate for a greater addressing range.

Not sure how much controversy it would cause, but if we decide that misaligned LD/SD is irrelevant then we could potentially repurpose the lower 3 immediate bits to increase the immediate range by a factor of 8 (although I can see this causing a bit of friction, as it would introduce another immediate format, so I won't die on this hill).
Alternatively we can just say that the encodings where bit 0 of each register field or imm[2:0] are non-zero are reserved for future use.

@christian-herber-nxp
Copy link
Collaborator

Also, from the same section, the following is an even simpler restriction (something I asked about during the meeting, and should be covered by the spec, is ensuring only the even register number is used as rd but reserve the odd encodings):

the specification already states

Use of misaligned (odd-numbered) registers for these operands is reserved

@christian-herber-nxp
Copy link
Collaborator

Not sure how much controversy it would cause, but if we decide that misaligned LD/SD is irrelevant then we could potentially repurpose the lower 3 immediate bits to increase the immediate range by a factor of 8 (although I can see this causing a bit of friction, as it would introduce another immediate format, so I won't die on this hill).
Alternatively we can just say that the encodings where bit 0 of each register field or imm[2:0] are non-zero are reserved for future use.

Misaligned load/store is generally a thing in the ISA, and I do not see a reason why this wouldn't apply here. Misaligned data structures are generally not something that leads to good performance, so likely this is something you are doing for reducing memory footprint.

@christian-herber-nxp
Copy link
Collaborator

I agree with this, and I think that it should also follow the normal load semantics of updating all or none of the destination register(s).

Totally agree. The current text does specify that

In implementations that crack Zilsd instructions for sequential execution, correct execution requires addressing idempotent memory, because the core must be able to handle traps detected during the sequence. The entire sequence is re-executed after returning from the trap handler, and multiple traps are possible during the sequence.

The requirement to be able to handle traps and to re-execute the sequence somewhat implies, that rd and rd+1 need to be updated atomically.
To make this explicit, I would again follow Zcmp, and add the section on the software view. This would read roughly like this:

From a software perspective the load/store pair instructions appears as:

  • load instructions:
    • A sequence of one or more loads reading the bytes of the double word without updating rd or rd+1
      • The bytes may be loaded in any order.
      • The bytes may be grouped into larger accesses.
      • Any of the bytes may be loaded multiple times.
    • An atomic write of the load result into rd and rd+1
  • store instructions:
    • A sequence of one or more stores writing the bytes of the double word
      • The bytes may be stored in any order.
      • The bytes may be grouped into larger accesses.
      • Any of the bytes may be stored multiple times.

I believe this would give all the flexibility and clarify any issues around misaligned load/stores and devices not having a 64b memory interface.

@christian-herber-nxp
Copy link
Collaborator

As this seems to have settled, I have created a PR. I cannot add you people as reviewers, so please hop on it yourself and leave a review. I have also raised the request to enable me to assign anyone here as reviewers.

@tovine
Copy link
Collaborator

tovine commented Mar 18, 2024

Misaligned load/store is generally a thing in the ISA, and I do not see a reason why this wouldn't apply here. Misaligned data structures are generally not something that leads to good performance, so likely this is something you are doing for reducing memory footprint.

Yes, and in this case you'd probably use the compressed variants anyway (as otherwise the same code size reduction could be achieved using a pair of c.lw/sw).
So in theory we could go a bit further and either reserve any non-zero values in the lower 3 immediate bits of the 32-bit versions (to save encoding space), or repurpose them for 8 times larger range (which would probably be a bit more controversial because it would essentially mean another encoding format).

But as long as it is legal to raise a misaligned access fault when the address isn't 64-bit aligned then it's fine with me.
Now that I think about it; maybe we should add an explicit note about this (since it's a greater-than-XLEN access) - from previous experience, even if something is seemingly obvious when writing the spec it can still be misinterpreted if there's any room for interpretation. :)

@christian-herber-nxp
Copy link
Collaborator

But as long as it is legal to raise a misaligned access fault when the address isn't 64-bit aligned then it's fine with me. Now that I think about it; maybe we should add an explicit note about this (since it's a greater-than-XLEN access) - from previous experience, even if something is seemingly obvious when writing the spec it can still be misinterpreted if there's any room for interpretation. :)

I think this is a separate issue from the current one. If you wish to discuss it, please open a separate issue (this issue is closed when the PR is merged).

Wider then XLEN load/stores are already a thing, e.g. in RV32ID. The privileged specification does acknowledge that wider than XLEN load/stores might be broken down and thus seen as multiple load/stores.

If a load/store is considered aligned or not is not dependent on XLEN, but just a property of the access size and the address.

@christian-herber-nxp
Copy link
Collaborator

Yes, and in this case you'd probably use the compressed variants anyway (as otherwise the same code size reduction could be achieved using a pair of c.lw/sw).
So in theory we could go a bit further and either reserve any non-zero values in the lower 3 immediate bits of the 32-bit versions (to save encoding space), or repurpose them for 8 times larger range (which would probably be a bit more controversial because it would essentially mean another encoding format).

The misaligned 32b encodings definitely have their place. 32b encodings of ld/sd have a performance advantage if you have a 64-bit memory port. They might be used in structures or so, where the fields you are targeting might not be aligned.
We can argue if this is the best use of the bits. But as you note, deviating from the existing definition would be controversial. With the same argument, you could also argue to change this for the RV64 ld/sd.

@christian-herber-nxp
Copy link
Collaborator

reponening to address the comment raised by @ubc-guy in the PR:

with the currently written up description, ld and sd instructions may appear as a load or store of byte operations in any order and may be repeated; further, the load performs a single atomic update to the register pair.

I have a few concerns with the way this is written:

a) atomic updates to the register pair require additional resources, particularly a dword size buffer which first accepts the byte and then commits; to avoid increasing latency by 1 cycle, the entire dword would also have to be bypassed with muxes. in other words, providing this simple software view requires an expensive hardware solution.

The implementation is simple if you have a 64b memory port (which is not too uncommon). This is also the only scenario in which a performance improvement would be expected. Without it, this extension will likely only provide code size improvements (maybe power could be something). It is an optional extension, so if the tradeoff is not working out, it does not have to be implemented.

b) this atomic update text seems to contradict an earlier note within the current spec:

NOTE: Therefore, implementations are not required to ensure atomicity in loading/storing to/from the individual registers relating to a 64-bit operand.

good catch. I will have to update that.

c) the cm.pop instruction does not require atomic updates of their multi-register destinations

Because that is repeatable, and no source registers are overwritten. For ld, it is possible to overwrite the source register contents if interrupted halfway, thus making it impossible to repeat the sequence.

d) do lh/lw/sh/sw instructions behave similarly, ie breaking down into bytes? if so, there are no further concerns here. however, the only similar description I could find regarding a breakdown into bytes is cm.pop and cm.push so we'll continue on to (e) below.

There is only the need to break something down if it is misaligned or wider than the memory port.
This is e.g. also the case for RV32ID, where double precision loads can be broken down.

e) from the Unpriv Manual, naturally aligned operations (eg halfword aligned for halfword operations) behave atomically but unaligned ones do not. the precise text from Unpriv is below:

Furthermore, whereas naturally aligned loads and stores are guaranteed to execute atomically, misaligned loads and
stores might not, and hence require additional synchronization to ensure atomicity.

NOTE: We do not mandate atomicity for misaligned accesses so execution environment
implementations can use an invisible machine trap and a software handler to handle
some or all misaligned accesses. If hardware misaligned support is provided, software can
exploit this by simply using regular load and store instructions. Hardware can then
automatically optimize accesses depending on whether runtime addresses are aligned.

f) the UnPriv Manual seems to contradict itself between lh/sw/sh/sw and cm.push/cm.pop, which is not good.

g) i think the correct response is a compromise, where individual words (not bytes) are updated atomically by cm.push/cm.pop; the solution here, for ld/sd in rv32, would then to also have individual words be updated atomically (but not the pair).

As indicated above, it is functionally necessary for the pair to be updated atomically. Otherwise, interrupts might not be recoverable.

@ubc-guy
Copy link
Collaborator

ubc-guy commented Mar 21, 2024

The implementation is simple if you have a 64b memory port (which is not too uncommon). This is also the only scenario in which a performance improvement would be expected. Without it, this extension will likely only provide code size improvements (maybe power could be something). It is an optional extension, so if the tradeoff is not working out, it does not have to be implemented.

System implementers will typically buy/license/use the core most readily available to them, which may have this extension. Yet, they have control over the memory bus and what is connected. They also don't write the software, which may use the new ld/sd instructions regardless of what is attached to the memory bus.

Narrow-width SDRAM, SRAM, or Flash is not uncommon, whether on-chip or not. Plus, software may be using ld/sd to access memory-mapped IO devices more efficiently.

Moreover, the cache-register file interface is typically 32b on an RV32 system. Writing back 64b with ld would require significant rework to the regfile (twice the wb bandwidth) and the cache interface. Otherwise, it will be done across 2 cycles anyways -- yes the hardware will make this part appear atomic from an instruction execution perspective, but it's not always using a 64b memory interface as you might expect.

Finally, unaligned ld will always take multiple cycles. This will actually be quite common (dual 32b words being aligned to a 32b boundary have a 50% chance of not being aligned on the optimized 64b boundary between cache and register file).

The cost to provide an atomic update for all of these cases is to add registers to buffer enough bits until the hardware can guarantee the atomic update to the register file.

If an atomic update is not required, a simple implementation could do multiple writes to the register file. In fact, if the register file has only a single 32b write port, it would always take 2 writes anyways, so the additional 64b register which provides the illusion of an atomic update is quite wasteful.

c) the cm.pop instruction does not require atomic updates of their multi-register destinations
Because that is repeatable, and no source registers are overwritten. For ld, it is possible to overwrite the source register contents if interrupted halfway, thus making it impossible to repeat the sequence.

earlier, I had suggested solving this the simple restriction that ld src and dst be non-overlapping. this same restriction is present in other instructions/extensions.

As indicated above, it is functionally necessary for the pair to be updated atomically. Otherwise, interrupts might not be recoverable.

the ld instruction is interruptable/resumable on idempotent memory when overlapping src/dst registers are specified. the idempotent restriction and the non-overlapping restriction have been used in other situations throughout RISC-V.

@christian-herber-nxp
Copy link
Collaborator

earlier, I had suggested solving this the simple restriction that ld src and dst be non-overlapping. this same restriction is present in other instructions/extensions.

Could you give some examples? That would help.
Consistency with the RV64 ld instructions seems important to me.
When it gets to alignment, this can be a SW choice. E.g. if you have a DSP or AI application, using Zilsd and P to do the number crunching, the data better be 64b aligned.

@christian-herber-nxp
Copy link
Collaborator

closing as the current solution does seem to have consensus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants