-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
further question about instruction atomicity vs. faults #7
Comments
Hi, my original spec was a bit more verbose on this topic. Do you think we should bring in more of the text from there to cover it better? And you raise an interesting point about loads being atomic when it comes to updating the destination register, this definitely sounds like something we should require in order to make error handling and retry possible. Not sure how much this would complicate implementations, but I guess that's a challenge that those who support misaligned accesses without trapping would manage without big issues. Initially I mainly thought about the case of aligned accesses and in that case you wouldn't be crossing page boundaries so it should either fail or not, but of course that's a rather tight restriction to enforce at the ISA level. |
the current text on this topic is really based on the text included for Zcmp, see https://github.com/riscvarchive/riscv-code-size-reduction/blob/main/Zc-specification/pushpop.adoc#pushpop-fault-handling. Push/pop requires e.g. the entire pop sequence to be re-executed. Thus, implementations do not need to be constrained to updating all registers at the same time. The text from the priv spec should hold also for this extension, but it technically does not cover it (and does not cover Zcmp). |
If I understand correctly, c.pop is idempotent, and so may be safely reissued over and over, because the only value that is read and written is sp, and sp is only written once, as the last action of the instruction, after all the loads complete. Whereas with Zilsd InterruptsI note c.pop is also interruptable such that the interrrupt handler may observe partial completion. Again, no problem because c.pop is idempotent. The present Zilsd ld spec reads: "It is implementation defined whether interrupts can also be taken during the sequence execution." As with page and alignment faults, it is a problem if you can interrupt a Zilsd-ld such that the interrupted instruction may not be correctly reissued. |
One simple approach to preserve idempotency of So It is not clear that the irregularity this introduces into an implementation is better or worse than the extra microarch complexity of deferring the first write-back until both loads complete. |
@grayresearch is right in my opinion - the sequence must be correctly restartable after an interrupt and the possiblity of overwriting the address register with a partially executed load is quite annoying to say the least.
I agree with this, and I think that it should also follow the normal load semantics of updating all or none of the destination register(s). For a store, a misaligned store is allowed to do a partial update if it crosses VM pages. I think that should also be applicable here, so that the first store could be committed if the second store fails due to different permissions on VM pages. |
@grayresearch I'm not sure changing the order of expansion is the "simple" approach. I think it is far easier to forbid these types of instructions and either call this encoding reserved or throw an illegal instruction trap. This would, for example, forbid certain pointer-chasing tight loops which load-pair a "next" and "data" struct. However, this can be solved by unrolling the loop twice and using two different ld instructions for each unrolled loop body. Forbidding certain combinations where regs are both source & dest is easy to detect by hardware. I'm pretty sure certain other instructions have similar restrictions when the rules get wonky, but I can't remember any off hand (too early, not enough coffee). |
from https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vec-operands "A destination vector register group can overlap a source vector register group only if one of the following holds: Also, from the same section, the following is an even simpler restriction (something I asked about during the meeting, and should be covered by the spec, is ensuring only the even register number is used as rd but reserve the odd encodings): "Vector operands or results may occupy one or more vector registers depending on EMUL, but are always specified using the lowest-numbered vector register in the group. Using other than the lowest-numbered vector register to specify a vector register group is a reserved encoding." |
yes - forbidding the combination is sensible. Would you allow partial update though @ubc-guy ? |
what does partial update provide? how hard is it to implement? seems unlikely to be useful and hard to implement to me. btw, can you simply encode a 32b LD instruction as two 16b C instructions? if so, then why use the 32b encoding space at all? specifying it in this way allows complete restart capability and avoids the ordering problem. however, you lose any guarantees you want to add (eg: atomic bus transaction, atomic update of both registers). you should be clear about the guarantees you want, because if you don't want them then I would just use the dual C instruction encoding because it is more flexible (the order of update is explicit and resuming after an exception is already clearly defined). I would be tempted to leave them out of the spec for this reason. the 16b C.LD instructions are a different beast entirely and should remain in the spec, but then you still need to solve the problems identified by this github issue. |
Good point about using a pair of C.LW instead of an uncompressed LD.
Given these points; does it even make sense to allow misaligned PS: Another point of keeping the 32-bit version around is the increased flexibility you get with the full register set and 12-bit immediate for a greater addressing range. Not sure how much controversy it would cause, but if we decide that misaligned LD/SD is irrelevant then we could potentially repurpose the lower 3 immediate bits to increase the immediate range by a factor of 8 (although I can see this causing a bit of friction, as it would introduce another immediate format, so I won't die on this hill). |
the specification already states
|
Misaligned load/store is generally a thing in the ISA, and I do not see a reason why this wouldn't apply here. Misaligned data structures are generally not something that leads to good performance, so likely this is something you are doing for reducing memory footprint. |
Totally agree. The current text does specify that
The requirement to be able to handle traps and to re-execute the sequence somewhat implies, that From a software perspective the load/store pair instructions appears as:
I believe this would give all the flexibility and clarify any issues around misaligned load/stores and devices not having a 64b memory interface. |
As this seems to have settled, I have created a PR. I cannot add you people as reviewers, so please hop on it yourself and leave a review. I have also raised the request to enable me to assign anyone here as reviewers. |
Yes, and in this case you'd probably use the compressed variants anyway (as otherwise the same code size reduction could be achieved using a pair of But as long as it is legal to raise a misaligned access fault when the address isn't 64-bit aligned then it's fine with me. |
I think this is a separate issue from the current one. If you wish to discuss it, please open a separate issue (this issue is closed when the PR is merged). Wider then XLEN load/stores are already a thing, e.g. in RV32ID. The privileged specification does acknowledge that wider than XLEN load/stores might be broken down and thus seen as multiple load/stores. If a load/store is considered aligned or not is not dependent on XLEN, but just a property of the access size and the address. |
The misaligned 32b encodings definitely have their place. 32b encodings of ld/sd have a performance advantage if you have a 64-bit memory port. They might be used in structures or so, where the fields you are targeting might not be aligned. |
reponening to address the comment raised by @ubc-guy in the PR:
The implementation is simple if you have a 64b memory port (which is not too uncommon). This is also the only scenario in which a performance improvement would be expected. Without it, this extension will likely only provide code size improvements (maybe power could be something). It is an optional extension, so if the tradeoff is not working out, it does not have to be implemented.
good catch. I will have to update that.
Because that is repeatable, and no source registers are overwritten. For ld, it is possible to overwrite the source register contents if interrupted halfway, thus making it impossible to repeat the sequence.
There is only the need to break something down if it is misaligned or wider than the memory port.
As indicated above, it is functionally necessary for the pair to be updated atomically. Otherwise, interrupts might not be recoverable. |
System implementers will typically buy/license/use the core most readily available to them, which may have this extension. Yet, they have control over the memory bus and what is connected. They also don't write the software, which may use the new ld/sd instructions regardless of what is attached to the memory bus. Narrow-width SDRAM, SRAM, or Flash is not uncommon, whether on-chip or not. Plus, software may be using ld/sd to access memory-mapped IO devices more efficiently. Moreover, the cache-register file interface is typically 32b on an RV32 system. Writing back 64b with ld would require significant rework to the regfile (twice the wb bandwidth) and the cache interface. Otherwise, it will be done across 2 cycles anyways -- yes the hardware will make this part appear atomic from an instruction execution perspective, but it's not always using a 64b memory interface as you might expect. Finally, unaligned ld will always take multiple cycles. This will actually be quite common (dual 32b words being aligned to a 32b boundary have a 50% chance of not being aligned on the optimized 64b boundary between cache and register file). The cost to provide an atomic update for all of these cases is to add registers to buffer enough bits until the hardware can guarantee the atomic update to the register file. If an atomic update is not required, a simple implementation could do multiple writes to the register file. In fact, if the register file has only a single 32b write port, it would always take 2 writes anyways, so the additional 64b register which provides the illusion of an atomic update is quite wasteful.
earlier, I had suggested solving this the simple restriction that ld src and dst be non-overlapping. this same restriction is present in other instructions/extensions.
the ld instruction is interruptable/resumable on idempotent memory when overlapping src/dst registers are specified. the idempotent restriction and the non-overlapping restriction have been used in other situations throughout RISC-V. |
Could you give some examples? That would help. |
closing as the current solution does seem to have consensus |
This issue clarifies my tech-chairs question about permissible behaviors of Zilsd WRT faults.
The tech-chairs discussion referred to the priv spec. Here, I believe, is the pertinent language: "On some implementations, misaligned loads, stores, and instruction fetches may also be decomposed into multiple accesses, some of which may succeed before an access-fault exception occurs." Also: "If mtval is written with a nonzero value when a misaligned load or store causes an access-fault or page-fault exception, then mtval will contain the virtual address of the portion of the access that caused the fault."
For the Zilsd's spec "crack these instructions" and Fault Handling sections, perhaps the spec should explicitly require atomicity of the destination register pair writes. For example given
ld x14,0(x14)
, the spec must be clear that the x14,x15 pair are updated atomically: it is never possible to observe any partial register updates of the ld destinations. Here if x14 addresses 64b that spans pages, and the second page is inaccessible, the exception handler must not observe that x14 was updated by the first 32b load -- this could preclude correct reissue of the faulting instruction. Similarly forld x14,0(x15)
.Should the spec require any particular order of accesses and hence of faults -- or explicitly note that any order may occur? For example, for
ld x4,0(x2)
if the 64b spans two pages which are both inaccessible, should the spec constrain which of the two fault addresses occur first? Similarly, when the 64b address is misaligned and the implementation incurs an address-misaligned trap.My take: Zilsd ld should follow what "ordinary ld" does, which is issue multiple accesses, and per the priv spec text above, order of accesses / observable order of traps is unspecified.
The text was updated successfully, but these errors were encountered: