Skip to content

Commit

Permalink
[Grammar] 6-3 TMA-AMD.md
Browse files Browse the repository at this point in the history
  • Loading branch information
dendibakh authored Aug 9, 2024
1 parent 891d26f commit 20280e8
Showing 1 changed file with 2 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ A description of pipeline utilization metrics shown above can be found in [@AMDU

Advanced cryptography instructions are not trivial, so internally they are broken into smaller pieces ($\mu$ops). Once a processor encounters such an instruction, it retrieves $\mu$ops for it from the microcode. Microoperations are fetched from the microcode sequencer with a lower bandwidth than from regular instruction decoders, making it a potential source of performance bottlenecks. Crypto++ SHA256 implementation heavily uses instructions such as `SHA256MSG2`, `SHA256RNDS2`, and others which consist of multiple $\mu$ops according to [uops.info](https://uops.info/table.html)[^2] website.

The `retiring_microcode` metric indicates that 6.1% of dispatch slots were used by microcode operations that eventually retired. When comparing with its sibling metric `retiring_fastpath`, we can say that roughly every 4th instruction was a microcode operation. If we now look at the `frontend_bound_bandwidth` metric, we will see that 6.1% of dispatch slots were unused due to bandwidth bottleneck in the CPU frontend. This suggest that 6.1% of dispatch slots were wasted because the microcode sequencer has not been providing $\mu$ops while the backend could have consumed them. In this example, the `retiring_microcode` and `frontend_bound_bandwidth` metrics are tightly connected, however, the fact that they are equal is merely a coincidence.
The `retiring_microcode` metric indicates that 6.1% of dispatch slots were used by microcode operations that eventually retired. When comparing with its sibling metric `retiring_fastpath`, we can say that roughly every 4th instruction was a microcode operation. If we now look at the `frontend_bound_bandwidth` metric, we will see that 6.1% of dispatch slots were unused due to bandwidth bottleneck in the CPU frontend. This suggests that 6.1% of dispatch slots were wasted because the microcode sequencer has not been providing $\mu$ops while the backend could have consumed them. In this example, the `retiring_microcode` and `frontend_bound_bandwidth` metrics are tightly connected, however, the fact that they are equal is merely a coincidence.

The majority of cycles are stalled in the CPU backend (`backend_bound`), but only 1.7% of cycles are stalled waiting for memory accesses (`backend_bound_memory`). So, we know that the benchmark is mostly limited by the computing capabilities of the machine. As you will know from Part 2 of this book, it could be related to either data flow dependencies or execution throughput of certain cryptographic operations. They are less frequent than traditional `ADD`, `SUB`, `CMP`, and other instructions and thus can be often executed only on a single execution unit. A large number of such operations may saturate the execution throughput of this particular unit. Further analysis should involve a closer look at the source code and generated assembly, checking execution port utilization, finding data dependencies, etc.

Expand All @@ -39,4 +39,4 @@ In summary, Crypto++ implementation of SHA-256 on AMD Ryzen 9 7950X utilizes onl
When it comes to Windows, at the time of writing, TMA methodology is only supported on server platforms (codename Genoa), and not on client systems (codename Raphael). TMA support was added in AMD uProf version 4.1, but only in the command line tool `AMDuProfPcm` tool which is part of AMD uProf installation. You can consult [@AMDUprofManual, Chapter 2.8 Pipeline Utilization] for more details on how to run the analysis. The graphical version of AMD uProf doesn't have the TMA analysis yet.

[^1]: Crypto++ - [https://github.com/weidai11/cryptopp](https://github.com/weidai11/cryptopp)
[^2]: uops.info - [https://uops.info/table.html](https://uops.info/table.html)
[^2]: uops.info - [https://uops.info/table.html](https://uops.info/table.html)

0 comments on commit 20280e8

Please sign in to comment.