Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support rounding modes for floating point math #529

Open
elliottslaughter opened this issue Apr 5, 2022 · 1 comment
Open

Support rounding modes for floating point math #529

elliottslaughter opened this issue Apr 5, 2022 · 1 comment

Comments

@elliottslaughter
Copy link
Member

Rounding modes are the last performance regression in the NVPTX codegen in a set of applications I'm looking at. See: StanfordLegion/legion#1041 (comment)

It appears that NVVM generates instructions like sub.f32 (no rounding mode) while NVPTX generates sub.rn.f32 (round-to-nearest). The latter lines up with what LLVM says its default is, so is probably the more "correct". But we're losing about 20% performance due to this, so we need a way to fix it.

The constraints from LLVM appear to be as follows:

If you want to use anything other than the default, you have to mark the entire function as strictfp. You must be very careful not to mix strictfp and non-strictfp code. You can still achieve the default behavior in strictfp, but you have to do so by explicitly calling every floating point instruction with round-to-nearest mode. You also can't use LLVM's default floating point instructions, but have to use some special intrinsics instead. Basically, you're going to end up ripping up your entire floating point code generation, which seems like a pain.

Codegen is painful, but there is also the question of what the interface should be. We could do this with macros, but that seems like it will quickly get obnoxious (terralib.fadd(..., "round-to-nearest")). It might be better to set the rounding mode function-wide (fn:setroundingmode("round-to-nearest")). Then we need to figure out the code generation but don't otherwise need to annotate floating point math within a function. You can still get different rounding modes by using different functions, and then :setinlined(true) to get them combined back into a single whole at optimization time.

@elliottslaughter
Copy link
Member Author

I've been digging some more, and I think this is a bit of a red herring.

The PTX semantics on rounding modes say:

If no rounding modifier is specified, default is .rn and instructions may be folded into a multiply-add.

This seems to be an odd conflation of two things: the default rounding mode (which is actually the same as LLVM) and the equivalent of LLVM's contract flag. The latter allows fusing multiply-adds, which is why this has performance impact.

It seems that NVVM always emitted these default-rounding-mode instructions that allowed fusion. That's technically incorrect, since it could give bad results. At any rate, I've now manually replicated the results with a hacky version of Terra so I'm pretty sure that's where the performance is going.

In short, I don't think there's any LLVM rounding mode that would lead to the optimizations I need (since the PTX semantics are... strange), so while this may still be a useful feature, it's not responsible for the performance regression in StanfordLegion/legion#1041 and therefore shouldn't hold back #471.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant