-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MMA-izing the prolongator and restrictor kernels #1497
Conversation
- Still need to add the spin factor of 2; - Still need to cover the to_non_rel = true;
…for loading from gmem.
…eature/prolongator-mma
…eature/prolongator-mma
… the process add an additional template to `kernel_param`.
…o feature/prolongator-mma
…underlying code such that FP16 works with rescaling.
…eature/prolongator-mma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the requested fixes done on this @hummingtree. Aside from a trivial comment I just made (logQuda
) this is good to go as far as I am concerned.
Good news: this passes a visual review! Bad news: I hit an issue that's only present with
Command---with the tunecache I have, it only triggers with single precision,
Here's the error, it's CA-GCR on the coarsest level very quickly, after the first norm check after a dslash; it also breaks with any other solver, so it seems like it's the coarsest dslash itself. It's unique to a batched solve, seemingly independent of
Reference tunecache: tunecache_fail.tar.gz Commit id: 49c0a58 |
Infinitely cleaner command... thanks @hummingtree
|
…s as they do not work; Add the logic to make sure the box sizes are not larger than the limit.
Thanks Evan for the tests! This should have been fixed in e8ca869. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the recent bugfixes this is a go---the recent issue I filed is orthogonal to this work. Awesome work @hummingtree !
cscs-ci run |
As the name suggests, this PR adds initial support for MMA-izing the prolongator and restrictor kernels. In addition,
The encoding is the following:
The default types are:
nVec
instantiation to use, e.g., ifnVec = 16,32
are instantiated, fornRHS = 5
,nVec = 16
will be picked; fornRHS = 24
,nVec = 32
will be picked; fornRHS = 96
, thenVec = 32
kernel will be called 3 times to divide and conquer.Remaining to-dos: