-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit of the OpenCL integrators and drivers by CPS. #4
base: master
Are you sure you want to change the base?
Conversation
@stonecp Thanks Chris! I will work on getting this integrated into our framework |
@arghdos ping on this? |
Ok, I see a few things we'll want to resolve before incorporating this:
So my question here, is do we want to want to spin off this functionality into it's own pyJac like package? Additionally, how exactly are the binary mechanism files created? I don't see that code in here.
|
I could switch over to using pyJac for the RHS function and that would also give me the Jacobian. I can stop using my cklib (a chemkin-like clone) in this case. The objects in OpenCL are JIT compiled so I just push in the source file directly to the OCL run-time library and out pops the objects. I could take the pyJac-generated source and load that at run-time. Does the RHS from pyJac populate an array (e.g., wdot[]) or are all values scalars as in the past? I'll have to review the RHS API. For the SIMD stuff, I may have to go upstream to change the datatype that's written by pyJac ... for example, change from double to double8. Again, I'll have to review pyJac output. If the datatype in the generated source is C++ templated, this will be easy ... but that won't work with OCL sadly.... yet another reason to stop using that. |
So the C-version of the RHS operates on simple arrays:
The CUDA-version takes additional an additional struct argument
which contains pointers to preallocated global device work-arrays (e.g. concentrations, fwd/reverse rates, etc.). I used this instead of local (in the CUDA sense) memory arrays, because I've had problems before with exceeding the maximum local (CUDA) memory allowance per thread (especially for Jacobian arrays). It's generally simpler to use global device arrays and pass them in, however... For OpenCL this would require a workaround, but it's potentially possible. One option is to simply declare a maximum size and only force the JIT recompile if it's exceeded. Again my current OpenCL work for the new pyJac avoids this by having the arrays be dynamically sized (of length divisible by the workitem size), which requires no recompiling (ideal for say a CFD situation where the number of problems may be changing with every iteration)
It's all C-based, and it should be relatively easy (if rather somewhat tedious) to add a new OpenCL language to the current version of pyJac that will simply write all "double"s as "double8", etc.. The new version of pyJac is generated using loopy and I can pretty easily add support for vector types (I've currently just used regular arrays and let OpenCL's implicit vectorizer figure it out for simplicity, but it would be good to see if there was a significant performance benefit by using the vector types explicitly). One more issue, that currently isn't well handled in accelerInt... user data. Currently the RHS function expects a user parameter (here, the pressure), but this clearly should be revised towards some sort of CVODEs-like |
@stonecp @arghdos where did we leave this? |
@kyleniemeyer I'm beginning the process of coupling of pyJac-v2 to accelerInt this week (as soon as all my unit tests pass). RE: the memory structure & user-data, I've changed pyJac-v2 to adopt a model closer to what @stonecp used in this PR; essentially there is a work-buffer from which all the internally used buffers are extracted via pointer arithmetic, e.g.:
This has the added benefit of de-duplicating a lot of internal memory storage and fits well with the copy-in/out driver that @stonecp created to power the initial condition loop. pyJac-v2 now supports explicit-SIMD types for wide-vectorization as well (though currently only with my own branch of loopy, as I still need to get that PR merged). So, long story short the bulk of the issues I raised above have been dealt with, and V2 should be fairly easy to couple to the new solver. One thing I would still like feedback on, what should we be expecting for user_data? For reference, CVODEs expects a user RHS function like:
and Jacobian of:
Currently V2 uses functions like:
Obviously this works for us, but I think it would be worthwhile (with an eye on our future accelerInt paper) to come up with a more generalized calling interface, I was thinking something like:
But this presents a few challenges in a vectorized environment, namely:
I believe the answer to this is yes, even if it's annoying to copy over arrays of structs etc.
My opinion is yes for compatibility, even if our application has no use for it.
To see what I mean, take our case of the pressure array, which contains a different pressure for each initial condition. However, in the 'driver' kernel model, the RHS function is expecting to be called on a local (in the sense of 'I copied this data from the global state vector of all initial conditions to my own private working buffer'), however indexing a void pointer is tricky, hence a call like:
(where So there are a few solutions here I'd like both your feedback on:
The downside here is that the user has to correctly unpack a somewhat tricky pointer arithmetic, and we have to expose the
or The plus-side here is that we skip the concerns above, at the downside of all this having to happen at compile-time, making it difficult to link to run-time defined objects (@kyleniemeyer this might be a bigger issue when we get around to hooking up accelerInt to the scipy stack). |
Kyle and Nick,
Here's my initial commit of the OpenCL code used for the WIP paper on SIMD parallelism for kinetics.