-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA Reconstruction Stuck (2024.05.04.) #569
Comments
No idea. Is this right value to put? |
Note this is |
Yes I worry that the CKFs are counting the branches forever with the maximum of maximum branches per seed. |
No. That's really just the default value in our code. This was a good suggestion, but apparently that's not where the code gets stuck. 😦
|
Then it might be KF where the propagation gets stuck. But you really need to let us know where you get stuck. |
😛 Let's not get quite that basic just yet. It is indeed the fitting. I can at least attach a normal GDB process, even if
|
Does this only happen for cuda? or also for cpu? |
Fair question. But the host version does run through.
|
KF is supposed to reproduced the mathematically identical tracks found by CKF, at least that is what I intended |
Also please make this specific data available somewhere so I can test and debug by meyself |
Let's start with the important part first. I've put the ODD ttbar files here yesterday: https://cernbox.cern.ch/s/aLswvi2pNcBX9wr Just make sure that you have ~100 GB free space if you want to download it. 😦 (~24 GB for the TGZ, and ~65 GB for the uncompressed files.) I could upload just the one problematic event as well if you'd like. 🤔 At the same time: The plot thickens. After building the code with #568 locally fixed, in Debug mode, the bloody thing runs through. 😮
I'll do some further testing next, but it very much seems to be some sort of a race condition. Which the slower-running debug binary manages to avoid. 🤔 |
Unfortunately I'm none the wiser, now that I looked at the code for a bit. I mean, the propagation code of course can get into an endless loop relatively easily. So I even tried to "activate" the aborter that we have in traccc's fitting code. Plus added one assertion in the only place where a memory error could relatively easily happen. diff --git a/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp b/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
index 898e6ef8..565a27ce 100644
--- a/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
+++ b/core/include/traccc/fitting/kalman_filter/kalman_fitter.hpp
@@ -106,7 +106,7 @@ class kalman_fitter {
}
/// Individual actor states
- typename aborter::state m_aborter_state{};
+ typename aborter::state m_aborter_state{detray::unit<scalar_type>::m};
typename transporter::state m_transporter_state{};
typename interactor::state m_interactor_state{};
typename fit_actor::state m_fit_actor_state;
@@ -229,6 +229,7 @@ class kalman_fitter {
auto& track_states = fitter_state.m_fit_actor_state.m_track_states;
// Fit parameter = smoothed track parameter at the first surface
+ assert(!track_states.empty());
fit_res.fit_params = track_states[0].smoothed();
for (const auto& trk_state : track_states) { But they made no difference for the code. It still goes into an endless loop in optimized mode, and finishes in debug mode. 😕 Though at least I learned about nvtop along the way. 😛 |
Have you tried slapping a print statement on this loop to see if it loops forever? Of course if it really is a race condition the additional timing might avoid it, but could be worth a try. |
Aborting with a limited path length won't help when the track is oscillating around the same surface. (I tried the same thing in the CKF and it was no-no) But as I said, the infinite loop should not happen at all at least for traccc KF after CKF |
In case it's oscillating around a surface, you could also try to either reduce the overstepping tolerance or increase the minimum step size (is that still checked correctly?). If the stepper cannot increase the step size enough after stepping onto a surface with a mini-step, we might indeed end up oscillating around that surface... |
It should not happen, but pretty clearly it does. 🤔 Adding diff --git a/core/include/detray/propagator/propagator.hpp b/core/include/detray/propagator/propagator.hpp
index e361f4e6..3b47ec4c 100644
--- a/core/include/detray/propagator/propagator.hpp
+++ b/core/include/detray/propagator/propagator.hpp
@@ -151,8 +151,11 @@ struct propagator {
m_navigator.update(propagation, m_cfg.navigation);
// Run while there is a heartbeat
+ int i = 0;
while (propagation._heartbeat) {
+ printf("starting iteration %i\n", i++);
+
// Take the step
propagation._heartbeat &=
m_stepper.step(propagation, m_cfg.stepping); to the code, I ended up killing my test job at:
It's not actually a "race condition" that I suspect here. 🤔 Since this code is not doing any cooperation between threads that I could see. What I suspect is a floating-point issue. That the CKF for whatever reason propagates one particle just differently enough using "fast math" that the KF doesn't reproduce it. So yeah, I'm very much getting the sense that an aborter, based on the number of iterations, is the way to go here... |
I am pretty aware of that it clearly happens. |
I'm very happy to leave you to it. 🤔 Would you like some additional help with getting the cells that produce this behaviour? (I could put the files one-by-one onto EOS for instance.) |
The GPU or CUDA can generate different results from the exactly same sequence of the same calculations?
I will appreciate it |
Though adding an "aborter" is still something that we should consider. Since this is likely not going to be the last bug in our code. And making the code print a warning and then continue, would be much preferable to getting into endless loops on weird events. For context: We are just trying to process O(100) events here. In the foreseeable future I hope that we'll be able to go up to processing mullions. And at that point we'll need reasonable output about errors, and not just endlessly looping jobs. Since that's exactly what we had with the NSW reconstruction as well... 😦 |
Sounds legit. |
You can find the one problematic event here: https://cernbox.cern.ch/s/jQ3TYzcLX0cAgQz A "standard setup" of the main branch (with the latest data file downloaded for the ODD geometry files that I've been using), with this one event, should reproduce this endless loop, using the sort of commands that I've put a lot of into this issue. Though I have to admit, I didn't even try it on a different NVIDIA GPU yet, just the one in my desktop. 🤔 So it's not out of the question that a little different GPU may not even reproduce the issue. |
Processing every ODD ttbar event that I made in #561, I have one that makes the reconstruction run forever. 😕
The application is just stuck on that file, with both my CPU and GPU reporting to be busy. 🤔
I don't see this behaviour on any of the other events that I simulated. So I can imagine two things:
In the end, both of them are the same. 🤔 Since even on "bad events" we can't afford to go into an endless loop with our code.
Note that I didn't find yet which algorithm/kernel is doing it. Unfortunately attaching
cuda-gdb
to a running process is a lot more difficult than I first thought. 😦 So I thought I'd open the issue with just this little information for now.The text was updated successfully, but these errors were encountered: