You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has been known for a while, but finally documenting it here. It's happened on MARCC, AWS EC2, and Stampede2.
When the initial cluster.py process starts (parent) it will eventually fork child worker processes via python's multiprocessor module. This mostly works well, and gets around the GIL.
However, one of the other modules being used uses one or more locks within its library (not our code) which are frequently copied while still locked, as part of the inherited memory of the child processes.
However, the unlock signal is only received by the original lock in the parent python process, the children who inherited the lock never get it, and thus silently hang until terminated by an external signal.
The hang always comes before any work is done, during initialization. Both SQLAlchemy and the logging module (via queues) use locks and are major suspects.
Attempts have been made to fix this by passing references to the logging queue and the SQLAlchemy db handle, they have continually failed. This makes me think there are multiple locks being inherited where this can happen, not just one.
The text was updated successfully, but these errors were encountered:
This has been known for a while, but finally documenting it here. It's happened on MARCC, AWS EC2, and Stampede2.
When the initial cluster.py process starts (parent) it will eventually fork child worker processes via python's multiprocessor module. This mostly works well, and gets around the GIL.
However, one of the other modules being used uses one or more locks within its library (not our code) which are frequently copied while still locked, as part of the inherited memory of the child processes.
However, the unlock signal is only received by the original lock in the parent python process, the children who inherited the lock never get it, and thus silently hang until terminated by an external signal.
The hang always comes before any work is done, during initialization. Both SQLAlchemy and the logging module (via queues) use locks and are major suspects.
Attempts have been made to fix this by passing references to the logging queue and the SQLAlchemy db handle, they have continually failed. This makes me think there are multiple locks being inherited where this can happen, not just one.
The text was updated successfully, but these errors were encountered: