Held lock in parent worker copied to children causes worker hangs #1

ChristopherWilks · 2020-03-24T16:31:02Z

This has been known for a while, but finally documenting it here. It's happened on MARCC, AWS EC2, and Stampede2.

When the initial cluster.py process starts (parent) it will eventually fork child worker processes via python's multiprocessor module. This mostly works well, and gets around the GIL.
However, one of the other modules being used uses one or more locks within its library (not our code) which are frequently copied while still locked, as part of the inherited memory of the child processes.

However, the unlock signal is only received by the original lock in the parent python process, the children who inherited the lock never get it, and thus silently hang until terminated by an external signal.

The hang always comes before any work is done, during initialization. Both SQLAlchemy and the logging module (via queues) use locks and are major suspects.

Attempts have been made to fix this by passing references to the logging queue and the SQLAlchemy db handle, they have continually failed. This makes me think there are multiple locks being inherited where this can happen, not just one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Held lock in parent worker copied to children causes worker hangs #1

Held lock in parent worker copied to children causes worker hangs #1

ChristopherWilks commented Mar 24, 2020

Held lock in parent worker copied to children causes worker hangs #1

Held lock in parent worker copied to children causes worker hangs #1

Comments

ChristopherWilks commented Mar 24, 2020