Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hold a hard reference to Ruby threads #6143

Merged
merged 3 commits into from
Apr 14, 2020

Conversation

headius
Copy link
Member

@headius headius commented Mar 27, 2020

The old logic maintained a weak reference to the native thread
associated with a Ruby thread, in order to avoid keeping its
resources alive longer than the thread's lifecycle. However in
the case shown in #6142, it seems likely that under heavy load the
native thread gets collected before it starts running. Because of
the logic we have to silently ignore collected references, the
NativeThread.start method simply returns if the thread has gone
away, giving us no indication that the thread never actually ran.

This patch splits our NativeThread into two types:

  • RubyNativeThread, which holds a hard reference to the native
    Thread and implements start to call Thread#start.
  • AdoptedNativeThread, which holds a weak reference to the native
    Thread and errors (BUG!) if we attempt to call call start on
    that adopted thread.

I believe this will fix the issue in #6142. This may explain other
"one in a million" failures we have never quite tracked down.

@headius headius added this to the JRuby 9.3.0.0 milestone Mar 27, 2020
@headius
Copy link
Member Author

headius commented Mar 27, 2020

The second commit here may actually be the only fix needed. We hold a hard reference to the thread but once we are no longer using that local variable the JVM may consider it dereferenced. Doing the Thread.start via the WeakReference opens up the possibility that GC will have claimed the object. Instead, I modified the logic to start the native Thread directly, so it remains referenced until (at least) the Thread#start call has completed.

@headius
Copy link
Member Author

headius commented Mar 27, 2020

I got some confirmation from our JVM friends that this theory could very well be true. Even though we hold a reference to the Thread object in a local variable, by the time we call start we are no longer using that local variable. The start call itself happens after traversing the WeakReference, so if the JVM happens to GC before that point but after the last use of the local variable, we might see the WR get cleared.

Incredible "luck" to catch this on film.

@headius headius linked an issue Mar 27, 2020 that may be closed by this pull request
@headius
Copy link
Member Author

headius commented Mar 27, 2020

Strange failure in concurrent-ruby here makes me reluctant to merge this just yet. The CyclicBarrier test tries to wait for all threads to finish, but at least one of them seems to still be waiting with this stack trace:

  #number_waiting
    without any waiting thread
      should be equal to zero
    with waiting threads
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:460)
org.jruby.RubyThread$SleepTask.run(RubyThread.java:1531)
org.jruby.RubyThread.executeBlockingTask(RubyThread.java:1584)
org.jruby.RubyThread.wait_timeout(RubyThread.java:2132)
com.concurrent_ruby.ext.SynchronizationLibrary$JRubyLockableObject.nsWait(SynchronizationLibrary.java:254)
com.concurrent_ruby.ext.SynchronizationLibrary$JRubyLockableObject$INVOKER$i$0$1$nsWait.call(SynchronizationLibrary$JRubyLockableObject$INVOKER$i$0$1$nsWait.gen)
org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:815)
org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:203)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:172)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:316)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)
org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:86)
org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:156)
org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:143)
org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:182)
org.jruby.runtime.callsite.CachingCallSite.callIter(CachingCallSite.java:191)
org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:337)
org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:72)
org.jruby.ir.interpreter.Interpreter.INTERPRET_BLOCK(Interpreter.java:116)
org.jruby.runtime.MixedModeIRBlockBody.commonYieldPath(MixedModeIRBlockBody.java:137)
org.jruby.runtime.IRBlockBody.doYield(IRBlockBody.java:172)
org.jruby.runtime.BlockBody.yield(BlockBody.java:108)
org.jruby.runtime.Block.yield(Block.java:184)
com.concurrent_ruby.ext.SynchronizationLibrary$JRubyLockableObject.rubySynchronize(SynchronizationLibrary.java:232)

I'm not sure if this is a problem related to my change (which seems unlikely) or something flaky about the concurrent-ruby tests. I will make the simpler change on master (ensure the thread remains hard referenced until start returns) and use this PR to work on additional cleanup and refactoring of this thread lifecycle.

The old logic maintained a weak reference to the native thread
associated with a Ruby thread, in order to avoid keeping its
resources alive longer than the thread's lifecycle. However in
the case shown in jruby#6142, it seems likely that under heavy load the
native thread gets collected before it starts running. Because of
the logic we have to silently ignore collected references, the
NativeThread.start method simply returns if the thread has gone
away, giving us no indication that the thread never actually ran.

This patch splits our NativeThread into two types:

* RubyNativeThread, which holds a hard reference to the native
  Thread and implements start to call Thread#start.
* AdoptedNativeThread, which holds a weak reference to the native
  Thread and errors (BUG!) if we attempt to call call start on
  that adopted thread.

I believe this will fix the issue in jruby#6142. This may explain other
"one in a million" failures we have never quite tracked down.
This also calls start on the native thread directly, rather than
by traversing the native thread object.
Also some misc cleanup surrounding the new ThreadLike impls.
@headius
Copy link
Member Author

headius commented Apr 14, 2020

The concurrent-ruby failure appears to be a bad test, which I filed in ruby-concurrency/concurrent-ruby#862

@headius headius merged commit a38cc23 into jruby:master Apr 14, 2020
@headius headius deleted the hard_ref_ruby_threads branch April 14, 2020 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Thread didn't execute?
1 participant