Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CRaC] Fix hangup after restoring #34372

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

YaSuenag
Copy link

@YaSuenag YaSuenag commented Feb 6, 2025

I run following ApplicationRunner Spring Boot app and I obtained checkpoint by CRIU. The app did not finish after restoring.

  @Override
  public void run(ApplicationArguments args) throws Exception {
    if(args.containsOption("checkpoint")){
      System.out.println("Ready to obtain checkpoint...");
      // Wait restoring...
      cpCoordinator.await();
    }
    System.out.println("from Spring Boot App");
  }

I obtained thread dump, then I got following stack trace. It shows beforeCheckpoint CRaC handler waits signal in CyclicBarrier.

"prevent-shutdown" #29 [1504] prio=5 os_prio=0 cpu=0.17ms elapsed=25.76s tid=0x00007feb1017db00 nid=1504 waiting on condition  [0x00007feb4e22b000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
        - parking to wait for  <0x000000008a9279b0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:371)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block([email protected]/AbstractQueuedSynchronizer.java:519)
        at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3780)
        at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3725)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await([email protected]/AbstractQueuedSynchronizer.java:1707)
        at java.util.concurrent.CyclicBarrier.dowait([email protected]/CyclicBarrier.java:236)
        at java.util.concurrent.CyclicBarrier.await([email protected]/CyclicBarrier.java:364)
        at org.springframework.context.support.DefaultLifecycleProcessor$CracResourceAdapter.awaitPreventShutdownBarrier(DefaultLifecycleProcessor.java:634)
        at org.springframework.context.support.DefaultLifecycleProcessor$CracResourceAdapter.lambda$beforeCheckpoint$0(DefaultLifecycleProcessor.java:606)
        at org.springframework.context.support.DefaultLifecycleProcessor$CracResourceAdapter$$Lambda/0x00007feb501c37c0.run(Unknown Source)
        at java.lang.Thread.runWith([email protected]/Thread.java:1596)
        at java.lang.Thread.run([email protected]/Thread.java:1583)

I investigated CracResourceAdapter, prevent-shutdown thread might through the second awaitPreventShutdownBarrier() call if that thread runs before awaitPreventShutdownBarrier() at beforeCheckpoint().

We need to separate barriers for beforeCheckpoint / afterRestore to work as expected.

Signed-off-by: Yasumasa Suenaga <[email protected]>
@YaSuenag YaSuenag force-pushed the pr/crac-restore-hang branch from 5136e9e to 13fbbd1 Compare February 6, 2025 03:19
@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged or decided on label Feb 6, 2025
@sdeleuze sdeleuze self-assigned this Feb 6, 2025
@sdeleuze sdeleuze added in: core Issues in core modules (aop, beans, core, context, expression) type: enhancement A general enhancement labels Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: core Issues in core modules (aop, beans, core, context, expression) status: waiting-for-triage An issue we've not yet triaged or decided on type: enhancement A general enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants