Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignite daemon dead or terracotta process dead #34

Closed
mathieucarbou opened this issue Apr 4, 2020 · 3 comments
Closed

Ignite daemon dead or terracotta process dead #34

mathieucarbou opened this issue Apr 4, 2020 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@mathieucarbou
Copy link
Member

mathieucarbou commented Apr 4, 2020

On the tc-platform build (https://github.com/Terracotta-OSS/terracotta-platform/pull/) with Azure, we often run into build failures caused by the ignite process failing (dead), or tc nodes not found by the ignite process.
These are cryptic errors hard to troubleshoot since there is n oeasy way to pick up the logs in Azure.
So I am opeeing an issue here to make you all 3 aware that there are some random build failures because of the angela process handling mechanism.
@akomakom : I don't if it is possible in Azure to "pin" a build that failed and look at the content, logs, etc.
We could put everything in trace mode but it won't help if we do not have access of the workspace of a failed build.

I am unable to reproduce locally... (Azure only issue perhaps... ?)

@mathieucarbou
Copy link
Member Author

For example, this one: https://dev.azure.com/TerracottaCI/terracotta/_build/results?buildId=3693&view=logs&j=8dbdaf4e-75b0-549d-49ae-570ac1daf5bb&t=da6f2a7a-ce43-5148-4474-fcade0333e00

2020-04-04T13:18:02.3893674Z 2020-04-04 13:18:02.388 INFO  o.t.a.c.Tsa:278 - Stopping TC server on localhost
2020-04-04T13:18:02.7328865Z 2020-04-04 13:18:02.731 INFO  o.t.a.e.tsa:98 - [node-1-2] 2020-04-04 13:18:02,731 WARN - NodeID[localhost:11863] left the cluster
2020-04-04T13:18:03.3920666Z 2020-04-04 13:18:03.391 INFO  o.t.a.c.u.RetryUtils:61 - Condition became true after 1MILLISECONDS
2020-04-04T13:18:03.3942479Z 2020-04-04 13:18:03.393 INFO  o.t.a.c.Tsa:278 - Stopping TC server on localhost
2020-04-04T13:18:05.3973087Z 2020-04-04 13:18:05.396 INFO  o.t.a.c.u.RetryUtils:61 - Condition became true after 1MILLISECONDS
2020-04-04T13:18:05.4319845Z 2020-04-04 13:18:05.427 INFO  o.t.a.c.ClusterFactory:260 - shutting down local agent
2020-04-04T13:18:05.4555579Z [ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 214.499 s <<< FAILURE! - in org.terracotta.dynamic_config.system_tests.activated.ConfigSyncIT
2020-04-04T13:18:05.4557515Z [ERROR] testPassiveZapsAppendLogHistoryMismatch(org.terracotta.dynamic_config.system_tests.activated.ConfigSyncIT)  Time elapsed: 4.195 s  <<< ERROR!
2020-04-04T13:18:05.4630874Z class org.apache.ignite.IgniteException: Remote job threw user exception (override or implement ComputeTask.result(..) method if you would like to have automatic failover for this exception): Failed to execute job due to unexpected runtime exception [jobId=27475354171-e3db64c7-bf4a-46d1-81e7-098f3cb38d56, ses=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=org.terracotta.angela.client.Tsa$$Lambda$273/0x000000010044fc40, dep=GridDeployment [ts=1586006205745, depMode=SHARED, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader@277050dc, clsLdrId=69f65354171-2114264b-4702-43f1-83de-94e20c2cba79, userVer=0, loc=true, sampleClsName=org.terracotta.angela.common.util.IgniteCommonHelper, pendingUndeploy=false, undeployed=false, usage=1], taskClsName=org.terracotta.angela.client.Tsa$$Lambda$273/0x000000010044fc40, sesId=17475354171-e3db64c7-bf4a-46d1-81e7-098f3cb38d56, startTime=1586006207337, endTime=9223372036854775807, taskNodeId=e3db64c7-bf4a-46d1-81e7-098f3cb38d56, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader@277050dc, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, internal=false, topPred=null, subjId=e3db64c7-bf4a-46d1-81e7-098f3cb38d56, mapFut=IgniteFuture [orig=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1588404998]], execName=null], jobId=27475354171-e3db64c7-bf4a-46d1-81e7-098f3cb38d56], err=TC server died while waiting on state-change condition]
2020-04-04T13:18:05.4634884Z 	at org.apache.ignite.compute.ComputeTaskAdapter.result(ComputeTaskAdapter.java:102)
2020-04-04T13:18:05.4635712Z 	at org.apache.ignite.internal.processors.task.GridTaskWorker$5.apply(GridTaskWorker.java:1053)
2020-04-04T13:18:05.4642468Z 	at org.apache.ignite.internal.processors.task.GridTaskWorker$5.apply(GridTaskWorker.java:1046)
2020-04-04T13:18:05.4651579Z 	at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6852)
2020-04-04T13:18:05.4652514Z 	at org.apache.ignite.internal.processors.task.GridTaskWorker.result(GridTaskWorker.java:1046)
2020-04-04T13:18:05.4653323Z 	at org.apache.ignite.internal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:864)
2020-04-04T13:18:05.4667019Z 	at org.apache.ignite.internal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:1081)
2020-04-04T13:18:05.4668455Z 	at org.apache.ignite.internal.processors.task.GridTaskProcessor$JobMessageListener.onMessage(GridTaskProcessor.java:1316)
2020-04-04T13:18:05.4681722Z 	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
2020-04-04T13:18:05.4682773Z 	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
2020-04-04T13:18:05.4683703Z 	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
2020-04-04T13:18:05.4698263Z 	at org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
2020-04-04T13:18:05.4699263Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2020-04-04T13:18:05.4700038Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2020-04-04T13:18:05.4700713Z 	at java.base/java.lang.Thread.run(Thread.java:834)
2020-04-04T13:18:05.4768245Z Caused by: class org.apache.ignite.compute.ComputeUserUndeclaredException: Failed to execute job due to unexpected runtime exception [jobId=27475354171-e3db64c7-bf4a-46d1-81e7-098f3cb38d56, ses=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=org.terracotta.angela.client.Tsa$$Lambda$273/0x000000010044fc40, dep=GridDeployment [ts=1586006205745, depMode=SHARED, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader@277050dc, clsLdrId=69f65354171-2114264b-4702-43f1-83de-94e20c2cba79, userVer=0, loc=true, sampleClsName=org.terracotta.angela.common.util.IgniteCommonHelper, pendingUndeploy=false, undeployed=false, usage=1], taskClsName=org.terracotta.angela.client.Tsa$$Lambda$273/0x000000010044fc40, sesId=17475354171-e3db64c7-bf4a-46d1-81e7-098f3cb38d56, startTime=1586006207337, endTime=9223372036854775807, taskNodeId=e3db64c7-bf4a-46d1-81e7-098f3cb38d56, clsLdr=jdk.internal.loader.ClassLoaders$AppClassLoader@277050dc, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, internal=false, topPred=null, subjId=e3db64c7-bf4a-46d1-81e7-098f3cb38d56, mapFut=IgniteFuture [orig=GridFutureAdapter [ignoreInterrupts=false, state=INIT, res=null, hash=1588404998]], execName=null], jobId=27475354171-e3db64c7-bf4a-46d1-81e7-098f3cb38d56], err=TC server died while waiting on state-change condition]
2020-04-04T13:18:05.4772612Z 	at org.apache.ignite.internal.processors.job.GridJobWorker.handleThrowable(GridJobWorker.java:689)
2020-04-04T13:18:05.4787017Z 	at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:621)
2020-04-04T13:18:05.4788133Z 	at org.apache.ignite.internal.processors.job.GridJobWorker.body(GridJobWorker.java:491)
2020-04-04T13:18:05.4788904Z 	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
2020-04-04T13:18:05.4802259Z 	at org.apache.ignite.internal.processors.job.GridJobProcessor.processJobExecuteRequest(GridJobProcessor.java:1191)
2020-04-04T13:18:05.4803305Z 	at org.apache.ignite.internal.processors.job.GridJobProcessor$JobExecutionListener.onMessage(GridJobProcessor.java:1923)
2020-04-04T13:18:05.4803969Z 	... 7 more
2020-04-04T13:18:05.4804951Z Caused by: java.lang.RuntimeException: TC server died while waiting on state-change condition
2020-04-04T13:18:05.4818226Z 	at org.terracotta.angela.common.TerracottaServerInstance.waitForState(TerracottaServerInstance.java:148)
2020-04-04T13:18:05.4819308Z 	at org.terracotta.angela.agent.AgentController.waitForTsaInState(AgentController.java:295)
2020-04-04T13:18:05.4820192Z 	at org.terracotta.angela.client.Tsa.lambda$start$7b0bf020$1(Tsa.java:249)
2020-04-04T13:18:05.4833717Z 	at org.apache.ignite.internal.processors.closure.GridClosureProcessor$C4.execute(GridClosureProcessor.java:1944)
2020-04-04T13:18:05.4834663Z 	at org.apache.ignite.internal.processors.job.GridJobWorker$2.call(GridJobWorker.java:568)
2020-04-04T13:18:05.4835448Z 	at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6820)
2020-04-04T13:18:05.4840758Z 	at org.apache.ignite.internal.processors.job.GridJobWorker.execute0(GridJobWorker.java:562)
2020-04-04T13:18:05.4851799Z 	... 11 more
2020-04-04T13:18:05.4852390Z 
2020-04-04T13:18:05.4852974Z [INFO] Running org.terracotta.dynamic_config.system_tests.activated.DiagnosticCommand1x2IT

@mathieucarbou
Copy link
Member Author

@saurabhagas also: RetryUtils:61 - Condition became true after 1MILLISECONDS missing space

@mathieucarbou mathieucarbou added the bug Something isn't working label Apr 4, 2020
@aurbroszniowski
Copy link
Member

aurbroszniowski commented Apr 6, 2020

The problem comes from Azure: The root cause is displayed in the console (therefore Angela can not know it) and the console output is separated in Azure

If you look at the stack trace:

  • IgniteException: Remote job threw user exception
    Means that when using Ignite to execute a remote task, the remote task threw an exception

  • Caused by: java.lang.RuntimeException: TC server died while waiting on state-change condition
    Means that this remote exception was that the Terracotta server died, when changing of state. In Angela, a process moves between different states, enumerated in org.terracotta.angela.common.TerracottaServerState
    If something wrong happens during an action that is supposed to change that state (e.g. starting a Terracotta server will move it from STOPPED to STARTED), then this exception is thrown.
    If the error doesn't come from Angela, it cannot obviously gives the root cause and you need then to look into the appropriate location (for the Terracotta server, it's in the console, and the terracotta server log).
    So for this example, if you look in the console log, you see:

2020-04-04 13:16:49.456 INFO  o.t.a.e.tsa:98 - [node-1-1] 2020-04-04 13:16:49,455 ERROR - com.tc.net.groups.GroupException: java.net.BindException: Address already in use

https://dev.azure.com/TerracottaCI/terracotta/_build/results?buildId=3693&view=ms.vss-test-web.build-test-results-tab&runId=1004988&resultId=100704&paneView=attachments

The Terracotta server couldn't start because another one is started using the same port, and it shut down. The STARTED stage couldn't then never be reached and Angela reported it.

I've added the expected condition to the Angela exception so next time it helps a bit on understanding what stage should be happening.

I've opened 2 tickets to propose some user experience improvement related to this ticket:
#35
#36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants