Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error while running CifarApp #130

Open
prakhar21 opened this issue Jun 8, 2016 · 2 comments
Open

error while running CifarApp #130

prakhar21 opened this issue Jun 8, 2016 · 2 comments

Comments

@prakhar21
Copy link

When I am running the CifarApp on SparkCluster, the following error comesup:

16/06/08 12:50:04 INFO DAGScheduler: ResultStage 14 (foreach at CifarApp.scala:105) failed in 0.040 s 16/06/08 12:50:04 INFO DAGScheduler: Job 8 failed: foreach at CifarApp.scala:105, took 0.049292 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 43, localhost): java.lang.ArrayIndexOutOfBoundsException

@robertnishihara
Copy link
Member

Looks like that is the line

 workers.foreach(_ => workerStore.get[CaffeSolver]("solver").trainNet.setWeights(broadcastWeights.value))

It's possible that the lookup workerStore.get[CaffeSolver] is failing. So perhaps try just

 workers.foreach(_ => workerStore.get[CaffeSolver]("solver"))

and see if that succeeds or fails.

If that is failing, it may be that some worker does not have a net on it. How many nodes are you using? And what are you passing into CifarApp for the number of workers?

@hckuo2
Copy link

hckuo2 commented Oct 9, 2016

@robertnishihara Did that but the following errors raised.

F1009 04:25:49.868021  8028 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000)
*** Check failure stack trace: ***
F1009 04:25:49.868021  8027 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000) F1009 04:25:49.868021  8029 blob.cpp:21] Check failed: count_ == other.count() (1000000 vs. 100)
*** Check failure stack trace: ***
Aborted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants