Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

报错:NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize #24

Open
fy88fy opened this issue Dec 23, 2020 · 17 comments

Comments

@fy88fy
Copy link

fy88fy commented Dec 23, 2020

你好,问一下?
我在测试使用PyTorch-On-Angel ,提交报错了,在代码里找到的是com.tencent.angel.graph.utils.params.HasBatchSize,会是什么原因导致的呢?

param mode = yarn-client Exception in thread "main" java.lang.NoClassDefFoundError: com/tencent/angel/spark/ml/graph/params/HasBatchSize at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:53) at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.tencent.angel.spark.ml.graph.params.HasBatchSize at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 25 more

@ouyangwen-it
Copy link
Contributor

ouyangwen-it commented Dec 30, 2020

你使用的是哪个分支,spark是哪个版本,用最新的0.2.1分支试下

@dongxuej
Copy link

pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境

@ouyangwen-it
Copy link
Contributor

ouyangwen-it commented Dec 30, 2020

pytorch-on-angel用的是master的,spark是2.4.5U2,我们用的是纯cpu环境

用0.2.1分支,angel环境用3.1.0的

@dongxuej
Copy link

angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0

@ouyangwen-it
Copy link
Contributor

angel现在用的也是master的,那个就是3.1.0的吧??还是非得下载branch3.1.0
master就可以了

@dongxuej
Copy link

dongxuej commented Jan 4, 2021

这个classnotfound的问题已经解决了我现在遇到了新的问题

Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

我给的参数如下:
input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/
output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=5
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraLibraryPath=./torch/lib
--conf spark.driver.extraLibraryPath=./torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--jars $SONA_SPARK_JARS
--name "deepfm for torch on angel"
--archives /home/work/software/angel/bin/torchlib.zip#torch
--files /home/work/software/angel/bin/deepfm.pt
--driver-memory 5g
--num-executors 5
--executor-cores 1
--executor-memory 5g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
./pytorch-on-angel-0.2.1.jar
trainInput:$input batchSize:128 torchModelPath:deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client \

报错如下:
image

@ouyangwen-it
Copy link
Contributor

ouyangwen-it commented Jan 4, 2021

这个classnotfound的问题已经解决了我现在遇到了新的问题

Exception in thread "main" java.lang.UnsatisfiedLinkError: no torch_angel in java.library.path

我给的参数如下:
input=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/fanqizha/subgraph/input/20191231/
output=hdfs://jinrong-hadoop3-1v/home/hdp-jinrong-stargraph/jiadongxue/angel/model/20191231_deepfm/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=5
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib
--conf spark.executor.extraLibraryPath=./torch/lib
--conf spark.driver.extraLibraryPath=./torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--jars $SONA_SPARK_JARS
--name "deepfm for torch on angel"
--archives /home/work/software/angel/bin/torchlib.zip#torch
--files /home/work/software/angel/bin/deepfm.pt
--driver-memory 5g
--num-executors 5
--executor-cores 1
--executor-memory 5g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
./pytorch-on-angel-0.2.1.jar
trainInput:$input batchSize:128 torchModelPath:deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client \

报错如下:
image

image
应该是你的依赖包解压后的目录和你设置的不匹配,是这样的,spark的--archives参数会把你的hdfs上的压缩包解压到executor执行目录下,目录名是井号后面那个别名,目录应该是./torch/(你压缩包解压后的目录结构)
--archives跟的是hdfs路径

@fy88fy
Copy link
Author

fy88fy commented Jan 4, 2021

你好,我现在遇到个问题,是提交不到yarn上。找不到hdfs上的deepfm.pt文件,麻烦帮忙看一下。
脚本配置如下:

#!/bin/bash
input=hdfs://xxxx-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp
output=hdfs://xxxx-1v/home/hdp/jia/angel/model/20191231_louvain/
source ./spark-on-angel-env.sh
echo "------------------"
#JAVA_LIBRARY_PATH=/home/work/software/java/lib
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib
echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit \
       --conf spark.ps.instances=5 \
       --conf spark.ps.cores=1 \
       --conf spark.ps.jars=$SONA_ANGEL_JARS \
       --conf spark.ps.memory=5g \
       --conf spark.ps.log.level=INFO \
       --archives hdfs://XXXX-hadoop3-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch \
       --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
       --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:./torch/lib \
       --conf spark.executor.extraLibraryPath=./torch/lib \
       --conf spark.driver.extraLibraryPath=./torch/lib \
       --conf spark.executorEnv.OMP_NUM_THREADS=2 \
       --conf spark.executorEnv.MKL_NUM_THREADS=2 \
       --conf spark.executorEnv.JAVA_HOME=/home/work/software/java \
       --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java \
       --conf spark.hadoop.fs.defaultFS=hdfs://xxxx-hadoop3-1v/ \
       --jars $SONA_SPARK_JARS  \
       --name "deepfm for torch on angel" \
       --files deepfm.pt \
       --driver-memory 5g \
       --num-executors 5 \
       --executor-cores 1 \
       --executor-memory 5g \
       --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample \
       ./pytorch-on-angel-0.2.1.jar\
       trainInput:$input batchSize:128 torchModelPath:deepfm.pt \
       stepSize:0.001 numEpoch:10 testRatio:0.1 \
       angelModelOutputPath:$output mode:yarn-client \

image

@ouyangwen-it
Copy link
Contributor

你spark用yarn-cluster模式提交试试呢

@fy88fy
Copy link
Author

fy88fy commented Jan 5, 2021

改为yarn-cluster报如下错误了:
image

image

@ouyangwen-it
Copy link
Contributor

你这个torchlib.zip压缩包解压的目录结构是什么样的

@fy88fy
Copy link
Author

fy88fy commented Jan 5, 2021

torchlib.zip解压开是lib目录,lib下是很多.a文件
image

@ouyangwen-it
Copy link
Contributor

torchlib.zip解压开是lib目录,lib下是很多.a文件
image

你可以在RecommendationExample里面把当前目录打印下看看吗,看有没有torch/lib

@fy88fy
Copy link
Author

fy88fy commented Jan 13, 2021

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.x.121.219
	 ApplicationMaster RPC port: -1
	 queue: root.default
	 start time: 1610514454704
	 final status: UNDEFINED
	 tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
	 user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
	at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
	... 3 more


	at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
	at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
	at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
	at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
	at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
	at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp
output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT}
source /home/work/software/angel/bin/spark-on-angel-env.sh
echo "------------------"
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib

echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=15
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/
--jars $SONA_SPARK_JARS
--name "deepfm for torch"
--driver-memory 5g
--num-executors 15
--executor-cores 5
--executor-memory 8g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
/home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar
trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client
`

@ouyangwen-it
Copy link
Contributor

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.x.121.219
	 ApplicationMaster RPC port: -1
	 queue: root.default
	 start time: 1610514454704
	 final status: UNDEFINED
	 tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
	 user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
	at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
	... 3 more


	at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
	at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
	at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
	at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
	at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
	at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

`input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp
output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT}
source /home/work/software/angel/bin/spark-on-angel-env.sh
echo "------------------"
JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib

echo $JAVA_LIBRARY_PATH
$SPARK_HOME/bin/spark-submit
--conf spark.ps.instances=15
--conf spark.ps.cores=1
--conf spark.ps.jars=$SONA_ANGEL_JARS
--conf spark.ps.memory=5g
--conf spark.ps.log.level=INFO
--archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch
--conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib
--conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib
--conf spark.executorEnv.OMP_NUM_THREADS=2
--conf spark.executorEnv.MKL_NUM_THREADS=2
--conf spark.executorEnv.JAVA_HOME=/home/work/software/java
--conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java
--conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/
--jars $SONA_SPARK_JARS
--name "deepfm for torch"
--driver-memory 5g
--num-executors 15
--executor-cores 5
--executor-memory 8g
--class com.tencent.angel.pytorch.examples.supervised.RecommendationExample
/home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar
trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt
stepSize:0.001 numEpoch:10 testRatio:0.1
angelModelOutputPath:$output mode:yarn-client
`

看日志显示是保存模型的时候报错了,你可以看下ps端的日志

@fy88fy
Copy link
Author

fy88fy commented Jan 15, 2021

你好,我最后把集群所有节点环境都配置了一下。yarn-client模式就可以用了。但是偶尔会报这个错,是什么原因导致的呢。

21/01/13 13:07:37 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:38 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:39 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:40 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:41 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:42 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:43 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:44 INFO Client: Application report for application_1609301285435_0588 (state: ACCEPTED)
21/01/13 13:07:45 INFO Client: Application report for application_1609301285435_0588 (state: RUNNING)
21/01/13 13:07:45 INFO Client: 
	 client token: N/A
	 diagnostics: N/A
	 ApplicationMaster host: 10.x.121.219
	 ApplicationMaster RPC port: -1
	 queue: root.default
	 start time: 1610514454704
	 final status: UNDEFINED
	 tracking URL: http://xxx.net:8888/proxy/application_1609301285435_0588/
	 user: user-2001
numDataPartitions=7500
numDataPartitions=7500
type: BIAS_WEIGHT_EMBEDDING_MATS name:DeepFM mats_dims: 130,10,1,1,10,5,1,1,5,1,1,1
optimizer: AsyncAdam eta=0.001 decay=0.001
from driver start Angel PS! 
AppMaster capability = <memory:2048, vCores:1, gCores:0>
validate_auc=0.8820555586167144 time=12161ms                                     
train_auc=0.8922190567461854 validate_auc=0.8954355298794986 time=3696ms        
train_auc=0.9007353906278361 validate_auc=0.9038945708702969 time=2840ms        
train_auc=0.9039252388499588 validate_auc=0.9064111964938902 time=2716ms        
train_auc=0.9068299665301356 validate_auc=0.9090327256697933 time=2602ms        
train_auc=0.9075874962681636 validate_auc=0.909377106018285 time=2575ms         
train_auc=0.9093007058840281 validate_auc=0.9113024570743141 time=2535ms        
train_auc=0.9108590019247338 validate_auc=0.9131101889959353 time=2603ms        
train_auc=0.9103151686457647 validate_auc=0.9114020624674163 time=2616ms        
train_auc=0.9114118253731684 validate_auc=0.9126884555230131 time=2597ms        
Exception in thread "main" com.tencent.angel.exception.AngelException: Model save falied Detail failed log:
ParameterServer_0:save task com.tencent.angel.model.PSMatricesSaveContext@6fc796e8 failed:java.io.IOException: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:154)
	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:121)
	at com.tencent.angel.ps.io.save.PSModelSaver.lambda$save$9(PSModelSaver.java:184)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: SAVE matrix failed:java.io.IOException: SAVE model partitions failed.null
	at com.tencent.angel.model.output.format.MatrixFormatImpl.save(MatrixFormatImpl.java:135)
	at com.tencent.angel.ps.io.PSModelIOExecutor.saveMatrix(PSModelIOExecutor.java:237)
	at com.tencent.angel.ps.io.PSModelIOExecutor.process(PSModelIOExecutor.java:216)
	at com.tencent.angel.ps.io.PSModelIOExecutor.access$000(PSModelIOExecutor.java:39)
	at com.tencent.angel.ps.io.PSModelIOExecutor$MatrixDiskIOOp.compute(PSModelIOExecutor.java:195)
	at java.util.concurrent.RecursiveAction.exec(RecursiveAction.java:189)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

	at com.tencent.angel.ps.io.PSModelIOExecutor.save(PSModelIOExecutor.java:151)
	... 3 more


	at com.tencent.angel.client.AngelClient.isSaveCompleted(AngelClient.java:670)
	at com.tencent.angel.client.AngelClient.save(AngelClient.java:381)
	at com.tencent.angel.client.AngelPSClient.save(AngelPSClient.java:146)
	at com.tencent.angel.spark.context.AngelPSContext$.save(AngelPSContext.scala:300)
	at com.tencent.angel.spark.context.AngelPSContext.save(AngelPSContext.scala:258)
	at com.tencent.angel.pytorch.recommendation.RecommendPSModel.savePSModel(RecommendPSModel.scala:234)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample$.main(RecommendationExample.scala:88)
	at com.tencent.angel.pytorch.examples.supervised.RecommendationExample.main(RecommendationExample.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

input=hdfs://XXX-1v/home/yarn/pytorch-on-angel/census_148d_train.libsvm.tmp output=hdfs://XXX-1v/home/user/pytorch-on-angel/${DT} source /home/work/software/angel/bin/spark-on-angel-env.sh echo "------------------" JAVA_LIBRARY_PATH=/home/work/software/angel/lib:/home/work/software/java/lib echo $JAVA_LIBRARY_PATH $SPARK_HOME/bin/spark-submit --conf spark.ps.instances=15 --conf spark.ps.cores=1 --conf spark.ps.jars=$SONA_ANGEL_JARS --conf spark.ps.memory=5g --conf spark.ps.log.level=INFO --archives hdfs://XXX-1v/home/yarn/pytorch-on-angel/torchlib.zip#torch --conf spark.driver.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraJavaOptions=-Djava.library.path=$JAVA_LIBRARY_PATH:.:/home/work/software/angel/bin/torch/lib --conf spark.executor.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.driver.extraLibraryPath=/home/work/software/angel/bin/torch/lib --conf spark.executorEnv.OMP_NUM_THREADS=2 --conf spark.executorEnv.MKL_NUM_THREADS=2 --conf spark.executorEnv.JAVA_HOME=/home/work/software/java --conf spark.yarn.appMasterEnv.JAVA_HOME=/home/work/software/java --conf spark.hadoop.fs.defaultFS=hdfs://XXX-1v/ --jars $SONA_SPARK_JARS --name "deepfm for torch" --driver-memory 5g --num-executors 15 --executor-cores 5 --executor-memory 8g --class com.tencent.angel.pytorch.examples.supervised.RecommendationExample /home/work/software/angel/bin/pytorch-on-angel-0.2.1.jar.old.jar trainInput:$input batchSize:128 torchModelPath:/home/work/software/angel/bin/deepfm.pt stepSize:0.001 numEpoch:10 testRatio:0.1 angelModelOutputPath:$output mode:yarn-client

看日志显示是保存模型的时候报错了,你可以看下ps端的日志

ps日志报错如下:
image
image
image
image

@ouyangwen-it
Copy link
Contributor

查看具体的出错的ps ParameterServer_0的日志:查看方法参考文档:https://github.com/Angel-ML/angel/wiki/%E5%B7%A5%E7%A8%8B%E5%B8%B8%E8%A7%81%E9%97%AE%E9%A2%98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants