Skip to content
This repository has been archived by the owner on Jul 16, 2020. It is now read-only.

Error while training XGBoost example on K8s #86

Open
rakshithvasudev opened this issue Jun 5, 2020 · 0 comments
Open

Error while training XGBoost example on K8s #86

rakshithvasudev opened this issue Jun 5, 2020 · 0 comments

Comments

@rakshithvasudev
Copy link

rakshithvasudev commented Jun 5, 2020

Hello There,

I was following the guide mentioned here. The Maven build was done for both cuda10-0 & 10-1 with their respective images CUDA10 & 10.1 and both error out the same as shown below.

++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_K8S_CMD=driver
+ case "$SPARK_K8S_CMD" in
+ shift 1
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -n '' ']'
+ PYSPARK_ARGS=
+ '[' -n '' ']'
+ R_ARGS=
+ '[' -n '' ']'
+ '[' '' == 2 ']'
+ '[' '' == 3 ']'
+ case "$SPARK_K8S_CMD" in
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /sbin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=10.45.0.0 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class ai.rapids.spark.examples.mortgage.GPUMain spark-internal -trainDataPath=/rapids-spark/xgboost4j_spark/data/mortgage/csv/train/mortgage_train_merged.csv -evalDataPath=/rapids-spark/xgboost4j_spark/data/mortgage/csv/test/mortgage_eval_merged.csv -format=csv -numWorkers=1 -treeMethod=gpu_hist -numRound=100 -maxDepth=8
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/spark/jars/sample_xgboost_apps-0.1.5-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/06/05 13:38:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/05 13:38:06 INFO SparkContext: Running Spark version 2.4.3
20/06/05 13:38:06 INFO SparkContext: Submitted application: Mortgage-GPU-csv
20/06/05 13:38:06 INFO SecurityManager: Changing view acls to: 185
20/06/05 13:38:06 INFO SecurityManager: Changing modify acls to: 185
20/06/05 13:38:06 INFO SecurityManager: Changing view acls groups to: 
20/06/05 13:38:06 INFO SecurityManager: Changing modify acls groups to: 
20/06/05 13:38:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(185); groups with view permissions: Set(); users  with modify permissions: Set(185); groups with modify permissions: Set()
20/06/05 13:38:07 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
20/06/05 13:38:07 INFO SparkEnv: Registering MapOutputTracker
20/06/05 13:38:07 INFO SparkEnv: Registering BlockManagerMaster
20/06/05 13:38:07 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/06/05 13:38:07 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/06/05 13:38:07 INFO DiskBlockManager: Created local directory at /var/data/spark-b49dff72-85e4-4719-bb58-491ea8e8ab0b/blockmgr-cc59ce7c-1b10-47bc-b098-d737465b96de
20/06/05 13:38:07 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
20/06/05 13:38:07 INFO SparkEnv: Registering OutputCommitCoordinator
20/06/05 13:38:07 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/06/05 13:38:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://spark-1591364282553-driver-svc.spark-rak.svc:4040
20/06/05 13:38:07 INFO SparkContext: Added JAR file:///rapids-spark/xgboost4j_spark/jars/sample_xgboost_apps-0.1.5-jar-with-dependencies.jar at spark://spark-1591364282553-driver-svc.spark-rak.svc:7078/jars/sample_xgboost_apps-0.1.5-jar-with-dependencies.jar with timestamp 1591364287515
20/06/05 13:38:08 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
20/06/05 13:38:08 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079.
20/06/05 13:38:08 INFO NettyBlockTransferService: Server created on spark-1591364282553-driver-svc.spark-rak.svc:7079
20/06/05 13:38:08 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/06/05 13:38:08 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, spark-1591364282553-driver-svc.spark-rak.svc, 7079, None)
20/06/05 13:38:08 INFO BlockManagerMasterEndpoint: Registering block manager spark-1591364282553-driver-svc.spark-rak.svc:7079 with 2004.6 MB RAM, BlockManagerId(driver, spark-1591364282553-driver-svc.spark-rak.svc, 7079, None)
20/06/05 13:38:08 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark-1591364282553-driver-svc.spark-rak.svc, 7079, None)
20/06/05 13:38:08 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark-1591364282553-driver-svc.spark-rak.svc, 7079, None)
20/06/05 13:38:11 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.45.0.23:40096) with ID 1
20/06/05 13:38:11 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
20/06/05 13:38:11 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse').
20/06/05 13:38:11 INFO SharedState: Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'.
20/06/05 13:38:11 INFO BlockManagerMasterEndpoint: Registering block manager 10.45.0.23:40115 with 4.4 GB RAM, BlockManagerId(1, 10.45.0.23, 40115, None)
20/06/05 13:38:12 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint

------ Training ------
20/06/05 13:38:12 INFO XGBoostSpark: GpuDataset Running XGBoost 1.0.0-Beta5 with parameters: 
alpha -> 0.0
min_child_weight -> 30.0
sample_type -> uniform
base_score -> 0.5
colsample_bylevel -> 1.0
grow_policy -> depthwise
skip_drop -> 0.0
lambda_bias -> 0.0
silent -> 0
scale_pos_weight -> 2.0
seed -> 0
cache_training_set -> false
features_col -> features
num_early_stopping_rounds -> 0
label_col -> delinquency_12
num_workers -> 1
subsample -> 1.0
lambda -> 1.0
max_depth -> 8
probability_col -> probability
raw_prediction_col -> rawPrediction
tree_limit -> 0
custom_eval -> null
rate_drop -> 0.0
max_bin -> 16
use_external_memory -> false
objective -> binary:logistic
features_cols -> List(orig_channel, first_home_buyer, loan_purpose, property_type, occupancy_status, property_state, product_type, relocation_mortgage_indicator, seller_name, mod_flag, orig_interest_rate, orig_upb, orig_loan_term, orig_ltv, orig_cltv, num_borrowers, dti, borrower_credit_score, num_units, zip, mortgage_insurance_percent, current_loan_delinquency_status, current_actual_upb, interest_rate, loan_age, msa, non_interest_bearing_upb)
eval_metric -> error
num_round -> 100
timeout_request_workers -> 1800000
missing -> 0.0
checkpoint_path -> 
tracker_conf -> TrackerConf(0,python)
tree_method -> gpu_hist
max_delta_step -> 0.0
eta -> 0.1
max_leaves -> 256
verbosity -> 1
colsample_bytree -> 1.0
normalize_type -> tree
custom_obj -> null
gamma -> 0.1
check_group_integrity -> true
sketch_eps -> 0.03
nthread -> 1
prediction_col -> prediction
checkpoint_interval -> -1
20/06/05 13:38:12 INFO XGBoostSpark: File split in repartition is enabled.
20/06/05 13:38:12 INFO GpuDataset: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes, total bytes is 904414646.
20/06/05 13:38:12 INFO GpuDataset: Config 'spark.rapids.splitFile' is set to true
20/06/05 13:38:12 WARN XGBoostSpark: Missing weight column!
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.45.0.0, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
20/06/05 13:38:12 INFO RabitTracker$TrackerProcessLogger: 2020-06-05 13:38:12,469 INFO start listen on 10.45.0.0:9091
20/06/05 13:38:12 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 211.9 KB, free 2004.4 MB)
20/06/05 13:38:12 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.0 KB, free 2004.4 MB)
20/06/05 13:38:12 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on spark-1591364282553-driver-svc.spark-rak.svc:7079 (size: 19.0 KB, free: 2004.6 MB)
20/06/05 13:38:12 INFO SparkContext: Created broadcast 0 from broadcast at GpuDataset.scala:117
20/06/05 13:38:12 INFO XGBoostSpark: starting GPU training with timeout set as 1800000 msfor waiting for resources
20/06/05 13:38:12 INFO SparkContext: Starting job: foreachPartition at XGBoost.scala:686
20/06/05 13:38:12 INFO DAGScheduler: Got job 0 (foreachPartition at XGBoost.scala:686) with 1 output partitions
20/06/05 13:38:12 INFO DAGScheduler: Final stage: ResultStage 0 (foreachPartition at XGBoost.scala:686)
20/06/05 13:38:12 INFO DAGScheduler: Parents of final stage: List()
20/06/05 13:38:12 INFO DAGScheduler: Missing parents: List()
20/06/05 13:38:12 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at mapPartitions at GpuDataset.scala:131), which has no missing parents
20/06/05 13:38:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.8 KB, free 2004.4 MB)
20/06/05 13:38:12 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.6 KB, free 2004.4 MB)
20/06/05 13:38:12 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on spark-1591364282553-driver-svc.spark-rak.svc:7079 (size: 4.6 KB, free: 2004.6 MB)
20/06/05 13:38:12 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
20/06/05 13:38:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at mapPartitions at GpuDataset.scala:131) (first 15 tasks are for partitions Vector(0))
20/06/05 13:38:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/06/05 13:38:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.45.0.23, executor 1, partition 0, PROCESS_LOCAL, 8507 bytes)
20/06/05 13:38:13 WARN ServletHandler: Error for /api/v1/applications/spark-7fbde8a747bd4eb59056e96d00d16550/executors
java.lang.AbstractMethodError: javax.ws.rs.core.UriBuilder.uri(Ljava/lang/String;)Ljavax/ws/rs/core/UriBuilder;
	at javax.ws.rs.core.UriBuilder.fromUri(UriBuilder.java:119)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:298)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
	at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
	at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
	at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
	at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
	at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.spark_project.jetty.server.Server.handle(Server.java:539)
	at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
	at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
	at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
	at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
	at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:748)
20/06/05 13:38:13 WARN XGBoostSpark: Unable to read total number of alive cores from REST API.Health Check will be ignored.
java.io.IOException: Server returned HTTP response code: 500 for URL: http://spark-1591364282553-driver-svc.spark-rak.svc:4040/api/v1/applications/spark-7fbde8a747bd4eb59056e96d00d16550/executors
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
	at java.net.URL.openStream(URL.java:1068)
	at org.codehaus.jackson.JsonFactory._optimizedStreamFromURL(JsonFactory.java:935)
	at org.codehaus.jackson.JsonFactory.createJsonParser(JsonFactory.java:530)
	at org.codehaus.jackson.map.ObjectMapper.readTree(ObjectMapper.java:1590)
	at org.apache.spark.SparkParallelismTracker.org$apache$spark$SparkParallelismTracker$$numAliveCores(SparkParallelismTracker.scala:56)
	at org.apache.spark.SparkParallelismTracker$$anonfun$executeHonorForGpu$1.apply$mcZ$sp(SparkParallelismTracker.scala:160)
	at org.apache.spark.SparkParallelismTracker$$anonfun$1.apply$mcV$sp(SparkParallelismTracker.scala:100)
	at org.apache.spark.SparkParallelismTracker$$anonfun$1.apply(SparkParallelismTracker.scala:100)
	at org.apache.spark.SparkParallelismTracker$$anonfun$1.apply(SparkParallelismTracker.scala:100)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20/06/05 13:38:13 WARN ServletHandler: Error for /api/v1/applications/spark-7fbde8a747bd4eb59056e96d00d16550/executors
java.lang.AbstractMethodError: javax.ws.rs.core.UriBuilder.uri(Ljava/lang/String;)Ljavax/ws/rs/core/UriBuilder;
	at javax.ws.rs.core.UriBuilder.fromUri(UriBuilder.java:119)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:298)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:228)
	at org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
	at org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
	at org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
	at org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
	at org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.spark_project.jetty.server.Server.handle(Server.java:539)
	at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
	at org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
	at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
	at org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
	at org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
	at org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:748)
20/06/05 13:38:13 WARN XGBoostSpark: Unable to read total number of alive worker from REST API.Health Check will be ignored.
java.io.IOException: Server returned HTTP response code: 500 for URL: http://spark-1591364282553-driver-svc.spark-rak.svc:4040/api/v1/applications/spark-7fbde8a747bd4eb59056e96d00d16550/executors
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1900)
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
	at java.net.URL.openStream(URL.java:1068)
	at org.codehaus.jackson.JsonFactory._optimizedStreamFromURL(JsonFactory.java:935)
	at org.codehaus.jackson.JsonFactory.createJsonParser(JsonFactory.java:530)
	at org.codehaus.jackson.map.ObjectMapper.readTree(ObjectMapper.java:1590)
	at org.apache.spark.SparkParallelismTracker.org$apache$spark$SparkParallelismTracker$$numAliveWorkers(SparkParallelismTracker.scala:80)
	at org.apache.spark.SparkParallelismTracker$$anonfun$executeHonorForGpu$1.apply$mcZ$sp(SparkParallelismTracker.scala:160)
	at org.apache.spark.SparkParallelismTracker$$anonfun$1.apply$mcV$sp(SparkParallelismTracker.scala:100)
	at org.apache.spark.SparkParallelismTracker$$anonfun$1.apply(SparkParallelismTracker.scala:100)
	at org.apache.spark.SparkParallelismTracker$$anonfun$1.apply(SparkParallelismTracker.scala:100)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
20/06/05 13:38:15 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.45.0.23:40115 (size: 4.6 KB, free: 4.4 GB)
20/06/05 13:38:18 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.45.0.23, executor 1): java.lang.RuntimeException: Error running cudaGetDevice after allocation
	at ml.dmlc.xgboost4j.java.XGBoostSparkJNI.allocateGpuDevice(Native Method)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$appendGpuIdToParameters(XGBoost.scala:597)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForGpuDataset$1.apply(XGBoost.scala:626)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForGpuDataset$1.apply(XGBoost.scala:625)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset$$anonfun$ml$dmlc$xgboost4j$scala$spark$rapids$GpuDataset$$getBatchMapper$1.apply(GpuDataset.scala:516)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset$$anonfun$ml$dmlc$xgboost4j$scala$spark$rapids$GpuDataset$$getBatchMapper$1.apply(GpuDataset.scala:515)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

20/06/05 13:38:18 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 10.45.0.23, executor 1, partition 0, PROCESS_LOCAL, 8507 bytes)
20/06/05 13:38:18 ERROR XGBoostTaskFailedListener: Training Task Failed during XGBoost Training: ExceptionFailure(java.lang.RuntimeException,Error running cudaGetDevice after allocation,[Ljava.lang.StackTraceElement;@3d094646,java.lang.RuntimeException: Error running cudaGetDevice after allocation
	at ml.dmlc.xgboost4j.java.XGBoostSparkJNI.allocateGpuDevice(Native Method)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$appendGpuIdToParameters(XGBoost.scala:597)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForGpuDataset$1.apply(XGBoost.scala:626)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$ml$dmlc$xgboost4j$scala$spark$XGBoost$$trainForGpuDataset$1.apply(XGBoost.scala:625)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset$$anonfun$ml$dmlc$xgboost4j$scala$spark$rapids$GpuDataset$$getBatchMapper$1.apply(GpuDataset.scala:516)
	at ml.dmlc.xgboost4j.scala.spark.rapids.GpuDataset$$anonfun$ml$dmlc$xgboost4j$scala$spark$rapids$GpuDataset$$getBatchMapper$1.apply(GpuDataset.scala:515)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
	at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
,Some(org.apache.spark.ThrowableSerializationWrapper@33ceaf97),Vector(AccumulableInfo(2,None,Some(3075),None,false,true,None), AccumulableInfo(4,None,Some(0),None,false,true,None)),Vector(LongAccumulator(id: 2, name: Some(internal.metrics.executorRunTime), value: 3075), LongAccumulator(id: 4, name: Some(internal.metrics.resultSize), value: 0))), stopping SparkContext
20/06/05 13:38:18 INFO SparkUI: Stopped Spark web UI at http://spark-1591364282553-driver-svc.spark-rak.svc:4040
20/06/05 13:38:18 INFO DAGScheduler: Job 0 failed: foreachPartition at XGBoost.scala:686, took 5.377021 s
20/06/05 13:38:18 INFO DAGScheduler: ResultStage 0 (foreachPartition at XGBoost.scala:686) failed in 5.340 s due to Stage cancelled because SparkContext was shut down
20/06/05 13:38:18 ERROR RabitTracker: Uncaught exception thrown by worker:
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:932)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:930)
	at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
	at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:930)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2128)
	at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
	at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2041)
	at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
	at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
	at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply$mcV$sp(SparkParallelismTracker.scala:197)
	at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:197)
	at org.apache.spark.TaskFailedListener$$anon$1$$anonfun$run$1.apply(SparkParallelismTracker.scala:197)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.TaskFailedListener$$anon$1.run(SparkParallelismTracker.scala:196)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:935)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:933)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:933)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributedForGpuDataset$1$$anon$1.run(XGBoost.scala:686)
20/06/05 13:38:18 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
20/06/05 13:38:18 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
20/06/05 13:38:18 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
20/06/05 13:38:18 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 10.45.0.23, executor 1: java.lang.RuntimeException (Error running cudaGetDevice after allocation) [duplicate 1]
20/06/05 13:38:18 ERROR Utils: Uncaught exception in thread task-result-getter-1
org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
	at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
	at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
	at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
	at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
	at org.apache.spark.scheduler.TaskSchedulerImpl.handleFailedTask(TaskSchedulerImpl.scala:623)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:150)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4$$anonfun$run$2.apply(TaskResultGetter.scala:132)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4$$anonfun$run$2.apply(TaskResultGetter.scala:132)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4.run(TaskResultGetter.scala:132)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Exception in thread "task-result-getter-1" java.lang.Error: org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Could not find CoarseGrainedScheduler.
	at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160)
	at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:140)
	at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:187)
	at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:528)
	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.reviveOffers(CoarseGrainedSchedulerBackend.scala:449)
	at org.apache.spark.scheduler.TaskSchedulerImpl.handleFailedTask(TaskSchedulerImpl.scala:623)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4$$anonfun$run$2.apply$mcV$sp(TaskResultGetter.scala:150)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4$$anonfun$run$2.apply(TaskResultGetter.scala:132)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4$$anonfun$run$2.apply(TaskResultGetter.scala:132)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
	at org.apache.spark.scheduler.TaskResultGetter$$anon$4.run(TaskResultGetter.scala:132)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	... 2 more
20/06/05 13:38:18 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/05 13:38:18 INFO MemoryStore: MemoryStore cleared
20/06/05 13:38:18 INFO BlockManager: BlockManager stopped
20/06/05 13:38:18 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/05 13:38:18 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/05 13:38:18 INFO SparkContext: Successfully stopped SparkContext
20/06/05 13:38:23 INFO RabitTracker$TrackerProcessLogger: Tracker Process ends with exit code 143
20/06/05 13:38:23 INFO RabitTracker: Tracker Process ends with exit code 143
20/06/05 13:38:23 INFO XGBoostSpark: GpuDataset Rabit returns with exit code 143
Exception in thread "main" ml.dmlc.xgboost4j.java.XGBoostError: XGBoostModel training failed
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.ml$dmlc$xgboost4j$scala$spark$XGBoost$$postTrackerReturnProcessing(XGBoost.scala:886)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributedForGpuDataset$1.apply(XGBoost.scala:693)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$$anonfun$trainDistributedForGpuDataset$1.apply(XGBoost.scala:675)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainDistributedForGpuDataset(XGBoost.scala:674)
	at ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier.fit(XGBoostClassifier.scala:223)
	at ai.rapids.spark.examples.mortgage.GPUMain$$anonfun$6.apply(GPUMain.scala:74)
	at ai.rapids.spark.examples.mortgage.GPUMain$$anonfun$6.apply(GPUMain.scala:74)
	at ai.rapids.spark.examples.utility.Benchmark.time(Benchmark.scala:28)
	at ai.rapids.spark.examples.mortgage.GPUMain$.main(GPUMain.scala:73)
	at ai.rapids.spark.examples.mortgage.GPUMain.main(GPUMain.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/06/05 13:38:23 INFO ShutdownHookManager: Shutdown hook called
20/06/05 13:38:23 INFO ShutdownHookManager: Deleting directory /var/data/spark-b49dff72-85e4-4719-bb58-491ea8e8ab0b/spark-0ba4cc2e-6ce0-47ca-b5eb-ed7d16a89fec
20/06/05 13:38:23 INFO ShutdownHookManager: Deleting directory /tmp/spark-8cc0a04c-b169-489b-8ac9-03b64a995e37

Here are my submission variables:

export SPARK_HOME=`pwd`
export DATA_PATH=/rapids-spark/xgboost4j_spark/data 
export JARS_PATH=/rapids-spark/xgboost4j_spark/jars
export SPARK_MASTER=k8s://https://k8s-ip
export TEMPLATE_PATH=$SPARK_HOME/gpu_executor_template.yaml
export SPARK_DOCKER_IMAGE=rakshithvasudev/spark-py-operator
export SPARK_DOCKER_TAG=rapids-metrics-2.4.3-cu10
export NAMESPACE=spark-rak
export K8S_ACCOUNT=spark
export SPARK_DEPLOY_MODE=cluster
export SPARK_NUM_EXECUTORS=1
export SPARK_DRIVER_MEMORY=4g
export SPARK_EXECUTOR_MEMORY=8g
export EXAMPLE_CLASS=ai.rapids.spark.examples.mortgage.GPUMain
export JAR_EXAMPLE=${JARS_PATH}/sample_xgboost_apps-0.1.5-jar-with-dependencies.jar
export TREE_METHOD=gpu_hist



${SPARK_HOME}/bin/spark-submit                                                          \
  --master ${SPARK_MASTER}                                                              \
  --deploy-mode ${SPARK_DEPLOY_MODE}                                                    \
  --class ${EXAMPLE_CLASS}                                                              \
  --conf spark.executor.instances=${SPARK_NUM_EXECUTORS}                                \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=${K8S_ACCOUNT}         \
  --conf spark.kubernetes.container.image=${SPARK_DOCKER_IMAGE}:${SPARK_DOCKER_TAG}     \
  --conf spark.kubernetes.driver.podTemplateFile=${TEMPLATE_PATH}                       \
  --conf spark.kubernetes.executor.podTemplateFile=${TEMPLATE_PATH}                     \
  --conf spark.kubernetes.namespace=$NAMESPACE                                          \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark                  \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-4103da68-1e6b-4c1a-8b99-a2731ae574e6.options.claimName=rapids-pvc \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-4103da68-1e6b-4c1a-8b99-a2731ae574e6.options.claimName=rapids-pvc \
  --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-4103da68-1e6b-4c1a-8b99-a2731ae574e6.mount.path=/rapids-spark \
  --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-4103da68-1e6b-4c1a-8b99-a2731ae574e6.mount.path=/rapids-spark \
  ${JAR_EXAMPLE}                                                                        \
  -trainDataPath=${DATA_PATH}/mortgage/csv/train/mortgage_train_merged.csv              \
  -evalDataPath=${DATA_PATH}/mortgage/csv/test/mortgage_eval_merged.csv                 \
  -format=csv                                                                           \
  -numWorkers=${SPARK_NUM_EXECUTORS}                                                    \
  -treeMethod=${TREE_METHOD}                                                            \
  -numRound=100                                                                         \
  -maxDepth=8

Any help would be appreciated.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant