[QST] SparkException ERROR ContextCleaner: Error cleaning broadcast #5328

unigrav-tech · 2022-04-22T03:34:02Z

unigrav-tech
Apr 22, 2022

Hi, it reports many errors when running spark with rapids on data generated by TPC-DS.
And it is OK when running spark without rapids.
Any help will be appreciated.

Env
32 cores, 256GB memory, Nvidia RTX5000
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
spark-3.2.1-bin-hadoop2.7
cudf-22.02.0-cuda11.jar
rapids-4-spark_2.12-22.02.0.jar
spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar

2.start spark standalone ( 1 master and 3 works on the same machine)
sh sbin/start-master.sh
sh sbin/start-worker.sh spark://10.1.164.41:7077 -c 8 -m 64G

open spark-shell

$SPARK_HOME/bin/spark-shell --jars ${SPARK_SQL_PERF_JAR},${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}
--master spark://10.1.164.41:7077
--deploy-mode client
--executor-memory 4G
--num-executors 2
--executor-cores 2
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.rapids.sql.concurrentGpuTasks=2
--conf spark.rapids.memory.pinnedPool.size=2G
--conf spark.locality.wait=0s
--conf spark.sql.files.maxPartitionBytes=512m
--conf spark.plugins=com.nvidia.spark.SQLPlugin

perf testing (execute the script on spark shell)

import com.databricks.spark.sql.perf.tpcds.TPCDS

// Note: Declare "sqlContext" for Spark 2.x version
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val tpcds = new TPCDS (sqlContext = sqlContext)
// Set:
val dataSize = "10g"
val databaseName = "tpcds" + dataSize // name of database with TPCDS data.
sql(s"use $databaseName")
val resultLocation = "/home/spark/result/tpcds_results_gpu" + dataSize // place to write results
val iterations = 1 // how many iterations of queries to run.
val queries = tpcds.tpcds2_4Queries // queries to run.
val timeout = 46060 // timeout, in seconds.
// Run:
val experiment = tpcds.runExperiment(
queries,
iterations = iterations,
resultLocation = resultLocation,
forkThread = true)
experiment.waitForFinish(timeout)

error like:
`
22/04/22 00:35:27 ERROR ContextCleaner: Error cleaning broadcast 1879
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:194)
at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:351)
at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:78)
at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:254)
at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3(ContextCleaner.scala:204)
at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$3$adapted(ContextCleaner.scala:195)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.ContextCleaner.$anonfun$keepCleaning$1(ContextCleaner.scala:195)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1404)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:189)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:79)
Caused by: java.lang.RuntimeException: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
at org.apache.spark.rpc.netty.Dispatcher.postRemoteMessage(Dispatcher.scala:136)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
```
 at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:209)
 at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
 at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
 at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
 at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
 at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.lang.Thread.run(Thread.java:748)
```

22/04/22 00:35:27 WARN BlockManagerMaster: Failed to remove broadcast 1881 with removeFromMaster = true - org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:176)
at org.apache.spark.rpc.netty.Dispatcher.postRemoteMessage(Dispatcher.scala:136)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:684)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:163)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
`

Answered by tgravescs

Apr 22, 2022

It sounds like your environment is not setup properly. it looks like you are not properly using Spark GPU scheduling and executors are trying to use the same GPU. Alternatively if you didn't want to use GPU scheduling you could set Gpus up in process exclusive mode. But since you are using standalone its probably easiest just to configure it to do GPU scheduling.

2.start spark standalone ( 1 master and 3 works on the same machine)

Is there a reason you are using 3 workers on the same machine? You should just use a single worker since you only have 1 GPU.

please see instructions here: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-clu…

View full answer

nvliyuan · 2022-04-22T04:38:32Z

nvliyuan
Apr 22, 2022
Collaborator

// Note: Declare "sqlContext" for Spark 2.x version

what is the spark version? can you please try spark-3.2.1-bin-hadoop3.3?
it seems the executor fails, can you please also share the executor log?

0 replies

unigrav-tech · 2022-04-22T11:28:45Z

unigrav-tech
Apr 22, 2022
Author

what is the spark version? can you please try spark-3.2.1-bin-hadoop3.3? it seems the executor fails, can you please also share the executor log?

spark version is 3.2.1

Only spark is using GPUs. The memory size of GPU is 16GB, and we have 2GPU on the machine.

the log for some executors:
1. less than the minimum allocation

22/04/22 19:01:10 INFO RapidsExecutorPlugin: RAPIDS Accelerator build: {version=22.02.0, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-02-14T10:37:11Z, revision=a32ec69e67ff0f8adf85ab6b2665f6a1c751dac0, cudf_version=22.02.0, branch=HEAD}
22/04/22 19:01:10 INFO RapidsExecutorPlugin: cudf build: {version=22.02.0, user=, date=2022-02-06T12:32:11Z, revision=774d859fef2cb242dd8314d50b9c1e038468e266, branch=HEAD}
22/04/22 19:01:11 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
22/04/22 19:01:17 ERROR RapidsExecutorPlugin: Exception in the executor plugin
java.lang.IllegalArgumentException: The pool allocation of -564.0625 MB (calculated from spark.rapids.memory.gpu.allocFraction (=1.0) and 75.9375 MB free memory) was less than the minimum allocation of 806.146484375 (calculated from spark.rapids.memory.gpu.minAllocFraction (=0.05) and 16122.9375 MB total memory)
at com.nvidia.spark.rapids.GpuDeviceManager$.computeRmmPoolSize(GpuDeviceManager.scala:197)
at com.nvidia.spark.rapids.GpuDeviceManager$.initializeRmm(GpuDeviceManager.scala:233)
at com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(GpuDeviceManager.scala:328)
at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:137)
at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:219)
at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.internal.plugin.ExecutorPluginContainer.(PluginContainer.scala:113)
at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211)
at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199)
at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:253)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231)
at org.apache.spark.executor.Executor.(Executor.scala:253)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:169)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/04/22 19:01:17 INFO MemoryStore: MemoryStore cleared
22/04/22 19:01:17 INFO BlockManager: BlockManager stopped
22/04/22 19:04:04 ERROR CoarseGrainedExecutorBackend: RECE

for other executors:
2. Could not find a single GPU to use

22/04/22 19:10:11 INFO RapidsExecutorPlugin: RAPIDS Accelerator build: {version=22.02.0, user=, url=https://github.com/NVIDIA/spark-rapids.git, date=2022-02-14T10:37:11Z, revision=a32ec69e67ff0f8adf85ab6b2665f6a1c751dac0, cudf_version=22.02.0, branch=HEAD}
22/04/22 19:10:11 INFO RapidsExecutorPlugin: cudf build: {version=22.02.0, user=, date=2022-02-06T12:32:11Z, revision=774d859fef2cb242dd8314d50b9c1e038468e266, branch=HEAD}
22/04/22 19:10:12 INFO RapidsExecutorPlugin: Initializing memory from Executor Plugin
22/04/22 19:10:18 INFO GpuDeviceManager: Will not use GPU 0 because of ai.rapids.cudf.CudaException: out of memory
22/04/22 19:10:18 INFO GpuDeviceManager: Will not use GPU 1 because of ai.rapids.cudf.CudaException: out of memory
22/04/22 19:10:18 INFO GpuDeviceManager: Will not use GPU 0 because of ai.rapids.cudf.CudaException: out of memory
22/04/22 19:10:18 INFO GpuDeviceManager: Will not use GPU 1 because of ai.rapids.cudf.CudaException: out of memory
22/04/22 19:10:18 ERROR RapidsExecutorPlugin: Exception in the executor plugin
java.lang.IllegalStateException: Could not find a single GPU to use
at com.nvidia.spark.rapids.GpuDeviceManager$.findGpuAndAcquire(GpuDeviceManager.scala:90)
at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeMemory$1(GpuDeviceManager.scala:327)
at scala.runtime.java8.JFunction0$mcI$sp.apply(JFunction0$mcI$sp.java:23)
at scala.Option.getOrElse(Option.scala:189)
at com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(GpuDeviceManager.scala:327)
at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:137)
at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:219)
at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:125)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.internal.plugin.ExecutorPluginContainer.(PluginContainer.scala:113)
at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:211)
at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:199)
at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:253)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231)
at org.apache.spark.executor.Executor.(Executor.scala:253)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:169)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/04/22 19:10:18 INFO MemoryStore: MemoryStore cleared
22/04/22 19:10:18 INFO BlockManager: BlockManager stopped

3. nvidia-smi

0 replies

tgravescs · 2022-04-22T12:50:29Z

tgravescs
Apr 22, 2022
Maintainer

It sounds like your environment is not setup properly. it looks like you are not properly using Spark GPU scheduling and executors are trying to use the same GPU. Alternatively if you didn't want to use GPU scheduling you could set Gpus up in process exclusive mode. But since you are using standalone its probably easiest just to configure it to do GPU scheduling.

2.start spark standalone ( 1 master and 3 works on the same machine)

Is there a reason you are using 3 workers on the same machine? You should just use a single worker since you only have 1 GPU.

please see instructions here: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster

Specifically worker setup section:

Setup worker configs on each node
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
Edit $SPARK_HOME/conf/spark-env.sh and add any worker options. The example below sets the number of GPUs per worker to 1 and points to the discovery script. Change this for your setup.
SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh"

If you setup the workers to see the GPUs then you can request them with your spark-shell command line by specifying the parameters:

  --conf spark.executor.resource.gpu.amount=1 \
   --conf spark.task.resource.gpu.amount=0.25 \

With the above setup you should be able to see the GPUs resources in the Spark Master UI.

0 replies

unigrav-tech · 2022-04-25T05:48:15Z

unigrav-tech
Apr 25, 2022
Author

It works well with a single worker. (standalone mode)

0 replies

tgravescs · 2022-04-26T13:17:56Z

tgravescs
Apr 26, 2022
Maintainer

thanks for the confirmation, I'm going to close this then. If you have more questions, you can open another issue or use our discussions board: https://github.com/NVIDIA/spark-rapids/discussions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] SparkException ERROR ContextCleaner: Error cleaning broadcast #5328

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[QST] SparkException ERROR ContextCleaner: Error cleaning broadcast #5328

unigrav-tech Apr 22, 2022

Replies: 5 comments

nvliyuan Apr 22, 2022 Collaborator

unigrav-tech Apr 22, 2022 Author

tgravescs Apr 22, 2022 Maintainer

unigrav-tech Apr 25, 2022 Author

tgravescs Apr 26, 2022 Maintainer

unigrav-tech
Apr 22, 2022

nvliyuan
Apr 22, 2022
Collaborator

unigrav-tech
Apr 22, 2022
Author

tgravescs
Apr 22, 2022
Maintainer

unigrav-tech
Apr 25, 2022
Author

tgravescs
Apr 26, 2022
Maintainer