[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

monchickey · 2025-01-24T06:48:31Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

DolphinScheduler version: 3.2.2
Deployment: pseudo-cluster
Spark is deployed in a standalone cluster, version: 3.5.4
Resource files are stored using MinIO S3
The configuration files involve api-server/conf/common.properties and worker-server/conf/common.properties, The main changes are as follows:

resource.storage.type=S3
resource.storage.upload.base.path=/dolphinscheduler
resource.aws.access.key.id=<minio access key>
resource.aws.secret.access.key=<minio secret key>
resource.aws.region=cn-north-1
resource.aws.s3.bucket.name=dolphinscheduler
resource.aws.s3.endpoint=http://<ip>:9000
resource.hdfs.root.user=root
resource.hdfs.fs.defaultFS=s3a://dolphinscheduler

Keep the rest of the configuration as default, After starting the service, the jar file can be uploaded normally.
Then select the SPARK component in the workflow, select the Jar package uploaded to MinIO, and select cluster as the deployment method.
Then run the workflow instance, and the output log attachment is as follows:

1737699046243.log

The important error information is:

[INFO] 2025-01-24 13:53:34.674 +0800 - *********************************  Execute task instance  *************************************
[INFO] 2025-01-24 13:53:34.675 +0800 - ***********************************************************************************************
[INFO] 2025-01-24 13:53:34.677 +0800 - Final Shell file is: 
[INFO] 2025-01-24 13:53:34.677 +0800 - ****************************** Script Content *****************************************************************
[INFO] 2025-01-24 13:53:34.677 +0800 - #!/bin/bash
BASEDIR=$(cd `dirname $0`; pwd)
cd $BASEDIR
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3
${SPARK_HOME}/bin/spark-submit --master spark://192.168.11.17:7077 --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=512M --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2G /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
[INFO] 2025-01-24 13:53:34.678 +0800 - ****************************** Script Content *****************************************************************
[INFO] 2025-01-24 13:53:34.678 +0800 - Executing shell command : sudo -u default -i /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/6_6.sh
[INFO] 2025-01-24 13:53:34.687 +0800 - process start, process id is: 172698
[INFO] 2025-01-24 13:53:37.688 +0800 -  -> 
	25/01/24 13:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	25/01/24 13:53:37 INFO SecurityManager: Changing view acls to: default
	25/01/24 13:53:37 INFO SecurityManager: Changing modify acls to: default
	25/01/24 13:53:37 INFO SecurityManager: Changing view acls groups to: 
	25/01/24 13:53:37 INFO SecurityManager: Changing modify acls groups to: 
	25/01/24 13:53:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: default; groups with view permissions: EMPTY; users with modify permissions: default; groups with modify permissions: EMPTY
[INFO] 2025-01-24 13:53:38.691 +0800 -  -> 
	25/01/24 13:53:37 INFO Utils: Successfully started service 'driverClient' on port 39639.
	25/01/24 13:53:37 INFO TransportClientFactory: Successfully created connection to /192.168.11.17:7077 after 57 ms (0 ms spent in bootstraps)
	25/01/24 13:53:38 INFO ClientEndpoint: ... waiting before polling master for driver state
	25/01/24 13:53:38 INFO ClientEndpoint: Driver successfully submitted as driver-20250124135338-0056
[INFO] 2025-01-24 13:53:43.693 +0800 -  -> 
	25/01/24 13:53:43 INFO ClientEndpoint: State of driver-20250124135338-0056 is ERROR
	25/01/24 13:53:43 ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
	java.nio.file.NoSuchFileException: /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
		at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
		at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
		at java.nio.file.Files.copy(Files.java:1274)
		at org.apache.spark.util.Utils$.copyRecursive(Utils.scala:681)
		at org.apache.spark.util.Utils$.copyFile(Utils.scala:652)
		at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:725)
		at org.apache.spark.util.Utils$.fetchFile(Utils.scala:467)
		at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:162)
		at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:179)
		at org.apache.spark.deploy.worker.DriverRunner$$anon$2.run(DriverRunner.scala:99)
	25/01/24 13:53:43 INFO ShutdownHookManager: Shutdown hook called
	25/01/24 13:53:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-2af4f41d-c583-4698-9d8e-546a656bcf17
[INFO] 2025-01-24 13:53:43.695 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6, processId:172698 ,exitStatusCode:255 ,processWaitForStatus:true ,processExitValue:255
[INFO] 2025-01-24 13:53:43.697 +0800 - Start finding appId in /opt/apache-dolphinscheduler-3.2.2-bin/worker-server/logs/20250124/131329769571008/2/6/6.log, fetch way: log 
[INFO] 2025-01-24 13:53:43.698 +0800 - 
***********************************************************************************************
[INFO] 2025-01-24 13:53:43.699 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2025-01-24 13:53:43.699 +0800 - ***********************************************************************************************

From the error message, we can see that although the jar package on MinIO was selected when configuring the workflow, DolphinScheduler still used the local temporary directory as a parameter during runtime, which caused the Spark Driver to fail to read the package and cause an error.

What you expected to happen

Tasks can be submitted and run normally,

How to reproduce

You can reproduce it by following the steps above.

Anything else

The above problem will occur as long as DolphinScheduler and Spark Driver are not running on the same node.

Version

3.2.x

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

SbloodyS · 2025-01-24T08:43:45Z

You should put spark-examples_2.12-3.5.4.jar to storage center and use it in task through resource file if you want to use cluster deployment.

monchickey · 2025-01-24T08:52:46Z

@SbloodyS I just tried it and still get the same error, because the main package is required, and the selection in the resources below doesn't seem to take effect.
I uploaded the package by selecting Resources->Upload Files, and then selected the package in Main Package and Resources. Is this OK?

monchickey added bug Something isn't working Waiting for reply Waiting for reply labels Jan 24, 2025

monchickey changed the title ~~[Bug] [All Model] Failed to submit Spark task in cluster mode~~ [Bug] [All Module] Failed to submit Spark task in cluster mode Jan 24, 2025

monchickey changed the title ~~[Bug] [All Module] Failed to submit Spark task in cluster mode~~ [Bug] [Worker] Failed to submit Spark task in cluster mode Jan 24, 2025

SbloodyS added question Further information is requested and removed bug Something isn't working Waiting for reply Waiting for reply labels Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

monchickey commented Jan 24, 2025

SbloodyS commented Jan 24, 2025

monchickey commented Jan 24, 2025 •

edited

Loading

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

Comments

monchickey commented Jan 24, 2025

Search before asking

What happened

What you expected to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

SbloodyS commented Jan 24, 2025

monchickey commented Jan 24, 2025 • edited Loading

monchickey commented Jan 24, 2025 •

edited

Loading