Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

Open
2 of 3 tasks
monchickey opened this issue Jan 24, 2025 · 2 comments
Open
2 of 3 tasks

[Bug] [Worker] Failed to submit Spark task in cluster mode #16987

monchickey opened this issue Jan 24, 2025 · 2 comments
Labels
question Further information is requested

Comments

@monchickey
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

DolphinScheduler version: 3.2.2
Deployment: pseudo-cluster
Spark is deployed in a standalone cluster, version: 3.5.4
Resource files are stored using MinIO S3
The configuration files involve api-server/conf/common.properties and worker-server/conf/common.properties, The main changes are as follows:

resource.storage.type=S3
resource.storage.upload.base.path=/dolphinscheduler
resource.aws.access.key.id=<minio access key>
resource.aws.secret.access.key=<minio secret key>
resource.aws.region=cn-north-1
resource.aws.s3.bucket.name=dolphinscheduler
resource.aws.s3.endpoint=http://<ip>:9000
resource.hdfs.root.user=root
resource.hdfs.fs.defaultFS=s3a://dolphinscheduler

Keep the rest of the configuration as default, After starting the service, the jar file can be uploaded normally.
Then select the SPARK component in the workflow, select the Jar package uploaded to MinIO, and select cluster as the deployment method.
Then run the workflow instance, and the output log attachment is as follows:

1737699046243.log

The important error information is:

[INFO] 2025-01-24 13:53:34.674 +0800 - *********************************  Execute task instance  *************************************
[INFO] 2025-01-24 13:53:34.675 +0800 - ***********************************************************************************************
[INFO] 2025-01-24 13:53:34.677 +0800 - Final Shell file is: 
[INFO] 2025-01-24 13:53:34.677 +0800 - ****************************** Script Content *****************************************************************
[INFO] 2025-01-24 13:53:34.677 +0800 - #!/bin/bash
BASEDIR=$(cd `dirname $0`; pwd)
cd $BASEDIR
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3
${SPARK_HOME}/bin/spark-submit --master spark://192.168.11.17:7077 --deploy-mode cluster --class org.apache.spark.examples.JavaSparkPi --conf spark.driver.cores=1 --conf spark.driver.memory=512M --conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2G /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
[INFO] 2025-01-24 13:53:34.678 +0800 - ****************************** Script Content *****************************************************************
[INFO] 2025-01-24 13:53:34.678 +0800 - Executing shell command : sudo -u default -i /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/6_6.sh
[INFO] 2025-01-24 13:53:34.687 +0800 - process start, process id is: 172698
[INFO] 2025-01-24 13:53:37.688 +0800 -  -> 
	25/01/24 13:53:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
	25/01/24 13:53:37 INFO SecurityManager: Changing view acls to: default
	25/01/24 13:53:37 INFO SecurityManager: Changing modify acls to: default
	25/01/24 13:53:37 INFO SecurityManager: Changing view acls groups to: 
	25/01/24 13:53:37 INFO SecurityManager: Changing modify acls groups to: 
	25/01/24 13:53:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: default; groups with view permissions: EMPTY; users with modify permissions: default; groups with modify permissions: EMPTY
[INFO] 2025-01-24 13:53:38.691 +0800 -  -> 
	25/01/24 13:53:37 INFO Utils: Successfully started service 'driverClient' on port 39639.
	25/01/24 13:53:37 INFO TransportClientFactory: Successfully created connection to /192.168.11.17:7077 after 57 ms (0 ms spent in bootstraps)
	25/01/24 13:53:38 INFO ClientEndpoint: ... waiting before polling master for driver state
	25/01/24 13:53:38 INFO ClientEndpoint: Driver successfully submitted as driver-20250124135338-0056
[INFO] 2025-01-24 13:53:43.693 +0800 -  -> 
	25/01/24 13:53:43 INFO ClientEndpoint: State of driver-20250124135338-0056 is ERROR
	25/01/24 13:53:43 ERROR ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
	java.nio.file.NoSuchFileException: /tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6/spark-examples_2.12-3.5.4.jar
		at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
		at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
		at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
		at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
		at java.nio.file.Files.copy(Files.java:1274)
		at org.apache.spark.util.Utils$.copyRecursive(Utils.scala:681)
		at org.apache.spark.util.Utils$.copyFile(Utils.scala:652)
		at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:725)
		at org.apache.spark.util.Utils$.fetchFile(Utils.scala:467)
		at org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:162)
		at org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:179)
		at org.apache.spark.deploy.worker.DriverRunner$$anon$2.run(DriverRunner.scala:99)
	25/01/24 13:53:43 INFO ShutdownHookManager: Shutdown hook called
	25/01/24 13:53:43 INFO ShutdownHookManager: Deleting directory /tmp/spark-2af4f41d-c583-4698-9d8e-546a656bcf17
[INFO] 2025-01-24 13:53:43.695 +0800 - process has exited. execute path:/tmp/dolphinscheduler/exec/process/default/131329535157952/131329769571008_2/6/6, processId:172698 ,exitStatusCode:255 ,processWaitForStatus:true ,processExitValue:255
[INFO] 2025-01-24 13:53:43.697 +0800 - Start finding appId in /opt/apache-dolphinscheduler-3.2.2-bin/worker-server/logs/20250124/131329769571008/2/6/6.log, fetch way: log 
[INFO] 2025-01-24 13:53:43.698 +0800 - 
***********************************************************************************************
[INFO] 2025-01-24 13:53:43.699 +0800 - *********************************  Finalize task instance  ************************************
[INFO] 2025-01-24 13:53:43.699 +0800 - ***********************************************************************************************

From the error message, we can see that although the jar package on MinIO was selected when configuring the workflow, DolphinScheduler still used the local temporary directory as a parameter during runtime, which caused the Spark Driver to fail to read the package and cause an error.

What you expected to happen

Tasks can be submitted and run normally,

How to reproduce

You can reproduce it by following the steps above.

Anything else

The above problem will occur as long as DolphinScheduler and Spark Driver are not running on the same node.

Version

3.2.x

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@monchickey monchickey added bug Something isn't working Waiting for reply Waiting for reply labels Jan 24, 2025
@monchickey monchickey changed the title [Bug] [All Model] Failed to submit Spark task in cluster mode [Bug] [All Module] Failed to submit Spark task in cluster mode Jan 24, 2025
@monchickey monchickey changed the title [Bug] [All Module] Failed to submit Spark task in cluster mode [Bug] [Worker] Failed to submit Spark task in cluster mode Jan 24, 2025
@SbloodyS SbloodyS added question Further information is requested and removed bug Something isn't working Waiting for reply Waiting for reply labels Jan 24, 2025
@SbloodyS
Copy link
Member

You should put spark-examples_2.12-3.5.4.jar to storage center and use it in task through resource file if you want to use cluster deployment.

@monchickey
Copy link
Author

monchickey commented Jan 24, 2025

@SbloodyS I just tried it and still get the same error, because the main package is required, and the selection in the resources below doesn't seem to take effect.
I uploaded the package by selecting Resources->Upload Files, and then selected the package in Main Package and Resources. Is this OK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants