yarn mode error #606

jordanFisherYzw · 2023-09-18T07:17:41Z

when I run the yarn mode according to https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN

echo $KRYLOV_WF_HOME
echo $KRYLOV_WF_TASK_NAME/$1

export PYTHON_ROOT=./Python
export LD_LIBRARY_PATH=${PATH}
export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/python3
export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python3"
export PATH=${PYTHON_ROOT}/bin/:$PATH

export SPARK_WORKER_INSTANCES=4
export CORE_PER_WORKERS=2
export TOTAL_CORES=$((${CORE_PER_WORKERS}*${SPARK_WORKER_INSTANCES}))

/apache/releases/spark-3.1.1.0.9.0-bin-ebay/bin/spark-submit
--master yarn
--deploy-mode cluster
--queue hdlq-business-ads-guidance-high-mem
--num-executors ${SPARK_WORKER_INSTANCES}
--executor-cores ${CORE_PER_WORKERS}
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--executor-memory 24G
--archives "hdfs://user/tfos/Python.zip#Python"
--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS
--conf spark.executorEnv.CLASSPATH=$(hadoop classpath --glob)
--py-files $KRYLOV_WF_HOME/src/$KRYLOV_WF_TASK_NAME/TensorFlowOnSpark/tfspark.zip
--conf spark.dynamicAllocation.enabled=false
--conf spark.yarn.maxAppAttempts=1
$KRYLOV_WF_HOME/src/$KRYLOV_WF_TASK_NAME/$1
--images_labels "hdfs/mnist/csv/csv/train/"
--model_dir "./mnist_model"
--export_dir "./mnist_export"
--cluster_size ${SPARK_WORKER_INSTANCES}

When I sumbit it to my spark. There is no error when the executor ips are different from each other.

But when two executors got same ip, the program crash at TFSparkNode.run(). And I found
nodeRDD.foreachPartition(TFSparkNode.run(map_fun,
tf_args,
cluster_meta,
tensorboard,
log_dir,
queues,
background=(input_mode == InputMode.SPARK)))

 because two elements are processed by only two executors with different ips. And the other two who share the same ips with the processing one got error at util.read_executor_id()
 
 def read_executor_id():

"""Read worker id from a local file in the executor's current working directory"""
logger.info("read_executor_id os.listdir('./') is {0}".format(os.listdir('./')))
logger.info("read_executor_id os.path.isfile(executor_id) {0}".format(os.path.isfile("executor_id")))
if os.path.isfile("executor_id"):
with open("executor_id", "r") as f:
return int(f.read())
else:
msg = "No executor_id file found on this node, please ensure that:\n" +
"1. Spark num_executors matches TensorFlow cluster_size\n" +
"2. Spark tasks per executor is 1\n" +
"3. Spark dynamic allocation is disabled\n" +
"4. There are no other root-cause exceptions on other nodes\n"
raise Exception(msg)

below I pasted the detail info

2023-09-17 19:26:06,020 INFO (MainThread-736679) sorted_cluster_info : [{'executor_id': 0, 'host': '10.97.210.22', 'job_name': 'chief', 'task_index': 0, 'port': 37505, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-lh_bcdpi/listener-i_exgo0d', 'authkey': b'\x1d\xe1\xe21L5O\x1a\xb2\x14e3\x96\xc2\x02\x7f'}, {'executor_id': 1, 'host': '10.97.210.22', 'job_name': 'worker', 'task_index': 0, 'port': 36295, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-x86briu5/listener-lcfs6er9', 'authkey': b'\x9b\x838\xac2\xdfJ\\xa5\xd8\xd6Q\xf8e\xc8O'}, {'executor_id': 2, 'host': '10.183.5.149', 'job_name': 'worker', 'task_index': 1, 'port': 34805, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-gf6tu6eb/listener-sbphcavb', 'authkey': b'\xbfo\xc1_\x8d\xb9F%\xb7\xe5\xfa%\xd0\x9a\x18K'}, {'executor_id': 3, 'host': '10.183.5.149', 'job_name': 'worker', 'task_index': 2, 'port': 44817, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-v2ai241q/listener-ybhpgpqi', 'authkey': b'\xdd\xf8\xd6[\xb9\xdfH\xaf\xa6M\xe4aP\xa4\xa7\xc7'}]
2023-09-17 19:26:06,020 INFO (MainThread-736679) node: {'executor_id': 0, 'host': '10.97.210.22', 'job_name': 'chief', 'task_index': 0, 'port': 37505, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-lh_bcdpi/listener-i_exgo0d', 'authkey': b'\x1d\xe1\xe21L5O\x1a\xb2\x14e3\x96\xc2\x02\x7f'} : last_executor_id : -1
2023-09-17 19:26:06,020 INFO (MainThread-736679) node: {'executor_id': 1, 'host': '10.97.210.22', 'job_name': 'worker', 'task_index': 0, 'port': 36295, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-x86briu5/listener-lcfs6er9', 'authkey': b'\x9b\x838\xac2\xdfJ\\xa5\xd8\xd6Q\xf8e\xc8O'} : last_executor_id : 0
2023-09-17 19:26:06,020 INFO (MainThread-736679) node: {'executor_id': 2, 'host': '10.183.5.149', 'job_name': 'worker', 'task_index': 1, 'port': 34805, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-gf6tu6eb/listener-sbphcavb', 'authkey': b'\xbfo\xc1_\x8d\xb9F%\xb7\xe5\xfa%\xd0\x9a\x18K'} : last_executor_id : 1
2023-09-17 19:26:06,020 INFO (MainThread-736679) node: {'executor_id': 3, 'host': '10.183.5.149', 'job_name': 'worker', 'task_index': 2, 'port': 44817, 'tb_pid': 0, 'tb_port': 0, 'addr': '/tmp/pymp-v2ai241q/listener-ybhpgpqi', 'authkey': b'\xdd\xf8\xd6[\xb9\xdfH\xaf\xa6M\xe4aP\xa4\xa7\xc7'} : last_executor_id : 2
2023-09-17 19:26:06,020 INFO (MainThread-736679) export TF_CONFIG: {"cluster": {"chief": ["10.97.210.22:37505"], "worker": ["10.97.210.22:36295", "10.183.5.149:34805", "10.183.5.149:44817"]}, "task": {"type": "chief", "index": 0}, "environment": "cloud"}

The text was updated successfully, but these errors were encountered:

leewyang · 2023-09-20T16:42:14Z

Yes, unfortunately, the system expects that each executor resides on it's own host. This was due to early constraints w.r.t GPU scheduling, which had to be done by the application (before Spark added GPU scheduling itself).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yarn mode error #606

yarn mode error #606

jordanFisherYzw commented Sep 18, 2023

leewyang commented Sep 20, 2023

yarn mode error #606

yarn mode error #606

Comments

jordanFisherYzw commented Sep 18, 2023

When I sumbit it to my spark. There is no error when the executor ips are different from each other.

leewyang commented Sep 20, 2023