Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

StephenRebelSSC · 2024-08-12T15:21:44Z

Description:
When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).

The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':

def launch_slurm_job(launch_file_contents, *args):
    """
        Small helper function to save a sbatch script and call it.
    Args:
        launch_file_contents: Contents of the sbatch script
        *args: any other arguments to pass to the sbatch command

    Returns: the id of the launched slurm job

    """
    with tempfile.NamedTemporaryFile("w") as f:
        f.write(launch_file_contents)
        f.flush()
        return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]

The issues arises because my cluster's job submission output of the following form:
Submitted batch job <job_id> on cluster <cluster_name>

In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.

Reproducibility:

Set up a pipeline using 'SlurmPipelineExecutor'.
Run the pipeline on a cluster with a similar job submission output format as described above.
Observe the dependency error and failed execution.

Proposed Solution:
To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.

In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.

Additional Information:

Environment: HPC cluster with custom Slurm job submission output
Version: Most recent version of datatrove installed from source.
Impact: This issue will affect users on systems with similar output formats.

Please let me know if further details or a pull request would be helpful.

The text was updated successfully, but these errors were encountered:

guipenedo · 2024-08-28T10:26:16Z

Hi! Would you be willing to PR this change?

StephenRebelSSC · 2024-08-28T12:36:18Z

Yeah for sure.

StephenRebel mentioned this issue Sep 3, 2024

Add job_id_position Parameter to launch_slurm_job Method #282

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

StephenRebelSSC commented Aug 12, 2024

guipenedo commented Aug 28, 2024

StephenRebelSSC commented Aug 28, 2024

Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

Comments

StephenRebelSSC commented Aug 12, 2024

guipenedo commented Aug 28, 2024

StephenRebelSSC commented Aug 28, 2024