Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Job ID Extraction on Clusters with Custom Slurm Output #265

Open
StephenRebelSSC opened this issue Aug 12, 2024 · 2 comments
Open

Comments

@StephenRebelSSC
Copy link

Description:
When running a 'SlurmPipelineExecutor' pipeline on my HPC cluster, I encounter dependency issues that result in a failed execution. The problem arises during the stats collection step after a stage has completed (observed during initial testing).

The issue originates from the 'launch_slurm_job' function in 'src/datatrove/executor/slurm.py':

def launch_slurm_job(launch_file_contents, *args):
    """
        Small helper function to save a sbatch script and call it.
    Args:
        launch_file_contents: Contents of the sbatch script
        *args: any other arguments to pass to the sbatch command

    Returns: the id of the launched slurm job

    """
    with tempfile.NamedTemporaryFile("w") as f:
        f.write(launch_file_contents)
        f.flush()
        return subprocess.check_output(["sbatch", *args, f.name]).decode("utf-8").split()[-1]

The issues arises because my cluster's job submission output of the following form:
Submitted batch job <job_id> on cluster <cluster_name>

In this format, the 'launch_slurm_job' function incorrectly reads the cluster name as the job ID, leading to errors in subsequent stages that depend on the correct job ID.

Reproducibility:

  1. Set up a pipeline using 'SlurmPipelineExecutor'.
  2. Run the pipeline on a cluster with a similar job submission output format as described above.
  3. Observe the dependency error and failed execution.

Proposed Solution:
To handle clusters with non-standard job submission output formats, I propose adding an argument or flag to the 'SlurmPipelineExecutor' class. This would allow users to specify the correct position of the job ID in the submission output.

In my case, I temporarily resolved the issue by hardcoding the function to extract the job ID from the 4th element in the subprocess call return.

Additional Information:

  • Environment: HPC cluster with custom Slurm job submission output
  • Version: Most recent version of datatrove installed from source.
  • Impact: This issue will affect users on systems with similar output formats.

Please let me know if further details or a pull request would be helpful.

@guipenedo
Copy link
Collaborator

Hi! Would you be willing to PR this change?

@StephenRebelSSC
Copy link
Author

Yeah for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants