Unifying parallel information for both codes and workflow engines #1881

tomdemeyere · 2024-03-14T22:03:04Z

tomdemeyere
Mar 14, 2024

The Problem

Currently, there is no unified approach to specifying parallelization information for every code available in Quacc. Each code has its own way of specifying parallelization information, which can be tedious and potentially confusing for users.
Some workflow engines might require additional information about resources for each job. This seems to be the direction that Parsl is taking (https://parsl.readthedocs.io/en/stable/userguide/mpi_apps.html). When using Parsl with MPI, users now need to specify additional kwargs for each job to define resources such as the number of tasks or nodes. Parsl then uses this information to build the parallel command.

The reason Parsl has to do this is likely due to the pilot job model: Parsl must manage how tasks are deployed. This requirement is not needed when adhering to the condition of 1 task = 1 Slurm job, as it is assumed that the user will correctly attribute resources. In the pilot job model, multiple task run in the same slurm job, when using srun, users should never have to worry about this because even if they oversubscribe their Slurm jobs, srun will gracefully wait when everything is full. In contrast, mpiexec/mpirun will simply oversubscribe everything, leading to suboptimal performance. Hence the need for a new mechanism from Parsl's side.

The Proposal

In Quacc, in the job decorator, or as a kwarg, provide a unique way to specify resources, such as the number of nodes and tasks, as well as the number of CPUs per task. Both the code and the workflow engine (if needed) will receive this information and deal with it as they need it. This way, users only need to specify the resources once.

This also means that we go away from ASE's parallel_info which in itself isn't such a big problem because it seems that no one know how to use it anyway... 😅

Andrew-S-Rosen · 2024-03-15T13:40:43Z

Andrew-S-Rosen
Mar 15, 2024
Maintainer

Thank you for raising this comment! I think you will probably find my response unsatisfactory, but nonetheless this is my opinion on the matter.

Currently, there is no unified approach to specifying parallelization information for every code available in Quacc. Each code has its own way of specifying parallelization information, which can be tedious and potentially confusing for users.

I agree that this is frustrating and not intuitive. A lot of this is due to poor organization on the ASE side of things, where we don't have a unified approach to handling this.

The reason Parsl has to do this is likely due to the pilot job model: Parsl must manage how tasks are deployed.

Correct. I encourage you to check out this recent discussion about a cleaner approach for this. It's not so much that Parsl needs a new mechanism; the old mechanism via HTEX works fine. Rather, the old mechanism is just confusing.

Even if one were to use the new MPI-based features in Parsl, the user (or quacc) still has to define an srun command themselves. This is because the PythonApp is making subprocess calls to an MPI-based executable; Parsl does not construct the srun command for us.

That said, I view this as a separate matter. The first point is about the execution command, including any parallelization flags (e.g. srun -N 1 blah blah) coming from ASE. The second point is about resource management, which is fully in the purview of the workflow orchestrator.

In Quacc, in the job decorator, or as a kwarg, provide a unique way to specify resources, such as the number of nodes and tasks, as well as the number of CPUs per task. Both the code and the workflow engine (if needed) will receive this information and deal with it as they need it. This way, users only need to specify the resources once.

For most of the calculators that don't use this hacky parallel_info business, the users only have to specify the resource once: as an environment variable. This is no different than normal ASE. Espresso and ONETEP just happen to be the exceptions right now since they are the only ones using this undesirable parallel_info nonsense. Really, what needs to be done here is that we need to first solve this upstream in ASE. That is the only robust way of approaching this. I have had an open issue about this for quite some time, and in my opinion, it is the most important blocker for a new release since it is the main thing making the new profile approach not sufficiently flexible.

As for setting things like Parsl commands, it would be a logistical challenge. Quacc supports several workflow orchestration tools, and they all behave very differently depending on a person's compute architecture and computing needs. I am hesitant to get into the business of having quacc interact with the workflow orchestration utilities for this reason. Even if we wanted to go down this road, it is not immediately clear to me how one might achieve this. The job decorators in quacc have minimal logic of their own (by design). They predominantly are an alias for the workflow orchestration tool's decorator. And for some workflow engines, like Dask/Parsl/Prefect, the configuration details can be specified before quacc is ever run.

Summary:

We need a unified approach in ASE. Ideally, users should be able to specify the parallelization info via an environment variable like the older calculators. Then we can have a quacc setting for each calculator that is internally consistent, very similar to what is currently done with the VASP_PARALLEL_CMD setting. Ideally, every MPI-based calculator would have an analogous quacc setting, and that's all that would be needed.
For the workflow engines, there is simply too much variability in what users may want. There is no one-size-fits-all approach. For Parsl specifically, there is a need for improved syntax to make things clearer for end users. That is underway.

11 replies

Andrew-S-Rosen Mar 15, 2024
Maintainer

MPI App from Parsl create the command that has the nodelist mpirun -np ... -nodes <...> for each jobs, supposedly correctly.

Perhaps I simply need to see this MPI App in action. My point of confusion was coming in because the binary is only ever called when the Python script in the PythonApp does a subprocess call. My naive understanding of the MPI App was that Parsl is calling mpirun on the App itself, and we don't want mpirun being called on the Python script --- we want mpirun to be prepended to the binary in the ASE-based subprocess call. But perhaps I am misunderstanding this situation? It's not obvious to me how Parsl would ever know anything about the subprocess call since that exists solely in user space.

Apologies if I'm still misunderstanding this. I am open to improvements if they are general. I may need to see a minimal example to better appreciate this.

tomdemeyere Mar 16, 2024
Author

import parsl
from parsl import python_app
from parsl.app.app import python_app
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.launchers import SingleNodeLauncher
from parsl.providers import SlurmProvider

worker_init = """
source /mnt/lustre/a2fs-work3/work/e89/e89/td5g20/.scratch_bashrc
conda activate mace
"""

config = Config(
    executors=[
        HighThroughputExecutor(
            label="test",
            enable_mpi_mode=True,
            mpi_launcher="srun",
            max_workers=1,
            cores_per_worker=1,
            provider=SlurmProvider(
                account="e89-soto",
                qos="short",
                worker_init=worker_init,
                walltime="00:01:00",
                nodes_per_block=2,
                cores_per_node=1,
                partition="standard",
                init_blocks=0,
                max_blocks=1,
                launcher=SingleNodeLauncher(),
            ),
        ),
    ],
)

parsl.load(config)

resource_specification = {
    "num_nodes": 2,
    "ranks_per_node": 1,
    "num_ranks": 1,
}


@python_app
def mpi_test(parsl_resource_specification={}):
    import os

    return os.environ["PARSL_SRUN_PREFIX"]

results = mpi_test(parsl_resource_specification=resource_specification).result()

print(results)

That's the smallest example that I could come up with, in my case this prints: srun --ntasks 1 --ntasks-per-node 1 --nodelist nid002628,nid002650 --nodes 2

Notice the --nodelist nid002628,nid002650 here, Parsl is now managing that. In the srun case we don't really care since the command is normally distributing each task to available nodes correctly. Additionally if no nodes are available srun will wait, practically speaking acting as a task manager. mpirun/mpiexec or other exotic launchers will simply launch all tasks on the mother node (I am not sure of the exact behaviour here) and not wait if there are too many tasks. Providing the resource_specification to Parsl as done in the example above avoid this because the task management is now done by Parsl. In this case even if one uses mpirun, Parsl will manage which task is going to which nodes, if there are too many of them, it will wait. This is possible because Parsl now know the size of each python_app (note that switching to @job does not work, I still have to investigate that)

How would this link to Quacc? The prefix that I print in the example above should be used as the prefix to build the actual command that ASE will run. In my understanding, that is the only requirement here.

Hence this proposal to come up with a way to unify the parallelisation_info for different softwares and workflow engines.

Andrew-S-Rosen Mar 17, 2024
Maintainer

Thanks for the detailed writeup. This was very helpful to understand the scenario a bit more. I also understand the issue with solely relying on srun to distribute the load correctly. As a sidenote: wow, this mechanism in Parsl is a bit gnarly. 😅 It took me a while to realize that this parsl_resouce_specification is some magic that Parsl uses. Having Parsl rely on the user passing a custom kwarg to their function seems... messy at best.

As for how to handle this in quacc, I think there is room to improve things but there's a few points of note. The first is that most calculators in quacc aren't based on the new GenericFileIOCalculator class and therefore don't support parallel_info, so any decision made here has to be independent of that. I also am hoping that this parallelization mechanism is improved upstream in ASE, but that's a side comment. The second point is that we want to make sure that any feature added to quacc is applicable (or at least extensible) beyond the details of a specific workflow management tool. If both of these points can be met, then I think there is room for such a proposal. The latter part seems like quite a challenge, in part because of how many options there are. For instance, one might not even be using an HPC system but rather cloud compute where these concepts become blurry (as one example). But perhaps there is still a way forward.

tomdemeyere Mar 17, 2024
Author

Thinking about it, these are the specifications that I had in mind:

The new parameter would be passed to the job decorator
Probably should be a dict, with only a few recognised keys
It is optional
The ASE command to be run will directly be constructed from this parameter
If the workflow engine being used needs it (only Parsl that I know of) we pass it using the if statement already in place inside the job decorator. Otherwise we skip, nothing changed.

In this way:

Everything is optional.
Other workflow engines are not affected.
There is an unified way to specify parallelism

I am not very aware about cloud computing, but at least for my (probably narrow) view this would be water tight.

I agree about Parsl kwargs, not very ideal…

Andrew-S-Rosen Mar 17, 2024
Maintainer

At minimum, it seems we need to update the @job (and @subflow?) decorator definition for Parsl because the user is never able to pass in the parsl_resource_specification. Even just having @job accept parsl_resource_specification as a keyword argument, which is not passed to @python_app but is injected into the function signature would be wise. This is independent of any particular aspect of quacc --- it's generally a missing piece of info we have no way to propagate.

There is the second aspect you propose, which is related to construction of the ASE command. Some calculators take parallel_info; others take a string, and other still accept environment variables. Is there a mechanism to "unify" these two points if ASE itself doesn't have a unified mechanism for handling parallelization details? That's where I'm a bit stuck on. If all the calculators took parallel_info (or some common interface), then that would be a lot easier.

hironori-kondo · 2024-11-08T21:13:03Z

hironori-kondo
Nov 8, 2024

Reviving this discussion with a clarification. From what I can see, ASE has done away with parallel_info, and quacc directly builds an EspressoProfile based on the quacc settings—so we should now be changing ESPRESSO_PARALLEL_CMD between jobs (if they require different parallelization settings)?

1 reply

Andrew-S-Rosen Nov 9, 2024
Maintainer

@hironori-kondo that is correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unifying parallel information for both codes and workflow engines #1881

{{title}}

Replies: 2 comments 12 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unifying parallel information for both codes and workflow engines #1881

tomdemeyere Mar 14, 2024

The Problem

The Proposal

Replies: 2 comments · 12 replies

Andrew-S-Rosen Mar 15, 2024 Maintainer

Andrew-S-Rosen Mar 15, 2024 Maintainer

tomdemeyere Mar 16, 2024 Author

Andrew-S-Rosen Mar 17, 2024 Maintainer

tomdemeyere Mar 17, 2024 Author

Andrew-S-Rosen Mar 17, 2024 Maintainer

hironori-kondo Nov 8, 2024

Andrew-S-Rosen Nov 9, 2024 Maintainer

tomdemeyere
Mar 14, 2024

Replies: 2 comments 12 replies

Andrew-S-Rosen
Mar 15, 2024
Maintainer

Andrew-S-Rosen Mar 15, 2024
Maintainer

tomdemeyere Mar 16, 2024
Author

Andrew-S-Rosen Mar 17, 2024
Maintainer

tomdemeyere Mar 17, 2024
Author

Andrew-S-Rosen Mar 17, 2024
Maintainer

hironori-kondo
Nov 8, 2024

Andrew-S-Rosen Nov 9, 2024
Maintainer