-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine: Fix bug introduced when refactoring upload_calculation
#6348
Conversation
b851b73
to
13d400b
Compare
Note that this is a critical bug that is currently on |
13d400b
to
20fc810
Compare
Thanks @sphuber! @DrFedro also reported an issue related to this to me, i.e. that the
Can confirm that the changes in this PR fix that issue. |
So he is running of the |
Yes, it seems so, and reverting to v2.5.1 fixed the issue.
I still want to have a proper look at the code, so I can also make sure I understand the changes in 6898ff4. Should have time for this tomorrow or Friday. |
Looking into this some more, won't the following line cause similar woes? I was playing around with the following code: import os
import pathlib
import shutil
from tempfile import TemporaryDirectory
from aiida import orm, load_profile
load_profile()
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'
folder_data = orm.FolderData(tree=pseudo_path)
shutil.rmtree(remote_workdir, ignore_errors=True)
def copy_local(transport, folder_data):
with TemporaryDirectory() as tmpdir:
dirpath = pathlib.Path(tmpdir)
data_node = folder_data
filepath_target = (dirpath / 'pseudo').resolve().absolute()
filepath_target.parent.mkdir(parents=True, exist_ok=True)
data_node.base.repository.copy_tree(filepath_target, 'pseudo')
transport.put(f'{dirpath}/*', transport.getcwd())
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
copy_local(transport, folder_data)
transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo') The code above will give the following directory tree:
But switching the order of the |
Sure, but that is because you are calling the following transport.copy(os.path.join(pseudo_path, 'pseudo'), 'pseudo') And that is saying copy the contents of the source
So you are globbing the contents of So I don't think there is a regression in the behavior of |
I don't think there is a regression, I was just wondering if we should make a similar change for I rewrote the example to rely on the functions in the from logging import LoggerAdapter
import shutil
from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.engine.daemon.execmanager import _copy_remote_files, _copy_local_files
load_profile()
random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
pseudo_path = '/Users/mbercx/project/core/jupyter/data'
shutil.rmtree(remote_workdir, ignore_errors=True)
folder_data = orm.FolderData(tree=pseudo_path)
folder_data.store()
local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo')
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item])
_copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ()) Critical of course are the local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo/*', 'pseudo') Here, all is well, since I use the glob If we remove the glob, and invert the local_copy_list_item = (folder_data.uuid, 'pseudo', 'pseudo')
remote_copy_list_item = (localhost.uuid, '/Users/mbercx/project/core/jupyter/data/pseudo', 'pseudo')
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_remote_files(logger, random_calc_job, localhost, transport, [remote_copy_list_item], ())
_copy_local_files(logger, random_calc_job, transport, None, [local_copy_list_item]) the behavior is different, i.e. there is no nested |
Since |
Sure, but they are actually used by the user, albeit it indirectly. The I need to try and find some time to try the example scripts on the latest release, to see what the original behavior was. I think the current changes on |
TLDR: The solution of this PR is wrong. Not even so much for the discrepancy in behavior of local and remote copy lists, but really because the implementation of This is a tricky one. The first question is what the behavior of The problem is really due to a detail of the original implementation of
The implementation did not literally copy the contents of each sequentially to the working directory. Rather, it would copy the instructions of the In the new implementation, this was changed, where the 3 copying steps are directly copied to the remote working dir, and the In principle, getting rid of the "hack" of merging |
20fc810
to
c2a5903
Compare
LocalTransport
: Accept existing directories in puttree
upload_calculation
Scratch that... it is still not quite that simple 😭 |
c2a5903
to
db3c9f1
Compare
@mbercx could you give this another review please? The behavior of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sphuber! I think the critical question is indeed what the behavior of Transport.put
and Transport.copy
should be. I think it'd be hard to change their behavior from what you describe above, so I agree it makes sense to keep cp
-like behavior.
I had a closer look at the code and did some more field testing on the behavior of the various FileCopyOperation
s. I left two comments so far, of which the double comment on line 364 re copying the contents of a FileType.DIRECTORY
is the most critical.
# Now copy the contents of the temporary folder to the remote working directory using the transport | ||
for filepath in dirpath.iterdir(): | ||
transport.put(str(filepath), filepath.name) | ||
transport.makedirs(str(pathlib.Path(target).parent), ignore_existing=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that because of line 359 and this one, the copy behaviour of local_copy_list
is not the same as cp
, which would simply fail in case the parent folder that you are trying to copy into doesn't exist.
I wonder if this was also the previous behavior of local_copy_list
. The QE plugin made the pseudo
directory in the sandbox folder exactly because otherwise the copy command would fail, I assume.
Finally, remote_copy_list
does fail when trying to copy files into a parent folder that doesn't exist.
@@ -360,15 +361,14 @@ def _copy_local_files(logger, node, transport, inputs, local_copy_list): | |||
if data_node.base.repository.get_object(filename_source).file_type == FileType.DIRECTORY: | |||
# If the source object is a directory, we copy its entire contents | |||
data_node.base.repository.copy_tree(filepath_target, filename_source) | |||
transport.put(f'{dirpath}/*', target or '.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to my comment below, I was wondering if this means that the local_copy_list
behavior is once again different from cp
. Funny enough, it does seem similar when using -r
and adding a forward slash after the source directory:
❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo ./pseudo; tree
.
└── pseudo
└── pseudo
├── Ba.upf
└── Si.upf
3 directories, 2 files
❯ rm -rf *; mkdir pseudo; cp -r ../test_qe/pseudo/ ./pseudo; tree
.
└── pseudo
├── Ba.upf
└── Si.upf
2 directories, 2 files
Kind of similar to rsync
, I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, after having a closer look, the behaviour seems different than I expected. With the following code:
import shutil
from logging import LoggerAdapter
from aiida import orm, load_profile
from aiida.common import AIIDA_LOGGER
from aiida.common.folders import SandboxFolder
from aiida.engine.daemon.execmanager import _copy_local_files, _copy_sandbox_files
load_profile()
random_calc_job = orm.load_node(36)
logger = LoggerAdapter(logger=AIIDA_LOGGER.getChild('execmanager'))
localhost = orm.load_computer('localhost')
remote_workdir = '/Users/mbercx/project/core/jupyter/workdir'
test_path = '/Users/mbercx/project/core/jupyter/test_qe'
shutil.rmtree(remote_workdir, ignore_errors=True)
folder_data = orm.FolderData(tree=test_path)
folder_data.store()
local_copy_list = [
(folder_data.uuid, 'pseudo', 'pseudo'),
]
with SandboxFolder() as sandbox_folder:
sandbox_folder.get_subfolder('pseudo', create=True)
with localhost.get_transport() as transport:
transport.mkdir(remote_workdir)
transport.chdir(remote_workdir)
_copy_sandbox_files(logger, random_calc_job, transport, sandbox_folder)
_copy_local_files(logger, random_calc_job, transport, None, local_copy_list)
and the contents of test_qe
:
test_qe
└── pseudo
├── Ba.upf
└── Si.upf
I indeed get a nested folder:
workdir/
└── pseudo
└── pseudo
├── Ba.upf
└── Si.upf
Not creating the pseudo
folder in the sandbox leads to the non-nested result. However, if I make the local_copy_list
:
local_copy_list = [
(folder_data.uuid, 'pseudo', '.'),
]
Then the workdir
becomes:
workdir/
├── Ba.upf
└── Si.upf
Is that what we want? Now we are really doing the cp -r pseudo/ .
(with forward slash) option, which means copy the contents of the directory to the target path.
@sphuber just a note: I'm leaving on holiday tomorrow until the 20th, so will most likely not have time to review until after that... I agree the release should come out soon though. Maybe @giovannipizzi (due to his experience) or @khsrali (due to the fact that he's working on transports) can get involved in the review? I think the discrepancy between The fact that |
If and only if the target directory already exists right? Otherwise it just copies as is. The problem here is indeed really the fact that the old implementation did not go directly through the |
Not sure if that's true, see my example near the end of #6348 (comment) |
Just a note that I also stumbled on this behavior now, and installing from the PR here fixes it. So I'll try to dedicate some time to reviewing the code and reading the discussion, maybe that could be helpful for merging. |
I tested this PR for a while with aiida-quantumespresso, and so far I did not found any issues in the code behavior. I confirm that the main branch is completely broken. Can you add a test that covers this case before merging? |
|
In 6898ff4 the implementation of the processing of the `local_copy_list` in the `upload_calculation` method was changed. Originally, the files specified by the `local_copy_list` were first copied into the `SandboxFolder` before copying its contents to the working directory using the transport. The commit allowed the order in which the local and sandbox files were copied, so the local files were now no longer copied through the sandbox. Rather, they were copied to a temporary directory on disk, which was then copied over using the transport. The problem is that if the latter would copy over a directory that was already created by the copying of the sandbox, an exception would be raised.
This reverts commit 424027f.
In 6898ff4 the implementation of the processing of the `local_copy_list` in the `upload_calculation` method was changed. Originally, the files specified by the `local_copy_list` were first copied into the `SandboxFolder` before copying its contents to the working directory using the transport. The commit allowed the order in which the local and sandbox files were copied, so the local files were now no longer copied through the sandbox. Rather, they were copied to a temporary directory on disk, which was then copied over using the transport. The problem is that if the latter would copy over a directory that was already created by the copying of the sandbox, an exception would be raised. For example, if the sandbox contained the directory `folder` and the `local_copy_list` contained the items `(_, 'file', './folder/file')` this would work just fine in the original implementation as the `file` would be written to the `folder` on the remote folder. The current implementation, however, would write the file content to `folder/file` in a local temporary directory, and then iterate over the directories and copy them over. Essentially it would be doing: Transport.put('/tmpdir/folder', 'folder') but since `folder` would already exist on the remote working directory the local folder would be _nested_ and so the final path on the remote would be `/workingdir/folder/folder/file`. The correct approach is to copy each item of the `local_copy_list` from the local temporary directory _individually_ using the `Transport.put` interface and not iterate over the top-level entries of the temporary directory at the end.
db3c9f1
to
ad15c97
Compare
@sphuber as discussed, I've written a bunch of tests to check the edge cases I've been studying, first based on the latest release tag (v2.5.1): https://github.com/mbercx/aiida-core/blob/test/upload-calc/tests/engine/daemon/test_execmanager.py These should all pass for the previous implementation, see https://github.com/mbercx/aiida-core/actions/runs/9285274053/job/25549435202 I've then added the same tests as a commit on top of this PR: Here, 3 tests fail: https://github.com/mbercx/aiida-core/actions/runs/9285404390/job/25549825351
So the main point now is that the current copying behaviour of the PR does not preserve that of v2.5.1. I'll see if I can fix that first. |
Ok, I've extended the number of tests, added tests for the SSH transport, and managed to get it down to one failure (for SSH): https://github.com/mbercx/aiida-core/actions/runs/9345893736/job/25719640192 See all changes here: mbercx@fa0e9c5 It seems the issue is that the local and SSH transport return different errors when trying to remotely copy a file to a directory that doesn't yet exist. The local transport returns a
The SSH transport returns an OSError instead: aiida-core/src/aiida/transports/plugins/ssh.py Lines 1191 to 1195 in acec0c1
For some reason, the engine deals with these two error differently: aiida-core/src/aiida/engine/daemon/execmanager.py Lines 292 to 304 in acec0c1
I'm now trying to figure out why these two are treated differently. EDIT: the reason is explained here: 101a8d6 So similar to the source file not existing, if the target path is a file in a directory that doesn't exist, the failure will not be transient and hence we should have the SSH transport return a |
The What we should do is check for the try:
os.remove(filename)
except OSError as e:
if e.errno == errno.ENOENT:
print('File does not exist') See this PEP as well for details. If the reason for the SshTransport is raising is the file is missing, we should have transport also raise the more specific |
Thanks @sphuber! Unfortunately, the Instead I've added a check to see if the parent destination directory exists in the mbercx@011f5f7#diff-1284fda2c2014c8c418475c99a82308d82ab96cb33e6c49d978c0031f3b1ef3d It's more patchy for sure, but at least my tests now all pass locally. 😅 I'm praying they also pass on GitHub: https://github.com/mbercx/aiida-core/actions/runs/9347879530/job/25725810490 |
Also double-checked my branch adding these tests on v2.5.1, and can confirm they all pass, save for the same error discussed above. |
My suggestion for proceeding is to split up the PR in 3 commits:
|
Superseded by #6447 |
In 6898ff4 the implementation of the
processing of the
local_copy_list
in theupload_calculation
methodwas changed. Originally, the files specified by the
local_copy_list
were first copied into the
SandboxFolder
before copying its contentsto the working directory using the transport. The commit allowed the
order in which the local and sandbox files were copied, so the local
files were now no longer copied through the sandbox. Rather, they were
copied to a temporary directory on disk, which was then copied over
using the transport. The problem is that if the latter would copy over a
directory that was already created by the copying of the sandbox, an
exception would be raised.
For example, if the sandbox contained the directory
folder
and thelocal_copy_list
contained the items(_, 'file', './folder/file')
this would work just fine in the original implementation as the
file
would be written to the
folder
on the remote folder. The currentimplementation, however, would write the file content to
folder/file
in a local temporary directory, and then iterate over the directories
and copy them over. Essentially it would be doing:
but since
folder
would already exist on the remote working directorythe local folder would be nested and so the final path on the remote
would be
/workingdir/folder/folder/file
.The correct approach is to copy each item of the
local_copy_list
fromthe local temporary directory individually using the
Transport.put
interface and not iterate over the top-level entries of the temporary
directory at the end.