-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LibOS,PAL] Emulate file-backed mmap via PAL read/write APIs #1818
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)
libos/test/ltp/manifest.template
line 20 at r1 (raw file):
# many LTP multi-process tests rely on shared-memory IPC via `mmap(MAP_SHARED, </dev/shm fd>)` { type = "untrusted_shm", path = "/dev/shm", uri = "dev:/dev/shm" },
This is taken from #1718. So I'd say that that PR is the prerequisite of this one.
This is required because LTP uses shared file-backed mmaps without explicit msyncs or munmaps (recall that Gramine requires explicit points like msync()
and munmap()
to synchronize mappings with file contents).
pal/include/pal/pal.h
line 426 at r1 (raw file):
* `PalStreamRead` and `PalStreamWrite`. Use `PalVirtualMemoryFree` to unmap this mapping. */ int PalStreamMap(PAL_HANDLE handle, void* addr, pal_prot_flags_t prot, uint64_t offset,
We may want to rename this to PalDeviceMap()
pal/src/host/linux-sgx/pal_files.c
line 391 at r1 (raw file):
} return ret; }
The biggest win of this PR: removing the super-complicated file_map()
callback from Linux-SGX PAL.
Jenkins, retest Jenkins-SGX-20.04-musl please. Error is in the signing test:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 21 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)
a discussion (no related file):
it should be noted that this PR introduces a major regression: writable MAP_SHARED
mappings that are actually shared between processes will no longer work correctly
libos/src/bookkeep/libos_vma.c
line 1558 at r2 (raw file):
/* This helper function is to refresh access protections on the VMA pages of a given file handle on * file-extend operations (currently only `ftruncate`). */ int prot_refresh_mmaped_from_file_handle(struct libos_handle* hdl) {
this iterates over all VMAs we have in memory, not just VMAs belonging to this file; once we fix write
to also call this function, this is likely to cause significant performance problems
libos/src/fs/libos_fs_util.c
line 255 at r2 (raw file):
} int generic_truncate(struct libos_handle* hdl, file_off_t size) {
calling the new hook only in truncate
is not enough — file length can also be extended by a write (indeed, it is the more common case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 21 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mwkmwkmwk)
libos/src/fs/libos_fs_util.c
line 255 at r2 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
calling the new hook only in
truncate
is not enough — file length can also be extended by a write (indeed, it is the more common case)
I agree, but we don't have this semantics even in the current implementation. So I think this PR is strictly better than what we currently have.
Also, looking at our tests and example apps, seems like no app uses such a strange pattern of mmap()
+ write(...extend...)
. Or do you have some other scenario in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 21 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mwkmwkmwk, and @oshogbo)
a discussion (no related file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
it should be noted that this PR introduces a major regression: writable
MAP_SHARED
mappings that are actually shared between processes will no longer work correctly
So, do we actually care about this? Previously this worked only on non-SGX PAL, right?
CC: @dimakuv, @kailun-qin, @oshogbo
libos/src/fs/libos_fs_util.c
line 255 at r2 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I agree, but we don't have this semantics even in the current implementation. So I think this PR is strictly better than what we currently have.
Also, looking at our tests and example apps, seems like no app uses such a strange pattern of
mmap()
+write(...extend...)
. Or do you have some other scenario in mind?
Hmm, but this shouldn't be hard to add, I think? Or maybe you're worrying about the performance if we add it to each write()
? If it's the latter, then I think it's possible to optimize it by storing more metadata so that the overhead is negligible for files which don't have any mappings.
libos/test/ltp/manifest.template
line 20 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This is taken from #1718. So I'd say that that PR is the prerequisite of this one.
This is required because LTP uses shared file-backed mmaps without explicit msyncs or munmaps (recall that Gramine requires explicit points like
msync()
andmunmap()
to synchronize mappings with file contents).
#1718 was merged, could you rebase? Current merge base doesn't contain it.
739b2e1
to
074b34e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 21 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
So, do we actually care about this? Previously this worked only on non-SGX PAL, right?
CC: @dimakuv, @kailun-qin, @oshogbo
Yes, previously it worked only on non-SGX PAL (i.e. in gramine-direct
).
I would say that this was a bad experience for everyone involved:
- Users were surprised that something that works on
gramine-direct
suddenly doesn't work ongramine-sgx
, for no good reason. - Developers were annoyed to not being able to debug mmap-related issues under
gramine-direct
.
To me, this "major regression" is actually a win. It makes mmap behavior sane and uniform.
-- commits
line 26 at r2:
TODO:
file-growing `ftruncate()` on
->
file-growing `write()` and `ftruncate()` on
libos/src/bookkeep/libos_vma.c
line 1558 at r2 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
this iterates over all VMAs we have in memory, not just VMAs belonging to this file; once we fix
write
to also call this function, this is likely to cause significant performance problems
To make the claim of @mwkmwkmwk more specific:
- The
dump_vmas()
function indeed iterates over all VMAs. - However, this function calls a
vma_filter_needs_prot_refresh()
callback on each VMA, to verify that this VMA belongs to this file (hdl
). - Unfortunately, we don't maintain a per-handle list of VMAs over which we could iterate. I guess this was the crux of @mwkmwkmwk's comment.
So, I generally agree with @mwkmwkmwk, but I don't see a simple way around this O(n) logic.
Btw, I added write()
logic now.
libos/src/fs/libos_fs_util.c
line 255 at r2 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Hmm, but this shouldn't be hard to add, I think? Or maybe you're worrying about the performance if we add it to each
write()
? If it's the latter, then I think it's possible to optimize it by storing more metadata so that the overhead is negligible for files which don't have any mappings.
Done.
I forgot that we already have this metadata -- hdl->inode->num_mmapped
. I'm using that one now.
libos/test/ltp/manifest.template
line 20 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
#1718 was merged, could you rebase? Current merge base doesn't contain it.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 21 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)
a discussion (no related file):
Did a quick perf test on my (old-ish) machine.
Runtimes of the full FS pytest suite (in Gramine debug mode, non-EDMM):
-
Without PR
- direct: 4.41 seconds
- sgx: 84.12 seconds
-
With PR
- direct: 4.63 seconds
- sgx: 87.66 seconds
Results look reasonable to me. I guess we'll need to do a few more perf tests (in release mode, with EDMM, on different machines) before being ready for merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 19 files at r1, 6 of 10 files at r3, all commit messages.
Reviewable status: 16 of 21 files reviewed, 12 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mwkmwkmwk, and @oshogbo)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Yes, previously it worked only on non-SGX PAL (i.e. in
gramine-direct
).I would say that this was a bad experience for everyone involved:
- Users were surprised that something that works on
gramine-direct
suddenly doesn't work ongramine-sgx
, for no good reason.- Developers were annoyed to not being able to debug mmap-related issues under
gramine-direct
.To me, this "major regression" is actually a win. It makes mmap behavior sane and uniform.
Ok, makes sense to me.
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Did a quick perf test on my (old-ish) machine.
Runtimes of the full FS pytest suite (in Gramine debug mode, non-EDMM):
Without PR
- direct: 4.41 seconds
- sgx: 84.12 seconds
With PR
- direct: 4.63 seconds
- sgx: 87.66 seconds
Results look reasonable to me. I guess we'll need to do a few more perf tests (in release mode, with EDMM, on different machines) before being ready for merge.
Is the slowdown caused by some specific test(s), or distributed equally? 5% is noticable, I think it should be less (or rather: I don't see why it has to be that much). OTOH, our tests are not designed to be benchmarks, so, could you try running something more suitable?
libos/include/libos_vma.h
line 136 at r3 (raw file):
int reload_mmaped_from_file_handle(struct libos_handle* hdl); /* Refresh page protections of file mappings of `hdl` */
Please add a short remark why this is needed in the first place, which is: when the file size has changed.
libos/src/fs/libos_fs_util.c
line 152 at r3 (raw file):
return pal_to_unix_errno(ret); size_t to_read_size = size;
Sounds like Yoda, I assume you meant size_to_read
?
Code quote:
to_read_size
libos/src/fs/libos_fs_util.c
line 152 at r3 (raw file):
return pal_to_unix_errno(ret); size_t to_read_size = size;
I think Linux rounds the read size (or rather: map size) up to a whole page. Probably nothing relies on this, but IMO better to emulate this exactly. Right now we're leaving the padding with zeros instead of file contents.
We probably should also add a test for this? (ideally, just a subtest in one of the existing ones)
libos/src/fs/libos_fs_util.c
line 273 at r3 (raw file):
ret = prot_refresh_mmaped_from_file_handle(hdl); if (ret < 0) { log_error("refresh of page protections of mmapped regions of file failed: %s",
Suggestion:
refreshing page protections
pal/include/pal/pal.h
line 426 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
We may want to rename this to
PalDeviceMap()
Could you do that in this PR? There aren't that many places left using it.
pal/src/pal_rtld.c
line 555 at r3 (raw file):
goto out; } ret = _PalStreamRead(handle, c->map_off, map_size, map_addr);
ditto (rounding up the read size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 10 of 27 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
Ok, makes sense to me.
Done from my side
a discussion (no related file):
Previously, mkow (Michał Kowalczyk) wrote…
Is the slowdown caused by some specific test(s), or distributed equally? 5% is noticable, I think it should be less (or rather: I don't see why it has to be that much). OTOH, our tests are not designed to be benchmarks, so, could you try running something more suitable?
Ok, here are the new results. I had a couple bugs/issues in my previous quick perf test; now I think I fixed all those issues.
- First issue was that I ran in
debug
mode. Now I run inrelease
mode. - Second issue (main one) was that I didn't compare the
libos/test/fs/
pytest apples-to-apples. That's because this PR enables several tests which were previously (on master) disabled. So now I have a pre-requisite -- see below -- to compare apples-to-apples. - Third issue that I did the test only on one machine, without EDMM. Now I also show the results with EDMM.
Pre-requisite
We compare with current master for this PR (commit cbd30b7).
With this PR, to compare apples-to-apples (same total number of FS tests), need to revert Pytest files:
git checkout cbd30b795098cd459a3e2165e9cf56f99cf2dc65 -- \
libos/test/fs/test_fs.py \
libos/test/fs/test_enc.py \
libos/test/fs/test_tmpfs.py
old non-EDMM machine (fast hard disk)
-
without PR
- direct: 3.79 seconds
- sgx: 69.75 seconds
-
with PR
- direct: 4.01 seconds
- sgx: 69.78 seconds
new EDMM machine (slow hard disk)
-
without PR
- direct: 9.35 seconds
- sgx (with EDMM=1): 17.04s seconds
-
with PR
- direct: 9.38s seconds
- sgx (with EDMM=1): 17.06s seconds
Conclusion: Things make sense now. First, gramine-direct
is a bit slower now because it doesn't use the "cheaty" mmap(..., file-fd, ...)
but always uses a generic emulate-mmap-via-read-write. Second, gramine-sgx
has exact same performance, because nothing really changed in this PR for the SGX case (except those IF checks on write/truncate).
libos/include/libos_vma.h
line 136 at r3 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please add a short remark why this is needed in the first place, which is: when the file size has changed.
Done.
libos/src/fs/libos_fs_util.c
line 152 at r3 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Sounds like Yoda, I assume you meant
size_to_read
?
Done.
libos/src/fs/libos_fs_util.c
line 152 at r3 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
I think Linux rounds the read size (or rather: map size) up to a whole page. Probably nothing relies on this, but IMO better to emulate this exactly. Right now we're leaving the padding with zeros instead of file contents.
We probably should also add a test for this? (ideally, just a subtest in one of the existing ones)
I don't understand what you mean.
Please note that this callback function gets 4KB-aligned addr
and size
. I added asserts now to make it explicit.
Do these asserts clear the picture now? We (1) allocate size
zero-filled bytes, aligned to a whole page, then we (2) read as much as possible into these allocated bytes from the underlying file, and finally we (3) PROT_NONE those pages that do not cover the read-from-file bytes (whole pages).
So in a general case, it ends up like this:
size / 4KB
whole pages allocated, with proper permissions (at least Read):
a. First N pages contain the contents of the file;
b. The last page where file contents stop (if any) gets padded with zeros;- Last
size / 4KB - N
pages have PROT_NONE.
This is exactly how Linux works. In fact, I believe we have enough tests for this, because that's what Glibc and many apps rely on. But if you still want a sub-test, what should it be?
libos/src/fs/libos_fs_util.c
line 273 at r3 (raw file):
ret = prot_refresh_mmaped_from_file_handle(hdl); if (ret < 0) { log_error("refresh of page protections of mmapped regions of file failed: %s",
Done.
pal/include/pal/pal.h
line 426 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Could you do that in this PR? There aren't that many places left using it.
Done.
pal/src/pal_rtld.c
line 555 at r3 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto (rounding up the read size)
ditto, I don't understand the comment.
Similar to another place, I added the asserts to make it explicit.
Please note that implementations of _PalStreamRead()
won't read past the end of the file. For example, for trusted files (which the PAL entrypoint must be), we have this guardrail:
gramine/pal/src/host/linux-sgx/pal_files.c
Line 192 in cbd30b7
off_t end = MIN(offset + count, handle->file.size); |
Also note that below (I put a "mark" comment) our code explicitly zeroes-out the unused/unread parts of the allocated memory region.
pal/src/pal_rtld.c
line 612 at r3 (raw file):
* segment allocated via _PalVirtualMemoryAlloc() is already zeroed out) */ if (ALLOC_ALIGN_UP(c->data_end) > c->data_end) memset((void*)c->data_end, 0, ALLOC_ALIGN_UP(c->data_end) - c->data_end);
mark (see comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 19 files at r1, 1 of 10 files at r3, 15 of 15 files at r4, all commit messages.
Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mwkmwkmwk, and @oshogbo)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Ok, here are the new results. I had a couple bugs/issues in my previous quick perf test; now I think I fixed all those issues.
- First issue was that I ran in
debug
mode. Now I run inrelease
mode.- Second issue (main one) was that I didn't compare the
libos/test/fs/
pytest apples-to-apples. That's because this PR enables several tests which were previously (on master) disabled. So now I have a pre-requisite -- see below -- to compare apples-to-apples.- Third issue that I did the test only on one machine, without EDMM. Now I also show the results with EDMM.
Pre-requisite
We compare with current master for this PR (commit cbd30b7).
With this PR, to compare apples-to-apples (same total number of FS tests), need to revert Pytest files:
git checkout cbd30b795098cd459a3e2165e9cf56f99cf2dc65 -- \ libos/test/fs/test_fs.py \ libos/test/fs/test_enc.py \ libos/test/fs/test_tmpfs.py
old non-EDMM machine (fast hard disk)
without PR
- direct: 3.79 seconds
- sgx: 69.75 seconds
with PR
- direct: 4.01 seconds
- sgx: 69.78 seconds
new EDMM machine (slow hard disk)
without PR
- direct: 9.35 seconds
- sgx (with EDMM=1): 17.04s seconds
with PR
- direct: 9.38s seconds
- sgx (with EDMM=1): 17.06s seconds
Conclusion: Things make sense now. First,
gramine-direct
is a bit slower now because it doesn't use the "cheaty"mmap(..., file-fd, ...)
but always uses a generic emulate-mmap-via-read-write. Second,gramine-sgx
has exact same performance, because nothing really changed in this PR for the SGX case (except those IF checks on write/truncate).
Ok, now it makes sense, so I'm resolving :)
libos/src/bookkeep/libos_vma.c
line 1558 at r2 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
To make the claim of @mwkmwkmwk more specific:
- The
dump_vmas()
function indeed iterates over all VMAs.- However, this function calls a
vma_filter_needs_prot_refresh()
callback on each VMA, to verify that this VMA belongs to this file (hdl
).- Unfortunately, we don't maintain a per-handle list of VMAs over which we could iterate. I guess this was the crux of @mwkmwkmwk's comment.
So, I generally agree with @mwkmwkmwk, but I don't see a simple way around this O(n) logic.
Btw, I added
write()
logic now.
One note here: in revision r4
of this PR it doesn't trigger on all write()
calls, only on writes to files with mappings, so it should be reasonably fast (also, see the top-level discussion about performance).
libos/src/bookkeep/libos_vma.c
line 1530 at r4 (raw file):
size_to_prot = 0; } else { if (file_size - vma_info->file_offset > vma_info->length) {
ditto (see below)
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
if (vma_info->length - size_to_prot) { ret = PalVirtualMemoryProtect(vma_info->addr + size_to_prot, vma_info->length - size_to_prot, /*prot=*/0);
Hmm, doesn't it break some invariants that vma_info
holds different protection information than the real memory? What e.g. about EFAULT
checks in syscalls?
libos/src/bookkeep/libos_vma.c
line 1599 at r4 (raw file):
/* NOTE: Unfortunately there's a data race here: the memory can be unmapped, or remapped, by * another thread by the time we get to `msync`. */ lock(&file->inode->lock);
Are we guaranteed here that file->inode
exists?
/* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */
struct libos_inode* inode;
I'm always confused by that, I'm not sure when it exists and when it doesn't...
libos/src/bookkeep/libos_vma.c
line 1608 at r4 (raw file):
size_to_msync = 0; } else { if (file_size - vma_info->file_offset > vma_info->length) {
That's a very long way to write MIN()
;)
libos/src/bookkeep/libos_vma.c
line 1609 at r4 (raw file):
} else { if (file_size - vma_info->file_offset > vma_info->length) { /* file size exceeds the mmapped part in VMA, all VMA is synced */
Suggestion:
the whole VMA is synced */
libos/src/fs/libos_fs_util.c
line 152 at r3 (raw file):
Ah, I thought this function has the same semantic as mmap()
: (quoting man
)
The address addr must be a multiple of the page size (but length need not be).
Our callback actually doesn't have this semantics, so there's nothing wrong here.
libos/src/fs/chroot/encrypted.c
line 496 at r4 (raw file):
if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) { /* If there are any mappings for the file, this will refresh their access protections. */
ditto (see below)
libos/src/fs/chroot/encrypted.c
line 504 at r4 (raw file):
} /* If there are any MAP_SHARED mappings for the file, this will read data from `enc`. */
ditto
libos/src/fs/chroot/encrypted.c
line 537 at r4 (raw file):
if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) { /* If there are any mappings for the file, this will refresh their access protections. */
Suggestion:
/* There are mappings for the file, refresh their access protections. */
libos/src/fs/chroot/fs.c
line 247 at r4 (raw file):
if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) { /* If there are any mappings for the file, this will refresh their access protections. */
Suggestion:
/* There are mappings for the file, refresh their access protections. */
libos/src/fs/chroot/fs.c
line 255 at r4 (raw file):
} /* If there are any MAP_SHARED mappings for the file, this will read data from `hdl`. */
ditto
libos/src/fs/tmpfs/fs.c
line 302 at r4 (raw file):
Can you read this without the lock? struct libos_inode
documentation says:
The fields in this structure are protected by
lock
, with the exception of fields that do not change (type
,mount
,fs
).
pal/src/pal_rtld.c
line 555 at r3 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
ditto, I don't understand the comment.
Similar to another place, I added the asserts to make it explicit.
Please note that implementations of
_PalStreamRead()
won't read past the end of the file. For example, for trusted files (which the PAL entrypoint must be), we have this guardrail:gramine/pal/src/host/linux-sgx/pal_files.c
Line 192 in cbd30b7
off_t end = MIN(offset + count, handle->file.size); Also note that below (I put a "mark" comment) our code explicitly zeroes-out the unused/unread parts of the allocated memory region.
Ah, I didn't notice that c
doesn't contain raw ELF values, but already-processed ones (i.e. aligned by the code above). Raw ELF load commands don't have to be fully aligned to memory pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 22 of 27 files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)
libos/src/bookkeep/libos_vma.c
line 1530 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto (see below)
Done.
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Hmm, doesn't it break some invariants that
vma_info
holds different protection information than the real memory? What e.g. aboutEFAULT
checks in syscalls?
Hm, there good point.
I'm afraid you're correct but there's no easy way to fix this. Our VMA subsystem is not ready for such stuff. What we would need is to be able to split and merge adjacent VMAs on these callback functions. VMA
splitting logic exists in our code, but merging doesn't exist, see e.g. this comment:
gramine/libos/src/bookkeep/libos_vma.c
Lines 99 to 104 in 42ba480
/* | |
* "vma_tree" holds all vmas with the assumption that no 2 overlap (though they could be adjacent). | |
* Currently we do not merge similar adjacent vmas - if we ever start doing it, this code needs | |
* to be revisited as there might be some optimizations that would break due to it. | |
*/ | |
static struct avl_tree vma_tree = {.cmp = vma_tree_cmp}; |
Interestingly, we have been having this same issue for ages in our checkpoint-and-restore logic:
gramine/libos/src/bookkeep/libos_vma.c
Lines 1582 to 1605 in 42ba480
/* Send file-backed memory region. */ | |
uint64_t file_size = 0; | |
int ret = get_file_size(vma->file, &file_size); | |
if (ret < 0) | |
return ret; | |
/* Access beyond the last file-backed page will cause SIGBUS. For reducing fork | |
* latency, we send only those memory contents of VMA that are backed by the file, | |
* round up to pages. Rest of VMA memory region will be inaccessible in the child | |
* process. */ | |
size_t send_size = vma->length; | |
if (vma->file_offset + vma->length > file_size) { | |
send_size = file_size > vma->file_offset ? file_size - vma->file_offset : 0; | |
send_size = ALLOC_ALIGN_UP(send_size); | |
} | |
/* It may happen that the whole file-backed memory is beyond the file size (e.g., | |
* the file was truncated after the memory was allocated). In this case we consider | |
* the whole memory region to be inaccessible. */ | |
if (send_size > 0) { | |
struct libos_mem_entry* mem; | |
DO_CP_SIZE(memory, vma->addr, send_size, &mem); | |
mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0); | |
} |
TLDR:
- Your issue is legit.
- We can easily split VMAs into two: one with normal protections, one with NONE protections.
- But we cannot merge those VMAs back together (not implemented) when the
write()
ortruncate()
grows the file further. - So we may end up with hundreds of adjacent VMAs, slowing down everything.
- We had the same issue in restored VMAs (for opened files in child processes); we didn't observe any issues because of that in real-world apps.
So given the above, I think we should postpone fixing this issue (for the time when we are ready to modify VMA bookkeeping code). I wrote a corresponding FIXME comment here.
libos/src/bookkeep/libos_vma.c
line 1599 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Are we guaranteed here that
file->inode
exists?/* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */ struct libos_inode* inode;I'm always confused by that, I'm not sure when it exists and when it doesn't...
First of all, you're looking at the wrong inode
field. What you linked is the dentry field:
gramine/libos/include/libos_fs.h
Lines 218 to 220 in 42ba480
struct libos_dentry { | |
/* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */ | |
struct libos_inode* inode; |
But here we're using the inode
field of the LibOS handle:
gramine/libos/include/libos_handle.h
Lines 170 to 179 in 42ba480
/* | |
* Inode associated with this handle. Currently optional, and only for the use of underlying | |
* filesystem (see `libos_inode` in `libos_fs.h`). Eventually, should replace `dentry` fields. | |
* | |
* This field does not change, so reading it does not require holding `lock`. | |
* | |
* When taking locks for both handle and inode (`hdl->lock` and `hdl->inode->lock`), you should | |
* lock the *inode* first. | |
*/ | |
struct libos_inode* inode; |
If the handle has an inode
, then it always has this inode. And we are guaranteed to have an inode
here, because file
handle is an actual file in this function -- we know this because this function is called from write()
and truncate()
callbacks in the FS subsystem.
libos/src/bookkeep/libos_vma.c
line 1608 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
That's a very long way to write
MIN()
;)
But that was on purpose, to have explicit comments what and why we're doing. Do you want me to remove these comments? I like them.
libos/src/bookkeep/libos_vma.c
line 1609 at r4 (raw file):
} else { if (file_size - vma_info->file_offset > vma_info->length) { /* file size exceeds the mmapped part in VMA, all VMA is synced */
Done. (2x)
libos/src/fs/chroot/encrypted.c
line 496 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto (see below)
Done.
libos/src/fs/chroot/encrypted.c
line 504 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
Done. But here had to rephrase a bit, to make clear it will happen for MAP_SHARED mappings only
libos/src/fs/chroot/encrypted.c
line 537 at r4 (raw file):
if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) { /* If there are any mappings for the file, this will refresh their access protections. */
Done.
libos/src/fs/chroot/fs.c
line 247 at r4 (raw file):
if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) { /* If there are any mappings for the file, this will refresh their access protections. */
Done.
libos/src/fs/chroot/fs.c
line 255 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
Done.
libos/src/fs/tmpfs/fs.c
line 302 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Can you read this without the lock?
struct libos_inode
documentation says:The fields in this structure are protected by
lock
, with the exception of fields that do not change (type
,mount
,fs
).
Done. Oops, good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 22 of 27 files reviewed, 15 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
a discussion (no related file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Done from my side
very well then
libos/src/bookkeep/libos_vma.c
line 1558 at r2 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
One note here: in revision
r4
of this PR it doesn't trigger on allwrite()
calls, only on writes to files with mappings, so it should be reasonably fast (also, see the top-level discussion about performance).
I still think we should keep all VMAs of a given libos_handle
on a linked list; however, I'm fine with this being fixed in a separate PR
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Hm, there good point.
I'm afraid you're correct but there's no easy way to fix this. Our VMA subsystem is not ready for such stuff. What we would need is to be able to split and merge adjacent VMAs on these callback functions. VMA
splitting logic exists in our code, but merging doesn't exist, see e.g. this comment:gramine/libos/src/bookkeep/libos_vma.c
Lines 99 to 104 in 42ba480
/* * "vma_tree" holds all vmas with the assumption that no 2 overlap (though they could be adjacent). * Currently we do not merge similar adjacent vmas - if we ever start doing it, this code needs * to be revisited as there might be some optimizations that would break due to it. */ static struct avl_tree vma_tree = {.cmp = vma_tree_cmp}; Interestingly, we have been having this same issue for ages in our checkpoint-and-restore logic:
gramine/libos/src/bookkeep/libos_vma.c
Lines 1582 to 1605 in 42ba480
/* Send file-backed memory region. */ uint64_t file_size = 0; int ret = get_file_size(vma->file, &file_size); if (ret < 0) return ret; /* Access beyond the last file-backed page will cause SIGBUS. For reducing fork * latency, we send only those memory contents of VMA that are backed by the file, * round up to pages. Rest of VMA memory region will be inaccessible in the child * process. */ size_t send_size = vma->length; if (vma->file_offset + vma->length > file_size) { send_size = file_size > vma->file_offset ? file_size - vma->file_offset : 0; send_size = ALLOC_ALIGN_UP(send_size); } /* It may happen that the whole file-backed memory is beyond the file size (e.g., * the file was truncated after the memory was allocated). In this case we consider * the whole memory region to be inaccessible. */ if (send_size > 0) { struct libos_mem_entry* mem; DO_CP_SIZE(memory, vma->addr, send_size, &mem); mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0); } TLDR:
- Your issue is legit.
- We can easily split VMAs into two: one with normal protections, one with NONE protections.
- But we cannot merge those VMAs back together (not implemented) when the
write()
ortruncate()
grows the file further.- So we may end up with hundreds of adjacent VMAs, slowing down everything.
- We had the same issue in restored VMAs (for opened files in child processes); we didn't observe any issues because of that in real-world apps.
So given the above, I think we should postpone fixing this issue (for the time when we are ready to modify VMA bookkeeping code). I wrote a corresponding FIXME comment here.
I'd propose a somewhat different handling of this.
I think we should commit to the model of "logical" split within a single VMA: some prefix of the VMA is actually active (and has valid pages), while the rest is unmapped (and returns SIGBUS
). This is, in fact, what our code already expects in some places — see memfault_upcall
in libos_signal.c
, for one.
What I would propose is:
- have a
valid_end
(exact name TBD) field added tolibos_vma
andlibos_vma_info
, which is the boundary between valid area andSIGBUS
-producing area - have a function in
libos_vma.c
that updates that field and callsPalVirtualMemoryProtect
as appropriate to enforce it - check this field in
is_in_adjacent_user_vmas
so that syscalls return-EFAULT
as appropriate
This should ensure correct semantics without too much suffering with split_vma
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 1599 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
First of all, you're looking at the wrong
inode
field. What you linked is the dentry field:gramine/libos/include/libos_fs.h
Lines 218 to 220 in 42ba480
struct libos_dentry { /* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */ struct libos_inode* inode; But here we're using the
inode
field of the LibOS handle:gramine/libos/include/libos_handle.h
Lines 170 to 179 in 42ba480
/* * Inode associated with this handle. Currently optional, and only for the use of underlying * filesystem (see `libos_inode` in `libos_fs.h`). Eventually, should replace `dentry` fields. * * This field does not change, so reading it does not require holding `lock`. * * When taking locks for both handle and inode (`hdl->lock` and `hdl->inode->lock`), you should * lock the *inode* first. */ struct libos_inode* inode; If the handle has an
inode
, then it always has this inode. And we are guaranteed to have aninode
here, becausefile
handle is an actual file in this function -- we know this because this function is called fromwrite()
andtruncate()
callbacks in the FS subsystem.
Ok, reading this won't be needed after implementing @mwkmwkmwk's suggestion from the discussion above (we'll be able to get this info from the vma).
libos/src/bookkeep/libos_vma.c
line 1608 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
But that was on purpose, to have explicit comments what and why we're doing. Do you want me to remove these comments? I like them.
Seems overly verbose for a trivial thing to me, but won't block on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 1558 at r2 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
I still think we should keep all VMAs of a given
libos_handle
on a linked list; however, I'm fine with this being fixed in a separate PR
Done. Created an issue on this: #1917
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
I'd propose a somewhat different handling of this.
I think we should commit to the model of "logical" split within a single VMA: some prefix of the VMA is actually active (and has valid pages), while the rest is unmapped (and returns
SIGBUS
). This is, in fact, what our code already expects in some places — seememfault_upcall
inlibos_signal.c
, for one.What I would propose is:
- have a
valid_end
(exact name TBD) field added tolibos_vma
andlibos_vma_info
, which is the boundary between valid area andSIGBUS
-producing area- have a function in
libos_vma.c
that updates that field and callsPalVirtualMemoryProtect
as appropriate to enforce it- check this field in
is_in_adjacent_user_vmas
so that syscalls return-EFAULT
as appropriateThis should ensure correct semantics without too much suffering with
split_vma
.
I like the proposal from @mwkmwkmwk. @mkow @kailun-qin What do you think?
If there is no hard no to this proposal, I'll cook a PR for this. That new PR will become a blocker for this PR (which I'll need to rebase on top of that one).
libos/src/bookkeep/libos_vma.c
line 1608 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Seems overly verbose for a trivial thing to me, but won't block on this.
Let's keep it like this unless some other reviewer asks to modify this to MIN()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
I like the proposal from @mwkmwkmwk. @mkow @kailun-qin What do you think?
If there is no hard no to this proposal, I'll cook a PR for this. That new PR will become a blocker for this PR (which I'll need to rebase on top of that one).
+1 from my side.
libos/src/bookkeep/libos_vma.c
line 1608 at r4 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Let's keep it like this unless some other reviewer asks to modify this to
MIN()
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
+1 from my side.
+1, sounds great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 18 of 31 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @mkow, and @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 1548 at r4 (raw file):
Previously, kailun-qin (Kailun Qin) wrote…
+1, sounds great
Done. Added as a fixup commit to this PR, because the two changes (the one in this PR and the one @mwkmwkmwk suggested) are too tied together. Hope it's fine (I tried but didn't find a good way to create a prerequisite PR with this valid_end
/valid_length
). But now the code is really much easier to follow, IMHO.
libos/src/bookkeep/libos_vma.c
line 1599 at r4 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Ok, reading this won't be needed after implementing @mwkmwkmwk's suggestion from the discussion above (we'll be able to get this info from the vma).
Done.
libos/src/bookkeep/libos_vma.c
line 1374 at r5 (raw file):
} static bool vma_filter_needs_msync(struct libos_vma* vma, void* arg) {
FYI: I moved vma_filter_needs_msync()
helper function closer to its usage. This confuses Reviewable a bit, so if requested, I can move this function back (and do the code-move in a separate PR later).
Jenkins, retest Jenkins-SGX-EDMM please |
Jenkins, retest Jenkins-SGX-EDMM please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r6, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)
libos/include/libos_vma.h
line 103 at r6 (raw file):
Looks up VMA that contains
addr
and if found, updatesvma->valid_length
.
That's a bit weird semantics IMO. It allows addr
to point anywhere within the vma, so it's not intuitive whether the length is the length from addr
or from the beginning of the vma.
Maybe force addr
to always point to the beginning of the mapping?
libos/src/libos_rtld.c
line 274 at r6 (raw file):
return ret; } if (valid_size != map_size) {
Why is this correct? How does this work with large .bss sections? (see the if
below)
libos/src/bookkeep/libos_signal.c
line 365 at r6 (raw file):
/* * If the mapping exceeds end of a file then return a SIGBUS. Actually, the host * environment would most probably raise SIGBUS anyway, but Gramine PAL API fuses both
Not on the SGX PAL? There it's just a SIGSEGV
there? (and it's SGX which is our "most probable" setup)
libos/src/bookkeep/libos_vma.c
line 176 at r6 (raw file):
* it's called with the VMA lock held. * * Returns whether the traversed range was continuously covered by VMAs. This is useful for
This needs an update, a region can now be covered by VMAs but the function can return false if some end != valid_end
.
Code quote:
Returns whether the traversed range was continuously covered by VMAs.
libos/src/bookkeep/libos_vma.c
line 194 at r6 (raw file):
assert(vma->begin <= begin); struct libos_vma* prev = NULL; bool is_continuous = true;
Why is this always true? (this plus the assert above)
I don't see why is that.
libos/src/bookkeep/libos_vma.c
line 832 at r6 (raw file):
new_vma->begin = (uintptr_t)addr; new_vma->end = new_vma->begin + length; new_vma->valid_end = new_vma->begin + length;
Hmm, why ignoring file
here? valid_end
is potentially incorrect now.
Update: I see, it's updated after the real mmap? Could you add a comment here explaining that it's temporarily wrong and should be updated after the real mmap call?
libos/src/bookkeep/libos_vma.c
line 1126 at r6 (raw file):
out_found: new_vma->end = max_addr; new_vma->valid_end = max_addr;
ditto
libos/src/bookkeep/libos_vma.c
line 1418 at r6 (raw file):
spinlock_lock(&vma_tree_lock); bool is_continuous = _traverse_vmas_in_range(begin, end, madvise_dontneed_visitor, &ctx);
This has different semantics now regarding -ENOMEM
on the region after the file contents. Which one is the one Linux follows?
libos/src/bookkeep/libos_vma.c
line 1537 at r6 (raw file):
}; static bool vma_filter_needs_prot_refresh(struct libos_vma* vma, void* _args) {
This isn't just a "filter" anymore, it updates the traversed VMAs.
Code quote:
vma_filter
libos/src/bookkeep/libos_vma.c
line 1553 at r6 (raw file):
return false; /* by default, assume that file got smaller than the offset from which VMA is mapped, all VMA is
What do you mean by "by default assume [XYZ]"? This code has to handle all possible cases? What is a "default" in this context?
libos/src/bookkeep/libos_vma.c
line 1734 at r6 (raw file):
* * Access beyond the last file-backed page (reflected via vma->valid_length) will * cause SIGBUS. For reducing fork latency, we send only those memory contents of
Hmm, I think this sentence doesn't make sense now? The rest of the region is inaccessible anyways, there's nothing to send from it.
Code quote:
For reducing fork latency
libos/src/sys/libos_mmap.c
line 198 at r6 (raw file):
uintptr_t begin = MAX((uintptr_t)addr, (uintptr_t)vma->addr); uintptr_t end = MIN((uintptr_t)vma->addr + vma->length, (uintptr_t)addr + length);
Accidental change?
libos/src/sys/libos_mmap.c
line 265 at r6 (raw file):
if (update_valid_length_ret < 0) { log_error("[mmap] Failed to update valid length to %lu of bookkeeped memory %p-%p!", valid_length, addr, (char*)addr + length);
wrong alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @mkow, and @mwkmwkmwk)
libos/include/libos_vma.h
line 103 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Looks up VMA that contains
addr
and if found, updatesvma->valid_length
.That's a bit weird semantics IMO. It allows
addr
to point anywhere within the vma, so it's not intuitive whether the length is the length fromaddr
or from the beginning of the vma.
Maybe forceaddr
to always point to the beginning of the mapping?
Done.
libos/src/libos_rtld.c
line 274 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Why is this correct? How does this work with large .bss sections? (see the
if
below)
This is correct.
map_size = c->map_end - c->start
(see several lines above). Here, c->start
is the (page-aligned) offset at which the allocation of the segment should start. And c->map_end
is the (page-aligned) end offset at which the allocation of the segment should stop, see
gramine/libos/src/libos_rtld.c
Lines 86 to 115 in e95a3a1
struct loadcmd { | |
/* | |
* Load command for a single segment. The following properties are true: | |
* | |
* - start <= data_end <= map_end <= alloc_end | |
* - start, map_end, alloc_end are page-aligned | |
* - map_off is page-aligned | |
* | |
* The addresses are not relocated (i.e. you need to add l_base_diff to them). | |
*/ | |
/* Start of memory area */ | |
elf_addr_t start; | |
/* End of file data (data_end .. alloc_end should be zeroed out) */ | |
elf_addr_t data_end; | |
/* End of mapped file data (data_end rounded up to page size, so that we can mmap | |
* start .. map_end) */ | |
elf_addr_t map_end; | |
/* End of memory area */ | |
elf_addr_t alloc_end; | |
/* Offset from the beginning of file at which the first byte of the segment resides */ | |
elf_off_t map_off; | |
/* Permissions for memory area */ | |
int prot; | |
}; |
As for BSS sections, they end up in the memory region defined by [c->map_end, c->alloc_end)
. See
gramine/libos/src/libos_rtld.c
Lines 298 to 315 in e95a3a1
/* Allocate extra pages after the mapped area. */ | |
if (c->map_end < c->alloc_end) { | |
void* zero_page_start = (void*)(c->map_end + base_diff); | |
size_t zero_page_size = c->alloc_end - c->map_end; | |
int zero_map_flags = MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS; | |
pal_prot_flags_t zero_pal_prot = LINUX_PROT_TO_PAL(c->prot, zero_map_flags); | |
if ((ret = bkeep_mmap_fixed(zero_page_start, zero_page_size, c->prot, zero_map_flags, | |
/*file=*/NULL, /*offset=*/0, /*comment=*/NULL)) < 0) { | |
log_debug("cannot bookkeep address of zero-fill pages"); | |
return ret; | |
} | |
if ((ret = PalVirtualMemoryAlloc(zero_page_start, zero_page_size, zero_pal_prot)) < 0) { | |
log_debug("cannot map zero-fill pages"); | |
return pal_to_unix_errno(ret); | |
} | |
} |
So I could relax this check to if (valid_size > map_size) {fail}
. But for correctly-formatted ELF files, there's no need to relax. And I don't think we want to care about wrongly formatted ELFs.
libos/src/bookkeep/libos_signal.c
line 365 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Not on the SGX PAL? There it's just a
SIGSEGV
there? (and it's SGX which is our "most probable" setup)
You're right. But it's not only about SGX PAL, the same happens in direct PAL. Because we emulate this "mapping exceeds end of a file" (which in Linux is a SIGBUS condition) with the "exceeds-end-of-file part of mapping has NONE permissions" (which is a SIGSEGV condition).
So my comment is plain wrong. I guess I'll just revert to the previous comment, short and precise.
libos/src/bookkeep/libos_vma.c
line 176 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This needs an update, a region can now be covered by VMAs but the function can return false if some
end != valid_end
.
Done.
libos/src/bookkeep/libos_vma.c
line 194 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Why is this always true? (this plus the assert above)
I don't see why is that.
_lookup_vma(begin)
is guaranteed to return a VMA which address is lower than/equal to the begin
argument.
This looked up VMA may be completely below than the [begin, end)
region, but this bad case is covered by the IF check above.
So the only result at this point in the code is that we have a VMA which contains the begin
address (the VMA "covers" it). Thus assert()
is trivially true.
So I changed this line to a more obvious assert()
+ true
. I can revert back -- there is no difference in functionality -- but I think my version shows a clearer intent of this code.
libos/src/bookkeep/libos_vma.c
line 832 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Hmm, why ignoring
file
here?valid_end
is potentially incorrect now.Update: I see, it's updated after the real mmap? Could you add a comment here explaining that it's temporarily wrong and should be updated after the real mmap call?
Done.
libos/src/bookkeep/libos_vma.c
line 1126 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
Done.
libos/src/bookkeep/libos_vma.c
line 1418 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This has different semantics now regarding
-ENOMEM
on the region after the file contents. Which one is the one Linux follows?
TODO:
- Add a new LibOS test that tries to access a SIGBUS part of file-backed VMA and check
- what happens on e.g.
write(<all-vma>)
- what happens on e.g.
write(<no-sigbus-part-of-memory>)
- what happens on e.g.
write(<only-sigbus-part-of-memory>)
- what happens on
madvise(MADV_DONTNEED, <all-vma>)
- what happens on
madvise(MADV_DONTNEED, <no-sigbus-part-of-memory>)
- what happens on
madvise(MADV_DONTNEED, <only-sigbus-part-of-memory>)
libos/src/bookkeep/libos_vma.c
line 1537 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
This isn't just a "filter" anymore, it updates the traversed VMAs.
Done.
Wasn't sure how to best rename, so just removed the _filter_
part. But it still doesn't reflect that the vma->valid_end
is modified here. Any suggestions are welcome.
libos/src/bookkeep/libos_vma.c
line 1553 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
What do you mean by "by default assume [XYZ]"? This code has to handle all possible cases? What is a "default" in this context?
This "by default assume" is meant purely to explain the valid_length = 0;
initialization. I added an empty line to make it clear that this comment is for the next line (var init) only.
Previously I had a cascade of IFs to make it even more explicit. But it looked ugly, so I did this "init variable to some case and get rid of one level of IFs".
libos/src/bookkeep/libos_vma.c
line 1734 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Hmm, I think this sentence doesn't make sense now? The rest of the region is inaccessible anyways, there's nothing to send from it.
Done, I removed the "reduce fork latency" phrase. The rest of the sentence/paragraph still holds, though I guess it's obvious. I can remove the rest of the paragraph, if you insist.
libos/src/sys/libos_mmap.c
line 198 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Accidental change?
Done.
libos/src/sys/libos_mmap.c
line 265 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
wrong alignment
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimakuv: Didn't you forget to push?
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)
1de768e
to
809d603
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mkow That was on purpose. I wanted to push, but realized that I need more code. Now I pushed.
Reviewable status: 26 of 36 files reviewed, 17 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @mkow, and @mwkmwkmwk)
-- commits
line 41 at r7:
Will do: add new LibOS test mmap_file_sigbus
.
libos/src/bookkeep/libos_vma.c
line 176 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Done.
Done again: I reverted the changes because apparently that's how Linux works: Linux doesn't treat those SIGBUS-inducing parts of file-backed VMAs as non-contiguous regions. So the function's code doesn't need to be changed. See also my newly added LibOS test.
libos/src/bookkeep/libos_vma.c
line 194 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
_lookup_vma(begin)
is guaranteed to return a VMA which address is lower than/equal to thebegin
argument.This looked up VMA may be completely below than the
[begin, end)
region, but this bad case is covered by the IF check above.So the only result at this point in the code is that we have a VMA which contains the
begin
address (the VMA "covers" it). Thusassert()
is trivially true.So I changed this line to a more obvious
assert()
+true
. I can revert back -- there is no difference in functionality -- but I think my version shows a clearer intent of this code.
Done, this is irrelevant now since I reverted this whole function.
libos/src/bookkeep/libos_vma.c
line 1418 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
TODO:
- Add a new LibOS test that tries to access a SIGBUS part of file-backed VMA and check
- what happens on e.g.
write(<all-vma>)
- what happens on e.g.
write(<no-sigbus-part-of-memory>)
- what happens on e.g.
write(<only-sigbus-part-of-memory>)
- what happens on
madvise(MADV_DONTNEED, <all-vma>)
- what happens on
madvise(MADV_DONTNEED, <no-sigbus-part-of-memory>)
- what happens on
madvise(MADV_DONTNEED, <only-sigbus-part-of-memory>)
Done, see new test mmap_file_sigbus.c
libos/test/regression/mmap_file.c
line 4 at r7 (raw file):
* TODO: - Hans, get ze flammenwerfer... * - Now we have mmap_file_sigbus.c which largely supersedes this test, so can burn this one * with ze flammenwerfer
FYI: Currently kept this test, because more tests is better? Not sure if we want to burn this test, or keep it afloat for some time (until we're sure other tests completely cover this test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 26 of 36 files reviewed, 18 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
libos/test/regression/mmap_file_sigbus.c
line 12 at r10 (raw file):
* * - Behavior when using the mmapped regions as a buffer to a syscall * - specifying the whole 2-page region succeeds (writes 1 page) (yes, that's how Linux works)
...and that's what I don't know how to support in Gramine, see inline comments in that sub-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 26 of 36 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 184 at r10 (raw file):
* `write(partially-valid-vma)` Linux does not return -EFAULT but instead uses the buffer until * the first invalid address. This behavior is too cumbersome to implement in Gramine + SGX, * thus on `write(partially-valid-vma)` Gramine immediately returns -EFAULT.
Enjoy: https://yarchive.net/comp/linux/partial_reads_writes.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 10 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: all files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)
libos/src/libos_rtld.c
line 274 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This is correct.
map_size = c->map_end - c->start
(see several lines above). Here,c->start
is the (page-aligned) offset at which the allocation of the segment should start. Andc->map_end
is the (page-aligned) end offset at which the allocation of the segment should stop, seegramine/libos/src/libos_rtld.c
Lines 86 to 115 in e95a3a1
struct loadcmd { /* * Load command for a single segment. The following properties are true: * * - start <= data_end <= map_end <= alloc_end * - start, map_end, alloc_end are page-aligned * - map_off is page-aligned * * The addresses are not relocated (i.e. you need to add l_base_diff to them). */ /* Start of memory area */ elf_addr_t start; /* End of file data (data_end .. alloc_end should be zeroed out) */ elf_addr_t data_end; /* End of mapped file data (data_end rounded up to page size, so that we can mmap * start .. map_end) */ elf_addr_t map_end; /* End of memory area */ elf_addr_t alloc_end; /* Offset from the beginning of file at which the first byte of the segment resides */ elf_off_t map_off; /* Permissions for memory area */ int prot; }; As for BSS sections, they end up in the memory region defined by
[c->map_end, c->alloc_end)
. Seegramine/libos/src/libos_rtld.c
Lines 298 to 315 in e95a3a1
/* Allocate extra pages after the mapped area. */ if (c->map_end < c->alloc_end) { void* zero_page_start = (void*)(c->map_end + base_diff); size_t zero_page_size = c->alloc_end - c->map_end; int zero_map_flags = MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS; pal_prot_flags_t zero_pal_prot = LINUX_PROT_TO_PAL(c->prot, zero_map_flags); if ((ret = bkeep_mmap_fixed(zero_page_start, zero_page_size, c->prot, zero_map_flags, /*file=*/NULL, /*offset=*/0, /*comment=*/NULL)) < 0) { log_debug("cannot bookkeep address of zero-fill pages"); return ret; } if ((ret = PalVirtualMemoryAlloc(zero_page_start, zero_page_size, zero_pal_prot)) < 0) { log_debug("cannot map zero-fill pages"); return pal_to_unix_errno(ret); } }
So I could relax this check to
if (valid_size > map_size) {fail}
. But for correctly-formatted ELF files, there's no need to relax. And I don't think we want to care about wrongly formatted ELFs.
Oh, wait, there's a separate alloc_end
, I missed that, sorry. It's ok as it is then.
libos/src/bookkeep/libos_vma.c
line 194 at r6 (raw file):
_lookup_vma(begin)
is guaranteed to return a VMA which address is lower than/equal to thebegin
argument.
Nope, the other way around:
/* Returns the vma that contains `addr`. If there is no such vma, returns the closest vma with
* higher address. */
static struct libos_vma* _lookup_vma(uintptr_t addr) {
libos/src/bookkeep/libos_vma.c
line 1537 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Done.
Wasn't sure how to best rename, so just removed the
_filter_
part. But it still doesn't reflect that thevma->valid_end
is modified here. Any suggestions are welcome.
Maybe vma_update_valid_ends
? And just describe the return value in the comment?
libos/src/bookkeep/libos_vma.c
line 1553 at r6 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
This "by default assume" is meant purely to explain the
valid_length = 0;
initialization. I added an empty line to make it clear that this comment is for the next line (var init) only.Previously I had a cascade of IFs to make it even more explicit. But it looked ugly, so I did this "init variable to some case and get rid of one level of IFs".
Maybe put it into else
of the top-level if
then? Then the comment would be less confusing :)
libos/src/bookkeep/libos_vma.c
line 179 at r8 (raw file):
* emulating errors in memory management syscalls. */ // TODO: Probably other VMA functions could make use of this helper.
Why removing this? Aren't there any functions which could use it?
libos/src/bookkeep/libos_vma.c
line 1754 at r10 (raw file):
* Send file-backed memory region. * * Access beyond the last file-backed page (reflected via vma->valid_length) will
we're emulating this behavior here, it wouldn't actually happen if you tried it
Suggestion:
should
libos/test/regression/mmap_file.c
line 4 at r7 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
FYI: Currently kept this test, because more tests is better? Not sure if we want to burn this test, or keep it afloat for some time (until we're sure other tests completely cover this test).
What is missing from the other tests? More tests isn't always better, they take CI time, need maintenance and if they are of poor quality then they cause false bug reports. So, if they aren't testing anything and contain UBs like this one, then I'd rather get rid of them.
libos/test/regression/mmap_file_sigbus.c
line 20 at r10 (raw file):
* - madvise(MADV_DONTNEED) on the second 1-page region succeeds (yes, that's how Linux works) * - accessing the first 1-page region succeeds * - accessing the second 1-page region results in SIGBUS
by "accessing" you also mean madvise? or accessing from usermode after madvise was called on the region?
libos/test/regression/mmap_file_sigbus.c
line 36 at r10 (raw file):
#include <sys/types.h> #include <ucontext.h> #include <unistd.h>
Please sort.
libos/test/regression/mmap_file_sigbus.c
line 40 at r10 (raw file):
#include "common.h" #define TEST_DIR "tmp"
Suggestion:
TEST_DIR "tmp"
libos/test/regression/mmap_file_sigbus.c
line 59 at r10 (raw file):
"ret\n" "ret:\n" "ret\n"
Suggestion:
"ret:\n"
"ret\n"
libos/test/regression/mmap_file_sigbus.c
line 113 at r10 (raw file):
x = mem_read(addr_page1); if (x == 0xdeadbeef) errx(1, "unexpected value in successful read from file-backed mmap region: %lx", x);
But this value means that the read wasn't successful?
Code quote:
in successful read
libos/test/regression/mmap_file_sigbus.c
line 116 at r10 (raw file):
x = mem_read(addr_page2); if (x != 0xdeadbeef) errx(1, "unexpected value in failing read from file-backed mmap region: %lx", x);
ditto
libos/test/regression/mmap_file_sigbus.c
line 117 at r10 (raw file):
if (x != 0xdeadbeef) errx(1, "unexpected value in failing read from file-backed mmap region: %lx", x); if (g_sigbus_triggered != 1)
Please check whether it's 0 after the first call.
libos/test/regression/mmap_file_sigbus.c
line 139 at r10 (raw file):
if (ret != (ssize_t)page_size) errx(1, "write(2 pages): expected 1-page write, got ret=%ld, errno=%d", ret, errno); #endif
I think that uncommenting this #ifdef
won't produce a working test, because a seek is missing? Or maybe #else
?
libos/test/regression/mmap_file_sigbus.c
line 161 at r10 (raw file):
x = mem_read(addr_page1); if (x == 0xdeadbeef) errx(1, "unexpected value in successful read after madvise(MADV_DONTNEED): %lx", x);
ditto x2
libos/test/regression/mmap_file_sigbus.c
line 165 at r10 (raw file):
if (x != 0xdeadbeef) errx(1, "unexpected value in failing read after madvise(MADV_DONTNEED): %lx", x); if (g_sigbus_triggered != 1)
Please check whether it's 0 after the first call.
libos/test/regression/test_libos.py
line 808 at r10 (raw file):
f.truncate(os.sysconf("SC_PAGE_SIZE")) write_path = 'tmp/__mmaptestfilewrite__'
Could you pass the filenames as arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 20 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
libos/src/bookkeep/libos_vma.c
line 194 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
_lookup_vma(begin)
is guaranteed to return a VMA which address is lower than/equal to thebegin
argument.Nope, the other way around:
/* Returns the vma that contains `addr`. If there is no such vma, returns the closest vma with * higher address. */ static struct libos_vma* _lookup_vma(uintptr_t addr) {
Wow, my bad, I read the comment incorrectly and though that _lookup_vma()
returns closes vma with a lower address. But it's actually the higher address. Then my change was wrong, but surprisingly it still worked on our tests...
Anyway, I reverted all that, so not relevant anymore.
libos/src/bookkeep/libos_vma.c
line 1537 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Maybe
vma_update_valid_ends
? And just describe the return value in the comment?
Done.
libos/src/bookkeep/libos_vma.c
line 1553 at r6 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Maybe put it into
else
of the top-levelif
then? Then the comment would be less confusing :)
Done.
libos/src/bookkeep/libos_vma.c
line 179 at r8 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Why removing this? Aren't there any functions which could use it?
Done. I reverted this. I quickly looked through this file and didn't find any functions that would clearly benefit from this func. But on the other hand, I didn't do any thorough analysis (maybe if some func is refactored to be compatible with using this func, then it would work). So let's keep the comment until someone does a proper deep analysis.
libos/src/bookkeep/libos_vma.c
line 184 at r10 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Enjoy: https://yarchive.net/comp/linux/partial_reads_writes.html
Such a good link that I also added it elsewhere (in the LibOS regression test)
libos/src/bookkeep/libos_vma.c
line 1754 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
we're emulating this behavior here, it wouldn't actually happen if you tried it
Done.
libos/test/regression/mmap_file.c
line 4 at r7 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
What is missing from the other tests? More tests isn't always better, they take CI time, need maintenance and if they are of poor quality then they cause false bug reports. So, if they aren't testing anything and contain UBs like this one, then I'd rather get rid of them.
Done.
The only part that we could miss a bit was the in-child test. So I modified my new mmap_file_sigbus
test to also check this. This old and full of UBs test is removed.
libos/test/regression/mmap_file_sigbus.c
line 20 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
by "accessing" you also mean madvise? or accessing from usermode after madvise was called on the region?
Done.
libos/test/regression/mmap_file_sigbus.c
line 36 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please sort.
Done. Also removed some unneeded headers.
libos/test/regression/mmap_file_sigbus.c
line 59 at r10 (raw file):
"ret\n" "ret:\n" "ret\n"
Done.
libos/test/regression/mmap_file_sigbus.c
line 113 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
But this value means that the read wasn't successful?
Done. My bad phrasing, but the meaning was correct: "Read that was supposed to be successful unexpectedly returned a value that is reserved for invalid accesses". I rephrased.
libos/test/regression/mmap_file_sigbus.c
line 116 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto
Done.
libos/test/regression/mmap_file_sigbus.c
line 117 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please check whether it's 0 after the first call.
Done.
libos/test/regression/mmap_file_sigbus.c
line 139 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
I think that uncommenting this
#ifdef
won't produce a working test, because a seek is missing? Or maybe#else
?
Not sure I understand your comment. We are writing to write_fd
, which is a separate file: created plus truncated by the open()
couple lines above.
The contents of this file are irrelevant, it's just a temporary destination for writing. We only use it as a way to use addr_page1
and addr_page2
(backed by another read-only file) as source buffers.
We don't care if we always append to write_fd
or we perform lseek to rewind. This file will be removed afterwards anyway (see unlink()
below).
libos/test/regression/mmap_file_sigbus.c
line 161 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
ditto x2
Done.
libos/test/regression/mmap_file_sigbus.c
line 165 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please check whether it's 0 after the first call.
Done.
libos/test/regression/test_libos.py
line 808 at r10 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Could you pass the filenames as arguments?
Done. Good idea.
0595e70
to
99e3003
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 20 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
libos/test/regression/mmap_file_sigbus.c
line 40 at r10 (raw file):
#include "common.h" #define TEST_DIR "tmp"
Done. Not relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 19 files at r1, 6 of 15 files at r4, 9 of 13 files at r6, 3 of 10 files at r9, 1 of 8 files at r11.
Reviewable status: 29 of 36 files reviewed, 21 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
libos/src/fs/tmpfs/fs.c
line 278 at r11 (raw file):
if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) { /* There are mappings for the file, refresh their access protections. */
this code sequence (and the one in truncate
) comes up a lot; do you think we could deduplicate them?
pal/src/pal_rtld.c
line 558 at r11 (raw file):
goto out; } ret = _PalStreamRead(handle, c->map_off, map_size, map_addr);
this should verify the whole thing was actually read
99e3003
to
7c4ce91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 21 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
libos/src/fs/tmpfs/fs.c
line 278 at r11 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
this code sequence (and the one in
truncate
) comes up a lot; do you think we could deduplicate them?
Good idea, but can we do it in a follow-up PR or at least when all other comments in this PR will be resolved? This change will move quite a lot of code around, which will complicate the review process.
pal/src/pal_rtld.c
line 558 at r11 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
this should verify the whole thing was actually read
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
libos/src/fs/tmpfs/fs.c
line 278 at r11 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Good idea, but can we do it in a follow-up PR or at least when all other comments in this PR will be resolved? This change will move quite a lot of code around, which will complicate the review process.
so be it
pal/src/pal_rtld.c
line 105 at r13 (raw file):
size_t got = 0; while (got < count) { int64_t ret = _PalStreamRead(handle, offset, count - got, (char*)buf + got);
should be offset + got
pal/src/pal_rtld.c
line 594 at r13 (raw file):
continue; ret = _PalVirtualMemoryAlloc((void*)c->map_end, c->alloc_end - c->map_end, c->prot);
this is no longer needed with this approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
libos/src/fs/tmpfs/fs.c
line 278 at r11 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
so be it
#1931 -- not to forget. Let me put a blocking comment so I'll do it after e.g. rebase.
pal/src/pal_rtld.c
line 105 at r13 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
should be
offset + got
Done. Oops, good catch.
pal/src/pal_rtld.c
line 594 at r13 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
this is no longer needed with this approach
Why are you saying this? That's for the BSS part.
Above we mmapped only [c->start, c->map_end)
(for data and code). Now we also mmap [c->map_end, c->alloc_end)
which is used for BSS.
We could refactor this code to only have one _PalVirtualMemoryAlloc()
, but then we'll lose this clarify of data+code vs BSS. Additionally, we'll need to play with page protections more...
( Same question was raised in another PR: https://reviewable.io/reviews/gramineproject/gramine/1820#-O0P3jrN6viIx-pY8-sV )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 23 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
pal/src/pal_rtld.c
line 574 at r13 (raw file):
goto out; } ret = pal_stream_read_exact(handle, c->map_off, map_size, map_addr);
Just realized that this conflicts with #1820. I think I should first merge #1820 and then get rid of this pal_stream_read_exact()
, it won't be needed (we'll use memcpy()
instead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)
pal/src/pal_rtld.c
line 574 at r13 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Just realized that this conflicts with #1820. I think I should first merge #1820 and then get rid of this
pal_stream_read_exact()
, it won't be needed (we'll usememcpy()
instead).
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)
pal/src/pal_rtld.c
line 594 at r13 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Why are you saying this? That's for the BSS part.
Above we mmapped only
[c->start, c->map_end)
(for data and code). Now we also mmap[c->map_end, c->alloc_end)
which is used for BSS.We could refactor this code to only have one
_PalVirtualMemoryAlloc()
, but then we'll lose this clarify of data+code vs BSS. Additionally, we'll need to play with page protections more...( Same question was raised in another PR: https://reviewable.io/reviews/gramineproject/gramine/1820#-O0P3jrN6viIx-pY8-sV )
Ok, this is resolved in another PR. This issue will go away when we rebase this PR on top of #1820.
a0c58ef
to
4f150ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 22 of 36 files reviewed, 18 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv, @mkow, and @mwkmwkmwk)
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
TODO:
file-growing `ftruncate()` on
->
file-growing `write()` and `ftruncate()` on
Done
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Will do: add new LibOS test
mmap_file_sigbus
.
Done
pal/src/pal_rtld.c
line 574 at r13 (raw file):
Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…
sounds good
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r11, 1 of 2 files at r13, 12 of 12 files at r15, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
libos/src/fs/tmpfs/fs.c
line 278 at r11 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
#1931 -- not to forget. Let me put a blocking comment so I'll do it after e.g. rebase.
@dimakuv: That's the last blocking discussion here, right? Do you plan to create the PR now, or somewhen later, after merging?
libos/test/regression/mmap_file_sigbus.c
line 139 at r10 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Not sure I understand your comment. We are writing to
write_fd
, which is a separate file: created plus truncated by theopen()
couple lines above.The contents of this file are irrelevant, it's just a temporary destination for writing. We only use it as a way to use
addr_page1
andaddr_page2
(backed by another read-only file) as source buffers.We don't care if we always append to
write_fd
or we perform lseek to rewind. This file will be removed afterwards anyway (seeunlink()
below).
Ah, sorry, I mixed up reading and writing and somehow looked at this transfer in the opposite direction. Everything looks good.
libos/test/regression/mmap_file_sigbus.c
line 158 at r15 (raw file):
if (argc != 4) errx(1, "Usage: %s <path1: read-only file> <path2:write-only file> <fork?>", argv[0]);
One time you put a space after the colon and the other time you didn't.
libos/test/regression/mmap_file_sigbus.c
line 158 at r15 (raw file):
if (argc != 4) errx(1, "Usage: %s <path1: read-only file> <path2:write-only file> <fork?>", argv[0]);
Suggestion:
<fork|nofork>
libos/test/regression/mmap_file_sigbus.c
line 162 at r15 (raw file):
const char* path1 = argv[1]; const char* path2 = argv[2]; bool do_fork = strcmp(argv[3], "fork") == 0;
Please don't accept random garbage here, it's too easy to introduce unnoticed bugs this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 19 files at r1, 2 of 10 files at r3, 1 of 8 files at r11, 12 of 12 files at r15.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 35 of 36 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)
libos/src/fs/tmpfs/fs.c
line 278 at r11 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
@dimakuv: That's the last blocking discussion here, right? Do you plan to create the PR now, or somewhen later, after merging?
Done #1956
libos/test/regression/mmap_file_sigbus.c
line 158 at r15 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
One time you put a space after the colon and the other time you didn't.
Done.
libos/test/regression/mmap_file_sigbus.c
line 158 at r15 (raw file):
if (argc != 4) errx(1, "Usage: %s <path1: read-only file> <path2:write-only file> <fork?>", argv[0]);
Done.
libos/test/regression/mmap_file_sigbus.c
line 162 at r15 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Please don't accept random garbage here, it's too easy to introduce unnoticed bugs this way.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners
68ac11a
to
2b40ea7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved
-- commits
line 46 at r17:
@dimakuv: FYI: I changed that because the old version was ambiguous (it could also mean that "there are currently just a few FS tests in general", not that you added/enabled some in this PR).
Code quote:
A few more FS tests are enabled on SGX now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 19 files at r1, 2 of 10 files at r3, 6 of 15 files at r4, 5 of 13 files at r6, 3 of 10 files at r9, 2 of 8 files at r11, 12 of 12 files at r15, 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @dimakuv)
-- commits
line 17 at r17:
But it's PalDeviceMap()
?
Code quote:
Only `shm` FS still uses `PalStreamMap()`
Previously, the `chroot` FS (plain host-backed files) used the `PalStreamMap()` PAL API for file-backed mmap. This had three problems: 1) need to implement a non-trivial `map` callback in PALs; 2) discrepancy between `map` implementations in different PALs; 3) hard to debug file-mmap bugs because they only reproduced on SGX (`gramine-sgx`) PAL and not on Linux (`gramine-direct`) PAL. Note that other FSes already used emulated file-backed mmap: `tmpfs` and `encrypted` FSes emulate such mmaps via `PalStreamRead()` and `PalStreamWrite()`. This commit switches `chroot` FS to use emulated file-backed mmap. This way, `chroot` becomes similar in implementation to `tmpfs` and `encrypted` FSes. Only `shm` FS still uses `PalDeviceMap()` (previously called `PalStreamMap()`) because devices with shared memory have non-standard semantics of mmaps. Corresponding `file_map()` functions in PAL are removed. In this commit, we also introduce the model of "logical" split within a single VMA: some prefix of the VMA is accessible (has valid pages), while the rest is unmapped (returns SIGBUS). Only file-backed VMAs are split in this way (anonymous-memory VMAs can't be in "unmapped" state). This logical split is achieved via a new `vma->valid_length` field. The switch to emulated mmap uncovered several bugs: - Underlying file may be shorter than the requested mmap size. In this case access beyond the last file-backed page must cause SIGBUS. Previously this semantics worked only on `gramine-direct` and wasn't implemented on `gramine-sgx` (even with EDMM). - As a consequence of the semantics above, file-growing `write()` and `ftruncate()` on already-mmapped file must make newly extended file contents accessible. Previously it didn't work on `gramine-sgx` (with EDMM), now it is resolved via `prot_refresh_mmaped_from_file_handle()` call. - `msync()` must update file contents with the mmapped-in-process contents, but only those parts that do not exceed the file size. Previously there was a bug that msync'ed even the exceeding parts. - Applications expect `msync(MS_ASYNC)` to update file contents before the next app access to the file. Gramine instead ignored such requests, leading to accessing stale contents. We fix this bug by treating `MS_ASYNC` the same way as `MS_SYNC`. This bug was detected on LTP test `msync01`. A few more FS tests are enabled on SGX now. Generally, `gramine-sgx` now supports shared file-backed mappings, i.e. `mmap(MAP_SHARED)`. New LibOS test `mmap_file_sigbus` is added; old bad `mmap_file` test is removed. Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
2b40ea7
to
4a5c4a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @kailun-qin)
Previously, kailun-qin (Kailun Qin) wrote…
But it's
PalDeviceMap()
?
Done. Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved
Description of the changes
Previously, the
chroot
FS (plain host-backed files) used thePalStreamMap()
PAL API for file-backed mmap. This had three problems: 1) need to implement a non-trivialmap
callback in PALs; 2) discrepancy betweenmap
implementations in different PALs; 3) hard to debug file-mmap bugs because they only reproduced on SGX (gramine-sgx
) PAL and not on Linux (gramine-direct
) PAL.Note that other FSes already used emulated file-backed mmap:
tmpfs
andencrypted
emulate such mmaps viaPalStreamRead()
andPalStreamWrite()
.This PR switches
chroot
FS to use emulated file-backed mmap. This way,chroot
becomes similar in implementation totmpfs
andencrypted
FSes. Onlyshm
FS still usesPalStreamMap()
because devices with shared memory have non-standard semantics of mmaps. Correspondingfile_map()
functions in PAL are removed.This switch to emulated mmap uncovered several bugs:
gramine-direct
and wasn't implemented ongramine-sgx
(with EDMM).ftruncate()
on already-mmapped file must make newly extended file contents accessible. Previously it didn't work ongramine-sgx
(with EDMM), now it is resolved viaprot_refresh_mmaped_from_file_handle()
call.msync()
must update file contents with the mmapped-in-process contents, but only those parts that do not exceed the file size. Previously there was a bug that msync'ed even the exceeding parts.msync(MS_ASYNC)
to update file contents before the next app access to the file. Gramine instead ignored such requests, leading to accessing stale contents. We fix this bug by treatingMS_ASYNC
the same way asMS_SYNC
. This bug was detected on LTP testmsync01
.Several FS tests are enabled on SGX now. Generally,
gramine-sgx
now supports shared file-backed mappings, i.e.mmap(MAP_SHARED)
.Extracted from #1812.
How to test this PR?
Several new tests are enabled. Maybe some new LTP tests can be enabled (didn't check).
This change is