Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LibOS,PAL] Emulate file-backed mmap via PAL read/write APIs #1818

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

dimakuv
Copy link

@dimakuv dimakuv commented Mar 14, 2024

Description of the changes

Previously, the chroot FS (plain host-backed files) used the PalStreamMap() PAL API for file-backed mmap. This had three problems: 1) need to implement a non-trivial map callback in PALs; 2) discrepancy between map implementations in different PALs; 3) hard to debug file-mmap bugs because they only reproduced on SGX (gramine-sgx) PAL and not on Linux (gramine-direct) PAL.

Note that other FSes already used emulated file-backed mmap: tmpfs and encrypted emulate such mmaps via PalStreamRead() and PalStreamWrite().

This PR switches chroot FS to use emulated file-backed mmap. This way, chroot becomes similar in implementation to tmpfs and encrypted FSes. Only shm FS still uses PalStreamMap() because devices with shared memory have non-standard semantics of mmaps. Corresponding file_map() functions in PAL are removed.

This switch to emulated mmap uncovered several bugs:

  • Underlying file may be shorter than the requested mmap size. In this case access beyond the last file-backed page must cause SIGBUS. Previously this semantics worked only on gramine-direct and wasn't implemented on gramine-sgx (with EDMM).
  • As a consequence of the semantics above, file-growing ftruncate() on already-mmapped file must make newly extended file contents accessible. Previously it didn't work on gramine-sgx (with EDMM), now it is resolved via prot_refresh_mmaped_from_file_handle() call.
  • msync() must update file contents with the mmapped-in-process contents, but only those parts that do not exceed the file size. Previously there was a bug that msync'ed even the exceeding parts.
  • Applications expect msync(MS_ASYNC) to update file contents before the next app access to the file. Gramine instead ignored such requests, leading to accessing stale contents. We fix this bug by treating MS_ASYNC the same way as MS_SYNC. This bug was detected on LTP test msync01.

Several FS tests are enabled on SGX now. Generally, gramine-sgx now supports shared file-backed mappings, i.e. mmap(MAP_SHARED).

Extracted from #1812.

How to test this PR?

Several new tests are enabled. Maybe some new LTP tests can be enabled (didn't check).


This change is Reviewable

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel)


libos/test/ltp/manifest.template line 20 at r1 (raw file):


  # many LTP multi-process tests rely on shared-memory IPC via `mmap(MAP_SHARED, </dev/shm fd>)`
  { type = "untrusted_shm", path = "/dev/shm", uri = "dev:/dev/shm" },

This is taken from #1718. So I'd say that that PR is the prerequisite of this one.

This is required because LTP uses shared file-backed mmaps without explicit msyncs or munmaps (recall that Gramine requires explicit points like msync() and munmap() to synchronize mappings with file contents).


pal/include/pal/pal.h line 426 at r1 (raw file):

 * `PalStreamRead` and `PalStreamWrite`. Use `PalVirtualMemoryFree` to unmap this mapping.
 */
int PalStreamMap(PAL_HANDLE handle, void* addr, pal_prot_flags_t prot, uint64_t offset,

We may want to rename this to PalDeviceMap()


pal/src/host/linux-sgx/pal_files.c line 391 at r1 (raw file):

    }
    return ret;
}

The biggest win of this PR: removing the super-complicated file_map() callback from Linux-SGX PAL.

@dimakuv
Copy link
Author

dimakuv commented Mar 15, 2024

Jenkins, retest Jenkins-SGX-20.04-musl please. Error is in the signing test:

tests.test_sgx_sign.test_sign_from_pem_path (from pytest)

Error Message
cryptography.exceptions.InvalidSignature

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 21 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv)

a discussion (no related file):
it should be noted that this PR introduces a major regression: writable MAP_SHARED mappings that are actually shared between processes will no longer work correctly



libos/src/bookkeep/libos_vma.c line 1558 at r2 (raw file):

/* This helper function is to refresh access protections on the VMA pages of a given file handle on
 * file-extend operations (currently only `ftruncate`). */
int prot_refresh_mmaped_from_file_handle(struct libos_handle* hdl) {

this iterates over all VMAs we have in memory, not just VMAs belonging to this file; once we fix write to also call this function, this is likely to cause significant performance problems


libos/src/fs/libos_fs_util.c line 255 at r2 (raw file):

}

int generic_truncate(struct libos_handle* hdl, file_off_t size) {

calling the new hook only in truncate is not enough — file length can also be extended by a write (indeed, it is the more common case)

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 21 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mwkmwkmwk)


libos/src/fs/libos_fs_util.c line 255 at r2 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

calling the new hook only in truncate is not enough — file length can also be extended by a write (indeed, it is the more common case)

I agree, but we don't have this semantics even in the current implementation. So I think this PR is strictly better than what we currently have.

Also, looking at our tests and example apps, seems like no app uses such a strange pattern of mmap() + write(...extend...). Or do you have some other scenario in mind?

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 21 files reviewed, 6 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mwkmwkmwk, and @oshogbo)

a discussion (no related file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

it should be noted that this PR introduces a major regression: writable MAP_SHARED mappings that are actually shared between processes will no longer work correctly

So, do we actually care about this? Previously this worked only on non-SGX PAL, right?

CC: @dimakuv, @kailun-qin, @oshogbo



libos/src/fs/libos_fs_util.c line 255 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I agree, but we don't have this semantics even in the current implementation. So I think this PR is strictly better than what we currently have.

Also, looking at our tests and example apps, seems like no app uses such a strange pattern of mmap() + write(...extend...). Or do you have some other scenario in mind?

Hmm, but this shouldn't be hard to add, I think? Or maybe you're worrying about the performance if we add it to each write()? If it's the latter, then I think it's possible to optimize it by storing more metadata so that the overhead is negligible for files which don't have any mappings.


libos/test/ltp/manifest.template line 20 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This is taken from #1718. So I'd say that that PR is the prerequisite of this one.

This is required because LTP uses shared file-backed mmaps without explicit msyncs or munmaps (recall that Gramine requires explicit points like msync() and munmap() to synchronize mappings with file contents).

#1718 was merged, could you rebase? Current merge base doesn't contain it.

@dimakuv dimakuv force-pushed the dimakuv/chroot-uses-generic-mmap branch from 739b2e1 to 074b34e Compare June 8, 2024 18:56
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 21 files reviewed, 7 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

So, do we actually care about this? Previously this worked only on non-SGX PAL, right?

CC: @dimakuv, @kailun-qin, @oshogbo

Yes, previously it worked only on non-SGX PAL (i.e. in gramine-direct).

I would say that this was a bad experience for everyone involved:

  • Users were surprised that something that works on gramine-direct suddenly doesn't work on gramine-sgx, for no good reason.
  • Developers were annoyed to not being able to debug mmap-related issues under gramine-direct.

To me, this "major regression" is actually a win. It makes mmap behavior sane and uniform.



-- commits line 26 at r2:
TODO:

file-growing `ftruncate()` on

->

file-growing `write()` and `ftruncate()` on

libos/src/bookkeep/libos_vma.c line 1558 at r2 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

this iterates over all VMAs we have in memory, not just VMAs belonging to this file; once we fix write to also call this function, this is likely to cause significant performance problems

To make the claim of @mwkmwkmwk more specific:

  • The dump_vmas() function indeed iterates over all VMAs.
  • However, this function calls a vma_filter_needs_prot_refresh() callback on each VMA, to verify that this VMA belongs to this file (hdl).
  • Unfortunately, we don't maintain a per-handle list of VMAs over which we could iterate. I guess this was the crux of @mwkmwkmwk's comment.

So, I generally agree with @mwkmwkmwk, but I don't see a simple way around this O(n) logic.

Btw, I added write() logic now.


libos/src/fs/libos_fs_util.c line 255 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, but this shouldn't be hard to add, I think? Or maybe you're worrying about the performance if we add it to each write()? If it's the latter, then I think it's possible to optimize it by storing more metadata so that the overhead is negligible for files which don't have any mappings.

Done.

I forgot that we already have this metadata -- hdl->inode->num_mmapped. I'm using that one now.


libos/test/ltp/manifest.template line 20 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

#1718 was merged, could you rebase? Current merge base doesn't contain it.

Done.

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 21 files reviewed, 8 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)

a discussion (no related file):
Did a quick perf test on my (old-ish) machine.

Runtimes of the full FS pytest suite (in Gramine debug mode, non-EDMM):

  • Without PR

    • direct: 4.41 seconds
    • sgx: 84.12 seconds
  • With PR

    • direct: 4.63 seconds
    • sgx: 87.66 seconds

Results look reasonable to me. I guess we'll need to do a few more perf tests (in release mode, with EDMM, on different machines) before being ready for merge.


Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 19 files at r1, 6 of 10 files at r3, all commit messages.
Reviewable status: 16 of 21 files reviewed, 12 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mwkmwkmwk, and @oshogbo)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Yes, previously it worked only on non-SGX PAL (i.e. in gramine-direct).

I would say that this was a bad experience for everyone involved:

  • Users were surprised that something that works on gramine-direct suddenly doesn't work on gramine-sgx, for no good reason.
  • Developers were annoyed to not being able to debug mmap-related issues under gramine-direct.

To me, this "major regression" is actually a win. It makes mmap behavior sane and uniform.

Ok, makes sense to me.


a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Did a quick perf test on my (old-ish) machine.

Runtimes of the full FS pytest suite (in Gramine debug mode, non-EDMM):

  • Without PR

    • direct: 4.41 seconds
    • sgx: 84.12 seconds
  • With PR

    • direct: 4.63 seconds
    • sgx: 87.66 seconds

Results look reasonable to me. I guess we'll need to do a few more perf tests (in release mode, with EDMM, on different machines) before being ready for merge.

Is the slowdown caused by some specific test(s), or distributed equally? 5% is noticable, I think it should be less (or rather: I don't see why it has to be that much). OTOH, our tests are not designed to be benchmarks, so, could you try running something more suitable?



libos/include/libos_vma.h line 136 at r3 (raw file):

int reload_mmaped_from_file_handle(struct libos_handle* hdl);

/* Refresh page protections of file mappings of `hdl` */

Please add a short remark why this is needed in the first place, which is: when the file size has changed.


libos/src/fs/libos_fs_util.c line 152 at r3 (raw file):

        return pal_to_unix_errno(ret);

    size_t to_read_size = size;

Sounds like Yoda, I assume you meant size_to_read?

Code quote:

to_read_size

libos/src/fs/libos_fs_util.c line 152 at r3 (raw file):

        return pal_to_unix_errno(ret);

    size_t to_read_size = size;

I think Linux rounds the read size (or rather: map size) up to a whole page. Probably nothing relies on this, but IMO better to emulate this exactly. Right now we're leaving the padding with zeros instead of file contents.
We probably should also add a test for this? (ideally, just a subtest in one of the existing ones)


libos/src/fs/libos_fs_util.c line 273 at r3 (raw file):

        ret = prot_refresh_mmaped_from_file_handle(hdl);
        if (ret < 0) {
            log_error("refresh of page protections of mmapped regions of file failed: %s",

Suggestion:

refreshing page protections

pal/include/pal/pal.h line 426 at r1 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

We may want to rename this to PalDeviceMap()

Could you do that in this PR? There aren't that many places left using it.


pal/src/pal_rtld.c line 555 at r3 (raw file):

            goto out;
        }
        ret = _PalStreamRead(handle, c->map_off, map_size, map_addr);

ditto (rounding up the read size)

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 10 of 27 files reviewed, 13 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)

a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

Ok, makes sense to me.

Done from my side


a discussion (no related file):

Previously, mkow (Michał Kowalczyk) wrote…

Is the slowdown caused by some specific test(s), or distributed equally? 5% is noticable, I think it should be less (or rather: I don't see why it has to be that much). OTOH, our tests are not designed to be benchmarks, so, could you try running something more suitable?

Ok, here are the new results. I had a couple bugs/issues in my previous quick perf test; now I think I fixed all those issues.

  1. First issue was that I ran in debug mode. Now I run in release mode.
  2. Second issue (main one) was that I didn't compare the libos/test/fs/ pytest apples-to-apples. That's because this PR enables several tests which were previously (on master) disabled. So now I have a pre-requisite -- see below -- to compare apples-to-apples.
  3. Third issue that I did the test only on one machine, without EDMM. Now I also show the results with EDMM.

Pre-requisite

We compare with current master for this PR (commit cbd30b7).

With this PR, to compare apples-to-apples (same total number of FS tests), need to revert Pytest files:

git checkout cbd30b795098cd459a3e2165e9cf56f99cf2dc65 -- \
    libos/test/fs/test_fs.py \
	libos/test/fs/test_enc.py \
	libos/test/fs/test_tmpfs.py

old non-EDMM machine (fast hard disk)

  • without PR

    • direct: 3.79 seconds
    • sgx: 69.75 seconds
  • with PR

    • direct: 4.01 seconds
    • sgx: 69.78 seconds

new EDMM machine (slow hard disk)

  • without PR

    • direct: 9.35 seconds
    • sgx (with EDMM=1): 17.04s seconds
  • with PR

    • direct: 9.38s seconds
    • sgx (with EDMM=1): 17.06s seconds

Conclusion: Things make sense now. First, gramine-direct is a bit slower now because it doesn't use the "cheaty" mmap(..., file-fd, ...) but always uses a generic emulate-mmap-via-read-write. Second, gramine-sgx has exact same performance, because nothing really changed in this PR for the SGX case (except those IF checks on write/truncate).



libos/include/libos_vma.h line 136 at r3 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please add a short remark why this is needed in the first place, which is: when the file size has changed.

Done.


libos/src/fs/libos_fs_util.c line 152 at r3 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Sounds like Yoda, I assume you meant size_to_read?

Done.


libos/src/fs/libos_fs_util.c line 152 at r3 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

I think Linux rounds the read size (or rather: map size) up to a whole page. Probably nothing relies on this, but IMO better to emulate this exactly. Right now we're leaving the padding with zeros instead of file contents.
We probably should also add a test for this? (ideally, just a subtest in one of the existing ones)

I don't understand what you mean.

Please note that this callback function gets 4KB-aligned addr and size. I added asserts now to make it explicit.

Do these asserts clear the picture now? We (1) allocate size zero-filled bytes, aligned to a whole page, then we (2) read as much as possible into these allocated bytes from the underlying file, and finally we (3) PROT_NONE those pages that do not cover the read-from-file bytes (whole pages).

So in a general case, it ends up like this:

  1. size / 4KB whole pages allocated, with proper permissions (at least Read):
    a. First N pages contain the contents of the file;
    b. The last page where file contents stop (if any) gets padded with zeros;
  2. Last size / 4KB - N pages have PROT_NONE.

This is exactly how Linux works. In fact, I believe we have enough tests for this, because that's what Glibc and many apps rely on. But if you still want a sub-test, what should it be?


libos/src/fs/libos_fs_util.c line 273 at r3 (raw file):

        ret = prot_refresh_mmaped_from_file_handle(hdl);
        if (ret < 0) {
            log_error("refresh of page protections of mmapped regions of file failed: %s",

Done.


pal/include/pal/pal.h line 426 at r1 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Could you do that in this PR? There aren't that many places left using it.

Done.


pal/src/pal_rtld.c line 555 at r3 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto (rounding up the read size)

ditto, I don't understand the comment.

Similar to another place, I added the asserts to make it explicit.

Please note that implementations of _PalStreamRead() won't read past the end of the file. For example, for trusted files (which the PAL entrypoint must be), we have this guardrail:

off_t end = MIN(offset + count, handle->file.size);

Also note that below (I put a "mark" comment) our code explicitly zeroes-out the unused/unread parts of the allocated memory region.


pal/src/pal_rtld.c line 612 at r3 (raw file):

         * segment allocated via _PalVirtualMemoryAlloc() is already zeroed out) */
        if (ALLOC_ALIGN_UP(c->data_end) > c->data_end)
            memset((void*)c->data_end, 0, ALLOC_ALIGN_UP(c->data_end) - c->data_end);

mark (see comment above)

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 19 files at r1, 1 of 10 files at r3, 15 of 15 files at r4, all commit messages.
Reviewable status: all files reviewed, 17 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mwkmwkmwk, and @oshogbo)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Ok, here are the new results. I had a couple bugs/issues in my previous quick perf test; now I think I fixed all those issues.

  1. First issue was that I ran in debug mode. Now I run in release mode.
  2. Second issue (main one) was that I didn't compare the libos/test/fs/ pytest apples-to-apples. That's because this PR enables several tests which were previously (on master) disabled. So now I have a pre-requisite -- see below -- to compare apples-to-apples.
  3. Third issue that I did the test only on one machine, without EDMM. Now I also show the results with EDMM.

Pre-requisite

We compare with current master for this PR (commit cbd30b7).

With this PR, to compare apples-to-apples (same total number of FS tests), need to revert Pytest files:

git checkout cbd30b795098cd459a3e2165e9cf56f99cf2dc65 -- \
    libos/test/fs/test_fs.py \
	libos/test/fs/test_enc.py \
	libos/test/fs/test_tmpfs.py

old non-EDMM machine (fast hard disk)

  • without PR

    • direct: 3.79 seconds
    • sgx: 69.75 seconds
  • with PR

    • direct: 4.01 seconds
    • sgx: 69.78 seconds

new EDMM machine (slow hard disk)

  • without PR

    • direct: 9.35 seconds
    • sgx (with EDMM=1): 17.04s seconds
  • with PR

    • direct: 9.38s seconds
    • sgx (with EDMM=1): 17.06s seconds

Conclusion: Things make sense now. First, gramine-direct is a bit slower now because it doesn't use the "cheaty" mmap(..., file-fd, ...) but always uses a generic emulate-mmap-via-read-write. Second, gramine-sgx has exact same performance, because nothing really changed in this PR for the SGX case (except those IF checks on write/truncate).

Ok, now it makes sense, so I'm resolving :)



libos/src/bookkeep/libos_vma.c line 1558 at r2 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

To make the claim of @mwkmwkmwk more specific:

  • The dump_vmas() function indeed iterates over all VMAs.
  • However, this function calls a vma_filter_needs_prot_refresh() callback on each VMA, to verify that this VMA belongs to this file (hdl).
  • Unfortunately, we don't maintain a per-handle list of VMAs over which we could iterate. I guess this was the crux of @mwkmwkmwk's comment.

So, I generally agree with @mwkmwkmwk, but I don't see a simple way around this O(n) logic.

Btw, I added write() logic now.

One note here: in revision r4 of this PR it doesn't trigger on all write() calls, only on writes to files with mappings, so it should be reasonably fast (also, see the top-level discussion about performance).


libos/src/bookkeep/libos_vma.c line 1530 at r4 (raw file):

        size_to_prot = 0;
    } else {
        if (file_size - vma_info->file_offset > vma_info->length) {

ditto (see below)


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

    if (vma_info->length - size_to_prot) {
        ret = PalVirtualMemoryProtect(vma_info->addr + size_to_prot,
                                      vma_info->length - size_to_prot, /*prot=*/0);

Hmm, doesn't it break some invariants that vma_info holds different protection information than the real memory? What e.g. about EFAULT checks in syscalls?


libos/src/bookkeep/libos_vma.c line 1599 at r4 (raw file):

        /* NOTE: Unfortunately there's a data race here: the memory can be unmapped, or remapped, by
         * another thread by the time we get to `msync`. */
        lock(&file->inode->lock);

Are we guaranteed here that file->inode exists?

    /* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */
    struct libos_inode* inode;

I'm always confused by that, I'm not sure when it exists and when it doesn't...


libos/src/bookkeep/libos_vma.c line 1608 at r4 (raw file):

            size_to_msync = 0;
        } else {
            if (file_size - vma_info->file_offset > vma_info->length) {

That's a very long way to write MIN() ;)


libos/src/bookkeep/libos_vma.c line 1609 at r4 (raw file):

        } else {
            if (file_size - vma_info->file_offset > vma_info->length) {
                /* file size exceeds the mmapped part in VMA, all VMA is synced */

Suggestion:

the whole VMA is synced */

libos/src/fs/libos_fs_util.c line 152 at r3 (raw file):
Ah, I thought this function has the same semantic as mmap(): (quoting man)

The address addr must be a multiple of the page size (but length need not be).

Our callback actually doesn't have this semantics, so there's nothing wrong here.


libos/src/fs/chroot/encrypted.c line 496 at r4 (raw file):

    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) {
        /* If there are any mappings for the file, this will refresh their access protections. */

ditto (see below)


libos/src/fs/chroot/encrypted.c line 504 at r4 (raw file):

        }

        /* If there are any MAP_SHARED mappings for the file, this will read data from `enc`. */

ditto


libos/src/fs/chroot/encrypted.c line 537 at r4 (raw file):

    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) {
        /* If there are any mappings for the file, this will refresh their access protections. */

Suggestion:

/* There are mappings for the file, refresh their access protections. */

libos/src/fs/chroot/fs.c line 247 at r4 (raw file):

    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) {
        /* If there are any mappings for the file, this will refresh their access protections. */

Suggestion:

/* There are mappings for the file, refresh their access protections. */

libos/src/fs/chroot/fs.c line 255 at r4 (raw file):

        }

        /* If there are any MAP_SHARED mappings for the file, this will read data from `hdl`. */

ditto


libos/src/fs/tmpfs/fs.c line 302 at r4 (raw file):
Can you read this without the lock? struct libos_inode documentation says:

The fields in this structure are protected by lock, with the exception of fields that do not change (type, mount, fs).


pal/src/pal_rtld.c line 555 at r3 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

ditto, I don't understand the comment.

Similar to another place, I added the asserts to make it explicit.

Please note that implementations of _PalStreamRead() won't read past the end of the file. For example, for trusted files (which the PAL entrypoint must be), we have this guardrail:

off_t end = MIN(offset + count, handle->file.size);

Also note that below (I put a "mark" comment) our code explicitly zeroes-out the unused/unread parts of the allocated memory region.

Ah, I didn't notice that c doesn't contain raw ELF values, but already-processed ones (i.e. aligned by the code above). Raw ELF load commands don't have to be fully aligned to memory pages.

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 27 files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @kailun-qin, @mkow, @mwkmwkmwk, and @oshogbo)


libos/src/bookkeep/libos_vma.c line 1530 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto (see below)

Done.


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, doesn't it break some invariants that vma_info holds different protection information than the real memory? What e.g. about EFAULT checks in syscalls?

Hm, there good point.

I'm afraid you're correct but there's no easy way to fix this. Our VMA subsystem is not ready for such stuff. What we would need is to be able to split and merge adjacent VMAs on these callback functions. VMA
splitting logic exists in our code, but merging doesn't exist, see e.g. this comment:

/*
* "vma_tree" holds all vmas with the assumption that no 2 overlap (though they could be adjacent).
* Currently we do not merge similar adjacent vmas - if we ever start doing it, this code needs
* to be revisited as there might be some optimizations that would break due to it.
*/
static struct avl_tree vma_tree = {.cmp = vma_tree_cmp};

Interestingly, we have been having this same issue for ages in our checkpoint-and-restore logic:

/* Send file-backed memory region. */
uint64_t file_size = 0;
int ret = get_file_size(vma->file, &file_size);
if (ret < 0)
return ret;
/* Access beyond the last file-backed page will cause SIGBUS. For reducing fork
* latency, we send only those memory contents of VMA that are backed by the file,
* round up to pages. Rest of VMA memory region will be inaccessible in the child
* process. */
size_t send_size = vma->length;
if (vma->file_offset + vma->length > file_size) {
send_size = file_size > vma->file_offset ? file_size - vma->file_offset : 0;
send_size = ALLOC_ALIGN_UP(send_size);
}
/* It may happen that the whole file-backed memory is beyond the file size (e.g.,
* the file was truncated after the memory was allocated). In this case we consider
* the whole memory region to be inaccessible. */
if (send_size > 0) {
struct libos_mem_entry* mem;
DO_CP_SIZE(memory, vma->addr, send_size, &mem);
mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0);
}

TLDR:

  1. Your issue is legit.
  2. We can easily split VMAs into two: one with normal protections, one with NONE protections.
  3. But we cannot merge those VMAs back together (not implemented) when the write() or truncate() grows the file further.
  4. So we may end up with hundreds of adjacent VMAs, slowing down everything.
  5. We had the same issue in restored VMAs (for opened files in child processes); we didn't observe any issues because of that in real-world apps.

So given the above, I think we should postpone fixing this issue (for the time when we are ready to modify VMA bookkeeping code). I wrote a corresponding FIXME comment here.


libos/src/bookkeep/libos_vma.c line 1599 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Are we guaranteed here that file->inode exists?

    /* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */
    struct libos_inode* inode;

I'm always confused by that, I'm not sure when it exists and when it doesn't...

First of all, you're looking at the wrong inode field. What you linked is the dentry field:

struct libos_dentry {
/* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */
struct libos_inode* inode;

But here we're using the inode field of the LibOS handle:

/*
* Inode associated with this handle. Currently optional, and only for the use of underlying
* filesystem (see `libos_inode` in `libos_fs.h`). Eventually, should replace `dentry` fields.
*
* This field does not change, so reading it does not require holding `lock`.
*
* When taking locks for both handle and inode (`hdl->lock` and `hdl->inode->lock`), you should
* lock the *inode* first.
*/
struct libos_inode* inode;

If the handle has an inode, then it always has this inode. And we are guaranteed to have an inode here, because file handle is an actual file in this function -- we know this because this function is called from write() and truncate() callbacks in the FS subsystem.


libos/src/bookkeep/libos_vma.c line 1608 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

That's a very long way to write MIN() ;)

But that was on purpose, to have explicit comments what and why we're doing. Do you want me to remove these comments? I like them.


libos/src/bookkeep/libos_vma.c line 1609 at r4 (raw file):

        } else {
            if (file_size - vma_info->file_offset > vma_info->length) {
                /* file size exceeds the mmapped part in VMA, all VMA is synced */

Done. (2x)


libos/src/fs/chroot/encrypted.c line 496 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto (see below)

Done.


libos/src/fs/chroot/encrypted.c line 504 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto

Done. But here had to rephrase a bit, to make clear it will happen for MAP_SHARED mappings only


libos/src/fs/chroot/encrypted.c line 537 at r4 (raw file):

    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) {
        /* If there are any mappings for the file, this will refresh their access protections. */

Done.


libos/src/fs/chroot/fs.c line 247 at r4 (raw file):

    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) {
        /* If there are any mappings for the file, this will refresh their access protections. */

Done.


libos/src/fs/chroot/fs.c line 255 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto

Done.


libos/src/fs/tmpfs/fs.c line 302 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Can you read this without the lock? struct libos_inode documentation says:

The fields in this structure are protected by lock, with the exception of fields that do not change (type, mount, fs).

Done. Oops, good catch.

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 27 files reviewed, 15 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)

a discussion (no related file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Done from my side

very well then



libos/src/bookkeep/libos_vma.c line 1558 at r2 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

One note here: in revision r4 of this PR it doesn't trigger on all write() calls, only on writes to files with mappings, so it should be reasonably fast (also, see the top-level discussion about performance).

I still think we should keep all VMAs of a given libos_handle on a linked list; however, I'm fine with this being fixed in a separate PR


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Hm, there good point.

I'm afraid you're correct but there's no easy way to fix this. Our VMA subsystem is not ready for such stuff. What we would need is to be able to split and merge adjacent VMAs on these callback functions. VMA
splitting logic exists in our code, but merging doesn't exist, see e.g. this comment:

/*
* "vma_tree" holds all vmas with the assumption that no 2 overlap (though they could be adjacent).
* Currently we do not merge similar adjacent vmas - if we ever start doing it, this code needs
* to be revisited as there might be some optimizations that would break due to it.
*/
static struct avl_tree vma_tree = {.cmp = vma_tree_cmp};

Interestingly, we have been having this same issue for ages in our checkpoint-and-restore logic:

/* Send file-backed memory region. */
uint64_t file_size = 0;
int ret = get_file_size(vma->file, &file_size);
if (ret < 0)
return ret;
/* Access beyond the last file-backed page will cause SIGBUS. For reducing fork
* latency, we send only those memory contents of VMA that are backed by the file,
* round up to pages. Rest of VMA memory region will be inaccessible in the child
* process. */
size_t send_size = vma->length;
if (vma->file_offset + vma->length > file_size) {
send_size = file_size > vma->file_offset ? file_size - vma->file_offset : 0;
send_size = ALLOC_ALIGN_UP(send_size);
}
/* It may happen that the whole file-backed memory is beyond the file size (e.g.,
* the file was truncated after the memory was allocated). In this case we consider
* the whole memory region to be inaccessible. */
if (send_size > 0) {
struct libos_mem_entry* mem;
DO_CP_SIZE(memory, vma->addr, send_size, &mem);
mem->prot = LINUX_PROT_TO_PAL(vma->prot, /*map_flags=*/0);
}

TLDR:

  1. Your issue is legit.
  2. We can easily split VMAs into two: one with normal protections, one with NONE protections.
  3. But we cannot merge those VMAs back together (not implemented) when the write() or truncate() grows the file further.
  4. So we may end up with hundreds of adjacent VMAs, slowing down everything.
  5. We had the same issue in restored VMAs (for opened files in child processes); we didn't observe any issues because of that in real-world apps.

So given the above, I think we should postpone fixing this issue (for the time when we are ready to modify VMA bookkeeping code). I wrote a corresponding FIXME comment here.

I'd propose a somewhat different handling of this.

I think we should commit to the model of "logical" split within a single VMA: some prefix of the VMA is actually active (and has valid pages), while the rest is unmapped (and returns SIGBUS). This is, in fact, what our code already expects in some places — see memfault_upcall in libos_signal.c, for one.

What I would propose is:

  • have a valid_end (exact name TBD) field added to libos_vma and libos_vma_info, which is the boundary between valid area and SIGBUS-producing area
  • have a function in libos_vma.c that updates that field and calls PalVirtualMemoryProtect as appropriate to enforce it
  • check this field in is_in_adjacent_user_vmas so that syscalls return -EFAULT as appropriate

This should ensure correct semantics without too much suffering with split_vma.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r5, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 1599 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

First of all, you're looking at the wrong inode field. What you linked is the dentry field:

struct libos_dentry {
/* Inode associated with this dentry, or NULL. Protected by `g_dcache_lock`. */
struct libos_inode* inode;

But here we're using the inode field of the LibOS handle:

/*
* Inode associated with this handle. Currently optional, and only for the use of underlying
* filesystem (see `libos_inode` in `libos_fs.h`). Eventually, should replace `dentry` fields.
*
* This field does not change, so reading it does not require holding `lock`.
*
* When taking locks for both handle and inode (`hdl->lock` and `hdl->inode->lock`), you should
* lock the *inode* first.
*/
struct libos_inode* inode;

If the handle has an inode, then it always has this inode. And we are guaranteed to have an inode here, because file handle is an actual file in this function -- we know this because this function is called from write() and truncate() callbacks in the FS subsystem.

Ok, reading this won't be needed after implementing @mwkmwkmwk's suggestion from the discussion above (we'll be able to get this info from the vma).


libos/src/bookkeep/libos_vma.c line 1608 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

But that was on purpose, to have explicit comments what and why we're doing. Do you want me to remove these comments? I like them.

Seems overly verbose for a trivial thing to me, but won't block on this.

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 1558 at r2 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

I still think we should keep all VMAs of a given libos_handle on a linked list; however, I'm fine with this being fixed in a separate PR

Done. Created an issue on this: #1917


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

I'd propose a somewhat different handling of this.

I think we should commit to the model of "logical" split within a single VMA: some prefix of the VMA is actually active (and has valid pages), while the rest is unmapped (and returns SIGBUS). This is, in fact, what our code already expects in some places — see memfault_upcall in libos_signal.c, for one.

What I would propose is:

  • have a valid_end (exact name TBD) field added to libos_vma and libos_vma_info, which is the boundary between valid area and SIGBUS-producing area
  • have a function in libos_vma.c that updates that field and calls PalVirtualMemoryProtect as appropriate to enforce it
  • check this field in is_in_adjacent_user_vmas so that syscalls return -EFAULT as appropriate

This should ensure correct semantics without too much suffering with split_vma.

I like the proposal from @mwkmwkmwk. @mkow @kailun-qin What do you think?

If there is no hard no to this proposal, I'll cook a PR for this. That new PR will become a blocker for this PR (which I'll need to rebase on top of that one).


libos/src/bookkeep/libos_vma.c line 1608 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Seems overly verbose for a trivial thing to me, but won't block on this.

Let's keep it like this unless some other reviewer asks to modify this to MIN()

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

I like the proposal from @mwkmwkmwk. @mkow @kailun-qin What do you think?

If there is no hard no to this proposal, I'll cook a PR for this. That new PR will become a blocker for this PR (which I'll need to rebase on top of that one).

+1 from my side.


libos/src/bookkeep/libos_vma.c line 1608 at r4 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Let's keep it like this unless some other reviewer asks to modify this to MIN()

Sure

Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

+1 from my side.

+1, sounds great

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 18 of 31 files reviewed, 5 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @mkow, and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 1548 at r4 (raw file):

Previously, kailun-qin (Kailun Qin) wrote…

+1, sounds great

Done. Added as a fixup commit to this PR, because the two changes (the one in this PR and the one @mwkmwkmwk suggested) are too tied together. Hope it's fine (I tried but didn't find a good way to create a prerequisite PR with this valid_end/valid_length). But now the code is really much easier to follow, IMHO.


libos/src/bookkeep/libos_vma.c line 1599 at r4 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Ok, reading this won't be needed after implementing @mwkmwkmwk's suggestion from the discussion above (we'll be able to get this info from the vma).

Done.


libos/src/bookkeep/libos_vma.c line 1374 at r5 (raw file):

}

static bool vma_filter_needs_msync(struct libos_vma* vma, void* arg) {

FYI: I moved vma_filter_needs_msync() helper function closer to its usage. This confuses Reviewable a bit, so if requested, I can move this function back (and do the code-move in a separate PR later).

@dimakuv
Copy link
Author

dimakuv commented Jun 24, 2024

Jenkins, retest Jenkins-SGX-EDMM please

@dimakuv
Copy link
Author

dimakuv commented Jun 24, 2024

Jenkins, retest Jenkins-SGX-EDMM please

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r6, all commit messages.
Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)


libos/include/libos_vma.h line 103 at r6 (raw file):

Looks up VMA that contains addr and if found, updates vma->valid_length.

That's a bit weird semantics IMO. It allows addr to point anywhere within the vma, so it's not intuitive whether the length is the length from addr or from the beginning of the vma.
Maybe force addr to always point to the beginning of the mapping?


libos/src/libos_rtld.c line 274 at r6 (raw file):

            return ret;
        }
        if (valid_size != map_size) {

Why is this correct? How does this work with large .bss sections? (see the if below)


libos/src/bookkeep/libos_signal.c line 365 at r6 (raw file):

            /*
             * If the mapping exceeds end of a file then return a SIGBUS. Actually, the host
             * environment would most probably raise SIGBUS anyway, but Gramine PAL API fuses both

Not on the SGX PAL? There it's just a SIGSEGV there? (and it's SGX which is our "most probable" setup)


libos/src/bookkeep/libos_vma.c line 176 at r6 (raw file):

 * it's called with the VMA lock held.
 *
 * Returns whether the traversed range was continuously covered by VMAs. This is useful for

This needs an update, a region can now be covered by VMAs but the function can return false if some end != valid_end.

Code quote:

Returns whether the traversed range was continuously covered by VMAs.

libos/src/bookkeep/libos_vma.c line 194 at r6 (raw file):

    assert(vma->begin <= begin);
    struct libos_vma* prev = NULL;
    bool is_continuous = true;

Why is this always true? (this plus the assert above)
I don't see why is that.


libos/src/bookkeep/libos_vma.c line 832 at r6 (raw file):

    new_vma->begin     = (uintptr_t)addr;
    new_vma->end       = new_vma->begin + length;
    new_vma->valid_end = new_vma->begin + length;

Hmm, why ignoring file here? valid_end is potentially incorrect now.

Update: I see, it's updated after the real mmap? Could you add a comment here explaining that it's temporarily wrong and should be updated after the real mmap call?


libos/src/bookkeep/libos_vma.c line 1126 at r6 (raw file):

out_found:
    new_vma->end       = max_addr;
    new_vma->valid_end = max_addr;

ditto


libos/src/bookkeep/libos_vma.c line 1418 at r6 (raw file):

    spinlock_lock(&vma_tree_lock);
    bool is_continuous = _traverse_vmas_in_range(begin, end, madvise_dontneed_visitor, &ctx);

This has different semantics now regarding -ENOMEM on the region after the file contents. Which one is the one Linux follows?


libos/src/bookkeep/libos_vma.c line 1537 at r6 (raw file):

};

static bool vma_filter_needs_prot_refresh(struct libos_vma* vma, void* _args) {

This isn't just a "filter" anymore, it updates the traversed VMAs.

Code quote:

vma_filter

libos/src/bookkeep/libos_vma.c line 1553 at r6 (raw file):

        return false;

    /* by default, assume that file got smaller than the offset from which VMA is mapped, all VMA is

What do you mean by "by default assume [XYZ]"? This code has to handle all possible cases? What is a "default" in this context?


libos/src/bookkeep/libos_vma.c line 1734 at r6 (raw file):

                 *
                 * Access beyond the last file-backed page (reflected via vma->valid_length) will
                 * cause SIGBUS. For reducing fork latency, we send only those memory contents of

Hmm, I think this sentence doesn't make sense now? The rest of the region is inaccessible anyways, there's nothing to send from it.

Code quote:

For reducing fork latency

libos/src/sys/libos_mmap.c line 198 at r6 (raw file):

                uintptr_t begin = MAX((uintptr_t)addr, (uintptr_t)vma->addr);
                uintptr_t end = MIN((uintptr_t)vma->addr + vma->length,
                                    (uintptr_t)addr + length);

Accidental change?


libos/src/sys/libos_mmap.c line 265 at r6 (raw file):

            if (update_valid_length_ret < 0) {
                log_error("[mmap] Failed to update valid length to %lu of bookkeeped memory %p-%p!",
                        valid_length, addr, (char*)addr + length);

wrong alignment

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @mkow, and @mwkmwkmwk)


libos/include/libos_vma.h line 103 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Looks up VMA that contains addr and if found, updates vma->valid_length.

That's a bit weird semantics IMO. It allows addr to point anywhere within the vma, so it's not intuitive whether the length is the length from addr or from the beginning of the vma.
Maybe force addr to always point to the beginning of the mapping?

Done.


libos/src/libos_rtld.c line 274 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Why is this correct? How does this work with large .bss sections? (see the if below)

This is correct.

map_size = c->map_end - c->start (see several lines above). Here, c->start is the (page-aligned) offset at which the allocation of the segment should start. And c->map_end is the (page-aligned) end offset at which the allocation of the segment should stop, see

struct loadcmd {
/*
* Load command for a single segment. The following properties are true:
*
* - start <= data_end <= map_end <= alloc_end
* - start, map_end, alloc_end are page-aligned
* - map_off is page-aligned
*
* The addresses are not relocated (i.e. you need to add l_base_diff to them).
*/
/* Start of memory area */
elf_addr_t start;
/* End of file data (data_end .. alloc_end should be zeroed out) */
elf_addr_t data_end;
/* End of mapped file data (data_end rounded up to page size, so that we can mmap
* start .. map_end) */
elf_addr_t map_end;
/* End of memory area */
elf_addr_t alloc_end;
/* Offset from the beginning of file at which the first byte of the segment resides */
elf_off_t map_off;
/* Permissions for memory area */
int prot;
};

As for BSS sections, they end up in the memory region defined by [c->map_end, c->alloc_end). See

/* Allocate extra pages after the mapped area. */
if (c->map_end < c->alloc_end) {
void* zero_page_start = (void*)(c->map_end + base_diff);
size_t zero_page_size = c->alloc_end - c->map_end;
int zero_map_flags = MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS;
pal_prot_flags_t zero_pal_prot = LINUX_PROT_TO_PAL(c->prot, zero_map_flags);
if ((ret = bkeep_mmap_fixed(zero_page_start, zero_page_size, c->prot, zero_map_flags,
/*file=*/NULL, /*offset=*/0, /*comment=*/NULL)) < 0) {
log_debug("cannot bookkeep address of zero-fill pages");
return ret;
}
if ((ret = PalVirtualMemoryAlloc(zero_page_start, zero_page_size, zero_pal_prot)) < 0) {
log_debug("cannot map zero-fill pages");
return pal_to_unix_errno(ret);
}
}


So I could relax this check to if (valid_size > map_size) {fail}. But for correctly-formatted ELF files, there's no need to relax. And I don't think we want to care about wrongly formatted ELFs.


libos/src/bookkeep/libos_signal.c line 365 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Not on the SGX PAL? There it's just a SIGSEGV there? (and it's SGX which is our "most probable" setup)

You're right. But it's not only about SGX PAL, the same happens in direct PAL. Because we emulate this "mapping exceeds end of a file" (which in Linux is a SIGBUS condition) with the "exceeds-end-of-file part of mapping has NONE permissions" (which is a SIGSEGV condition).

So my comment is plain wrong. I guess I'll just revert to the previous comment, short and precise.


libos/src/bookkeep/libos_vma.c line 176 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

This needs an update, a region can now be covered by VMAs but the function can return false if some end != valid_end.

Done.


libos/src/bookkeep/libos_vma.c line 194 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Why is this always true? (this plus the assert above)
I don't see why is that.

_lookup_vma(begin) is guaranteed to return a VMA which address is lower than/equal to the begin argument.

This looked up VMA may be completely below than the [begin, end) region, but this bad case is covered by the IF check above.

So the only result at this point in the code is that we have a VMA which contains the begin address (the VMA "covers" it). Thus assert() is trivially true.

So I changed this line to a more obvious assert() + true. I can revert back -- there is no difference in functionality -- but I think my version shows a clearer intent of this code.


libos/src/bookkeep/libos_vma.c line 832 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, why ignoring file here? valid_end is potentially incorrect now.

Update: I see, it's updated after the real mmap? Could you add a comment here explaining that it's temporarily wrong and should be updated after the real mmap call?

Done.


libos/src/bookkeep/libos_vma.c line 1126 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto

Done.


libos/src/bookkeep/libos_vma.c line 1418 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

This has different semantics now regarding -ENOMEM on the region after the file contents. Which one is the one Linux follows?

TODO:

  • Add a new LibOS test that tries to access a SIGBUS part of file-backed VMA and check
  • what happens on e.g. write(<all-vma>)
  • what happens on e.g. write(<no-sigbus-part-of-memory>)
  • what happens on e.g. write(<only-sigbus-part-of-memory>)
  • what happens on madvise(MADV_DONTNEED, <all-vma>)
  • what happens on madvise(MADV_DONTNEED, <no-sigbus-part-of-memory>)
  • what happens on madvise(MADV_DONTNEED, <only-sigbus-part-of-memory>)

libos/src/bookkeep/libos_vma.c line 1537 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

This isn't just a "filter" anymore, it updates the traversed VMAs.

Done.

Wasn't sure how to best rename, so just removed the _filter_ part. But it still doesn't reflect that the vma->valid_end is modified here. Any suggestions are welcome.


libos/src/bookkeep/libos_vma.c line 1553 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

What do you mean by "by default assume [XYZ]"? This code has to handle all possible cases? What is a "default" in this context?

This "by default assume" is meant purely to explain the valid_length = 0; initialization. I added an empty line to make it clear that this comment is for the next line (var init) only.

Previously I had a cascade of IFs to make it even more explicit. But it looked ugly, so I did this "init variable to some case and get rid of one level of IFs".


libos/src/bookkeep/libos_vma.c line 1734 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Hmm, I think this sentence doesn't make sense now? The rest of the region is inaccessible anyways, there's nothing to send from it.

Done, I removed the "reduce fork latency" phrase. The rest of the sentence/paragraph still holds, though I guess it's obvious. I can remove the rest of the paragraph, if you insist.


libos/src/sys/libos_mmap.c line 198 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Accidental change?

Done.


libos/src/sys/libos_mmap.c line 265 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

wrong alignment

Done.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimakuv: Didn't you forget to push?

Reviewable status: all files reviewed, 16 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)

@dimakuv dimakuv force-pushed the dimakuv/chroot-uses-generic-mmap branch 2 times, most recently from 1de768e to 809d603 Compare June 26, 2024 12:56
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkow That was on purpose. I wanted to push, but realized that I need more code. Now I pushed.

Reviewable status: 26 of 36 files reviewed, 17 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv, @mkow, and @mwkmwkmwk)


-- commits line 41 at r7:
Will do: add new LibOS test mmap_file_sigbus.


libos/src/bookkeep/libos_vma.c line 176 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Done.

Done again: I reverted the changes because apparently that's how Linux works: Linux doesn't treat those SIGBUS-inducing parts of file-backed VMAs as non-contiguous regions. So the function's code doesn't need to be changed. See also my newly added LibOS test.


libos/src/bookkeep/libos_vma.c line 194 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

_lookup_vma(begin) is guaranteed to return a VMA which address is lower than/equal to the begin argument.

This looked up VMA may be completely below than the [begin, end) region, but this bad case is covered by the IF check above.

So the only result at this point in the code is that we have a VMA which contains the begin address (the VMA "covers" it). Thus assert() is trivially true.

So I changed this line to a more obvious assert() + true. I can revert back -- there is no difference in functionality -- but I think my version shows a clearer intent of this code.

Done, this is irrelevant now since I reverted this whole function.


libos/src/bookkeep/libos_vma.c line 1418 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

TODO:

  • Add a new LibOS test that tries to access a SIGBUS part of file-backed VMA and check
  • what happens on e.g. write(<all-vma>)
  • what happens on e.g. write(<no-sigbus-part-of-memory>)
  • what happens on e.g. write(<only-sigbus-part-of-memory>)
  • what happens on madvise(MADV_DONTNEED, <all-vma>)
  • what happens on madvise(MADV_DONTNEED, <no-sigbus-part-of-memory>)
  • what happens on madvise(MADV_DONTNEED, <only-sigbus-part-of-memory>)

Done, see new test mmap_file_sigbus.c


libos/test/regression/mmap_file.c line 4 at r7 (raw file):

 * TODO: - Hans, get ze flammenwerfer...
 *       - Now we have mmap_file_sigbus.c which largely supersedes this test, so can burn this one
 *         with ze flammenwerfer

FYI: Currently kept this test, because more tests is better? Not sure if we want to burn this test, or keep it afloat for some time (until we're sure other tests completely cover this test).

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 26 of 36 files reviewed, 18 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


libos/test/regression/mmap_file_sigbus.c line 12 at r10 (raw file):

 *
 *   - Behavior when using the mmapped regions as a buffer to a syscall
 *     - specifying the whole 2-page region succeeds (writes 1 page) (yes, that's how Linux works)

...and that's what I don't know how to support in Gramine, see inline comments in that sub-test

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 26 of 36 files reviewed, 19 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 184 at r10 (raw file):

 *     `write(partially-valid-vma)` Linux does not return -EFAULT but instead uses the buffer until
 *     the first invalid address. This behavior is too cumbersome to implement in Gramine + SGX,
 *     thus on `write(partially-valid-vma)` Gramine immediately returns -EFAULT.

Enjoy: https://yarchive.net/comp/linux/partial_reads_writes.html

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 10 files at r9, 2 of 2 files at r10, all commit messages.
Reviewable status: all files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mwkmwkmwk)


libos/src/libos_rtld.c line 274 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This is correct.

map_size = c->map_end - c->start (see several lines above). Here, c->start is the (page-aligned) offset at which the allocation of the segment should start. And c->map_end is the (page-aligned) end offset at which the allocation of the segment should stop, see

struct loadcmd {
/*
* Load command for a single segment. The following properties are true:
*
* - start <= data_end <= map_end <= alloc_end
* - start, map_end, alloc_end are page-aligned
* - map_off is page-aligned
*
* The addresses are not relocated (i.e. you need to add l_base_diff to them).
*/
/* Start of memory area */
elf_addr_t start;
/* End of file data (data_end .. alloc_end should be zeroed out) */
elf_addr_t data_end;
/* End of mapped file data (data_end rounded up to page size, so that we can mmap
* start .. map_end) */
elf_addr_t map_end;
/* End of memory area */
elf_addr_t alloc_end;
/* Offset from the beginning of file at which the first byte of the segment resides */
elf_off_t map_off;
/* Permissions for memory area */
int prot;
};

As for BSS sections, they end up in the memory region defined by [c->map_end, c->alloc_end). See

/* Allocate extra pages after the mapped area. */
if (c->map_end < c->alloc_end) {
void* zero_page_start = (void*)(c->map_end + base_diff);
size_t zero_page_size = c->alloc_end - c->map_end;
int zero_map_flags = MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS;
pal_prot_flags_t zero_pal_prot = LINUX_PROT_TO_PAL(c->prot, zero_map_flags);
if ((ret = bkeep_mmap_fixed(zero_page_start, zero_page_size, c->prot, zero_map_flags,
/*file=*/NULL, /*offset=*/0, /*comment=*/NULL)) < 0) {
log_debug("cannot bookkeep address of zero-fill pages");
return ret;
}
if ((ret = PalVirtualMemoryAlloc(zero_page_start, zero_page_size, zero_pal_prot)) < 0) {
log_debug("cannot map zero-fill pages");
return pal_to_unix_errno(ret);
}
}


So I could relax this check to if (valid_size > map_size) {fail}. But for correctly-formatted ELF files, there's no need to relax. And I don't think we want to care about wrongly formatted ELFs.

Oh, wait, there's a separate alloc_end, I missed that, sorry. It's ok as it is then.


libos/src/bookkeep/libos_vma.c line 194 at r6 (raw file):

_lookup_vma(begin) is guaranteed to return a VMA which address is lower than/equal to the begin argument.

Nope, the other way around:

/* Returns the vma that contains `addr`. If there is no such vma, returns the closest vma with
 * higher address. */
static struct libos_vma* _lookup_vma(uintptr_t addr) {

libos/src/bookkeep/libos_vma.c line 1537 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Done.

Wasn't sure how to best rename, so just removed the _filter_ part. But it still doesn't reflect that the vma->valid_end is modified here. Any suggestions are welcome.

Maybe vma_update_valid_ends? And just describe the return value in the comment?


libos/src/bookkeep/libos_vma.c line 1553 at r6 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

This "by default assume" is meant purely to explain the valid_length = 0; initialization. I added an empty line to make it clear that this comment is for the next line (var init) only.

Previously I had a cascade of IFs to make it even more explicit. But it looked ugly, so I did this "init variable to some case and get rid of one level of IFs".

Maybe put it into else of the top-level if then? Then the comment would be less confusing :)


libos/src/bookkeep/libos_vma.c line 179 at r8 (raw file):

 * emulating errors in memory management syscalls.
 */
// TODO: Probably other VMA functions could make use of this helper.

Why removing this? Aren't there any functions which could use it?


libos/src/bookkeep/libos_vma.c line 1754 at r10 (raw file):

                 * Send file-backed memory region.
                 *
                 * Access beyond the last file-backed page (reflected via vma->valid_length) will

we're emulating this behavior here, it wouldn't actually happen if you tried it

Suggestion:

should

libos/test/regression/mmap_file.c line 4 at r7 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

FYI: Currently kept this test, because more tests is better? Not sure if we want to burn this test, or keep it afloat for some time (until we're sure other tests completely cover this test).

What is missing from the other tests? More tests isn't always better, they take CI time, need maintenance and if they are of poor quality then they cause false bug reports. So, if they aren't testing anything and contain UBs like this one, then I'd rather get rid of them.


libos/test/regression/mmap_file_sigbus.c line 20 at r10 (raw file):

 *     - madvise(MADV_DONTNEED) on the second 1-page region succeeds (yes, that's how Linux works)
 *     - accessing the first 1-page region succeeds
 *     - accessing the second 1-page region results in SIGBUS

by "accessing" you also mean madvise? or accessing from usermode after madvise was called on the region?


libos/test/regression/mmap_file_sigbus.c line 36 at r10 (raw file):

#include <sys/types.h>
#include <ucontext.h>
#include <unistd.h>

Please sort.


libos/test/regression/mmap_file_sigbus.c line 40 at r10 (raw file):

#include "common.h"

#define TEST_DIR  "tmp"

Suggestion:

TEST_DIR "tmp"

libos/test/regression/mmap_file_sigbus.c line 59 at r10 (raw file):

    "ret\n"
"ret:\n"
    "ret\n"

Suggestion:

"ret:\n"
    "ret\n"

libos/test/regression/mmap_file_sigbus.c line 113 at r10 (raw file):

    x = mem_read(addr_page1);
    if (x == 0xdeadbeef)
        errx(1, "unexpected value in successful read from file-backed mmap region: %lx", x);

But this value means that the read wasn't successful?

Code quote:

in successful read

libos/test/regression/mmap_file_sigbus.c line 116 at r10 (raw file):

    x = mem_read(addr_page2);
    if (x != 0xdeadbeef)
        errx(1, "unexpected value in failing read from file-backed mmap region: %lx", x);

ditto


libos/test/regression/mmap_file_sigbus.c line 117 at r10 (raw file):

    if (x != 0xdeadbeef)
        errx(1, "unexpected value in failing read from file-backed mmap region: %lx", x);
    if (g_sigbus_triggered != 1)

Please check whether it's 0 after the first call.


libos/test/regression/mmap_file_sigbus.c line 139 at r10 (raw file):

    if (ret != (ssize_t)page_size)
        errx(1, "write(2 pages): expected 1-page write, got ret=%ld, errno=%d", ret, errno);
#endif

I think that uncommenting this #ifdef won't produce a working test, because a seek is missing? Or maybe #else?


libos/test/regression/mmap_file_sigbus.c line 161 at r10 (raw file):

    x = mem_read(addr_page1);
    if (x == 0xdeadbeef)
        errx(1, "unexpected value in successful read after madvise(MADV_DONTNEED): %lx", x);

ditto x2


libos/test/regression/mmap_file_sigbus.c line 165 at r10 (raw file):

    if (x != 0xdeadbeef)
        errx(1, "unexpected value in failing read after madvise(MADV_DONTNEED): %lx", x);
    if (g_sigbus_triggered != 1)

Please check whether it's 0 after the first call.


libos/test/regression/test_libos.py line 808 at r10 (raw file):

                f.truncate(os.sysconf("SC_PAGE_SIZE"))

        write_path = 'tmp/__mmaptestfilewrite__'

Could you pass the filenames as arguments?

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 20 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


libos/src/bookkeep/libos_vma.c line 194 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

_lookup_vma(begin) is guaranteed to return a VMA which address is lower than/equal to the begin argument.

Nope, the other way around:

/* Returns the vma that contains `addr`. If there is no such vma, returns the closest vma with
 * higher address. */
static struct libos_vma* _lookup_vma(uintptr_t addr) {

Wow, my bad, I read the comment incorrectly and though that _lookup_vma() returns closes vma with a lower address. But it's actually the higher address. Then my change was wrong, but surprisingly it still worked on our tests...

Anyway, I reverted all that, so not relevant anymore.


libos/src/bookkeep/libos_vma.c line 1537 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Maybe vma_update_valid_ends? And just describe the return value in the comment?

Done.


libos/src/bookkeep/libos_vma.c line 1553 at r6 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Maybe put it into else of the top-level if then? Then the comment would be less confusing :)

Done.


libos/src/bookkeep/libos_vma.c line 179 at r8 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Why removing this? Aren't there any functions which could use it?

Done. I reverted this. I quickly looked through this file and didn't find any functions that would clearly benefit from this func. But on the other hand, I didn't do any thorough analysis (maybe if some func is refactored to be compatible with using this func, then it would work). So let's keep the comment until someone does a proper deep analysis.


libos/src/bookkeep/libos_vma.c line 184 at r10 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Enjoy: https://yarchive.net/comp/linux/partial_reads_writes.html

Such a good link that I also added it elsewhere (in the LibOS regression test)


libos/src/bookkeep/libos_vma.c line 1754 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

we're emulating this behavior here, it wouldn't actually happen if you tried it

Done.


libos/test/regression/mmap_file.c line 4 at r7 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

What is missing from the other tests? More tests isn't always better, they take CI time, need maintenance and if they are of poor quality then they cause false bug reports. So, if they aren't testing anything and contain UBs like this one, then I'd rather get rid of them.

Done.

The only part that we could miss a bit was the in-child test. So I modified my new mmap_file_sigbus test to also check this. This old and full of UBs test is removed.


libos/test/regression/mmap_file_sigbus.c line 20 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

by "accessing" you also mean madvise? or accessing from usermode after madvise was called on the region?

Done.


libos/test/regression/mmap_file_sigbus.c line 36 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please sort.

Done. Also removed some unneeded headers.


libos/test/regression/mmap_file_sigbus.c line 59 at r10 (raw file):

    "ret\n"
"ret:\n"
    "ret\n"

Done.


libos/test/regression/mmap_file_sigbus.c line 113 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

But this value means that the read wasn't successful?

Done. My bad phrasing, but the meaning was correct: "Read that was supposed to be successful unexpectedly returned a value that is reserved for invalid accesses". I rephrased.


libos/test/regression/mmap_file_sigbus.c line 116 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto

Done.


libos/test/regression/mmap_file_sigbus.c line 117 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please check whether it's 0 after the first call.

Done.


libos/test/regression/mmap_file_sigbus.c line 139 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

I think that uncommenting this #ifdef won't produce a working test, because a seek is missing? Or maybe #else?

Not sure I understand your comment. We are writing to write_fd, which is a separate file: created plus truncated by the open() couple lines above.

The contents of this file are irrelevant, it's just a temporary destination for writing. We only use it as a way to use addr_page1 and addr_page2 (backed by another read-only file) as source buffers.

We don't care if we always append to write_fd or we perform lseek to rewind. This file will be removed afterwards anyway (see unlink() below).


libos/test/regression/mmap_file_sigbus.c line 161 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

ditto x2

Done.


libos/test/regression/mmap_file_sigbus.c line 165 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please check whether it's 0 after the first call.

Done.


libos/test/regression/test_libos.py line 808 at r10 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Could you pass the filenames as arguments?

Done. Good idea.

@dimakuv dimakuv force-pushed the dimakuv/chroot-uses-generic-mmap branch from 0595e70 to 99e3003 Compare June 27, 2024 09:10
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 20 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


libos/test/regression/mmap_file_sigbus.c line 40 at r10 (raw file):

#include "common.h"

#define TEST_DIR  "tmp"

Done. Not relevant.

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 19 files at r1, 6 of 15 files at r4, 9 of 13 files at r6, 3 of 10 files at r9, 1 of 8 files at r11.
Reviewable status: 29 of 36 files reviewed, 21 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


libos/src/fs/tmpfs/fs.c line 278 at r11 (raw file):

    if (__atomic_load_n(&hdl->inode->num_mmapped, __ATOMIC_ACQUIRE) != 0) {
        /* There are mappings for the file, refresh their access protections. */

this code sequence (and the one in truncate) comes up a lot; do you think we could deduplicate them?


pal/src/pal_rtld.c line 558 at r11 (raw file):

            goto out;
        }
        ret = _PalStreamRead(handle, c->map_off, map_size, map_addr);

this should verify the whole thing was actually read

@dimakuv dimakuv force-pushed the dimakuv/chroot-uses-generic-mmap branch from 99e3003 to 7c4ce91 Compare June 27, 2024 12:53
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 21 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


libos/src/fs/tmpfs/fs.c line 278 at r11 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

this code sequence (and the one in truncate) comes up a lot; do you think we could deduplicate them?

Good idea, but can we do it in a follow-up PR or at least when all other comments in this PR will be resolved? This change will move quite a lot of code around, which will complicate the review process.


pal/src/pal_rtld.c line 558 at r11 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

this should verify the whole thing was actually read

Done.

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


libos/src/fs/tmpfs/fs.c line 278 at r11 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Good idea, but can we do it in a follow-up PR or at least when all other comments in this PR will be resolved? This change will move quite a lot of code around, which will complicate the review process.

so be it


pal/src/pal_rtld.c line 105 at r13 (raw file):

    size_t got = 0;
    while (got < count) {
        int64_t ret = _PalStreamRead(handle, offset, count - got, (char*)buf + got);

should be offset + got


pal/src/pal_rtld.c line 594 at r13 (raw file):

            continue;

        ret = _PalVirtualMemoryAlloc((void*)c->map_end, c->alloc_end - c->map_end, c->prot);

this is no longer needed with this approach

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


libos/src/fs/tmpfs/fs.c line 278 at r11 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

so be it

#1931 -- not to forget. Let me put a blocking comment so I'll do it after e.g. rebase.


pal/src/pal_rtld.c line 105 at r13 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

should be offset + got

Done. Oops, good catch.


pal/src/pal_rtld.c line 594 at r13 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

this is no longer needed with this approach

Why are you saying this? That's for the BSS part.

Above we mmapped only [c->start, c->map_end) (for data and code). Now we also mmap [c->map_end, c->alloc_end) which is used for BSS.

We could refactor this code to only have one _PalVirtualMemoryAlloc(), but then we'll lose this clarify of data+code vs BSS. Additionally, we'll need to play with page protections more...

( Same question was raised in another PR: https://reviewable.io/reviews/gramineproject/gramine/1820#-O0P3jrN6viIx-pY8-sV )

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 23 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


pal/src/pal_rtld.c line 574 at r13 (raw file):

            goto out;
        }
        ret = pal_stream_read_exact(handle, c->map_off, map_size, map_addr);

Just realized that this conflicts with #1820. I think I should first merge #1820 and then get rid of this pal_stream_read_exact(), it won't be needed (we'll use memcpy() instead).

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @dimakuv and @mkow)


pal/src/pal_rtld.c line 574 at r13 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Just realized that this conflicts with #1820. I think I should first merge #1820 and then get rid of this pal_stream_read_exact(), it won't be needed (we'll use memcpy() instead).

sounds good

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 28 of 36 files reviewed, 22 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow and @mwkmwkmwk)


pal/src/pal_rtld.c line 594 at r13 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Why are you saying this? That's for the BSS part.

Above we mmapped only [c->start, c->map_end) (for data and code). Now we also mmap [c->map_end, c->alloc_end) which is used for BSS.

We could refactor this code to only have one _PalVirtualMemoryAlloc(), but then we'll lose this clarify of data+code vs BSS. Additionally, we'll need to play with page protections more...

( Same question was raised in another PR: https://reviewable.io/reviews/gramineproject/gramine/1820#-O0P3jrN6viIx-pY8-sV )

Ok, this is resolved in another PR. This issue will go away when we rebase this PR on top of #1820.

@dimakuv dimakuv force-pushed the dimakuv/chroot-uses-generic-mmap branch from a0c58ef to 4f150ac Compare July 25, 2024 06:54
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 22 of 36 files reviewed, 18 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv, @mkow, and @mwkmwkmwk)


-- commits line 26 at r2:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

TODO:

file-growing `ftruncate()` on

->

file-growing `write()` and `ftruncate()` on

Done


-- commits line 41 at r7:

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Will do: add new LibOS test mmap_file_sigbus.

Done


pal/src/pal_rtld.c line 574 at r13 (raw file):

Previously, mwkmwkmwk (Marcelina Kościelnicka) wrote…

sounds good

Done

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 8 files at r11, 1 of 2 files at r13, 12 of 12 files at r15, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)


libos/src/fs/tmpfs/fs.c line 278 at r11 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

#1931 -- not to forget. Let me put a blocking comment so I'll do it after e.g. rebase.

@dimakuv: That's the last blocking discussion here, right? Do you plan to create the PR now, or somewhen later, after merging?


libos/test/regression/mmap_file_sigbus.c line 139 at r10 (raw file):

Previously, dimakuv (Dmitrii Kuvaiskii) wrote…

Not sure I understand your comment. We are writing to write_fd, which is a separate file: created plus truncated by the open() couple lines above.

The contents of this file are irrelevant, it's just a temporary destination for writing. We only use it as a way to use addr_page1 and addr_page2 (backed by another read-only file) as source buffers.

We don't care if we always append to write_fd or we perform lseek to rewind. This file will be removed afterwards anyway (see unlink() below).

Ah, sorry, I mixed up reading and writing and somehow looked at this transfer in the opposite direction. Everything looks good.


libos/test/regression/mmap_file_sigbus.c line 158 at r15 (raw file):

    if (argc != 4)
        errx(1, "Usage: %s <path1: read-only file> <path2:write-only file> <fork?>", argv[0]);

One time you put a space after the colon and the other time you didn't.


libos/test/regression/mmap_file_sigbus.c line 158 at r15 (raw file):

    if (argc != 4)
        errx(1, "Usage: %s <path1: read-only file> <path2:write-only file> <fork?>", argv[0]);

Suggestion:

<fork|nofork>

libos/test/regression/mmap_file_sigbus.c line 162 at r15 (raw file):

    const char* path1 = argv[1];
    const char* path2 = argv[2];
    bool do_fork = strcmp(argv[3], "fork") == 0;

Please don't accept random garbage here, it's too easy to introduce unnoticed bugs this way.

Copy link
Contributor

@mwkmwkmwk mwkmwkmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 19 files at r1, 2 of 10 files at r3, 1 of 8 files at r11, 12 of 12 files at r15.
Reviewable status: all files reviewed, 4 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)

Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 35 of 36 files reviewed, 3 unresolved discussions, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel), "fixup! " found in commit messages' one-liners (waiting on @mkow)


libos/src/fs/tmpfs/fs.c line 278 at r11 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

@dimakuv: That's the last blocking discussion here, right? Do you plan to create the PR now, or somewhen later, after merging?

Done #1956


libos/test/regression/mmap_file_sigbus.c line 158 at r15 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

One time you put a space after the colon and the other time you didn't.

Done.


libos/test/regression/mmap_file_sigbus.c line 158 at r15 (raw file):

    if (argc != 4)
        errx(1, "Usage: %s <path1: read-only file> <path2:write-only file> <fork?>", argv[0]);

Done.


libos/test/regression/mmap_file_sigbus.c line 162 at r15 (raw file):

Previously, mkow (Michał Kowalczyk) wrote…

Please don't accept random garbage here, it's too easy to introduce unnoticed bugs this way.

Done.

Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, all discussions resolved, "fixup! " found in commit messages' one-liners

@mkow mkow force-pushed the dimakuv/chroot-uses-generic-mmap branch from 68ac11a to 2b40ea7 Compare July 28, 2024 13:22
Copy link
Member

@mkow mkow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved


-- commits line 46 at r17:
@dimakuv: FYI: I changed that because the old version was ambiguous (it could also mean that "there are currently just a few FS tests in general", not that you added/enabled some in this PR).

Code quote:

A few more FS tests are enabled on SGX now.

Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 19 files at r1, 2 of 10 files at r3, 6 of 15 files at r4, 5 of 13 files at r6, 3 of 10 files at r9, 2 of 8 files at r11, 12 of 12 files at r15, 1 of 1 files at r16, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @dimakuv)


-- commits line 17 at r17:
But it's PalDeviceMap()?

Code quote:

Only `shm` FS still uses `PalStreamMap()`

Previously, the `chroot` FS (plain host-backed files) used the
`PalStreamMap()` PAL API for file-backed mmap. This had three problems:
1) need to implement a non-trivial `map` callback in PALs;
2) discrepancy between `map` implementations in different PALs;
3) hard to debug file-mmap bugs because they only reproduced on SGX
   (`gramine-sgx`) PAL and not on Linux (`gramine-direct`) PAL.

Note that other FSes already used emulated file-backed mmap: `tmpfs` and
`encrypted` FSes emulate such mmaps via `PalStreamRead()` and
`PalStreamWrite()`.

This commit switches `chroot` FS to use emulated file-backed mmap. This
way, `chroot` becomes similar in implementation to `tmpfs` and
`encrypted` FSes. Only `shm` FS still uses `PalDeviceMap()` (previously
called `PalStreamMap()`) because devices with shared memory have
non-standard semantics of mmaps.  Corresponding `file_map()` functions
in PAL are removed.

In this commit, we also introduce the model of "logical" split within a
single VMA: some prefix of the VMA is accessible (has valid pages),
while the rest is unmapped (returns SIGBUS). Only file-backed VMAs are
split in this way (anonymous-memory VMAs can't be in "unmapped" state).
This logical split is achieved via a new `vma->valid_length` field.

The switch to emulated mmap uncovered several bugs:
- Underlying file may be shorter than the requested mmap size. In this
  case access beyond the last file-backed page must cause SIGBUS.
  Previously this semantics worked only on `gramine-direct` and wasn't
  implemented on `gramine-sgx` (even with EDMM).
- As a consequence of the semantics above, file-growing `write()` and
  `ftruncate()` on already-mmapped file must make newly extended file
  contents accessible. Previously it didn't work on `gramine-sgx` (with
  EDMM), now it is resolved via `prot_refresh_mmaped_from_file_handle()`
  call.
- `msync()` must update file contents with the mmapped-in-process
  contents, but only those parts that do not exceed the file size.
  Previously there was a bug that msync'ed even the exceeding parts.
- Applications expect `msync(MS_ASYNC)` to update file contents before
  the next app access to the file. Gramine instead ignored such
  requests, leading to accessing stale contents. We fix this bug by
  treating `MS_ASYNC` the same way as `MS_SYNC`. This bug was detected
  on LTP test `msync01`.

A few more FS tests are enabled on SGX now. Generally, `gramine-sgx` now
supports shared file-backed mappings, i.e. `mmap(MAP_SHARED)`. New LibOS
test `mmap_file_sigbus` is added; old bad `mmap_file` test is removed.

Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
@dimakuv dimakuv force-pushed the dimakuv/chroot-uses-generic-mmap branch from 2b40ea7 to 4a5c4a1 Compare July 29, 2024 06:26
Copy link
Author

@dimakuv dimakuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @kailun-qin)


-- commits line 17 at r17:

Previously, kailun-qin (Kailun Qin) wrote…

But it's PalDeviceMap()?

Done. Good catch.

Copy link
Contributor

@kailun-qin kailun-qin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@dimakuv dimakuv merged commit 4a5c4a1 into master Jul 29, 2024
18 checks passed
@dimakuv dimakuv deleted the dimakuv/chroot-uses-generic-mmap branch July 29, 2024 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants