Reimplement readv to deal with short reads #12674

yamt · 2024-07-11T13:27:27Z

Summary

The original implementation seems to have an assumption that the file is a regular file.

This commit re-implements it so that it can deal with tty, sockets, etc.

Impact

Testing

The original implementation seems to have an assumption that the file is a regular file. This commit re-implements it so that it can deal with tty, sockets, etc.

acassis · 2024-07-11T13:43:08Z

@yamt nice! Is there some example showing the failure? Should it be added to apps/testing/ ?

Another question: did you run ostest and LTS locally to confirm there is not known side effects?

xiaoxiang781216 · 2024-07-11T14:12:18Z

libs/libc/uio/lib_readv.c

-                {
-                  return ntotal;
-                }
+  buffer = malloc(total_size);


could we avoid malloc

i can't think of any trivial ways.

The only way to avoid malloc is to read directly into each iov and avoid the copy. In other words: iterate iov array. For each iov, if iov_len > 0, read() iov_len bytes into that iov. If any read() returns less than iov_len bytes, break from loop. Keep running counter of total bytes read.

I don't know if this is permissible under POSIX but it (1) avoids malloc, and (2) avoids copying. Even if there is added overhead from multiple calls to read(), it might be better than overhead of malloc(), free(), and copying.

it doesn't work. please read the discussions in this PR.

xiaoxiang781216 · 2024-07-11T14:13:02Z

libs/libc/uio/lib_readv.c

-            {
-              /* NOTE:  read() is a cancellation point */
-
-              nread = read(fildes, buffer, remaining);


why we can't read the data piece by piece?

because, in general, you should not keep reading after a short read.

yes, we can break out by returning the length we really got so far.

BTW, I don't get the problem from your commit message. could you explain more?

consider that you readv() on a tty, using a huge buffer, say, 1024 bytes.
if you got some data, say, 1 byte, readv() should return the 1 byte without waiting for the following 1023 bytes.

Sounds like struct file_operations read and write methods should actually operate with vectors (the same as file_read and file_write), but such a change basically would require rewriting thousands lines of the code...

@yamt @xiaoxiang781216 how does Linux, QNX, MacOS, etc implement it? Although the specification doesn't require some assumptions, it is important to follow what other OSes are doing, to avoid breaking applications when porting it to NuttX

Linux improve the driver model by changing argument from void * to iovec*. The official approach is extending file_operation_s to support iovec like what Linux or NuttX sock_intf_s(#2959).

no.

these tricks are not thread-safe.

we should preserve packet boundary for some kind of fds like datagram socket.

socket already improve this case by: #2959. But file_operation_s doesn't have the similar patch, so it's hard to modify readv to utilize this new capability from socket layer.

but adding malloc and trying to avoid memory leak with nested cancellation point is also conceptually wrong.

it's just inefficient.
it's far better than the original implementation, which is broken in a user-visible way.

I do not have a better solution right not and just noticed and writev seems also to suffer from the same problem.

right. writev/preadv/pwritev have the same problem.

if someone plans to do the ideal solution (ie. make ~everything iovec-based) anytime soon, it's great.
otherwise, i'd suggest to merge this PR for now.

yamt · 2024-07-11T14:13:25Z

@yamt nice! Is there some example that showing the failure? Should it be added to apps/testing/ ?

micropython repl (with micropython/micropython#13676) on toywasm and wamr.

Another question: did you run ostest and LTS locally to confirm there is not known side effects?

no.

yamt · 2024-07-12T08:23:03Z

The original implementation seems to have an assumption that the file is a regular file.

well, the original implementation was actually broken for regular files as well wrt read/write atomicity.

hartmannathan · 2024-07-21T21:45:02Z

libs/libc/uio/lib_readv.c

+  if (iovcnt == 1)
+    {
+      return read(fildes, iov->iov_base, iov->iov_len);
+    }


Is this optimization necessary? Would it be better to remove it and use common logic below?

the "common logic" below involves malloc. it's better to avoid it when easy.

@yamt Agreed, but if we can eliminate malloc (see other feedback here) then this optimization can be eliminated also.

@yamt Agreed, but if we can eliminate malloc (see other feedback here) then this optimization can be eliminated also.

I see why iovec-based multiple reads could be a problem (e.g., read data available == size of iovec, then we get stuck in blocked read). Ok, I am fine to merge this PR with malloc, and maybe someone can solve how to remove malloc later.

I see why iovec-based multiple reads could be a problem (e.g., read data available == size of iovec, then we get stuck in blocked read). Ok, I am fine to merge this PR with malloc, and maybe someone can solve how to remove malloc later.

i guess that tricks involving multiple read() calls for iovcnt>1 are all broken in one way or another.
implementing readv() on the top of read() is a design mistake. it should be the opposite.

hartmannathan · 2024-07-21T21:45:56Z

libs/libc/uio/lib_readv.c

-               * buffer reads.
-               */
+  nread = read(fildes, buffer, total_size);
+  if (nread == -1)


Would it be better to do "if (nread < 0)" (defensive coding)?

if you want that way, we should return -1 instead of nread i guess.
anyway, it isn't this PR is about.

hartmannathan · 2024-07-21T22:35:59Z

libs/libc/uio/lib_readv.c

-                {
-                  return ntotal;
-                }
+  buffer = malloc(total_size);


The only way to avoid malloc is to read directly into each iov and avoid the copy. In other words: iterate iov array. For each iov, if iov_len > 0, read() iov_len bytes into that iov. If any read() returns less than iov_len bytes, break from loop. Keep running counter of total bytes read.

I don't know if this is permissible under POSIX but it (1) avoids malloc, and (2) avoids copying. Even if there is added overhead from multiple calls to read(), it might be better than overhead of malloc(), free(), and copying.

yamt · 2024-08-15T07:32:03Z

can we make a decision?

yamt · 2024-09-17T01:06:17Z

can we make a decision?

ping

xiaoxiang781216 · 2024-09-17T03:46:24Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

yamt · 2024-09-17T03:56:48Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

it doesn't conform the spec.

xiaoxiang781216 · 2024-09-17T04:02:51Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

it doesn't conform the spec.

Why?

yamt · 2024-09-17T04:13:41Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

it doesn't conform the spec.

Why?

consider:

file types which are usually not interruptible. eg. regular files
file types which are supposed to preserve data boundaries. eg. udp sockets
file types which are supposed to preserve read/write atomicity. eg. regular files

xiaoxiang781216 · 2024-09-17T04:21:50Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

it doesn't conform the spec.

Why?

consider:

file types which are usually not interruptible. eg. regular files

if spec enforce the implementation must read all available data, please point out the statement.

file types which are supposed to preserve data boundaries. eg. udp sockets

since udp already support sendmsg, the right fix is mapping writev to sendmsg by checking the fd type is socket.

file types which are supposed to preserve read/write atomicity. eg. regular files

it's caller responsibility to ensure the atomicity if readv/writev just can finish the partial job.

xiaoxiang781216 · 2024-09-17T04:26:21Z

It's better to do the different action base on the fd type before we add readv/writev to file_operation:

forward to sendmsg/recvmsg for socket type
read/write as much as possible for regular file
read/write the first block for all other type

of course, it may need to move readv/writev from libc to fs/vfs to do this type of dispatch.
BTW, to simplify readv/writev dispatch, we can add readv/writev callback to file_operation and provide the default readv/writev implementation if the driver or file system doesn't provide one.

yamt · 2024-09-17T04:49:06Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

it doesn't conform the spec.

Why?

consider:

file types which are usually not interruptible. eg. regular files

if spec enforce the implementation must read all available data, please point out the statement.

i don't know where/if it's explicitly specified in the spec.
it's the traditional common behavior which many applications rely on.

file types which are supposed to preserve data boundaries. eg. udp sockets

since udp already support sendmsg, the right fix is mapping writev to sendmsg by checking the fd type is socket.

udp is just an example.
eg. there can be such character devices.

file types which are supposed to preserve read/write atomicity. eg. regular files

it's caller responsibility to ensure the atomicity if readv/writev just can finish the partial job.

no.
search "atomic" in https://pubs.opengroup.org/onlinepubs/009604599/functions/read.html.

yamt · 2024-09-17T04:54:00Z

It's better to do the different action base on the fd type before we add readv/writev to file_operation:
1. forward to sendmsg/recvmsg for socket type

2. read/write as much as possible for regular file

3. read/write the first block for all other type
of course, it may need to move readv/writev from libc to fs/vfs to do this type of dispatch. BTW, to simplify readv/writev dispatch, we can add readv/writev callback to file_operation and provide the default readv/writev implementation if the driver or file system doesn't provide one.

see #12674 (comment)

yamt · 2024-09-17T05:09:52Z

i don't think it's a good idea to block an obvious fix like this for months just saying "there can be a better solution" w/o actually providing any better solutions.

if this PR was merged two months ago, i might even have implemented the ideal iov-based solution at least partly in the two months.

xiaoxiang781216 · 2024-09-17T10:08:19Z

can we make a decision?

I would prefer to just read the first memory block, which conform the spec and don't need malloc.

it doesn't conform the spec.

Why?

consider:

file types which are usually not interruptible. eg. regular files

if spec enforce the implementation must read all available data, please point out the statement.

i don't know where/if it's explicitly specified in the spec. it's the traditional common behavior which many applications rely on.

But the spec explicitly allows return the short length. Caller always need handle the short return correctly.

i don't think it's a good idea to block an obvious fix like this for months just saying "there can be a better solution" w/o actually providing any better solutions.

if this PR was merged two months ago, i might even have implemented the ideal iov-based solution at least partly in the two months.

Since this solution isn't perfect, especially the allocation happens in the read/write path, I hesitate to merge this change, but other maintainers could merge it if they think it's OK.

yamt · 2024-09-17T10:14:27Z

file types which are usually not interruptible. eg. regular files

if spec enforce the implementation must read all available data, please point out the statement.

i don't know where/if it's explicitly specified in the spec. it's the traditional common behavior which many applications rely on.

But the spec explicitly allows return the short length. Caller always need handle the short return correctly.

whatever the standard allows, i don't think it's a good idea to break the semantics which real applications have been relying on for decades.

currently, nuttx implements readv/writev on the top of read/write. while it might work for the simplest cases, it's broken by design. for example, it's impossible to make it work correctly for files which need to preserve data boundaries without allocating a single contiguous buffer. (udp socket, some character devices, etc) this change is a start of the migration to a better design. that is, implement read/write on the top of readv/writev. to avoid a single huge change, following things will NOT be done in this commit: * fix actual bugs caused by the original readv-based-on-read design. (cf. apache#12674) * adapt filesystems/drivers to actually benefit from the new interface. (except a few trivial examples) * eventually retire the old interface. * retire read/write syscalls. implement them in libc instead. * pread/pwrite/preadv/pwritev (except the introduction of struct uio, which is a preparation to back these variations with the new interface.)

currently, nuttx implements readv/writev on the top of read/write. while it might work for the simplest cases, it's broken by design. for example, it's impossible to make it work correctly for files which need to preserve data boundaries without allocating a single contiguous buffer. (udp socket, some character devices, etc) this change is a start of the migration to a better design. that is, implement read/write on the top of readv/writev. to avoid a single huge change, following things will NOT be done in this commit: * fix actual bugs caused by the original readv-based-on-read design. (cf. #12674) * adapt filesystems/drivers to actually benefit from the new interface. (except a few trivial examples) * eventually retire the old interface. * retire read/write syscalls. implement them in libc instead. * pread/pwrite/preadv/pwritev (except the introduction of struct uio, which is a preparation to back these variations with the new interface.)

This would fix readv/writev issues mentioned in apache#12674. (only for this specific driver though. with this approach, we basically have to fix every single drivers and filesystems.) Lightly tested on the serial console, using micropython REPL on toywasm with esp32s3-devkit:toywasm, which used to be suffered by the readv issue.

xiaoxiang781216 · 2024-11-22T11:13:15Z

since the better approach is merged(#13498), let's close this pr.

yamt · 2024-11-25T11:28:47Z

since the better approach is merged(#13498), let's close this pr.

#13498 itself doesn't fix the problem i wanted to fix with this PR.

#14898 does. but it's still open.

This would fix readv/writev issues mentioned in apache#12674. (only for this specific driver though. with this approach, we basically have to fix every single drivers and filesystems.) Lightly tested on the serial console, using micropython REPL on toywasm with esp32s3-devkit:toywasm, which used to be suffered by the readv issue.

currently, nuttx implements readv/writev on the top of read/write. while it might work for the simplest cases, it's broken by design. for example, it's impossible to make it work correctly for files which need to preserve data boundaries without allocating a single contiguous buffer. (udp socket, some character devices, etc) this change is a start of the migration to a better design. that is, implement read/write on the top of readv/writev. to avoid a single huge change, following things will NOT be done in this commit: * fix actual bugs caused by the original readv-based-on-read design. (cf. apache#12674) * adapt filesystems/drivers to actually benefit from the new interface. (except a few trivial examples) * eventually retire the old interface. * retire read/write syscalls. implement them in libc instead. * pread/pwrite/preadv/pwritev (except the introduction of struct uio, which is a preparation to back these variations with the new interface.)

This would fix readv/writev issues mentioned in apache#12674. (only for this specific driver though. with this approach, we basically have to fix every single drivers and filesystems.) Lightly tested on the serial console, using micropython REPL on toywasm with esp32s3-devkit:toywasm, which used to be suffered by the readv issue.

currently, nuttx implements readv/writev on the top of read/write. while it might work for the simplest cases, it's broken by design. for example, it's impossible to make it work correctly for files which need to preserve data boundaries without allocating a single contiguous buffer. (udp socket, some character devices, etc) this change is a start of the migration to a better design. that is, implement read/write on the top of readv/writev. to avoid a single huge change, following things will NOT be done in this commit: * fix actual bugs caused by the original readv-based-on-read design. (cf. apache/nuttx#12674) * adapt filesystems/drivers to actually benefit from the new interface. (except a few trivial examples) * eventually retire the old interface. * retire read/write syscalls. implement them in libc instead. * pread/pwrite/preadv/pwritev (except the introduction of struct uio, which is a preparation to back these variations with the new interface.)

Reimplement readv to deal with short reads

c9ca374

The original implementation seems to have an assumption that the file is a regular file. This commit re-implements it so that it can deal with tty, sockets, etc.

xiaoxiang781216 reviewed Jul 11, 2024

View reviewed changes

readv: fix a leak on cancellation

12cb439

hartmannathan reviewed Jul 21, 2024

View reviewed changes

yamt mentioned this pull request Oct 18, 2024

move readv/writev to the kernel #13498

Merged

yamt mentioned this pull request Nov 22, 2024

drivers/serial/serial.c: adapt to the iovec-based api #14898

Open

xiaoxiang781216 closed this Nov 22, 2024

yamt reopened this Nov 25, 2024

github-actions bot added Area: OS Components OS Components issues Size: S The size of the change in this PR is small labels Nov 25, 2024

Reimplement readv to deal with short reads #12674

Are you sure you want to change the base?

Reimplement readv to deal with short reads #12674

Conversation

yamt commented Jul 11, 2024

Summary

Impact

Testing

acassis commented Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiaoxiang781216 Jul 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamt commented Jul 11, 2024

yamt commented Jul 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yamt commented Aug 15, 2024

yamt commented Sep 17, 2024

xiaoxiang781216 commented Sep 17, 2024

yamt commented Sep 17, 2024

xiaoxiang781216 commented Sep 17, 2024

yamt commented Sep 17, 2024

xiaoxiang781216 commented Sep 17, 2024

xiaoxiang781216 commented Sep 17, 2024 • edited Loading

yamt commented Sep 17, 2024

yamt commented Sep 17, 2024

yamt commented Sep 17, 2024

xiaoxiang781216 commented Sep 17, 2024

yamt commented Sep 17, 2024

xiaoxiang781216 commented Nov 22, 2024

yamt commented Nov 25, 2024 • edited Loading

acassis commented Jul 11, 2024 •

edited

Loading

xiaoxiang781216 Jul 14, 2024 •

edited

Loading

xiaoxiang781216 commented Sep 17, 2024 •

edited

Loading

yamt commented Nov 25, 2024 •

edited

Loading