Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a MultipartEntity definition for vibe.inet.webform. #2

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

vnayar
Copy link

@vnayar vnayar commented Mar 14, 2024

This is an initial attempt to implement a standards-compliant multipart/form-data request.

See vibe-d/vibe-http#32

The initial commit here is preliminary and will require iteration, due to a lack of familiarity on the author's part of Vibe's internal conventions and patterns.

@s-ludwig
Copy link
Member

One thing that is sub optimal about the way of defining the whole multipart entity up-front is that it inevitably requires a number of dynamic memory allocations. In high-throughput scenarios, this has repeatedly turned out to be a serious bottleneck or resulting in pseudo memory leaks with the GC implementation.

A way to mitigate this would be to change the entity/part types from class to template structs with appropriate factory functions:

auto multiPartEntity(HEADER_MAP, PARTS)(HEADER_MAP headers, PARTS parts) {
    return MultipartEntity!(HEADER_MAP, PARTS)(headers, parts);
}

auto multiPartEntity(PARTS)(string content_type, PARTS parts) {
    auto headers = only(tuple("Content-Type", content_type));
    return MultipartEntity!(typeof(headers), PARTS)(headers, parts);
}

auto formPart(string name, string value) {
    auto headers = only(...);
    return MultipartEntityPart!(typeof(headers), string)(headers, value);
}

auto formPart(string name, NativePath file_path) {
    // ...
}

struct MultipartEntity(HEADER_MAP, PARTS) {
    HEADER_MAP headers;
    PARTS parts;
    string boundary;
}

struct MultipartEntityPart(HEADER_MAP, VALUE) {
    HEADER_MAP headers;
    VALUE value;
}

* See_Also: https://datatracker.ietf.org/doc/html/rfc2046#section-5.1
* See_Also: https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
* See_Also: https://datatracker.ietf.org/doc/html/rfc2388
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small code style issue - the multi-line comment style used in the code base is omitting the " * " prefix for the intermediate lines and instead uses a single tab indentation. Also, the first paragraph should just be a single short summary sentence, because it will be listed in the "description" column of the overview tables of the API documentation.

@vnayar
Copy link
Author

vnayar commented Mar 20, 2024

One thing that is sub optimal about the way of defining the whole multipart entity up-front is that it inevitably requires a number of dynamic memory allocations. In high-throughput scenarios, this has repeatedly turned out to be a serious bottleneck or resulting in pseudo memory leaks with the GC implementation.

I wasn't fully aware of that. Originally I was hoping the use of classes would help make it easier to pass the object by reference and avoid copies, but wasn't aware that dynamic allocations are so much more expensive. What is it about the GC that makes them so costly?

Also, why use templates for the attribute types?

struct MultipartEntityPart(HEADER_MAP, VALUE) {
    HEADER_MAP headers;
    VALUE value;
}

Don't concrete types make it easier to utilize the values when writing them to a connection, such as in vibe-http, client.d?
https://github.com/vnayar/vibe-http/blob/multipart-form-data-entities-rfc-2388/source/vibe/http/client.d#L909

Maybe I'm confused about the actual status of types like InetHeaders and InputStream. These are the types utilized within vibe, right?

The MultipartEntityPart body type of SumType!(string, InterfaceProxy!InputStream) was selected to permit a body to be chosen either by giving a fixed value (a common case with <input> and <select> tags in HTML forms, but also to chose an available file, or to choose data that a service already has in memory (e.g. a server calling an API endpoint, and submitting data that it got from elsewhere, such as by extracting an email or generating a file in memory using a library.)

@s-ludwig
Copy link
Member

One thing that is sub optimal about the way of defining the whole multipart entity up-front is that it inevitably requires a number of dynamic memory allocations. In high-throughput scenarios, this has repeatedly turned out to be a serious bottleneck or resulting in pseudo memory leaks with the GC implementation.

I wasn't fully aware of that. Originally I was hoping the use of classes would help make it easier to pass the object by reference and avoid copies, but wasn't aware that dynamic allocations are so much more expensive. What is it about the GC that makes them so costly?

There are two escalation stages with the current GC. The first one happens because the collection stage uses a stop-the-world approach and allocations require taking a global GC lock. In situations with high GC pressure or usage frequency, this can nullify any multi-threading performance gains.

The second stage happens when for some reason (probably memory pool fragmentation, but maybe also due to the way the GC selects a pool for a new allocation, I haven't looked close enough into the implementation) the GC cannot consistently reuse memory for new allocations. This then leads to infinite grown of allocated pools, which is a de-facto memory leak. I've observed this being triggered both in vibe.d web application contexts and in our photo application as soon as rapid allocations for more than a few bytes happen from multiple threads at once. This happens rather rarely, but when it does it's hard to track down.

Also, why use templates for the attribute types?

The idea is that it allows the user to avoid any heap/GC allocations whenever possible. So for example a static array or an only() range can be passed instead of a dynamic array. This can also be used on combinations, such as allocating default headers once in a dynamic array and then doing .join(only("Some-Header", value)) to dynamically append a header without having to allocate and without passing a lot of by-value memory.

Don't concrete types make it easier to utilize the values when writing them to a connection, such as in vibe-http, client.d? https://github.com/vnayar/vibe-http/blob/multipart-form-data-entities-rfc-2388/source/vibe/http/client.d#L909

Maybe I'm confused about the actual status of types like InetHeaders and InputStream. These are the types utilized within vibe, right?

The MultipartEntityPart body type of SumType!(string, InterfaceProxy!InputStream) was selected to permit a body to be chosen either by giving a fixed value (a common case with <input> and <select> tags in HTML forms, but also to chose an available file, or to choose data that a service already has in memory (e.g. a server calling an API endpoint, and submitting data that it got from elsewhere, such as by extracting an email or generating a file in memory using a library.)

I forgot to add the proper template constraints in my example. For HEADER_MAP this would be something like isStringMap!HEADER_MAP and for PARTS it could be isInputRange!PARTS && isMultipartBodyType(ElementType!PARTS). isMultipartBodyType could then accept string, const(ubyte)[], an isInputStream!T type, or a SumType!(...) of such types (possibly also Algebraic or TaggedUnion, if there is a common interface). Since both have an input range interface, the code to handle them shouldn't be much different from the code with concrete types.

Without an example this does become a bit more complicated to understand from the user's point of view, unfortunately, but having the high-level overloads for the common cases should mitigate that hopefully.

@vnayar
Copy link
Author

vnayar commented May 27, 2024

I forgot to add the proper template constraints in my example. For HEADER_MAP this would be something like isStringMap!HEADER_MAP and for PARTS it could be isInputRange!PARTS && isMultipartBodyType(ElementType!PARTS). isMultipartBodyType could then accept string, const(ubyte)[], an isInputStream!T type, or a SumType!(...) of such types (possibly also Algebraic or TaggedUnion, if there is a common interface). Since both have an input range interface, the code to handle them shouldn't be much different from the code with concrete types.

In this particular case, I believe we won't be able to treat the parts as a range interface, because they lack a common element type. For example, a form might consist of 3 fields, a text input, a checkbox, and a multi-file upload. Thus a statement like isMultipartBodyType(ElementType!PARTS) wouldn't work, because there is no single element type for PARTS. I don't know if it's possible to create at compile-time a new SumType from the original part types. Is that what is needed? Perhaps as a stop-gap, we can simply make the returned type a SumType of all the types that can be returned by a helper function like auto formPart(string name, NativePath file_path) or auto formPart(string name, string value).

According to the spec, the default Content-Type of a multipart/form-data field is text/plain. So checkboxes, options, as well as text fields should all use text/plain (although technically they don't have to) https://datatracker.ietf.org/doc/html/rfc2388#section-3 File upload seems to be the only mentioned case where the Content-Type is explicitly set, and it can be something like application/pdf or multipart/mixed and represent a set of files.

It's a bit of a shame that there's no MIME RFC2046 library that already exists that could be used for this purpose. (Hmm, perhaps I should begin there?)

source/vibe/inet/webform.d Outdated Show resolved Hide resolved
@s-ludwig
Copy link
Member

s-ludwig commented Jun 3, 2024

Perhaps as a stop-gap, we can simply make the returned type a SumType of all the types that can be returned by a helper function like auto formPart(string name, NativePath file_path) or auto formPart(string name, string value)

Yeah, I'd say it makes sense to go from there. For the internal representation we should have that anyway, as Variant should really be a last-resort solution due to its less than ideal overall characteristics (GC allocations, virtual function calls, no nothrow/@safe).

I'd say that with changing to SumType and making the tests pass we can merge this and make further API refinements as separate steps.

@vnayar
Copy link
Author

vnayar commented Jun 4, 2024

Perhaps as a stop-gap, we can simply make the returned type a SumType of all the types that can be returned by a helper function like auto formPart(string name, NativePath file_path) or auto formPart(string name, string value)

Yeah, I'd say it makes sense to go from there. For the internal representation we should have that anyway, as Variant should really be a last-resort solution due to its less than ideal overall characteristics (GC allocations, virtual function calls, no nothrow/@safe).

I'd say that with changing to SumType and making the tests pass we can merge this and make further API refinements as separate steps.

I ran into a problem with this approach. Because the Header type is templated, and the stream type that a file can be in is also templated (inside FormFile(InputStreamT)), there's no way to make a single SumType, because to do so would require knowing the HeaderT and InputStreamT values in advance.

The only way I can think of to make this work is to provide the HeaderT and InputStreamT types as template-parameters to MultipartEntity, and this would force all the different parts of the multipart form to use the same header/stream types. Only Variant permits mixing and matching of header/stream types.

However, because the HeaderT and InputStreamT is sometimes determined by factory methods, and mixing would no longer be allowed, it would essentially mean only a single type is permitted, and there's no point offering it as a template-parameter. At this point, one might as well hard-code InetHeaderMap for HeaderT and InterfaceProxy!InputStream for InputStreamT.

How important is it to offer arbitrary range types in the interface? If it's very important, then maybe Variant should be kept. If it's not important, then the header and file stream types should probably be locked in and the body could then be SumType.

@vnayar
Copy link
Author

vnayar commented Jun 19, 2024

@s-ludwig So what do you think about the problem of using SumType?

In a nutshell, if I have a list/range of Multiparts (which can be the body of a multipart entity), then I need a type for that variable. However, if I provide a type, then it has to have a concrete header and body type. But this means it cannot support ranges, which themselves have no concrete type. E.g. you can't have a concrete range of different types of ranges. You would either have to lock-down supported range types, like InterfaceProxy!InputStream, or use Variant. Which is better?

@s-ludwig
Copy link
Member

Sorry for replying late, I actually started to write a reply two weeks ago but then got sidetracked. But now actually, after re-reading everything, I think that what I meant in my first comment was actually to have the PARTS template parameter to be variadic and just forgot about it when you brought up the range type issue and didn't see it in the code.

So it would look like this: auto multiPartEntity(HEADER_MAP, PARTS...)(HEADER_MAP headers, PARTS parts)

Parts can then be given as multiple individually typed arguments. Each argument could be a range of parts, or a single part. The user could choose the range element type to be whatever is needed. Although it may not actually be necessary at all, we could also support using SumType as range elements and match accordingly, so that the user has maximum flexibility here to represent the parts list.

@vnayar
Copy link
Author

vnayar commented Jun 24, 2024

Sorry for replying late, I actually started to write a reply two weeks ago but then got sidetracked. But now actually, after re-reading everything, I think that what I meant in my first comment was actually to have the PARTS template parameter to be variadic and just forgot about it when you brought up the range type issue and didn't see it in the code.

So it would look like this: auto multiPartEntity(HEADER_MAP, PARTS...)(HEADER_MAP headers, PARTS parts)

Parts can then be given as multiple individually typed arguments. Each argument could be a range of parts, or a single part. The user could choose the range element type to be whatever is needed. Although it may not actually be necessary at all, we could also support using SumType as range elements and match accordingly, so that the user has maximum flexibility here to represent the parts list.

I originally had tried something like this, and the code is still commented out, but it would look something like this:

/// Returns a MultipartEntity with Content-Type "multipart/form-data" from parts as a compile-time sequence.
auto multipartFormData(MultipartR...)(MultipartR parts) {
	string boundaryStr = createBoundaryString();
	auto headers = only(
			tuple("Content-Type", "multipart/form-data; boundary=" ~ boundaryStr));
	return MultipartEntity!(typeof(headers))(headers: headers, bodyValue: parts, boundary: boundaryStr);
}

In this case, the compiler expands the variadic parameter parts, and you end up with an error like this:

source/vibe/inet/webform.d(794,42): Error: template `vibe.inet.webform.MultipartEntity!(OnlyResult!(Tuple!(string, string))).MultipartEntity.__ctor` is not callable using argument types `!()(OnlyResult!(Tuple!(string, string)), MultipartEntity!(OnlyResult!(Tuple!(string, string))), MultipartEntity!(OnlyResult!(Tuple!(string, string), Tuple!(string, string), Tuple!(string, string))), MultipartEntity!(OnlyResult!(Tuple!(string, string), Tuple!(string, string))), string)`

Shortening the header types as H and body type as B, the error is:

source/vibe/inet/webform.d(794,42): Error: template `vibe.inet.webform.MultipartEntity!(H).MultipartEntity.__ctor` is not callable using argument types `!()(H, B1, B2, B3, string)`

In essence, the same problem remains. A concrete type that can fit all the diverse body types is still needed. And those body types might wrap a file or other arbitrary range. Thus, the body variable, if it does not wish to use Variant, would still need a SumType that encompasses all the possible body content types that are permitted: string, ranges, inputstreams, etc.

Even Variant doesn't solve the problem, because when sending the actual form on an HTTP connection, you need a way to write the contents of the multipart parts, and without a type, that's not really an option. The only ideas I have right now would be to support InputProxy!InputStream and InputRangeObject, these are at least concrete types one can compile against.

@vnayar
Copy link
Author

vnayar commented Jun 24, 2024

I added a new commit that hopefully makes things easier to reason about this. I got rid of Variant and replaced it with SumType, and the code compiles with the minimal tests running as well. To do this, I simply got rid of all the template parameters.

Starting from something that's working, maybe it would be easier to specify, one at a time, what template parameters should exist.

@s-ludwig
Copy link
Member

s-ludwig commented Jun 24, 2024

return MultipartEntity!(typeof(headers))(headers: headers, bodyValue: parts, boundary: boundaryStr);

This should be something like this:

return MultipartEntity!(typeof(headers), MultipartR)(headers: headers, bodyValue: parts, boundary: boundaryStr);

The named argument initialization probably cannot be used, yet, due to backwards compatibility and I'm not sure how well it plays with tuple typed fields, but in any case, the MultipartR types need to be explicitly passed to the MultipartEntity template type.

edit: The MultiPartEntity declaration would look like this:

struct MultipartEntity(HEADER_MAP, PARTS...) {
    HEADER_MAP headers;
    PARTS parts;
    string boundary;
}

@vnayar
Copy link
Author

vnayar commented Jul 11, 2024

edit: The MultiPartEntity declaration would look like this:

struct MultipartEntity(HEADER_MAP, PARTS...) {
    HEADER_MAP headers;
    PARTS parts;
    string boundary;
}

Wouldn't that make it impossible to declare an object in practice? Imagine trying to a assign to parts. I could try something like parts = [ MultiPartEntity!(InetHeaderMap, string), MultiPartEntity!(InetHeaderMap, File) ], but these two items contain no common type, there could be no type implied for parts or even the literal value being assigned to it.

I don't think this level of desired flexibility can be accomplished without Variant. One either needs to sharply limit the types that a MultiPartEntity body can have, or use Variant. Variant is also only a half-solution, because one still needs to know how to write the contents of the body to a Stream. So it does matter whether it's a File, Stream, etc.

@s-ludwig
Copy link
Member

The nice thing is that the user can prepare the parts however it fits best:

// functional style:

auto entity = multiPartEntity(header,
    formPart("user", "peter"),
    formPart("file", NativePath("foo.dat"));

auto entity = multiPartEntity(header,
    formPart("user", "peter"),
    paths.map!(p => formPart("file", p)));

// imperative style:

typeof(formPart("", NativePart.init))[] files;
foreach (path; paths)
    files ~= formPart("file", path);
auto entity = multiPartEntity(header,
    formPart("user", "peter"),
    files);

alias MyPartTypes = SumType!(typeof(formPart("", "")), typeof(formPart("", NativePath.init));
MyPartTypes[] parts;
parts ~= MyPartTypes(formPart("foo", "bar"));
parts ~= MyPartTypes(formPart("foo", NativePath("bar.dat")));
auto entity = multiPartEntity(header, parts);

The necessity for typeof when going the imperative route isn't very nice, but we could pre-define some type aliases for the two simplest cases, if that style of initialization is deemed important enough. However, I think that just using separate parameters for different types will work for almost all practical use cases, so that the necessity for SumType (or Variant) should be quite a niche case in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants