Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails on duplicate Content-Type headers #29

Open
jonbarrow opened this issue Jan 5, 2025 · 9 comments
Open

Fails on duplicate Content-Type headers #29

jonbarrow opened this issue Jan 5, 2025 · 9 comments

Comments

@jonbarrow
Copy link

jonbarrow commented Jan 5, 2025

According to spec, an HTTP message may contain headers with the same name. In those cases the values of the headers are concatenated into a comma separated list https://datatracker.ietf.org/doc/html/rfc9110#section-5.2:

When a field name is repeated within a section, its combined field value consists of the list of corresponding field line values within that section, concatenated in order, with each field line value separated by a comma.

For example, this section:

Example-Field: Foo, Bar
Example-Field: Baz

contains two field lines, both with the field name "Example-Field". The first field line has a field line value of "Foo, Bar", while the second field line value is "Baz". The field value for "Example-Field" is the list "Foo, Bar, Baz".

However content-type fails to parse values when they are concatenated like this. Basic repro:

const contentType = require('content-type');

let type1;
let type2;

try {
	type1 = contentType.parse('application/x-www-form-urlencoded');
	type2 = contentType.parse('application/x-www-form-urlencoded, application/x-www-form-urlencoded');
} catch {} // Ignore errors

console.log(type1); // ContentType { parameters: [Object: null prototype] {}, type: 'application/x-www-form-urlencoded' }
console.log(type2); // undefined. This threw "invalid media type"

While an uncommon situation, it is one that can happen and is perfectly within the spec. A common way which this may happen is when using Cloudflare Workers/Snippets. If a request comes in and is processed by a Cloudflare Worker/Snippet and contains a duplicate header, Cloudflare will automatically convert this into a comma separated list, and thus fail to be processed here

This is caused by the following regex only allowing a single value for the header:

var TYPE_REGEXP = /^[!#$%&'*+.^_`|~0-9A-Za-z-]+\/[!#$%&'*+.^_`|~0-9A-Za-z-]+$/

This issue is related to several other issues, which I will link here after they are made, since I'm not sure who should have the responsibility of ensuring the header is split correctly

@jonbarrow jonbarrow changed the title On duplicate Content-Type headers Fails on duplicate Content-Type headers Jan 5, 2025
@wesleytodd
Copy link
Member

wesleytodd commented Jan 8, 2025

Ok, doing a little bit of initial digging. The previously referenced RFC in the readme and code is superseded by this one and the old one does not contain this language afaict. Seems likely that this implementation just pre-dates the official handling of this in the spec.

I think the problem goes a bit deeper than JUST parsing this for multiple values though. Since there is a following section about handling multiple content-type's and it calls out this:

This means that, aside from the well-known exception noted below, a sender MUST NOT generate multiple field lines with the same name in a message (whether in the headers or trailers) or append a field line when a field line of the same name already exists in the message, unless that field's definition allows multiple field line values to be recombined as a comma-separated list

And the content-type header is defined here: https://www.rfc-editor.org/rfc/rfc9110.html#field.content-type

Although Content-Type is defined as a singleton field, it is sometimes incorrectly generated multiple times, resulting in a combined field value that appears to be a list. Recipients often attempt to handle this error by using the last syntactically valid member of the list, leading to potential interoperability and security issues if different implementations have different error handling behaviors.

This leads me to think that while our handling is slightly incorrect, it is also could be a security risk if we do anything other than consider this completely invalid behavior and throw/bail on it (see the paragraph above this one on how mime-sniffing can open issues).

I wonder if maybe @jasnell might be a good person to help us on this as someone who both deeply knows the domain and also works at CF. I am inclined to say we should do one of two things (depending on James' input):

  1. Refuse to parse this as it is invalid
  2. Parse only the "first" entry (which it is implied in the spec to be way to avoid the problem called out in the spec?)

@jonbarrow
Copy link
Author

The previously referenced RFC in the readme and code is superseded by this one and the old one does not contain this language afaict. Seems likely that this implementation just pre-dates the official handling of this in the spec.

The RFC mentioned in the readme does actually make mention of multiple header fields, with Content-Type specifically being listed as a "good example" for them (https://datatracker.ietf.org/doc/html/rfc7231):

Whether the field is a single value or whether it can be a list (delimited by commas; see Section 3.2 of [RFC7230]).

If it does not use the list syntax, document how to treat messages where the field occurs multiple times (a sensible default would be to ignore the field, but this might not always be the right choice).

Note that intermediaries and software libraries might combine multiple header field instances into a single one, despite the field's definition not allowing the list syntax.  A robust format enables recipients to discover these situations (good example: "Content-Type", as the comma can only appear inside quoted strings; bad example: "Location", as a comma can occur inside a URI)

The reference link mentioned in this section, https://datatracker.ietf.org/doc/html/rfc7230, reads:

A sender MUST NOT generate multiple header fields with the same field name in a message unless either the entire field value for that header field is defined as a comma-separated list [i.e., #(values)] or the header field is a well-known exception (as noted below).

A recipient MAY combine multiple header fields with the same field name into one "field-name: field-value" pair, without changing the semantics of the message, by appending each subsequent field value to the combined field value in order, separated by a comma.  The order in which header fields with the same field name are received is therefore significant to the interpretation of the combined field value; a proxy MUST NOT change the order of these field values when forwarding a message.

So this is still something to be expected even in the older specification, it seems

Parse only the "first" entry (which it is implied in the spec to be way to avoid the problem called out in the spec?)

I think this would be the most sensible thing to do imo, but I may just be biased

and also works at CF

That would be great. Right now we have an additional CF Snippet running that will automatically convert a duplicate header into a single value by taking the first entry (same as what was proposed here), but that's not super ideal for us in the long run since CF Snippets are limited in quantity and this is eating one of our slots

@jonbarrow
Copy link
Author

jonbarrow commented Jan 8, 2025

I think there might also be an argument to be made about upping the target spec? If the lack of presence of certain language in the older spec is an issue at least. These libraries are being used by body-parser and shipped with express, which is still being widely used by modern clients, so it may be worth considering targeting newer spec to handle these new clients if there's new language? (which is ironic, since this issue was originally found due to the client in question being a 3DS which is by no means modern lol)

@wesleytodd
Copy link
Member

I don't think it was an issue, I was just surprised it was not accounting for this since I know proxies have done this for years. Thanks for finding the section on that, I must have just missed it.

But yeah, I think we can work toward updating the references to the spec (good first PRs 😉) as well as doing the proposal here. Ideally we would do the doc references separately so they can breeze on through in case we need to go back and forth on the implementation updates. Especially since there is some risk this becomes a breaking change.

@jonchurch
Copy link
Member

jonchurch commented Jan 8, 2025

tldr; lists of values in Content-Type is not spec compliant, it's the job of an application to deal with invalid header values

Section 5.2 explains why Content-Type or any arbitrary header may be combined in the wild, but does not justify treating the result as valid. It defines a general syntax level rule for handling duplicated headers, but does not override the semantics of individual headers.


I think the quoted part of section 5.3 below is being misinterpretted in terms of what "good example" means. It does not support the idea that a list is valid for Content-Type, it is calling attention to the fact a list is invalid here and that implementers get to decide what to do when its encountered.

The RFC mentioned in the readme does actually make mention of multiple header fields, with Content-Type specifically being listed as a "good example" for them (datatracker.ietf.org/doc/html/rfc7231):

Whether the field is a single value or whether it can be a list (delimited by commas; see Section 3.2 of [RFC7230]).

If it does not use the list syntax, document how to treat messages where the field occurs multiple times (a sensible default would be to ignore the field, but this might not always be the right choice).

Note that intermediaries and software libraries might combine multiple header field instances into a single one, despite the field's definition not allowing the list syntax. A robust format enables recipients to discover these situations (good example: "Content-Type", as the comma can only appear inside quoted strings; bad example: "Location", as a comma can occur inside a URI)

The spec is referring to how robust media-type is as a format, and therefore how simple it is to detect these invalid list values for headers which only accept single values ("the comma can only appear inside quoted strings"). It is comparing it to Location, which is much harder to detect a list being provided due to , being an acceptable value in a URL.

For the called out section from rfc7230 section 3.2.2

A sender MUST NOT generate multiple header fields with the same field name in a message unless either the entire field value for that header field is defined as a comma-separated list [i.e., #(values)] or the header field is a well-known exception (as noted below).

This is typical spec speak, and how I interpret is is informed by unless either the entire field value for that header field is defined as a comma-separated list [i.e., #(values)].
Content-Type does not define a list as a valid value, nor is it a well known exception. So this section on field ordering is not relevant to the issue being discussed.

content-type is making the choice to reject invalid values, which is called out as a sensible default by the spec.

Combining of multiple headers this way is common for proxies, yes, but proxies typically do it without any idea if they are creating valid values. They do it per 5.2 in order to guard downstream servers from seeing duplicate headers and being confused. The root is almost always a misconfigured client, and the proxy is just doing its best. An invalid value better represents the semantics of the transported message than duplicate headers which could cause unexpected results.

I see a few paths forward here, ordered in terms of my preference

  1. Provide a better error for this known failure case in the library, document it, and allow folks to catch and handle that as they see fit
  2. Cloudflare (and other proxies) check the header they are concatting to see if they are creating invalid values. Honestly, a proxy's job is delivery so it seems unlikely they'd reject these invalid requests by default, but dropping duplicates is an option here.
  3. Add a nonbreaking option to opt-in to parsing these invalid values

If you're already handling these at the ingest, then that sounds sane to me and like the right approach. Fixing the client is the best approach, but since you're working with hardware I understand that's not an option.

@jonbarrow
Copy link
Author

Thank you for the reply, and I apologize for my lateness. Thank you for the detailed look at the targeted spec, and while I do still believe this is within spec as per the current RFC (due to section 5.2 seemingly being much more lenient in it's language than the current target specs language) that's just my interpretation

Fixing the client is the best approach, but since you're working with hardware I understand that's not an option

Yes unfortunately we don't have much control here, even though I agree this is the best option. Though to be fair even in situations outside of ours where someone may not be working with hardware like we are, fixing the client is often not an option for most people since you don't typically control the client outside of your own products

Cloudflare (and other proxies) check the header they are concatting to see if they are creating invalid values. Honestly, a proxy's job is delivery so it seems unlikely they'd reject these invalid requests by default, but dropping duplicates is an option here

I unfortunately have no pull over how Cloudflare operates in this regard. Even in our current setup, we aren't preventing the combined header from being created. Cloudflare creates it and we're retroactively deduplicating it. So I'm not sure how much of an option this actually is

If you're already handling these at the ingest, then that sounds sane to me and like the right approach

We're handling this via Cloudflare Snippets right now yes, but that's not super ideal for us since Cloudflare only provides a limited number of these Snippets and this takes up a slot. We are doing this at the Cloudflare level because the issue presents itself through body-parser which is the first thing that runs on our Express server

That's why I initially made several issues on various repositories, since I'm aware it's up to the application to decide what to do but I was not sure whose responsibility it actually was in this case since there's 6 different areas where this issue could potentially be handled (Cloudflare Snippets, Express/the Express server, body-parser, or any of the 3 dependencies used by body-parser)

The issues in the other repositories were closed in favor of handling this downstream, but if the interpretation is that this isn't spec compliant then I suppose those would be reopened? Or is the expectation that none of the packages handle this and it's left up to the developer? I'm genuinely asking, to be clear, since I really have no idea who would be responsible in this case and we're looking to settle on a solution for this on our side

Provide a better error for this known failure case in the library

Imo regardless of what happens, even if none of the mentioned packages fix this themselves, a better error would probably be in order just to make it more clear as to why the header is being rejected. At the very least it would save people some time having to dig through the source of multiple libraries to figure that out (plus digging through multiple versions of the same libraries, as in some of these cases GitHub does not match npm)

@wesleytodd
Copy link
Member

Ok, we have a few long posts here, but to help bring it back to a decision for this library specifically, we need to choose between one of these to I think:

  1. Better error when we refuse to parse this, then ensure that error is bubbled up to body-parser correctly
  2. Add an option to parse these, where we de-duplicate and take the first value

My vote goes to 1 since it is invalid in the first place, but I think 2 is viable because it would solve a real problem here.

@jonbarrow
Copy link
Author

jonbarrow commented Jan 16, 2025

Both of those sound reasonable to me

Given that, at least the current target spec, it does seem to forbid this I do agree that it probably isn't best to fix it in this repo but rather a clear error should be thrown to indicate why it's being rejected, and the deduplication should probably be handled either in type-is or body-parser

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants