-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Behavior In Paths Starting With //
#401
Comments
Do you have more concrete examples? I'm not sure those are the best software examples to compare Addressable with. A web browser would be better IMHO. I tried https://example.com//login in Chrome and Firefox, and they both make the request to
Can you clarify this paragraph? Perhaps show some code snippets of what is happening today and how you would like Addressable to behave? |
I wasn't referring to web browsers. Browsers (or at least FF and Chrome, which I've tried) appear to loyally relay the URLs entered. (One exception though is with the Web servers, however seem to consistently overlook such double slashes. Some examples I chose at random right now (I didn't run into a single site that didn't ignore it):
Sure. Here are several. Assume for all: examples = []
examples << Addressable::URI.parse('https://example.com//directory')
examples << Addressable::URI.parse('https://example.com/directory')
examples << Addressable::URI.parse('https://example.com//directory2')
examples << Addressable::URI.parse('https://example.com/directory2')
examples << Addressable::URI.parse('https://example.com//')
examples << Addressable::URI.parse('https://example.com/')
examples.each(&:freeze)
examples.freeze
Note: the preferred outputs above assume the gem is always, immediately translating the |
Sorry. Just re-read this. I guess I would disagree that web servers aren't a good example here, though I can look for other software. The thing with HTTP clients is that they intentionally don't tweak much in URLs because different servers might parse things differently. (I suppose this is an argument for only making the changes I listed happen during normalization.) The consequence (or perhaps the cause) of this is that HTTP clients don't have to make the conversion because HTTP servers tend to. This means that if we care about compatibility, this change would fix the issue on servers using this code, while not breaking it on clients who use this because the servers those clients connect to are prone to ignore the difference. |
Re: the file URI scheme, I think Addressable is doing the correct things, see https://en.wikipedia.org/wiki/File_URI_scheme#How_many_slashes? (until I looked this up, I didn't know that $ irb -raddressable/uri
irb(main):001:0> Addressable::VERSION::STRING
=> "2.7.0"
irb(main):002:0> Addressable::URI.parse('file://file.txt').hostname
=> "file.txt"
Yeah, I think we should compare Addressable with other software that parse URIs. I'm sure there are popular libraries in other languages we could compare with. I don't think the web servers are great examples because it is another level of the stack. Also, visting a popular website we go even higher up the stack: it could be their web framework that decides the behaviour around |
That's both interesting and really weird (the hostname thing). I think this really comes down to how this gem ultimately prioritizes the qualities that make a library like this useful:
The "Robustness" would suggest either the desired usage is guessed, or it were configurable.
So while I'm not even very passionate about the
This really goes back to my point before about priorities and compatibility. I just tried the I would agree that when a web framework or app logic is involved, URIs are often simply passed down to that component to handle the issue. Here's the thing though: ruby is perhaps most often used is used in connection in said web frameworks and application logic. So it's hardly inconceivable that authors of such libraries would want to hand off the parsing of URIs to gems such as this and expect them to be the one to handle this nuance. It certainly would seem awkward for such authors to need to correct the output of this library before being able to use it. My original use case involved attempting to compare two URLs when one of the values was mangled by another server to have an extra
But not making a decision (as is the case now) is itself a decision. And while various tools will leave even non-conforming URIs be and pass them on dutifully as-is, if that functionality were desired,
That I agree with; it's basically what I was suggesting when saying the |
I think it's worth calling out that Addressable's |
@sporkmonger , good point. my initial post as actually about that function ironically, but totally forgot about it when providing the requested examples. My bad. Unfortunately though, the output of Could As for |
+1 to this issue. While conformity to RFC standards is certainly the priority, where possible libraries should strive to remain functional when the non-compliance can be safely recovered from. I encountered problems with both
|
@KelseyDH your problem does not sound related to this issue, you have a perfectly valid URI, with a fragment. You could do this: irb(main):003:0> Addressable::VERSION::STRING
=> "2.8.0"
irb(main):004:0> uri = Addressable::URI.parse 'http://localhost:4300/webapp/foo/#//controller/action?account=001-001-111&email=john%40email.com'
=> #<Addressable::URI:0x140 URI:http://localhost:4300/webapp/foo/#//controller/action?account=001-001-111&email=john%40email.com>
irb(main):005:0> uri.query
=> nil
irb(main):006:0> uri.fragment
=> "//controller/action?account=001-001-111&email=john%40email.com"
irb(main):007:0> Addressable::URI.parse(uri.fragment)
=> #<Addressable::URI:0x154 URI://controller/action?account=001-001-111&email=john%40email.com>
irb(main):008:0> Addressable::URI.parse(uri.fragment).query
=> "account=001-001-111&email=john%40email.com"
irb(main):009:0> Addressable::URI.parse(uri.fragment).query_values
=> {"account"=>"001-001-111", "email"=>"[email protected]"} |
NB: So this is really long, but I wanted to provide ample background in the hopes that this will make a decision and/or PR a lot easier. (I'm willing to do the PR if there's go-ahead.)
Background
I was just parsing a URL that was the result of HTTP redirect, and due to a bug on that server (which I don't control), the user is redirected to a URL along the lines of
https://example.com//login
.This posed an issue for me because I was attempting to compare the path to the simple
/login
.It looks like this was a change made several years ago (see a14e0cb / #240) in order to comply with the standard (RFC 3986) for URI syntax., due to the ambiguity that can result in a URI without an authority.
Unexpected/Incompatible Behavior
That being said, however, silent acceptance of
//
prefixes as if they were simply/
, are exceedingly common, from web servers, to Linux filesystem paths, toRails
itself.Furthermore, it's not at all uncommon to see such errors with errant
/
s in paths, precisely because they are pretty much always silently ignored and there is no impetus to fix (or even notice) them.Standards
While -- per RFC 3986 --
//
at the start of a path is not legal in the absence of anauthority
, the RFC is a bit fuzzy on the details.In particular, while they say:
When authority is not present, the path cannot begin with two slash characters ("//")
, the ABNF listed in the RFC (which I've duplicated at the end of this issue) doesn't seem to allow such a path regardless of the presence/absence of anauthority
component.On top of this, the algorithm provided in the RFC for normalizing paths (also duplicated below), converts a leading
//
s to a/
.The Solution
Given how software interprets paths beginning with
//
combined with the ambiguity of the relevant RFC on the matter, I would argue that there is strong justification for -- while continuing to consider paths with//
illegal -- normalizing them into//
, perhaps silently, if they are encountered.At a bare minimum, it makes sense to de-duplicate
//
s inAddressable::URI::heuristic_parse
for thehttps?
schemes.Additional arguments for doing this is that not doing so can produce some confusing artifacts/inconsistencies. Among them that I've noticed:
uri.only(:path)
is called.uri.route_from(uri2)
(assuminguri2
has the sameauthority
) will throw an unexpected error.Extra Credit (A related bug)
This may or may not be considered a different issue depending on how this issue is resolved, but in researching this issue, I also found a bug in
addressable
when//
prefixes the path without anauthority
.Namely, the code:
does not throw in any errors because the validation checks if the authority is
nil
rather than if it is an empty string.If you were to then evaluate
uri.authority = nil
, however, the exception is thrown.Lastly, for what it's worth, I'm not sure if file URIs along the lines of
file://relative/path
are officially supported in the standards (I've seen them a bit before, at least), they parse (arguably incorrectly) with'relative'
(in this case) being included in thehost
component.Reference (Excerpts cited above)
ANBF for URI Paths from RFC 3986
Algorithm for normalizing/resolving relative paths within a URI's path in RFC 3986
5.2.4. Remove Dot Segments
The pseudocode also refers to a "remove_dot_segments" routine for interpreting and removing the special "." and ".." complete path segments from a referenced path. This is done after the path is extracted from a reference, whether or not the path was relative, in order to remove any invalid or extraneous dot-segments prior to forming the target URI. Although there are many ways to accomplish this removal process, we describe a simple method using two string buffers.
The input buffer is initialized with the now-appended path components and the output buffer is initialized to the empty string.
While the input buffer is not empty, loop as follows:
A. If the input buffer begins with a prefix of "../" or "./", then remove that prefix from the input buffer; otherwise,
B. if the input buffer begins with a prefix of "/./" or "/.", where "." is a complete path segment, then replace that prefix with "/" in the input buffer; otherwise,
C. if the input buffer begins with a prefix of "/../" or "/..", where ".." is a complete path segment, then replace that prefix with "/" in the input buffer and remove the last segment and its preceding "/" (if any) from the output buffer; otherwise,
D. if the input buffer consists only of "." or "..", then remove that from the input buffer; otherwise,
E. move the first path segment in the input buffer to the end of the output buffer, including the initial "/" character (if any) and any subsequent characters up to, but not including, the next "/" character or the end of the input buffer.
Finally, the output buffer is returned as the result of remove_dot_segments.
The text was updated successfully, but these errors were encountered: