Skip to content

Commit

Permalink
Merge pull request #770 from ndw/decode-uri
Browse files Browse the repository at this point in the history
566: Use fn:decode-from-uri in fn:parse-uri
  • Loading branch information
ndw authored Oct 31, 2023
2 parents 7ddbe06 + a146191 commit 8649197
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 87 deletions.
126 changes: 68 additions & 58 deletions specifications/xpath-functions-40/src/function-catalog.xml
Original file line number Diff line number Diff line change
Expand Up @@ -27534,6 +27534,8 @@ declare function some(
any backlashes (<code>\</code>), replace them with forward
slashes (<code>/</code>).</p>

<p>Strip off the fragment identifier and any query:</p>

<p>If the <emph>string</emph> matches <code>^(.*)#([^#]*)$</code>,
the <emph>string</emph> is the first match group and the
<emph>fragment</emph> is the second match group. Otherwise,
Expand All @@ -27546,6 +27548,8 @@ declare function some(
the string is unchanged and the <emph>query</emph> is the empty
sequence.</p>

<p>Attempt to identify the scheme:</p>

<ulist>
<item>
<p>If the <emph>string</emph> matches <code>^[a-zA-Z][:|].*$</code>:</p>
Expand Down Expand Up @@ -27573,42 +27577,54 @@ declare function some(
</item>
</ulist>

<p>If the <emph>scheme</emph> is the empty sequence, the
<code>unc-path</code> option is <code>true</code>, and the <emph>string</emph>
matches <code>^//[^/].*$</code>, then the scheme is <code>file</code>
and the <emph>filepath</emph> is the <emph>string</emph>.
</p>
<p>Now that the scheme, if there is one, has been identified,
determine if the URI is hierarchical:</p>

<ulist>
<item>
<p>If the <emph>scheme</emph> is known to be hierarchical, or known
not to be hierarchical, then <emph>hierarchical</emph> is set accordingly.
Exactly which schemes are known to be hierarchical or
non-hierarchical is
<termref def="implementation-defined">implementation-defined</termref>.
If the implementation does not know if a <emph>scheme</emph> is or is not
hierarchical, the <emph>hierarchical</emph> setting depends on the
<emph>string</emph>. If the <emph>string</emph> is the empty string,
<emph>hierarchical</emph> is the empty sequence (<emph>i.e.</emph> not known),
otherwise <emph>hierarchical</emph> is
<code>true</code> if <emph>string</emph> begins with <code>/</code> and <code>false</code> otherwise.</p>
<code>true</code> if <emph>string</emph> begins with <code>/</code> and
<code>false</code> otherwise.</p>
</item>
</ulist>

<p>If <phrase diff="add" at="2023-07-07">the scheme is not known or is known to be <code>file</code> and</phrase>
the <emph>string</emph> matches <code>^//*([a-zA-Z]:.*)$</code>,
the <emph>authority</emph> is empty and the <emph>string</emph> is
the first match group. Otherwise, if the <emph>string</emph>
matches <code>^///*([^/]+)(/.*)?$</code> then the <emph>authority</emph>
is the first match group and the <emph>string</emph> is the second
match group. If the <emph>string</emph> does not match either
regular expression, the <emph>authority</emph> is the empty sequence
and the <emph>string</emph> is unchanged.</p>
<p>Then examine the remaining parts of the string.</p>

<p>If the <emph>string</emph> matches <code>^//*([a-zA-Z]:.*)$</code>,
<ulist>
<item>
<p>If the <emph>scheme</emph> is the empty sequence, the
<code>unc-path</code> option is <code>true</code>, and the
<emph>string</emph> matches <code>^//[^/].*$</code>, then the
scheme is <code>file</code>, the <emph>authority</emph> is
empty, and the <emph>filepath</emph> is the
<emph>string</emph>.
</p>
</item>
<item>
<p>Otherwise:</p>

<ulist>
<item>
<p>If the scheme is not known or is known to be <code>file</code>
and the <emph>string</emph> matches <code>^//*([a-zA-Z]:.*)$</code>,
the <emph>authority</emph> is empty and the <emph>string</emph> is
the first match group. Otherwise, if the <emph>string</emph>
matches <code>^///*([^/]+)(/.*)?$</code> then the <emph>authority</emph>
the first match group.</p></item>
<item><p>Otherwise, if the <emph>string</emph>
matches <code>^///*([^/]+)?(/.*)?$</code>, the <emph>authority</emph>
is the first match group and the <emph>string</emph> is the second
match group. If the <emph>string</emph> does not match either
match group.</p></item>
<item><p>Finally, if the <emph>string</emph> does not match either
regular expression, the <emph>authority</emph> is the empty sequence
and the <emph>string</emph> is unchanged.</p>
and the <emph>string</emph> is unchanged.</p></item>
</ulist>
</item>
</ulist>

<p>If the <emph>authority</emph> matches
<code>^(([^@]*)@)(.*)(:([^:]*))?$</code>,
Expand Down Expand Up @@ -27657,23 +27673,23 @@ declare function some(
<p>Similar care must be taken to match the port because an IPv6/IPvFuture
address may contain a colon.</p>

<olist>
<ulist>
<item>
<p>If the <emph>authority</emph> matches
<code>^(([^@]*)@)?(\[[^\]]*\])(:([^:]*))?$</code>,
then the <emph>port</emph> is match group 5, otherwise
then the <emph>port</emph> is match group 5.
</p>
</item>
<item>
<p>If the <emph>authority</emph> matches
<p>Otherwise, if the <emph>authority</emph> matches
<code>^(([^@]*)@)?([^:]+)(:([^:]*))?$</code>,
then the <emph>port</emph> is match group 5, otherwise
then the <emph>port</emph> is match group 5.
</p>
</item>
<item>
<p>the <emph>port</emph> is the empty sequence.</p>
<p>Otherwise, the <emph>port</emph> is the empty sequence.</p>
</item>
</olist>
</ulist>

<p>If the <code>omit-default-ports</code> option is <code>true</code>, the port
is discarded and set to the empty sequence if the port number is the same
Expand All @@ -27697,20 +27713,8 @@ declare function some(
separator</emph> and applying <emph>uri decoding</emph> on each
token.</p>

<p>Applying <emph>uri decoding</emph> replaces all occurrences of
plus (<code>+</code>) with spaces and all occurrences of
<code>%[a-fA-F0-9][a-fA-F0-9]</code> with a single character with the
codepoint represented by the two digit hexadecimal number that
follows the <code>%</code> character. In other words, <code>"A%42C"</code> becomes
<code>"ABC"</code> If there are any occurrences of <code>%</code> followed
by up to two characters that are not hexadecimal digits, they are
replaced by the character sequence <code>0xef</code>, <code>0xbf</code>, <code>0xbd</code>
(that is, <code>0xfffd</code>, the Unicode replacement character, in UTF-8).
After replacing all of the percent-escaped characters, the character sequence is
interpreted as UTF-8 to get the string. In other words <code>"A%XYC%Z%F0%9F%92%A9"</code> becomes
<code>"A&#xfffd;C&#xfffd;💩"</code>. <phrase diff="add" at="2023-07-07">If the character sequence is
not a valid sequence of UTF-8 characters, any invalid characters are replaced with the
<code>0xfffd</code>.</phrase></p>
<p>Applying <emph>uri decoding</emph> is equivalent to
calling <code>fn:decode-from-uri</code> on the string.</p>

<p>The <emph>query separator</emph> is the value of the
<code>query-separator</code> option.
Expand Down Expand Up @@ -28292,20 +28296,26 @@ path with an explicit <code>file:</code> scheme.</p>
<p>The components are derived from the contents of the <code>$parts</code>
map in the following way:</p>

<p>If the <code>scheme</code> key is present in the map, the URI begins
with the value of that key. A URI is considered to be non-hierarchical
if either the <code>hierarchical</code> key is present in the
<code>$parts</code> map with the value
<code>false()</code> or if the scheme is known to be non-hierarchical.
(In other words, schemes are hierarchical by default.)</p>

<p>If the <code>scheme</code> is <code>file</code> and the <code>unc-path</code>
option is <code>true</code>, the scheme is delimited by a trailing <code>:////</code>,
otherwise, if the URI is non-hierarchical, the scheme is delimited by
a trailing <code>:</code>. For all other schemes, it is delimited by
a trailing <code>://</code>. Exactly which schemes are known to be
non-hierarchical is
<termref def="implementation-defined">implementation-defined</termref>.</p>
<p>If the <code>scheme</code> key is present in the map,
the URI begins with the value of that key. A URI is considered to be
non-hierarchical if either the <code>hierarchical</code> key
is present in the <code>$parts</code> map with the value
<code>false()</code> or if the scheme is known to be
non-hierarchical. (In other words, schemes are hierarchical by
default.)</p>

<ulist>
<item><p>If the <code>scheme</code> is
known to be non-hierarchical, it is delimited by a trailing
<code>:</code>.</p>
</item>
<item><p>Otherwise, if the <code>scheme</code> is <code>file</code> and the <code>unc-path</code>
option is <code>true</code>, the scheme is delimited by a trailing <code>:////</code>.</p>
</item>
<item><p>Otherwise, the scheme is delimited by
a trailing <code>://</code>.</p>
</item>
</ulist>

<p>For simplicity of exposition, we take the
<code>userinfo</code>, <code>host</code>, and
Expand Down Expand Up @@ -28501,4 +28511,4 @@ path with an explicit <code>file:</code> scheme.</p>
</fos:history>
</fos:function>

</fos:functions>
</fos:functions>
57 changes: 28 additions & 29 deletions specifications/xpath-functions-40/src/xpath-functions.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3305,15 +3305,22 @@ It is recommended that implementers consult <bibref ref="UNICODE-TR18"/> for inf
URIs, to identify their structure, and construct URI strings
from their structured representation.</p>

<p>Some URI schemes are hierarchical and some are non-hierarchical.
Implementations must treat the following schemes as non-hierarchical:
<code>jar</code>, <code>mailto</code>, <code>news</code>, <code>tag</code>,
<code>tel</code>, and <code>urn</code>. Whether additional schemes
are known to be non-hierarchical
<termref def="implementation-defined">implementation-defined</termref>.
If a scheme is not known to be non-hierarchical, it must be
treated as hierarchical.</p>

<?local-function-index?>

<p>The structured representation of a URI is described by the
<code>uri-structure-record</code>:</p>

<?type uri-structure-record?>



<p>The parts of this structure are:</p>

<table border="0" role="data">
Expand Down Expand Up @@ -3361,7 +3368,7 @@ It is recommended that implementers consult <bibref ref="UNICODE-TR18"/> for inf
<td>Parsed and unescaped path segments.</td>
</tr>
<tr>
<td>query-segments</td>
<td>query-parameters</td>
<td>Parsed and unescaped query terms</td>
</tr>
<tr>
Expand All @@ -3372,39 +3379,31 @@ It is recommended that implementers consult <bibref ref="UNICODE-TR18"/> for inf
</table>

<p>The segmented forms of the path and query parameters provide
convenient access to commonly used information. They’re represented
in the map as arrays, instead of sequences, just for the convenience
of serializing the structure.</p>
convenient access to commonly used information.</p>

<p>The path, if there is one, is tokenized on “/” characters and
each segment is unesaped. Consider the URI <code>http://example.com/path/to/a%2fb</code>. The path portion has to be returned as <code>/path/to/a%2fb</code> because
each segment is unescaped (as per the <code>fn:decode-from-uri</code> function). Consider the URI
<code>http://example.com/path/to/a%2fb</code>.
The path portion has to be returned as <code>/path/to/a%2fb</code> because
decoding the <code>%2f</code> would change the nature of the path.
The unescaped form is easily accessible from the path-segments array:</p>

<eg>[
"",
"path",
"to",
"a/b"
]</eg>
The unescaped form is easily accessible from the path-segments list:</p>

<eg>("", "path", "to", "a/b")</eg>

<p>Note that the presence or absence of a leading slash on the path
will effect whether or not the array begins with an empty string.</p>

<p>The query parameters are similarly decoded. Consider the URI:
<p>The query parameters are decoded into a map. Consider the URI:
<code>http://example.com/path?a=1&amp;b=2%264&amp;a=3</code>.
Here the decoded form in the query-segments gives quick access to
the parameter values:</p>

<eg>[
{ "key": "a",
"value": "1" },
{ "key": "b",
"value": "2&amp;4" },
{ "key": "a",
"value": "3" }
]</eg>
<p>Note that both keys and values are unescaped and that it’s an array
of maps because key values can be repeated, as seen for <code>a</code>
The decoded form in the query-parameters is the following map:</p>

<eg>{ "a": ("1", "3"),
"b": "2&amp;4",
}
</eg>
<p>Note that both keys and values are unescaped. If a key
is repeated in the query string, the map will contain a
sequence of values for that key, as seen for <code>a</code>
in this example.</p>

<div3 id="func-parse-uri">
Expand Down

0 comments on commit 8649197

Please sign in to comment.