Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AsciiSet::EMPTY and boolean operators #969

Merged
merged 3 commits into from
Sep 19, 2024
Merged

Conversation

joshka
Copy link
Contributor

@joshka joshka commented Sep 19, 2024

In RFCs, the sets of characters to percent-encode are often defined as
the union of multiple sets. This change adds an EMPTY constant to
AsciiSet and implements the Add trait for AsciiSet so that sets
can be combined with the + operator. The ! negation operator is also
defined, as well as equivalent constant functions for these (union(),
complement()).

AsciiSet now derives Debug, PartialEq, and Eq so that it can be
used in tests.


Example: https://www.rfc-editor.org/rfc/rfc3986#section-3.4 defines

   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Using this new method, this can be easily represented as:

const SUB_DELIMS: AsciiSet = AsciiSet::EMPTY.add(b'!').add(b'$') ...
const GEN_DELIMS: AsciiSet = AsciiSet::EMPTY.add(b':').add(b'/') ...
const RESERVED: AsciiSet = GEN_DELIMS.union(SUB_DELIMS);

Similarly the set of characters that must be encoded is defined as the set of characters that are not in the allowed characters

https://www.rfc-editor.org/rfc/rfc3986#section-2.2

URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component.
If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.

So a part like query is defined in https://www.rfc-editor.org/rfc/rfc3986#appendix-A as:

   query         = *( pchar / "/" / "?" )
   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   pct-encoded   = "%" HEXDIG HEXDIG
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

which can be translated to:

const QUERY: AsciiSet = PCHAR.add(b'/').add(b'?');
const PCHAR: AsciiSet = UNRESERVED.union(SUB_DELIMS).add(b':').add(b'@');
const UNRESERVED: AsciiSet = ...

// which can be then used like:
let encoded_query = utf8_percent_encode("foo?:@/=bar", !QUERY);

In RFCs, the sets of characters to percent-encode are often defined as
the union of multiple sets. This change adds an `EMPTY` constant to
`AsciiSet` and implements the `Add` trait for `AsciiSet` so that sets
can be combined with the `+` operator.

AsciiSet now derives `Debug`, `PartialEq`, and `Eq` so that it can be
used in tests.
@joshka joshka changed the title Add AsciiSet::EMPTY and impl ops::Add for AsciiSet Add AsciiSet::EMPTY and impl Add and Not for AsciiSet Sep 19, 2024
@joshka joshka changed the title Add AsciiSet::EMPTY and impl Add and Not for AsciiSet Add AsciiSet::EMPTY and operators Sep 19, 2024
@joshka joshka changed the title Add AsciiSet::EMPTY and operators Add AsciiSet::EMPTY and boolean operators Sep 19, 2024
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@9404ff5). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #969   +/-   ##
=======================================
  Coverage        ?   81.85%           
=======================================
  Files           ?       21           
  Lines           ?     3560           
  Branches        ?        0           
=======================================
  Hits            ?     2914           
  Misses          ?      646           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@valenting valenting added this pull request to the merge queue Sep 19, 2024
Merged via the queue into servo:main with commit 5505565 Sep 19, 2024
14 checks passed
@joshka
Copy link
Contributor Author

joshka commented Sep 19, 2024

Thanks!

@@ -77,6 +78,11 @@ const ASCII_RANGE_LEN: usize = 0x80;
const BITS_PER_CHUNK: usize = 8 * mem::size_of::<Chunk>();

impl AsciiSet {
/// An empty set.
pub const EMPTY: AsciiSet = AsciiSet {
Copy link

@ForsakenHarmony ForsakenHarmony Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's now inconsistent with the existing constants and functions taking &'static AsciiSet.

Copy link
Contributor Author

@joshka joshka Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't 100% sure about that. I went with EMPTY being a constant on the AsciiSet as empty seems like an inherent property of a type, but the other constants seem like usages of AsciiSet. I was 70/30% on this being right, so wouldn't object to this being changed to be consistent with the other constants.

The rationale for making the constants references rather than just values all seemed odd to me. What was that necessary for?

Edit: let's disuss on #970 instead of here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants