-
-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteSlice should offer more functions taking u8 #12
Comments
Are you certain? Have you benchmarked it? If you search (or split, in this case) with a string containing exactly one byte, then
I think I'd be okay with these. Probably call them,
Possibly. Its focus is certainly on treating bytes as if they were UTF-8, but there are plenty of functions that don't care about UTF-8 at all (e.g., I'm generally not opposed to loading up the API with more methods, but each should definitely be considered on their own merit. I think this crate is still finding its footing with respect to what the right API should be, so more issues are welcome. |
@thomcc Oh, I also meant to ask, I'm always looking to understand use cases better. Could you share more about what you're using bstr for, if possible? (As in, what high level problem are you trying to solve.) If not, no worries! |
Ah, my bad 😅, I read the code instead of running benchmarks, but missed the special case that just calls to find_byte, embarrassing. In that case, disregard that one, it feels a little weird, but not worth adding single-byte variants of everything.
I see. Honestly, I'm mostly using it as a for a feature-rich &[u8], and the ability to more easily debug byte-strings that are output. However, what text is there is mostly at least ASCII, and there's enough of it that the debug output is generally more useful than if it were all hex. (That said, a lot of the API I'm not currently using at the moment interests me a lot, in particular, the explicit control over UTF8 decoding is vastly more convenient that creating a 1-off .chars() iterator, let along
I'll see. I'm more likely to take #13 because it addresses similar use cases, but can actually be optimized, unlike a generic user callback.
Okay, I had some others too, which I'll file when after I give them a little more thought. |
@BurntSushi One thing I ran into recently is the trim family of functions cannot be used for trimming invalid UTF-8 bytes since it is not possible to distinguish between the replacement character sigil and the actual replacement character. I believe |
@lopopolo Sorry, I'm having trouble parsing your comment. The first part seems to be discussing a deficiency with the trim routines while the second part seems to be referring to the return type of other routines. (Which routines?) I'm unsure how these are connected and don't otherwise quite understand the second part. The trim routines to me seem like fairly string oriented to me. It seems a little weird to what to trim by something other than codepoints. Can you say what your use case is? If it's something reasonable, then I'm not opposed to adding |
All functions that either yield
For APIs like One example API I'd like to implement is use bstr::ByteSlice;
let buf = &b"xyz\FF"[..];
enum State {
None,
TrimUtf8Char,
InvalidUtf8,
}
let mut state = State::None;
let chopped = buf.trim_end_with(|ch: Option<char>| match (&mut state, ch) {
(state @ State::None, None) => {
*state = State::InvalidUtf8;
false
}
(state @State::None, Some(_)) => {
*state = State::TrimUtf8Char;
true
}
_ => false,
});
let buf = match state {
State::None => buf,
State::TrimUtf8Char => chopped,
State::InvalidUtf8 => &buf[..buf.len() - 1][..],
}; Instead, using use bstr::ByteSlice;
const REPLACEMENT_CHARACTER_BYTES: [u8; 3] = [239, 191, 189];
let buf = &b"xyz\FF"[..];
let truncate_to = if let Some((start, end, ch)) = buf.char_indices().rev().next() {
match ch {
REPLACEMENT_CHARACTER if buf[start..end] == REPLACEMENT_CHARACTER_BYTES[..] => {
// A literal Unicode replacement character is present in the
// string, so remove the entire character
start
}
// Invalid UTF-8, pop the last byte only.
REPLACEMENT_CHARACTER => buf.len() - 1,
// A valid UTF-8 character was found so remove the entire
// character.
_ => start,
}
} else {
0
};
let buf = &buf[..truncate_to][..] I'm not sure which code I like better, but I end up not being able to use any of the The Ruby API behaves like this: [2.6.6] > s = "abc\xFF"
=> "abc\xFF"
[2.6.6] > s.chop
=> "abc"
[2.6.6] > s = "abc"
=> "abc"
[2.6.6] > s.chop
=> "ab"
[2.6.6] > s = "abc�"
=> "abc�"
[2.6.6] > s.chop
=> "abc"
[2.6.6] > s = "💎"
=> "💎"
[2.6.6] > s.chop
=> "" |
@lopopolo Thanks for responding! There's a lot to unpack here I think... So the first thing that sticks out to me is that the As for
Yeah that makes things a little tricky, particularly since bstr uses the "substitution of maximal subparts" strategy. More generally, when I think about use cases for library routines, especially when it's about mild ergonomic improvements, I don't generally put too much weight into "implement another language's standard library API." For the most part, I think that's partially kicking the can down the road. With all that said, I actually think you're missing the simplest implementation path here. It's not to use fn chop(slice: &[u8]) -> &[u8] {
let (cp, size) = bstr::decode_last_utf8(slice);
let cp = match cp {
Some(cp) => cp,
None => return &slice[..slice.len().saturating_sub(1)],
};
let chopped = &slice[..slice.len()-size];
if cp == '\n' && chopped.last() == Some(&b'\r') {
return &chopped[..chopped.len()-1];
}
chopped
} Playground link with assorted test cases: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=6da060e4e9fadcfab8418e1f29a30eb2 Cases like this are why The |
Thanks for taking the time to address my concerns @BurntSushi. It appears I missed
Understood that this is what I'm trying to do with |
Oh wow you are right, these APIs do let me do exactly what I want. I'm a bit ashamed for missing these. Thanks for pointing me in the right direction. |
No problem! I guess they could also be routines in |
I'm using bstr to help parse a few formats, some of them are text (but not necessarily utf8) and some of them are binary but contain strings.
I have a few things that I wished were present:
The various
split
functions should have variants for passing a byte instead of just str/char. I ended up usingsplit_str(b"\0")
andsplit_str(b"\xff")
a few times which is going to be less efficient than directly invoking memchr.Versions of
fields_with
/trim_start_with
/trim_end_with
which pass their function the byte instead, and don't bother with UTF-8 decoding.It seems possible that you're more interested in this being useful for probably-text case than for (e.g. the emphasis is on the
str
, and not theb
). If that's the case, sorry for this and the next bug I'm going to file!The text was updated successfully, but these errors were encountered: