Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 18252: Document how unicode strings are compared for equality #42

Open
shannonpileggi opened this issue Aug 9, 2024 · 0 comments
Open
Labels
Documentation Issues in the documentation review patch Test/review the proposed patch

Comments

@shannonpileggi
Copy link

Bug 18252: Document how unicode strings are compared for equality

Initial submission, see bugizilla for further discussion. Patch provided needs to be reviewed.

Currently, the documentation in ?Comparison provides an excellent description of the challenges in comparing / ordering strings, but provides less information regarding how strings are compared for equality.

The closest snippet I see is this:

Character strings can be compared with different marked encodings (see Encoding): they are translated to UTF-8 before comparison.

However, it doesn't make any mention of whether unicode normalization is performed when comparing unicode strings. My understanding is that R doesn't perform any sort of unicode normalization when comparing strings, so for example:

# LATIN SMALL LETTER C WITH CEDILLA
ch1 <- "\xc3\xa7"

# LATIN SMALL LETTER C + COMBINING CEDILLA
ch2 <- "\x63\xcc\xa7"

# they look the same ...
print(c(ch1, ch2))
#> [1] "ç" "ç"

# but are not considered equal
ch1 == ch2
#> [1] FALSE
I think this behavior should be documented; e.g.

R does not perform unicode normalization, so different byte sequences that happen to represent the same character will not compare as equal.

I'm not sure whether R provides any user-facing APIs for normalizing unicode strings, but some packages (e.g. 'utf8') do. If there's a mechanism available for normalizing unicode strings in R, it could be worth mentioning here.
@shannonpileggi shannonpileggi added Documentation Issues in the documentation review patch Test/review the proposed patch Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 labels Aug 9, 2024
@hturner hturner removed the Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues in the documentation review patch Test/review the proposed patch
Projects
None yet
Development

No branches or pull requests

2 participants