Bug 18252: Document how unicode strings are compared for equality #42

shannonpileggi · 2024-08-09T20:56:03Z

Bug 18252: Document how unicode strings are compared for equality

Initial submission, see bugizilla for further discussion. Patch provided needs to be reviewed.

Currently, the documentation in ?Comparison provides an excellent description of the challenges in comparing / ordering strings, but provides less information regarding how strings are compared for equality.

The closest snippet I see is this:

Character strings can be compared with different marked encodings (see Encoding): they are translated to UTF-8 before comparison.

However, it doesn't make any mention of whether unicode normalization is performed when comparing unicode strings. My understanding is that R doesn't perform any sort of unicode normalization when comparing strings, so for example:

# LATIN SMALL LETTER C WITH CEDILLA
ch1 <- "\xc3\xa7"

# LATIN SMALL LETTER C + COMBINING CEDILLA
ch2 <- "\x63\xcc\xa7"

# they look the same ...
print(c(ch1, ch2))
#> [1] "ç" "ç"

# but are not considered equal
ch1 == ch2
#> [1] FALSE
I think this behavior should be documented; e.g.

R does not perform unicode normalization, so different byte sequences that happen to represent the same character will not compare as equal.

I'm not sure whether R provides any user-facing APIs for normalizing unicode strings, but some packages (e.g. 'utf8') do. If there's a mechanism available for normalizing unicode strings in R, it could be worth mentioning here.

The text was updated successfully, but these errors were encountered:

shannonpileggi added Documentation Issues in the documentation review patch Test/review the proposed patch Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 labels Aug 9, 2024

hturner removed the Hutch 2024 Issues reserved for R Dev Day @ Hutch 2024 label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 18252: Document how unicode strings are compared for equality #42

Bug 18252: Document how unicode strings are compared for equality #42

shannonpileggi commented Aug 9, 2024

Bug 18252: Document how unicode strings are compared for equality #42

Bug 18252: Document how unicode strings are compared for equality #42

Comments

shannonpileggi commented Aug 9, 2024