You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Initial submission, see bugizilla for further discussion. Patch provided needs to be reviewed.
Currently, the documentation in ?Comparison provides an excellent description of the challenges in comparing / ordering strings, but provides less information regarding how strings are compared for equality.
The closest snippet I see is this:
Character strings can be compared with different marked encodings (see Encoding): they are translated to UTF-8 before comparison.
However, it doesn't make any mention of whether unicode normalization is performed when comparing unicode strings. My understanding is that R doesn't perform any sort of unicode normalization when comparing strings, so for example:
# LATIN SMALL LETTER C WITH CEDILLA
ch1 <- "\xc3\xa7"
# LATIN SMALL LETTER C + COMBINING CEDILLA
ch2 <- "\x63\xcc\xa7"
# they look the same ...
print(c(ch1, ch2))
#> [1] "ç" "ç"
# but are not considered equal
ch1 == ch2
#> [1] FALSE
I think this behavior should be documented; e.g.
R does not perform unicode normalization, so different byte sequences that happen to represent the same character will not compare as equal.
I'm not sure whether R provides any user-facing APIs for normalizing unicode strings, but some packages (e.g. 'utf8') do. If there's a mechanism available for normalizing unicode strings in R, it could be worth mentioning here.
The text was updated successfully, but these errors were encountered:
Bug 18252: Document how unicode strings are compared for equality
Initial submission, see bugizilla for further discussion. Patch provided needs to be reviewed.
The text was updated successfully, but these errors were encountered: