Make iconv //IGNORE behavior conform to POSIX and the docs #16846

orlitzky · 2024-11-18T15:09:46Z

While working on #16840, I discovered that an old glibc bug workaround is causing our iconv() implementation to go against POSIX and its own docs.

There is a very detailed commit message on the commit that removes the workaround, but in short: the workaround ignores invalid input sequences, when what it should have been doing is ignoring sequences that have no representation in the target encoding. But the need for a workaround is also not detected/applied correctly.

(It would be nice if PHP could abstract the //IGNORE and //TRANSLIT magic away from the underlying iconv, but now that POSIX 2024 includes them, the smart thing to do is just wait.)

Afterwards, there's a test that's testing for the incorrect output. We fix & rename it.

Some iconv implementations allow you to append "//IGNORE" to the target encoding as an indication that iconv() should skip input sequences that cannot be represented in that encoding. Both GNU iconv implementations support this, and apparently so does Solaris. The behavior has even been codified in the recent POSIX 2024 standard, but not all implementations support it yet; musl, for example, does not. There are actually two types of "bad sequences" that iconv can encounter. The iconv() function translates sequences from an input encoding to an output encoding. All versions of POSIX, as well as the documentation for the various implementations, are clear that //IGNORE should only ignore valid input sequences that cannot be represented in the target encoding. If the input sequences themselves are invalid, that is an error (EILSEQ), even when //IGNORE is used. The PHP documentation for iconv is aligned with this. A priori, iconv "returns the converted string, or false on failure." And, If the string //IGNORE is appended, characters that cannot be represented in the target charset are silently discarded. Otherwise, E_NOTICE is generated and the function will return false. But again, this should only have an effect on valid input sequences. Which brings us to the problem. The current behavior of, $text = "aa\xC3\xC3\xC3\xB8aa"; var_dump(urlencode(iconv("UTF-8", "UTF-8//IGNORE", $text))); with glibc iconv is to print, string(10) "aa%C3%B8aa" even though the input $text is invalid UTF-8. The reason for this goes back to PHP bug 48147 which was intended to address a bug in glibc's iconv implementation with //IGNORE. Prior to PHP bug 52211, PHP's iconv would return part of the string if iconv failed. And the glibc bug was making it return the wrong part. So, as part of bug 48147, PHP added an ICONV_BROKEN_IGNORE test to config.m4, and added an internal workaround for ignoring EILSEQ errors mid-translation. Unfortunately there are some problems with this test and the workaround: * The test supplies an invalid input sequence to iconv() with //IGNORE, and looks for an error. As discussed above, this should always be an error. Any implementation conforming to any version of POSIX should trigger ICONV_BROKEN_IGNORE. (Tested: glibc and musl.) * The internal workaround for ICONV_BROKEN_IGNORE ignores EILSEQ errors. Again, this is not right, because you will only get EILSEQ from invalid input sequences (in POSIX conforming implementations) or when //IGNORE is NOT used (GNU implementations). * The workaround leaves "//IGNORE" in the target encoding and so does nothing to aid implementations like musl where //IGNORE is "broken" because it never worked to begin with. In short, we're always getting the workaround behavior, and the workaround behavior runs contrary to both POSIX and the PHP documentation. Invalid input sequences should never be ignored. This commit removes the workaround, but will be a breaking change for anyone using //IGNORE on invalid inputs.

The result of this test was used in PHP's iconv() function to implement a workaround, but the workaround has been removed and there are no remaining consumers of ICONV_BROKEN_IGNORE.

This test ensures that iconv() with //IGNORE will return part of the translated string when it is given invalid inputs. POSIX and the PHP documentation however agree that an error should be returned in that case: the test is making sure that we get the wrong answer. PHP's iconv has recently been updated to agree with the standard and docs. As a result, this test fails, but that's OK -- it should. The history of iconv in PHP is convoluted, so rather than just update the expected output here, I think it's best to rename the test and sever its ties to bug 48147. Referencing that bug at this point can only cause confusion.

orlitzky · 2024-11-19T03:25:23Z

It looks like libiconv has no problem with the invalid input sequences. I'm just going to give up on this for now. It's been nothing but a headache, and no one has actually complained about the current behavior.

cmb69 · 2024-11-19T11:18:26Z

Yeah, the different behavior of the iconv implementations is a headache, but there is not much we can do (unless we would drop support for anything else but libiconv).

orlitzky added 3 commits November 18, 2024 08:45

ext/iconv/config.m4: drop ICONV_BROKEN_IGNORE test

0535da4

The result of this test was used in PHP's iconv() function to implement a workaround, but the workaround has been removed and there are no remaining consumers of ICONV_BROKEN_IGNORE.

github-actions bot added Category: Build System Extension: iconv labels Nov 18, 2024

Girgias requested a review from petk November 18, 2024 15:15

orlitzky closed this Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make iconv //IGNORE behavior conform to POSIX and the docs #16846

Make iconv //IGNORE behavior conform to POSIX and the docs #16846

orlitzky commented Nov 18, 2024

orlitzky commented Nov 19, 2024

cmb69 commented Nov 19, 2024

Make iconv //IGNORE behavior conform to POSIX and the docs #16846

Make iconv //IGNORE behavior conform to POSIX and the docs #16846

Conversation

orlitzky commented Nov 18, 2024

orlitzky commented Nov 19, 2024

cmb69 commented Nov 19, 2024