Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make iconv //IGNORE behavior conform to POSIX and the docs #16846

Closed
wants to merge 3 commits into from

Conversation

orlitzky
Copy link
Contributor

While working on #16840, I discovered that an old glibc bug workaround is causing our iconv() implementation to go against POSIX and its own docs.

There is a very detailed commit message on the commit that removes the workaround, but in short: the workaround ignores invalid input sequences, when what it should have been doing is ignoring sequences that have no representation in the target encoding. But the need for a workaround is also not detected/applied correctly.

(It would be nice if PHP could abstract the //IGNORE and //TRANSLIT magic away from the underlying iconv, but now that POSIX 2024 includes them, the smart thing to do is just wait.)

Afterwards, there's a test that's testing for the incorrect output. We fix & rename it.

Some iconv implementations allow you to append "//IGNORE" to the
target encoding as an indication that iconv() should skip input
sequences that cannot be represented in that encoding. Both GNU iconv
implementations support this, and apparently so does Solaris. The
behavior has even been codified in the recent POSIX 2024 standard, but
not all implementations support it yet; musl, for example, does not.

There are actually two types of "bad sequences" that iconv can
encounter. The iconv() function translates sequences from an input
encoding to an output encoding. All versions of POSIX, as well as the
documentation for the various implementations, are clear that //IGNORE
should only ignore valid input sequences that cannot be represented in
the target encoding. If the input sequences themselves are invalid,
that is an error (EILSEQ), even when //IGNORE is used.

The PHP documentation for iconv is aligned with this. A priori, iconv
"returns the converted string, or false on failure." And,

  If the string //IGNORE is appended, characters that cannot be
  represented in the target charset are silently discarded. Otherwise,
  E_NOTICE is generated and the function will return false.

But again, this should only have an effect on valid input
sequences. Which brings us to the problem. The current behavior of,

  $text = "aa\xC3\xC3\xC3\xB8aa";
  var_dump(urlencode(iconv("UTF-8", "UTF-8//IGNORE", $text)));

with glibc iconv is to print,

  string(10) "aa%C3%B8aa"

even though the input $text is invalid UTF-8. The reason for this goes
back to PHP bug 48147 which was intended to address a bug in glibc's
iconv implementation with //IGNORE. Prior to PHP bug 52211, PHP's
iconv would return part of the string if iconv failed. And the glibc
bug was making it return the wrong part. So, as part of bug 48147, PHP
added an ICONV_BROKEN_IGNORE test to config.m4, and added an internal
workaround for ignoring EILSEQ errors mid-translation.

Unfortunately there are some problems with this test and the workaround:

  * The test supplies an invalid input sequence to iconv() with
    //IGNORE, and looks for an error. As discussed above, this should
    always be an error. Any implementation conforming to any version
    of POSIX should trigger ICONV_BROKEN_IGNORE. (Tested: glibc and
    musl.)

  * The internal workaround for ICONV_BROKEN_IGNORE ignores EILSEQ
    errors. Again, this is not right, because you will only get EILSEQ
    from invalid input sequences (in POSIX conforming implementations)
    or when //IGNORE is NOT used (GNU implementations).

  * The workaround leaves "//IGNORE" in the target encoding and so
    does nothing to aid implementations like musl where //IGNORE
    is "broken" because it never worked to begin with.

In short, we're always getting the workaround behavior, and the
workaround behavior runs contrary to both POSIX and the PHP
documentation. Invalid input sequences should never be ignored.

This commit removes the workaround, but will be a breaking change for
anyone using //IGNORE on invalid inputs.
The result of this test was used in PHP's iconv() function to
implement a workaround, but the workaround has been removed
and there are no remaining consumers of ICONV_BROKEN_IGNORE.
This test ensures that iconv() with //IGNORE will return part of the
translated string when it is given invalid inputs. POSIX and the PHP
documentation however agree that an error should be returned in that
case: the test is making sure that we get the wrong answer.

PHP's iconv has recently been updated to agree with the standard and
docs. As a result, this test fails, but that's OK -- it should. The
history of iconv in PHP is convoluted, so rather than just update the
expected output here, I think it's best to rename the test and sever
its ties to bug 48147. Referencing that bug at this point can only
cause confusion.
@orlitzky
Copy link
Contributor Author

It looks like libiconv has no problem with the invalid input sequences. I'm just going to give up on this for now. It's been nothing but a headache, and no one has actually complained about the current behavior.

@orlitzky orlitzky closed this Nov 19, 2024
@cmb69
Copy link
Member

cmb69 commented Nov 19, 2024

Yeah, the different behavior of the iconv implementations is a headache, but there is not much we can do (unless we would drop support for anything else but libiconv).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants