-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make iconv //IGNORE behavior conform to POSIX and the docs #16846
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some iconv implementations allow you to append "//IGNORE" to the target encoding as an indication that iconv() should skip input sequences that cannot be represented in that encoding. Both GNU iconv implementations support this, and apparently so does Solaris. The behavior has even been codified in the recent POSIX 2024 standard, but not all implementations support it yet; musl, for example, does not. There are actually two types of "bad sequences" that iconv can encounter. The iconv() function translates sequences from an input encoding to an output encoding. All versions of POSIX, as well as the documentation for the various implementations, are clear that //IGNORE should only ignore valid input sequences that cannot be represented in the target encoding. If the input sequences themselves are invalid, that is an error (EILSEQ), even when //IGNORE is used. The PHP documentation for iconv is aligned with this. A priori, iconv "returns the converted string, or false on failure." And, If the string //IGNORE is appended, characters that cannot be represented in the target charset are silently discarded. Otherwise, E_NOTICE is generated and the function will return false. But again, this should only have an effect on valid input sequences. Which brings us to the problem. The current behavior of, $text = "aa\xC3\xC3\xC3\xB8aa"; var_dump(urlencode(iconv("UTF-8", "UTF-8//IGNORE", $text))); with glibc iconv is to print, string(10) "aa%C3%B8aa" even though the input $text is invalid UTF-8. The reason for this goes back to PHP bug 48147 which was intended to address a bug in glibc's iconv implementation with //IGNORE. Prior to PHP bug 52211, PHP's iconv would return part of the string if iconv failed. And the glibc bug was making it return the wrong part. So, as part of bug 48147, PHP added an ICONV_BROKEN_IGNORE test to config.m4, and added an internal workaround for ignoring EILSEQ errors mid-translation. Unfortunately there are some problems with this test and the workaround: * The test supplies an invalid input sequence to iconv() with //IGNORE, and looks for an error. As discussed above, this should always be an error. Any implementation conforming to any version of POSIX should trigger ICONV_BROKEN_IGNORE. (Tested: glibc and musl.) * The internal workaround for ICONV_BROKEN_IGNORE ignores EILSEQ errors. Again, this is not right, because you will only get EILSEQ from invalid input sequences (in POSIX conforming implementations) or when //IGNORE is NOT used (GNU implementations). * The workaround leaves "//IGNORE" in the target encoding and so does nothing to aid implementations like musl where //IGNORE is "broken" because it never worked to begin with. In short, we're always getting the workaround behavior, and the workaround behavior runs contrary to both POSIX and the PHP documentation. Invalid input sequences should never be ignored. This commit removes the workaround, but will be a breaking change for anyone using //IGNORE on invalid inputs.
The result of this test was used in PHP's iconv() function to implement a workaround, but the workaround has been removed and there are no remaining consumers of ICONV_BROKEN_IGNORE.
This test ensures that iconv() with //IGNORE will return part of the translated string when it is given invalid inputs. POSIX and the PHP documentation however agree that an error should be returned in that case: the test is making sure that we get the wrong answer. PHP's iconv has recently been updated to agree with the standard and docs. As a result, this test fails, but that's OK -- it should. The history of iconv in PHP is convoluted, so rather than just update the expected output here, I think it's best to rename the test and sever its ties to bug 48147. Referencing that bug at this point can only cause confusion.
It looks like libiconv has no problem with the invalid input sequences. I'm just going to give up on this for now. It's been nothing but a headache, and no one has actually complained about the current behavior. |
Yeah, the different behavior of the iconv implementations is a headache, but there is not much we can do (unless we would drop support for anything else but libiconv). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While working on #16840, I discovered that an old glibc bug workaround is causing our
iconv()
implementation to go against POSIX and its own docs.There is a very detailed commit message on the commit that removes the workaround, but in short: the workaround ignores invalid input sequences, when what it should have been doing is ignoring sequences that have no representation in the target encoding. But the need for a workaround is also not detected/applied correctly.
(It would be nice if PHP could abstract the
//IGNORE
and//TRANSLIT
magic away from the underlying iconv, but now that POSIX 2024 includes them, the smart thing to do is just wait.)Afterwards, there's a test that's testing for the incorrect output. We fix & rename it.