UTF-8 encoding error in cmudict-0.7b #5

danmysak · 2021-12-08T07:02:31Z

The output of iconv -f UTF-8 cmudict-0.7b > /dev/null; echo $? (as suggested here) is:

iconv: cmudict-0.7b:35733:1: cannot convert

Removing the line fixes things.

The text was updated successfully, but these errors were encountered:

Alexir · 2021-12-08T19:46:00Z

Danilo, I believe that iconv is doing the right thing, at least for me (using a git shell): $ cat /proc/version MINGW64_NT-10.0-19043 version 3.1.7-340.x86_64 ***@***.***) (gcc version 10.2.0 (GCC) ) 2021-03-26 22:17 UTC $ iconv --version iconv (GNU libiconv 1.16) Copyright (C) 2000-2019 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later < https://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Written by Bruno Haible. $ file cmudict-0.7b cmudict-0.7b: ASCII text, with CRLF line terminators $ iconv -f UTF-8 cmudict-0.7b > foo $ diff foo cmudict-0.7b $ iconv by default will map into whatever is your local language setting; I'm not sure but it's likely that ASCII stays but it gets supplemented by a language-appropriate page. So it may be that no conversion is needed. In any case I'm not what's different in your case. Alex

…

On Wed, Dec 8, 2021 at 2:02 AM Danylo Mysak ***@***.***> wrote: The output of iconv -f UTF-8 cmudict-0.7b > /dev/null; echo $? (as suggested here <https://stackoverflow.com/questions/115210/how-to-check-whether-a-file-is-valid-utf-8/115262#115262>) is: iconv: cmudict-0.7b:35733:1: cannot convert — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#5>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABX2IQE6EKRHFXDFNY6ZXVTUP37JFANCNFSM5JTBRX3A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

danmysak · 2021-12-08T20:17:23Z

Alex, thanks for the reply! I think it might be the case that your version of iconv simply ignores errors silently. The relevant line of the hex view of cmudict-0.7b is as follows:

000fef40: 5931 0d0a 44c9 4ac0 2020 4420 4559 3220  Y1..D.J.  D EY2

I believe, the 0xc9 and 0xc0 bytes require valid continuation, which 0x4a and 0x20 cannot be (since they are in the ASCII range).

pbevin · 2024-07-24T14:40:32Z

The file encoding is CP-1252: byte C9 represents É and C0 represents À. Assuming you want the file in UTF-8, this is the right incantation:

iconv -f CP1252 -t UTF-8 cmudict-0.7b > foo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 encoding error in cmudict-0.7b #5

UTF-8 encoding error in cmudict-0.7b #5

danmysak commented Dec 8, 2021 •

edited

Loading

Alexir commented Dec 8, 2021 via email

danmysak commented Dec 8, 2021

pbevin commented Jul 24, 2024

UTF-8 encoding error in cmudict-0.7b #5

UTF-8 encoding error in cmudict-0.7b #5

Comments

danmysak commented Dec 8, 2021 • edited Loading

Alexir commented Dec 8, 2021 via email

danmysak commented Dec 8, 2021

pbevin commented Jul 24, 2024

danmysak commented Dec 8, 2021 •

edited

Loading