Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 encoding error in cmudict-0.7b #5

Open
danmysak opened this issue Dec 8, 2021 · 3 comments
Open

UTF-8 encoding error in cmudict-0.7b #5

danmysak opened this issue Dec 8, 2021 · 3 comments

Comments

@danmysak
Copy link

danmysak commented Dec 8, 2021

The output of iconv -f UTF-8 cmudict-0.7b > /dev/null; echo $? (as suggested here) is:

iconv: cmudict-0.7b:35733:1: cannot convert

Removing the line fixes things.

@Alexir
Copy link
Owner

Alexir commented Dec 8, 2021 via email

@danmysak
Copy link
Author

danmysak commented Dec 8, 2021

Alex, thanks for the reply! I think it might be the case that your version of iconv simply ignores errors silently. The relevant line of the hex view of cmudict-0.7b is as follows:

000fef40: 5931 0d0a 44c9 4ac0 2020 4420 4559 3220  Y1..D.J.  D EY2 

I believe, the 0xc9 and 0xc0 bytes require valid continuation, which 0x4a and 0x20 cannot be (since they are in the ASCII range).

@pbevin
Copy link

pbevin commented Jul 24, 2024

The file encoding is CP-1252: byte C9 represents É and C0 represents À. Assuming you want the file in UTF-8, this is the right incantation:

iconv -f CP1252 -t UTF-8 cmudict-0.7b > foo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants