Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dsl dictionary description parse issue on Windows #1974

Closed
xiaoyifang opened this issue Nov 21, 2024 · 7 comments
Closed

dsl dictionary description parse issue on Windows #1974

xiaoyifang opened this issue Nov 21, 2024 · 7 comments

Comments

@xiaoyifang
Copy link
Owner

xiaoyifang commented Nov 21, 2024

JMDict Furigana, JMDict+: https://jd4gd.com/jmdictplus.html

Originally posted by @darlopvil in #1875 (comment)

The description seems have some encoding issue.
image

@shenlebantongying
Copy link
Collaborator

Not reproducable on my machine (Linux) 😅

@xiaoyifang xiaoyifang changed the title dsl dictionary description parse issue dsl dictionary description parse issue on Windows Nov 21, 2024
@xiaoyifang
Copy link
Owner Author

Not reproducable on my machine (Linux) 😅

restrict it to Windows

@xiaoyifang
Copy link
Owner Author

xiaoyifang commented Nov 21, 2024

dsl use QTextStream read the description from .ann file. QTextStream will use system's default codec.
need to set the encoding .

example(generate with AI)

   QTextCodec* detectEncoding(const QByteArray& data) {
    // 尝试检测编码
    QTextCodec::ConverterState state;
    QTextCodec* codec = QTextCodec::codecForName("UTF-8");
    codec->toUnicode(data.constData(), data.size(), &state);

    if (state.invalidChars > 0) {
        // 如果有无效字符,尝试其他编码
        codec = QTextCodec::codecForName("ISO 8859-1");
    }

    return codec;
}

int main() {
    QFile annFile("path/to/your/file.txt");

    if (!annFile.open(QIODevice::ReadOnly | QIODevice::Text)) {
        qDebug() << "Failed to open file.";
        return -1;
    }

    QByteArray data = annFile.readAll();
    annFile.close();

    QTextCodec* detectedCodec = detectEncoding(data);

    QTextStream annStream(&annFile);
    annStream.setCodec(detectedCodec);

    annFile.open(QIODevice::ReadOnly | QIODevice::Text);
    QString content = annStream.readAll();

    qDebug() << "File content:" << content;

    annFile.close();

    return 0;
}

readAll can be replaced with readline

@shenlebantongying
Copy link
Collaborator

The default behavior of QTextStream is trying to use one of the Unicode encodings.

By default, UTF-8 is used for reading and writing, but you can also set the encoding by calling setEncoding(). Automatic Unicode detection is also supported. When this feature is enabled (the default behavior), QTextStream will detect the UTF-8, UTF-16 or the UTF-32 BOM (Byte Order Mark) and switch to the appropriate UTF encoding when reading.

https://doc.qt.io/qt-6/qtextstream.html#details

@shenlebantongying
Copy link
Collaborator

shenlebantongying commented Nov 21, 2024

The file is UTF16 without BOM, we cannot reliably detect the byteorder.

On my Linux system, the encoding of the annStream detected is Utf8, but somewhat displayed correctly just by accident.

The file is wrong. This is not fixable.

@shenlebantongying
Copy link
Collaborator

I sent a short message to the dict author.

I don't think we can do something here. The original code works accidentally in the original GD because QTextStream in Qt4/5 don't try to detect Utf8.

@xiaoyifang
Copy link
Owner Author

xiaoyifang commented Nov 22, 2024

The file is UTF16 without BOM, we cannot reliably detect the byteorder.

I think maybe we can .

if ( auto guessedEncoding = QStringConverter::encodingForData( { firstBytes, firstBytesSize }, '#' );
guessedEncoding.has_value() ) {
switch ( guessedEncoding.value() ) {
case QStringConverter::Utf8:
encoding = Utf8::Utf8;
break;
case QStringConverter::Utf16LE:
encoding = Utf8::Utf16LE;
break;
case QStringConverter::Utf16BE:
encoding = Utf8::Utf16BE;
break;
case QStringConverter::Utf32LE:
encoding = Utf8::Utf16LE;
break;
case QStringConverter::Utf32BE:
encoding = Utf8::Utf32BE;
break;
default:
break;
}
}
codec = QTextCodec::codecForName( getEncodingNameFor( encoding ) );

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants