Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows and Linux show different results with short snippets #44

Open
konstantinblaesi opened this issue Mar 15, 2018 · 5 comments
Open

Comments

@konstantinblaesi
Copy link

konstantinblaesi commented Mar 15, 2018

I know that short snippets are likely to fail the language detection, but I found it confusing that the snippet

Best practices

was detected as en on windows, but failed on linux with the error message Failed to identify language. Do you have any idea why the cld2 behaviour is not consistent across platforms?

@konstantinblaesi konstantinblaesi changed the title Windows and Linux show different results Windows and Linux show different results with short snippets Mar 15, 2018
@dachev dachev mentioned this issue Mar 6, 2020
@kibertoad
Copy link

@dachev Can you explain why there is such difference? Is it safe to use CLD on Linux in prod?

@vartemkin
Copy link

Same problem. For example "Черепашка" is defined on Windows but there is an error on Linux? Is it possible to fix this problem?

@dachev
Copy link
Owner

dachev commented Aug 30, 2024

@vartemkin can you try the latest version (2.10.0) There is a new option called bestEffort that might help.

@vartemkin
Copy link

vartemkin commented Aug 30, 2024

Yes, I'm already trying it on her.
I corrected the C++ code to add the verbose flag:

...
if (input->httpHint.length() > 0) {
      hints.content_language_hint = input->httpHint.c_str();
    }
    int flags = CLD2::kCLDFlagVerbose;
    if (input->bestEffort) {
      flags |= CLD2::kCLDFlagBestEffort;
    }
    
    printf("\n");
  const char * cc = (const char*)input->bytes.c_str();
  for (int i=0; i<input->numBytes; i++) printf("%d",cc[i]);
  printf("\n");

    CLD2::ExtDetectLanguageSummary(...

windows 10:

$ node 1.js

-48-100-48-75-48-77-48-80-48-68-48-72-48-70-47-127
<br>ScoreOneScriptSpan(Cyrl,18) ' ╨╝╨╡╨│╨░╨╝╨╕╨║╤Б '<br>
Hitbuffer[) <br>DumpHitBuffer[Cyrl, next_base/delta/distinct 2, 0, 0)<br>
Q[0]1,28463,╨╝╨╡╨│ <br>
Q[1]9,2836,╨╝╨╕╨║ <br>
<br>
Linear[) <br>DumpLinearBuffer[3)<br>
[0]1,Q=00000400,╨╝╨╡╨│<br>
[1]1,Q=0704350c,╨╝╨╡╨│<br>
[2]9,Q=07040aa9,╨╝╨╕╨║<br>
[3]18,U=00000000,   <br>
<br>
DumpChunkStart[1]<br>
[0]0
[1]3
<br>
<br>ScoreOneChunk[0..3) <br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] ru.9 bg.7 17B 3# Cyrl 36Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
<br>SharpenBoundaries<br>
<br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] ru.9 bg.7 17B 3# Cyrl 36Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
RUSSIAN (ru) (94%)

ubuntu:

ubuntu@ubuntu-desktop:~/Desktop/test$ node 1.js

-48-100-48-75-48-77-48-80-48-68-48-72-48-70-47-127
<br>ScoreOneScriptSpan(Cyrl,18) ' мегамикс '<br>
Hitbuffer[) <br>DumpHitBuffer[Cyrl, next_base/delta/distinct 2, 0, 0)<br>
Q[0]1,31373,мег <br>
Q[1]9,30711,мик <br>
<br>
Linear[) <br>DumpLinearBuffer[3)<br>
[0]1,Q=00000400,мег<br>
[1]1,Q=3500151b,мег<br>
[2]9,Q=07040aa9,мик<br>
[3]18,U=00000000,   <br>
<br>
DumpChunkStart[1]<br>
[0]0
[1]3
<br>
<br>ScoreOneChunk[0..3) <br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
<br>SharpenBoundaries<br>
<br>DumpSummaryBuffer[1]<br>
[i] offset linear[chunk_start] lang.score1 lang.score2 bytesB ngrams# script rel_delta rel_score<br>
[0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs<br>
[1] 18 lin[3] en.0 en.0 0B 0# Zyyy 0Rd 0Rs<br>
<br>
{ reliable: false, textBytes: 18, languages: [], chunks: [] }
/home/ubuntu/Desktop/test/node_modules/cld/index.js:77
        throw new Error('Failed to identify language');
              ^

Error: Failed to identify language
    at Object.detect (/home/ubuntu/Desktop/test/node_modules/cld/index.js:77:15)
    at async main (/home/ubuntu/Desktop/test/1.js:3:16)

Node.js v20.17.0

@vartemkin
Copy link

@dachev problem started in this line [0] 1 lin[0] un.7 tg.7 17B 3# Cyrl 0Rd 100Rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants