Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when trying to access some strings with glossary terms #10002

Closed
2 tasks done
endervad opened this issue Sep 21, 2023 · 12 comments
Closed
2 tasks done

Error when trying to access some strings with glossary terms #10002

endervad opened this issue Sep 21, 2023 · 12 comments
Assignees
Labels
bug Something is broken.
Milestone

Comments

@endervad
Copy link

Describe the issue

We've recently upgraded our self-hosted Weblate from 4.18.2 to 5.0.2. Since the upgrade, translators can't access some of the strings in components - the server throws an internal error. By 'access' I mean going to the string page (/translate/<project>/<component>/<language>/<id_or_some_other_query>) or going into Zen-mode (/zen/<project>/<component>/<language>/).
We figured from the error that it has something to do with the glossaries, specifically with strings having glossary terms in them, though not all of those strings were affected. Disabling all of the glossaries 'fixed' the issue as in it's now possible to access those problematic strings, but without glossary highlights of course.

I should also mention that this is a large project - 4m+ strings across 6k+ components. The glossaries aren't big at all - about maybe 500-1000 terms in total across all glossaries. Latest 5.x updates did bring a lot of huge performance improvements, and we're very thankful for that.

I already tried

  • I've read and searched the documentation.
  • I've searched for similar issues in this repository.

Steps to reproduce the behavior

  1. Have a huge project and some glossaries
  2. Try to access certain strings with glossary terms OR Try to open Zen-mode on a component that has such problematic strings - I don't know how to determine them

Expected behavior

The string page should successfully open OR The Zen-mode should successfully open.

Screenshots

The user sees a generic Internal Server Error message.

Exception traceback

No response

How do you run Weblate?

Docker container

Weblate versions

  • Weblate: 5.0.2
  • Django: 4.2.5
  • siphashc: 2.1
  • translate-toolkit: 3.10.1
  • lxml: 4.9.3
  • Pillow: 10.0.0
  • nh3: 0.2.14
  • python-dateutil: 2.8.2
  • social-auth-core: 4.4.2
  • social-auth-app-django: 5.3.0
  • django-crispy-forms: 2.0
  • oauthlib: 3.2.2
  • django-compressor: 4.4
  • djangorestframework: 3.14.0
  • django-filter: 23.2
  • django-appconf: 1.0.5
  • user-agents: 2.2.0
  • filelock: 3.12.4
  • rapidfuzz: 3.3.0
  • openpyxl: 3.1.2
  • celery: 5.3.4
  • django-celery-beat: 2.5.0
  • kombu: 5.3.2
  • translation-finder: 2.15
  • weblate-language-data: 2023.5
  • html2text: 2020.1.16
  • pycairo: 1.24.0
  • PyGObject: 3.46.0
  • diff-match-patch: 20230430
  • requests: 2.31.0
  • django-redis: 5.3.0
  • hiredis: 2.2.3
  • sentry-sdk: 1.31.0
  • Cython: 3.0.2
  • misaka: 2.1.1
  • GitPython: 3.1.36
  • borgbackup: 1.2.6
  • pyparsing: 3.1.1
  • ahocorasick_rs: 0.17.1
  • python-redis-lock: 4.0.0
  • charset-normalizer: 3.2.0
  • Python: 3.11.5
  • Git: 2.30.2
  • psycopg2: 2.9.7
  • phply: 1.2.6
  • ruamel.yaml: 0.17.32
  • tesserocr: 2.6.1
  • boto3: 1.28.48
  • zeep: 4.2.1
  • aeidon: 1.12
  • iniparse: 0.5
  • mysqlclient: 2.2.0
  • Mercurial: 6.5.2
  • git-svn: 2.30.2
  • git-review: 2.3.1
  • PostgreSQL server: 13.12
  • Database backends: django.db.backends.postgresql
  • Cache backends: default:RedisCache, avatar:FileBasedCache
  • Email setup: django.core.mail.backends.smtp.EmailBackend: smtp.gmail.com
  • OS encoding: filesystem=utf-8, default=utf-8
  • Celery: redis://cache:6379/1, redis://cache:6379/1, regular
  • Platform: Linux 5.10.0-25-amd64 (x86_64)

Weblate deploy checks

System check identified some issues:

WARNINGS:
?: (security.W004) You have not set a value for the SECURE_HSTS_SECONDS setting. If your entire site is served only over SSL, you may want to consider setting a value and enabling HTTP Strict Transport Security. Be sure to read the documentation first; enabling HSTS carelessly can cause serious, irreversible problems.
?: (security.W008) Your SECURE_SSL_REDIRECT setting is not set to True. Unless your site should be available over both SSL and non-SSL connections, you may want to either set this setting True or configure a load balancer or reverse-proxy server to redirect all connections to HTTPS.
?: (security.W012) SESSION_COOKIE_SECURE is not set to True. Using a secure-only session cookie makes it more difficult for network traffic sniffers to hijack user sessions.

INFOS:
?: (weblate.I021) Error collection is not set up, it is highly recommended for production use
        HINT: https://docs.weblate.org/en/weblate-5.0.2/admin/install.html#collecting-errors
?: (weblate.I028) Backups are not configured, it is highly recommended for production use
        HINT: https://docs.weblate.org/en/weblate-5.0.2/admin/backup.html

System check identified 5 issues (1 silenced).

Additional context

Admin receives an email describing the error. Here is one of the emails:
[Weblate] ERROR (EXTERNAL IP)_ Internal Server Error_ translate_ffxiv-translation_quest-040-luckmk105_04062_ru.zip

@nijel
Copy link
Member

nijel commented Sep 21, 2023

Strange, I've never seen such an error. I don't think it is related to project size, actually the performance improvements in 5.0.x were tuned on bigger project (https://weblate.eso-spolszczenie.eu/projects/eso-spolszczenie/#information).

Exception info from the e-mail:

IndexError at /translate/ffxiv-translation/quest-040-luckmk105_04062/ru/
cannot fit 'int' into an index-sized integer`

/usr/local/lib/python3.11/site-packages/weblate/glossary/models.py, line 91, in get_glossary_terms
                    (start == 0 or NON_WORD_RE.match(source[start - 1]))
                                                     ^^^^^^^^^^^^^^^^^

Related code:

for _termno, start, end in automaton.find_matches_as_indexes(
source, overlapping=True
):
if uses_ngram or (
(start == 0 or NON_WORD_RE.match(source[start - 1]))
and (end >= len(source) or NON_WORD_RE.match(source[end]))
):
term = source[start:end].lower()
positions[term].append((start, end))

Based on the error, it seems that start - 1 is bigger than sys.maxsize. What it certainly shouldn't be, because having that log string in Weblate would trigger a lot of different problems.

So most likely ahocorasick returns wrong offsets here. What operating system and architecture are you using?

@endervad
Copy link
Author

endervad commented Sep 21, 2023

This is Docker 24.0.5 on Debian 11 (x86_64), Linux kernel version - 5.10.0-25-amd64.

@nijel
Copy link
Member

nijel commented Sep 21, 2023

I've asked ahocorasick_rs maintainer about this, it might be an issue in that library, see G-Research/ahocorasick_rs#83

@nijel
Copy link
Member

nijel commented Sep 21, 2023

Can you please run following Python script? It will output more information to diagnose this.

  1. Save it to a file
  2. Run docker compose exec -u weblate weblate weblate shell < saved_file.py
  3. Paste here output
from itertools import chain

import ahocorasick_rs

from weblate.trans.models import Unit
from weblate.trans.models.component import prefetch_glossary_terms
from weblate.trans.util import PLURAL_SEPARATOR

unit = Unit.objects.get(pk=12519736)
parts = []
for text in unit.get_source_plurals():
    text = text.lower().strip()
    if text:
        parts.append(text)
source = PLURAL_SEPARATOR.join(parts)
project = unit.translation.component.project

prefetch_glossary_terms(project.glossaries)
terms = set(
    chain.from_iterable(glossary.glossary_sources for glossary in project.glossaries)
)

# Build automaton for efficient Aho-Corasick search
automaton = ahocorasick_rs.AhoCorasick(
    terms,
    implementation=ahocorasick_rs.Implementation.ContiguousNFA,
    store_patterns=False,
)

print("TERMS:")
print(terms)
print("STRING:")
print(repr(source))
print("MATCHES:")
print(automaton.find_matches_as_indexes(source, overlapping=True))

@nijel nijel self-assigned this Sep 21, 2023
@nijel
Copy link
Member

nijel commented Sep 21, 2023

I think the crash is caused by blank terms in a glossary (see G-Research/ahocorasick_rs#83 (comment)). The above snippet should confirm that, or there might a different issue as well. I will fix the issue with a blank term and close this issue as that is the most likely cause.

@nijel nijel closed this as completed in 845a544 Sep 21, 2023
@github-actions
Copy link

The issue you have reported is now resolved. If you don’t feel it’s right, please follow its labels to get a clue for further steps.

  • In case you see a similar problem, please open a separate issue.
  • If you are happy with the outcome, don’t hesitate to support Weblate by making a donation.

@nijel nijel added the bug Something is broken. label Sep 21, 2023
@nijel nijel added this to the 5.1 milestone Sep 21, 2023
@endervad
Copy link
Author

endervad commented Sep 21, 2023

Here is the output of the script:

TERMS:
set()
STRING:
'text_luckmk105_04062_securitysystem_000_005<tab><bloop> dragon restraints successfully disengaged. unable to confirm presence of specimen “tiamat.” please seek shelter immediately.'
MATCHES:
[]

And yes, we do have some blank terms in glossaries, as in w/ source but w/o translation.

@nijel
Copy link
Member

nijel commented Sep 21, 2023

I guess this is with disabled glossaries, so that doesn't expose the bug...

@endervad
Copy link
Author

Oh yeah, makes sense. I'll provide a new one in a bit.

@endervad
Copy link
Author

The output is too big, so I've put it in a file.
output.txt

@nijel
Copy link
Member

nijel commented Sep 21, 2023

Thanks, that confirms what i expected, so this issue should be fixed. You can work around it by removing empty terms from a glossary.

@endervad
Copy link
Author

Gotcha, thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken.
Projects
None yet
Development

No branches or pull requests

2 participants