Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to batch input to automatic word alignment? #3

Open
eelegiap opened this issue Feb 25, 2022 · 3 comments
Open

Possible to batch input to automatic word alignment? #3

eelegiap opened this issue Feb 25, 2022 · 3 comments

Comments

@eelegiap
Copy link

I've been using the fully-automated level of the tool. I have about 200 sentences pairs (Spanish/English) I want to align, but it's taking forever because I reload the language models every time to run the alignment for one sentence pair.

Is there a way to use the tool in a batched way, or to not load the language models over and over again during alignment? Thank you!

@BramVanroy
Copy link
Owner

Hello @eelegiap. Thanks for your interest!

Unfortunately, true batching is not available as a built-in. You can however, do batch processing of your sentences on your own and then create Sentences from the resulting docs with Sentence.from_parser. However, since you are only using 200 sentences I instead recommend to instead load the parsers separately so that they do not need to be reloaded every time. This works by passing a parser object instead of a language code to Sentence.from_text. The following should work.

from astred.aligned import AlignedSentences, Sentence
from astred.aligner import Aligner
from astred.utils import load_parser


nlp_en = load_parser("en", "stanza", is_tokenized=False, verbose=True)
nlp_es = load_parser("es", "stanza", is_tokenized=False, verbose=True)
aligner = Aligner()

your_data = [("This is a Spanish sentence.", "Esta es una oración en español."),
             ("Sorry, I do not speak Spanish", "Lo siento, no hablo español.")]

for sent_en_str, sent_es_str in your_data:
    sent_en = Sentence.from_text(sent_en_str, nlp_en)
    sent_es = Sentence.from_text(sent_es_str, nlp_es)
    aligned = AlignedSentences(sent_en, sent_es, aligner=aligner)
    # Do stuff
    for word in sent_en.no_null_words:
        print(word.text, [w.text for w in word.aligned if not w.is_null])

Please let me know if you encounter any other issues!

@eelegiap
Copy link
Author

Thank you, @BramVanroy, I'll try it out! One more thing -- On about 25% of my sentences, I've been getting an assertion error during the Stanza load:
assert(int(word.head) == int(head.id))
I am pretty sure that the problem is coming from Line 155 in utils.py file during the Stanza initialization. I think if you add 'mwt' multi-word tokens to the processor pipeline, it should solve the problem! (from stanfordnlp/stanza#272)

@BramVanroy
Copy link
Owner

That's a good catch! It's indeed because the processors are hardcoded. I vaguely remember that MWT would cause issues but I'd have to test. I'll try to have a look this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants