Inference on new corpus by trained alignments #46

juncaofish · 2019-12-03T02:35:16Z

I've already trained on large corpus in parallel to get word alignments. How could I further infer with the word alignments to get the translation probability for the new corpus?

hieuhoang · 2019-12-03T19:09:29Z

you have to run a phrase-table extraction algorithm with the corpus and alignment as input. eg. step 4,5,6 of the moses training
http://www.statmt.org/moses/?n=FactoredTraining.HomePage

nomadlx · 2020-01-06T02:50:25Z

You can have a look at this file force_align.py, i guess this code is used to be align a new corpus by using a trained conditional probability.

stribizhev · 2020-01-30T15:17:34Z

Has any one got a working demo script? Best for a model supporting SentencePiece tokenization.

Brucewuzhang · 2020-02-03T06:27:53Z

I rewrote the source code using pure python codes (I can't share it with you for some reason). I think anyone can implement fast align after reading the source code. My suggestion is that don't use statistical word alignment models for SentencePiece tokenization based algorithms. They are not compatible in my view. But statistical word alignment models can be useful depending on your purpose.

bricksdont · 2020-02-26T10:33:53Z

It is unclear whether the original question is about

a) word-aligning a corpus with a previously trained fast_align model (nomadlx assumed this was the case)
b) obtaining translation probabilities for phrases or sentences given word translation probabilities (Hieu assumed this was the case)

If a), then you might find useful: to train a fast_align model:

https://gist.github.com/bricksdont/7a9ac764d874b90853eff88d53971033

and to apply a trained model:

https://gist.github.com/bricksdont/0d1718c7c3fc05714b582afe4c3b5005

bricksdont · 2020-04-01T07:30:57Z

Edge cases that can break force_align.py:

The script only works with Python 2 at the moment
If the input corpus has lines where source or target are empty, such as
```
This is a test. |||
```
the process hangs indefinitely.

liesun1994 · 2020-04-09T03:00:47Z

same issues as #33

zolastro mentioned this issue May 25, 2022

Using fast_align to train save the model in infer. #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference on new corpus by trained alignments #46

Inference on new corpus by trained alignments #46

juncaofish commented Dec 3, 2019

hieuhoang commented Dec 3, 2019

nomadlx commented Jan 6, 2020

stribizhev commented Jan 30, 2020

Brucewuzhang commented Feb 3, 2020

bricksdont commented Feb 26, 2020

bricksdont commented Apr 1, 2020

liesun1994 commented Apr 9, 2020

Inference on new corpus by trained alignments #46

Inference on new corpus by trained alignments #46

Comments

juncaofish commented Dec 3, 2019

hieuhoang commented Dec 3, 2019

nomadlx commented Jan 6, 2020

stribizhev commented Jan 30, 2020

Brucewuzhang commented Feb 3, 2020

bricksdont commented Feb 26, 2020

bricksdont commented Apr 1, 2020

liesun1994 commented Apr 9, 2020