-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSWord moves word to upper line when correcting space error #50
Comments
this does not happen in Googledocs |
When fixing ?? to ? ? a new suggestions appear, ?B. can be fixed to ? B. However, there is a new line after ? which the program seems to ignore. |
It seems that the problem is that we haven't considered CARRIAGE RETURN / |
Soemthing very strange happens that looks like a bug. With the following minimal test text:
(copy to MS Word, paste it in a new document, and copy it back from the word file if the CR is lost) I get the foliowing in UnicodeChecker: CR ( Now store the test text (with the CR char) in a test file, and run it through the grammar checker: cat test.txt | ./tools/grammarcheckers/modes/smegramrelease.mod The result is this:
Suddenly the CR char (and the newline) is placed before the two question marks. That is, the character stream has been changed somewhere in the processing. That should not happen. |
The tokeniser/analyser is fine: cat test.txt | ./tools/grammarcheckers/modes/smegramrelease0-morph.mode
"<boarásmuvan>"
"boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0>
"boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0>
"<»>"
"»" PUNCT RIGHT <W:0.0>
"”" PUNCT RIGHT Err/Orth <W:0.0>
:
"<?>"
"?" CLB <W:0.0>
"<?>"
"?" CLB <W:0.0>
:
\n
"<B.>"
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
"." CLB <W:0.0> "<.>"
"b" Adv Sem/Time ABBR Gram/TNumAbbr Attr <W:0.0> "<B>"
"." CLB <W:0.0> "<.>"
"b" Adv Sem/Time ABBR Gram/TNumAbbr <W:0.0> "<B>"
"<Moske>"
"Moske" N Prop Sem/Plc Attr <W:0.0>
"Moske" N Prop Sem/Plc Sg Nom <W:0.0> |
The first whitespace analyser moves the chars one place: cat test.txt | ./tools/grammarcheckers/modes/smegramrelease1-blanktag.mode
"<boarásmuvan>"
"boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0> <firstCohort>
"boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0> <firstCohort>
"<»>"
"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide>
"”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide>
:
"<?>"
"?" CLB <W:0.0>
:
\n
"<?>"
"?" CLB <W:0.0>
"<B.>"
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
"." CLB <W:0.0> "<.>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0>
"." CLB <W:0.0> "<.>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> "<B>"
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0>
"." CLB <W:0.0> "<.>"
"b" Adv Sem/Time ABBR Gram/TNumAbbr Attr <W:0.0> "<B>"
"." CLB <W:0.0> "<.>"
"b" Adv Sem/Time ABBR Gram/TNumAbbr <W:0.0> "<B>"
"<Moske>"
"Moske" N Prop Sem/Plc Attr <W:0.0> <LastCohort>
"Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort> |
And then they are moved another time by the second whitespace analyser: "<boarásmuvan>"
"boarásmuvvat" Err/Orth-a-á V IV Ind Prs Sg1 <W:0.0> <firstCohort>
"boarásmuvvat" v1 V IV Ind Prs Sg1 <W:0.0> <firstCohort>
"<»>"
"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide>
"”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide>
:
\n
:
"<?>"
"?" CLB <W:0.0> <NoSpaceAfterPunctMark> <SpaceBeforePunctMark>
"<?>"
"?" CLB <W:0.0> <NoSpaceAfterPunctMark>
"<B.>"
"B" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> <NoSpaceAfterPunctMark>
"B" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> <NoSpaceAfterPunctMark>
"B" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> <NoSpaceAfterPunctMark>
"B" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark>
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Attr <W:0.0> <NoSpaceAfterPunctMark>
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Acc <W:0.0> <NoSpaceAfterPunctMark>
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Gen <W:0.0> <NoSpaceAfterPunctMark>
"Balphabet" N Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <NoSpaceAfterPunctMark>
"<Moske>"
"Moske" N Prop Sem/Plc Attr <W:0.0> <LastCohort>
"Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort> So something is clearly wrong in the whitespace analysers. |
This is not fine. That should probably be
which would mean a newline occurred. There should be an initial colon before any lines with unanalysed data. Anything without an initial colon/tab/quote is ignored by divvun-suggest. got to fix this in hfst-tokenise hfst/hfst#575 and divvun-suggest divvun/libdivvun#65 |
does this work correctly now? I get: $ cat ~/github/divvun/libdivvun/foo | ~/github/hfst/hfst/tools/src/hfst-tokenize -g tools/grammarcheckers/tokeniser-gramcheck-gt-desc.pmhfst | divvun-blanktag tools/grammarcheckers/analyser-gt-whitespace.hfst | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/valency.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/mwe-dis.bin' | cg-mwesplit | divvun-blanktag '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/analyser-gt-errorwhitespace.hfst' | divvun-cgspell -n 10 -b 15.000000 -w 5000.000000 -u 0.400000 -l '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/acceptor.default.hfst' -m '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/errmodel.default.hfst' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/valency-postspell.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/grc-disambiguator.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/spellchecker.bin' | vislcg3 -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/grammarchecker-release.bin' | divvun-suggest -g '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/generator-gramcheck-gt-norm.hfstol' -m '/home/flammie/github/giellalt/lang-sme/tools/grammarcheckers/errors.xml' -l se
"<boarásmuvan>"
"boarásmuvvat" v1 <mv> V IV Ind Prs Sg1 <W:0.0> <firstCohort> @+FMAINV &LINK &punct-aistton-right ID:1
punct-aistton-right
"<»>"
"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> &punct-aistton-right &space-before-punct-mark &LINK ID:2 R:LEFT:1
punct-aistton-right
space-before-punct-mark
"»" PUNCT RIGHT <W:0.0> <SpaceOnRightSide> "boarásmuvan”"S &punct-aistton-right &SUGGESTWF ID:2 R:LEFT:1
punct-aistton-right
"”" PUNCT RIGHT Err/Orth <W:0.0> <SpaceOnRightSide> &LINK &space-before-punct-mark ID:2 R:LEFT:1
space-before-punct-mark
:
"<?>"
"?" CLB <W:0.0> <SpaceBeforePunctMark>
"<?>"
"?" CLB <W:0.0> <LastCohortOfParagraph>
:\r\n
"<B.>"
"B" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark ID:7 R:RIGHT:8
no-space-after-punct-mark
"B" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN "B. Moske"S &no-space-after-punct-mark &SUGGESTWF ID:7 R:RIGHT:8
no-space-after-punct-mark
"Balphabet" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN &no-space-after-punct-mark ID:7 R:RIGHT:8
no-space-after-punct-mark
"Balphabet" N <NomGenSg> Sem/Sign ABBR Gram/TAbbr Sg Nom <W:0.0> <firstCohortOfParagraph> <NoSpaceAfterPunctMark> @HNOUN "B. Moske"S &no-space-after-punct-mark &SUGGESTWF ID:7 R:RIGHT:8
no-space-after-punct-mark
"<Moske>"
"Moske" N Prop Sem/Plc Sg Nom <W:0.0> <LastCohort> @HNOUN &LINK &no-space-after-punct-mark ID:8
no-space-after-punct-mark
:\r\n
$ xxd ~/github/divvun/libdivvun/foo
00000000: 626f 6172 c3a1 736d 7576 616e c2bb 203f boar..smuvan.. ?
00000010: 3f0d 0a42 2e4d 6f73 6b65 0d0a ?..B.Moske.. |
This happens when correcting "B.Moske" to "B. Moske":
The problem occurs because CR(LF) is not escaped in the various tools:
The text was updated successfully, but these errors were encountered: