Investigate possible issue with text alignment for deuterocanonical books and unusual versifications #478

Enkidu93 · 2024-09-04T19:23:57Z

See here for more information: #471 (and here - #442 (comment))

Let's add some tests to verify that we are handling ALL of the deuterocanonical books properly (same amount of testing as for the standard English versification books.

johnml1135 · 2024-12-23T14:14:39Z

The issue identified was that the mapping between DAN and SUS was incorrect because of a bug in the preprocess code. To really verify that all books work, there are a few main points where this could mess up:

In the versification files themselves (there could be a VRS error - how do we handle it?)
In the ParallelCorpus algorithm mapping versifications
In the preprocess script (as there was here)
In the USFM update script when the new USFM files are created.

Where should we make tests for this?

First, ensure that there are unit tests for preprocess and USFM update to specifically verify that the target references are used, not the source references. These test may already exist.

Then, use ParallelCorpus level testing to verify the following:

Note: Versification mappings are already tested in SIL.Scripture: https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/Versification.cs
We want to make sure we are testing improper versification mappings or mismatched between the VRS files and the SFM files including:
- Verses in SFM but not in vrs
- Verse in vrs, but not in SFM
- Double mapping? Two different verses map to the same location? What do we do? Do we throw an error, use the first one or combine the verses (i.e., Daniel 13 and BEL are both defined)
Mapping between books works
Mapping between chapters works
Mapping where both the source and target have odd (non-original) versification (i.e. Target has DAN, Orig has BEL, Source has SUS (can this even happen?) - make sure it makes the trip successfully through).
I don't see the immediate value in testing all versifications or all deuterocanonical books - just in testing the edge cases. I am assuming that SIL.Scripture does full testing of these books.

@Enkidu93, what do you think?

Enkidu93 · 2024-12-23T14:49:50Z

The issue identified was that the mapping between DAN and SUS was incorrect because of a bug in the preprocess code. To really verify that all books work, there are a few main points where this could mess up:
* In the versification files themselves (there could be a VRS error - how do we handle it?)

I'd be curious for Damien's thoughts, but I wouldn't worry to much about incorrect .vrs files. As the saying goes, 'garbage in garbage out'. I don't think we can take responsibility for this, and undefined behavior is appropriate behavior when bad data is provided. This is something that's skirting the edge of the purview of Serval/Machine. If we do anything at all about this, I'd suggest just a quick validation of the .vrs file (if that's possible) and throw a meaningful error if it's just plain invalid. We simply can't account for all the possible bad .vrs files out there, and furthermore, I'm not sure how often this is an issue: Seems like Paratext should have an automatic check for this or something (?).

* In the ParallelCorpus algorithm mapping versifications

* In the preprocess script (as there was here)

What exactly was the bug?

* In the USFM update script when the new USFM files are created.

This should be less of an issue, right, since we're not doing aligning here so much as just (potentially) changing versification? Where are you imagining an error like this would creep in?

Where should we make tests for this?

* First, ensure that there are unit tests for preprocess and USFM update to specifically verify that the target references are used, not the source references.  These test may already exist.

Yes, we should do this - at least for preprocessing. I'm not sure a test exists that specifically covers this.

Then, use ParallelCorpus level testing to verify the following:

* Note: Versification mappings are already tested in SIL.Scripture: https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/Versification.cs

* We want to make sure we are testing improper versification mappings or mismatched between the VRS files and the SFM files including:
  
  * Verses in SFM but not in vrs
  * Verse in vrs, but not in SFM
  * Double mapping?  Two different verses map to the same location?  What do we do?  Do we throw an error, use the first one or combine the verses (i.e., Daniel 13 and BEL are both defined)

Like I said above, I don't know if these are things we really can/should test.

* Mapping between books works

* Mapping between chapters works

* Mapping where both the source and target have odd (non-original) versification (i.e. Target has DAN, Orig has BEL, Source has SUS (can this even happen?) - make sure it makes the trip successfully through).

I think these are all in scope, and hopefully, Mudi has covered some of these in his tests so far (?).

* I don't see the immediate value in testing all versifications or all deuterocanonical books - just in testing the edge cases.  I am assuming that SIL.Scripture does full testing of these books.

Yep, I agree.

ddaspit · 2025-01-03T16:14:21Z

Regarding incorrect vrs files, there might be some simpler cases that we could handle automatically. For example, if the USFM contains a verse that isn't specified in the vrs file, we could automatically add it.

mudiagaobrikisil · 2025-01-09T19:41:41Z

I have written tests for the following:

Improper versification mappings

Verses in vrs but not in SFM. Ensure the verse is not in the parsed paratext project. It should gracefully continue even if there’s a verse in the vrs and not in the SFM.
Verses in SFM but not in vrs. Ensure the verse is not in the parsed paratext project. It should gracefully continue even if there’s a verse in the vrs and not in the SFM.

Mapping where both the source and target have odd (non-original) versification. As it stands, the test would throw an exception if there is alignment issue with the varying versifications, which is the case for some of the books. However I intend to modify the test to handle the sample case John mentioned (Target has DAN, Orig has BEL, Source has SUS (can this even happen?) and make the test throw an error if there is an oddity.

I intend to write the following tests:

To handle double mapping - undefined behaviour but it shouldn’t throw an exception
To ensure mapping between books work - this is essential as the bug has to do with mapping between DAN and SUS

Mapping between chapters work

I also intend to verify that there are tests that check if the target references are used for the preprocess and USFM update. If there are none, I write the tests

johnml1135 · 2025-01-09T19:42:45Z

@ddaspit - how would we add the verse if it isn't in the VRS file? Would we just assume there is no mapping? What if there is mapping? Should we follow some logic like "if there is no mapping used in this specific book then add the verse"?

Also, I am assuming in general we want to be graceful and forgiving for bad VRS and USFM files. There will be other places in Serval that will analyze these issues and tell the user to fix up their stuff, but I believe that this would be a bad place to do it - I believe people are expecting it to to a "best guess" job and not crash.

johnml1135 assigned Enkidu93 Nov 1, 2024

johnml1135 added this to Serval Nov 1, 2024

github-project-automation bot moved this to 🆕 New in Serval Nov 1, 2024

johnml1135 assigned mudiagaobrikisil Dec 5, 2024

Enkidu93 moved this from 🆕 New to 🏗 In progress in Serval Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate possible issue with text alignment for deuterocanonical books and unusual versifications #478

Investigate possible issue with text alignment for deuterocanonical books and unusual versifications #478

Enkidu93 commented Sep 4, 2024 •

edited by johnml1135

Loading

johnml1135 commented Dec 23, 2024 •

edited

Loading

Enkidu93 commented Dec 23, 2024

ddaspit commented Jan 3, 2025

mudiagaobrikisil commented Jan 9, 2025

johnml1135 commented Jan 9, 2025

Investigate possible issue with text alignment for deuterocanonical books and unusual versifications #478

Investigate possible issue with text alignment for deuterocanonical books and unusual versifications #478

Comments

Enkidu93 commented Sep 4, 2024 • edited by johnml1135 Loading

johnml1135 commented Dec 23, 2024 • edited Loading

Enkidu93 commented Dec 23, 2024

ddaspit commented Jan 3, 2025

mudiagaobrikisil commented Jan 9, 2025

johnml1135 commented Jan 9, 2025

Enkidu93 commented Sep 4, 2024 •

edited by johnml1135

Loading

johnml1135 commented Dec 23, 2024 •

edited

Loading