-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574
Comments
I took a deep dive into what's going on, and this is what I found. In the paratext project, as you mentioned, the file for Judges is using "JUD" for the "\id", which is the id for the book of Jude, not Judges. So when the project_corpus is initialized inside the Saying all this, I think the best solution would be to prevent this scenario from happening by throwing an error if the "\id" tag does not match the book id in the filename. I don't think allowing the error to exist without affecting the extraction process and just throwing a warning is practical, since it would mean needing to completely redesign the algorithm for aligning rows, and that's probably not necessary just to handle malformed data. |
@mshannon-sil I agree with your analysis. When there is a mismatch between the file name and the "\id" tag, we don't know which one is correct. We could decide to always use the book code from the file name and ignore the "\id" tag and that would fix the issue in this case, but in other cases, it might be the opposite. The safest thing to do is to throw an exception. |
@mshannon-sil can we close this issue? |
Not yet, it looks like the last release of the python package for machine.py was in October, and the fix was issued in November. @johnml1135 is there a plan for a new release of machine.py relatively soon? |
Just updated to 1.4.0. |
A project (Galego_2024_10_24.save) has an invalid book ID in the "\id" tag ("\id JUD") for the book of Judges; it should be "JDG" instead of "JUD".
No error was reported for this mismatch, but the book content was not extracted into the extract file. Also, most other books in the project were not extracted into the extract file either. The extract file contained verse extracts only for GEN, JUD, and REV, even though the project has a complete NT and multiple OT books.
A warning about the book ID mistake would be helpful, and the error should not affect the extraction of the other books.
The text was updated successfully, but these errors were encountered: