-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Centralized metadata constraints #339
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #339 +/- ##
==========================================
+ Coverage 43.79% 45.58% +1.79%
==========================================
Files 23 26 +3
Lines 2393 2481 +88
Branches 1326 1380 +54
==========================================
+ Hits 1048 1131 +83
- Misses 1329 1334 +5
Partials 16 16
☔ View full report in Codecov by Sentry. |
a9eaa76
to
59d7830
Compare
Inviting @mgautierfr @kelson42 to review the overall approach for centralizing all metadata checks in one place. Please look at the file Issues noticed so far:
|
@veloman-yunkan Thank you very much for your PR which is, at a first sight, really much the kind of code I was expected. Indeed regexs fall short in many cases. The case I have in mind is checking illustration entries (requires libpng). I'm not in favour of extending much the validation scenarios because this can go endlessly and my goal is to have the basic ones. But I would appreciate to have the appropriate code infrastructure so we can easily do that work later... without rearchitecturing effort. |
On my local computer, it does not compile:
|
I have mixed feelings about this. On one hand, this mainly highlights the shortcomings of such an approach but on the other hand, simple checks are better than none. Couple of comments (already identified):
Once again we fall short on setting clear goals for our tools. zimcheck's description is “zimcheck checks the quality of a ZIM file.”. Does that mean that whenever zimcheck doesn't report an issue, the ZIM is guaranteed to be valid? I join @kelson42 in thinking we want basic checks for now that we could extend in the future.
And that's it. The rest can be discussed and extended in separate tickets, raised by actual needs. Although it serves a different purpose, scraperlib now (not being used yet) enforces correct metadata with more elaborate checks (actual language code, proper PNG with correct sizes, etc) so most of what we produce shall be valid in this regard. |
This comment was marked as off-topic.
This comment was marked as off-topic.
I have open a issue #340 to discuss what to test and how. Please continue the discussion there and keep this discussion for PR review. |
I just want to emphasis again that we need this basic metadata feature pretty quickly. Thank you for having open side tickets to continue the interesting discussion and keep this PR "simple" without "insulting" the future. |
@veloman-yunkan I hope you got all the feedbacks needed, let us know othwerwise. |
With illustration metadata in the mix, in what units min and max length values should be specified - bytes or (unicode) characters?
|
|
04d2355
to
ecbd62c
Compare
A preliminary version of this PR (reflecting feedback received on the draft version) is ready. Note that the PR doesn't utilize the new |
@veloman-yunkan Thank you, just to ensure we understand us well, this is mandatory that this is used at least in zimcheck. |
@mgautierfr any feedback/review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems globally good to me.
A small comment on the static pointer thing but the global structure is good.
How do you plan to implement checkTextLanguage
?
Or it is just a sample and it have to be removed before merge ?
In this PR |
Ok. You can remove it right now (or comment it to not loose the illustration purpose) while you fix the issue with |
I added usage of this new utility in |
The regeneration itself should be easy. The question is whether I should use the current release of |
I think here this is not relevant (at least for the change we want to made). |
The PR is final (though needs at least a rebase&fixup unless other issues are spotted). Known issues:
|
@veloman-yunkan For now, I'm happy with this kind of regex error. |
Please rebase-fixup and we can merge. |
This commit is intended to demonstrate how complex checks (involving more than one metadata entries) can be added if required.
When -L|--longDesciption option was added to zimwriterfs its argument remained unused (participating in constraint checks doesn't count).
-L option of zimwriterfs was not properly registered in the call to `getopt_long()`
For some reason the PNG regex check fails on real data. But at least it uncovered the issue with the error message which reads as bellow: ``` Illustration_48x48@1 doesn't match regex: ^�PNG � .+ ```
Regexp is not the best tool for use with binary data where NUL characters may occur.
Note that metadata constraints were relaxed in order to let the zimcheck unittests pass without updating/regenerating the unittest ZIM data.
Test ZIM files (good.zim, poor.zim and bad_checksum.zim) under test/data/zimfiles were regenerated using zimwriterfs v3.1.3 and the updated script create_test_zimfiles. Metadata constraints that were relaxed in the previous commit have been restored.
Before this change, simple metadata checks were performed only if there were no missing metadata entries.
0f79d9d
to
d32037c
Compare
Done |
I think this is a problem of coherence between checks and it can be fixed later (and we probably have to update the checks). For now the structure is good and we can merge this long PR. |
This PR addresses #334 (comment)
Fixes #334
Fixes #336