Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fixes and complete refactoring #3

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

KPHIBYE
Copy link

@KPHIBYE KPHIBYE commented Jul 6, 2024

Foreword

First of all, I want to apologize for the big scale of this pull request. If you need any assistance or want me to make changes before merging, please don't hesitate to ask. I tried to preserve as much as possible of the idea behind your code while improving its structure and readability. All changes that users of the library will experience are listed in the CHANGELOG file. In this pull request, I want to state the reasoning behind certain changes I made.

Project structure

I renamed the test project to adhere to the commonly used naming scheme of ProjectName.Test. I also added a LICENSE file to your repository that contains the Apache 2.0 license to match the PackageLicenseExpression you declared in MRZ.csproj. Furthermore, I also added a CHANGELOG file that contains the changes that a user of the library will experience when upgrading to a newer version.

I removed the Parser class since it didn't offer any benefit because there isn't any state that is managed by it besides the unnecessary sum variable that could introduce a race condition if Parser.DetectFields was simultaneously called from multiple different threads. I have rewritten your helper functions to be (essentially) pure functions, which means that the code is now thread-safe.

Document class

I renamed the Format, Type and Number properties to DocumentFormat, DocumentType and DocumentNumber to express that their values refer to the document itself and not the person that is represented by the document. I added a private property for each check digit so that reading values can be decoupled from validation checking. I removed the Gender enum and changed the datatype of the Gender property to char because a user of the library most likely has their own way of encoding gender information and in my opinion, they should directly perform the mapping themselves. If a value is not present in the mrz, the default value is now an empty string instead of null because it is the default value for a missing capturing group. I am unsure if this is an improvement or deterioration, so I would kindly ask for your feedback regarding this topic.

Regex

I added names to all capturing groups, which in my opinion greatly improves the readability of all regex patterns and also enabled me to write one common Document initialization block for all mrz formats.

Tests

I found that the old regex patterns allowed the | character for DocumentType and Gender values and the seventh position of the first SDL line since in regex, [ACI] means "A single character of: A, C or I" and a | within square brackets is treated as character and not as OR. I fixed the patterns and added a unit test that checks that an exception is thrown if | is present in any position of any mrz format.

Benchmarks

I ran benchmarks on my machine to make sure that my refactored code does not perform significantly worse than your old code. As it turns out, my refactored code is 5-10x faster and allocates 5-7x less memory per execution. I highly suspect that the reason for those improvements is due to my use of the Regex.IsMatch method, which internally performs regex caching. This approach is also considered best practice.

Method mrz Mean Error StdDev Ratio Allocated Alloc Ratio
Old ABC12(...)<<<<< [73] 13.541 us 0.2645 us 0.2344 us 1.00 16.85 KB 1.00
New ABC12(...)<<<<< [73] 2.440 us 0.0198 us 0.0176 us 0.18 2.23 KB 0.13
Old CTD71(...)<<<<< [73] 13.355 us 0.2652 us 0.3630 us 1.00 16.87 KB 1.00
New CTD71(...)<<<<< [73] 2.533 us 0.0308 us 0.0288 us 0.19 2.16 KB 0.13
Old I<PRT(...)ANESS [94] 45.284 us 0.3417 us 0.3196 us 1.00 34.5 KB 1.00
New I<PRT(...)ANESS [94] 4.947 us 0.0475 us 0.0421 us 0.11 5 KB 0.14
Old I<UTO(...)<<<<< [94] 44.831 us 0.4951 us 0.4631 us 1.00 34.46 KB 1.00
New I<UTO(...)<<<<< [94] 4.553 us 0.0690 us 0.0612 us 0.10 5.08 KB 0.15
Old I<UTO(...)<<<<6 [74] 47.148 us 0.8189 us 0.7660 us 1.00 28.85 KB 1.00
New I<UTO(...)<<<<6 [74] 5.334 us 0.1061 us 0.1342 us 0.11 4.7 KB 0.16
Old IDFRA(...)209F3 [74] 33.835 us 0.4931 us 0.4371 us 1.00 20.67 KB 1.00
New IDFRA(...)209F3 [74] 5.030 us 0.0975 us 0.1398 us 0.15 4.3 KB 0.21
Old IDFRA(...)216M4 [74] 33.788 us 0.6702 us 1.3690 us 1.00 20.71 KB 1.00
New IDFRA(...)216M4 [74] 4.899 us 0.0959 us 0.0897 us 0.14 4.34 KB 0.21
Old IPUSA(...)<<<<< [94] 43.328 us 0.7424 us 0.6944 us 1.00 34.45 KB 1.00
New IPUSA(...)<<<<< [94] 4.494 us 0.0715 us 0.0634 us 0.10 4.99 KB 0.14
Old P<UTO(...)<<<10 [90] 50.648 us 0.9955 us 1.0223 us 1.00 30.24 KB 1.00
New P<UTO(...)<<<10 [90] 4.647 us 0.0924 us 0.0771 us 0.09 5.09 KB 0.17
Old V<UTO(...)<<<<< [74] 40.498 us 0.7465 us 0.9440 us 1.00 27.85 KB 1.00
New V<UTO(...)<<<<< [74] 3.712 us 0.0723 us 0.0940 us 0.09 3.95 KB 0.14
Old V<UTO(...)<<<<< [90] 44.375 us 0.7008 us 0.6212 us 1.00 28.84 KB 1.00
New V<UTO(...)<<<<< [90] 3.707 us 0.0287 us 0.0255 us 0.08 4.02 KB 0.14

NuGet (MRZ.csproj)

I added the LICNSE, CHANGELOG and docstring files to the NuGet package. This did not result in an increased package size but a decrease from 39kB to 34kB. The content of the CHANGELOG file is also automatically used as content in the PackageReleaseNotes tag for the NuGet package.

To allow for deterministic builds, please build with the -p:"ContinuousIntegrationBuild=true" flag. Packing now also produces a symbol package that should also be uploaded to nuget to enable better debugging support for users in combination with SourceLink.

Here is an example of how I would build and publish the packages:

dotnet pack $PROJECT_FOLDER/MRZ.csproj -c Release -p:"ContinuousIntegrationBuild=true"
dotnet nuget push "$BUILD_FOLDER/*.nupkg" --source https://api.nuget.org/v3/index.json --api-key $NUGET_API_KEY

The dotnet nuget push command will publish both, the code and the symbol package, if they are located in the same folder.

Thank you for this great library and again, if you need any assistance with this pull request, please don't hesitate to ask.

…tests, for detailed changes see CHANGELOG and pull request GioviQ#3
@GioviQ
Copy link
Owner

GioviQ commented Jul 10, 2024

😱
I am busy now, I will check later.

@KPHIBYE
Copy link
Author

KPHIBYE commented Jul 29, 2024

I have forgotten to ask some questions regarding special cases that are defined in the standard, but that were and are not implemented correctly, details that I noticed where I am unsure about their correctness and potential improvements.

Deviations from the standard

  1. The long document number format for TD1 documents, as outlined in Doc 9303, Part 5, 4.2.4, is currently not implemented and the document number check digit validation will almost always fail in this case.

  2. Missing parts of the date of birth as outlined in Doc 9303, Part 3, 4.8 are currently not handled at all and providing a MRZ with such a birthdate will lead to an invalid document format exception.

If all or part of the date of birth is unknown, the relevant character positions shall be completed with filler characters (<).

Inconsistencies between documents

  1. Is it correct that the document number check digit for the French ID card may be a filler character?

  2. Is it correct that the Swiss driving license and the given name of the French ID card do not permit numbers in names? All other documents do so, which is why I have changed the regex to also permit it for the Swiss driving license. I seem to have forgotten to do this for the French ID card.

  3. Is it ok that I changed the nationality field to empty for the French ID card and the Swiss driving license? I am under the impression that foreign citizens can obtain such documents and that setting doc.Nationality = doc.CountryCode would be incorrect in some rare cases, but I may be mistaken.

Proposed improvements

I thought of some features that might be useful to some people, but I am unsure if implementing them would be worth it.

  1. Add a boolean parameter to the constructor that indicates if all checksum checks should be skipped.

  2. Implementing the IEquatable interface to provide equality checks for documents.

  3. Implementing the ICloneable interface to make a copy of a document. To be honest, I don't know in which case this would be helpful, since all properties are read only. It was just a thought.

  4. Reintroducing a (static) Parser class instead of parsing in the constructor. This would enable users of the library to directly use the Document class as a data transfer object.

Please take your time and don't rush with a response. I am not in a hurry 😊

@GioviQ
Copy link
Owner

GioviQ commented Jul 30, 2024

2. Doc 9303, Part 3, 4.8

I have forgotten to ask some questions regarding special cases that are defined in the standard, but that were and are not implemented correctly, details that I noticed where I am unsure about their correctness and potential improvements.

Deviations from the standard

  1. The long document number format for TD1 documents, as outlined in Doc 9303, Part 5, 4.2.4, is currently not implemented and the document number check digit validation will almost always fail in this case.

My implementation is based on real cases and not all specifications are implemented. Consider that countries do not follow in some cases the standard

  1. Missing parts of the date of birth as outlined in Doc 9303, Part 3, 4.8 are currently not handled at all and providing a MRZ with such a birthdate will lead to an invalid document format exception.

If all or part of the date of birth is unknown, the relevant character positions shall be completed with filler characters (<).

My primary need is establish the age of a person, so if you want you can return a null date of birth instead of an excepion.

Inconsistencies between documents

  1. Is it correct that the document number check digit for the French ID card may be a filler character?

Implementation based on https://github.com/ZsBT/mrz-java/blob/master/src/main/java/com/innovatrics/mrz/records/FrenchIdCard.java

  1. Is it correct that the Swiss driving license and the given name of the French ID card do not permit numbers in names? All other documents do so, which is why I have changed the regex to also permit it for the Swiss driving license. I seem to have forgotten to do this for the French ID card.
  2. Is it ok that I changed the nationality field to empty for the French ID card and the Swiss driving license? I am under the impression that foreign citizens can obtain such documents and that setting doc.Nationality = doc.CountryCode would be incorrect in some rare cases, but I may be mistaken.

See cheminfo/mrz#2

Proposed improvements

I thought of some features that might be useful to some people, but I am unsure if implementing them would be worth it.

  1. Add a boolean parameter to the constructor that indicates if all checksum checks should be skipped.

No, because the optical reader often returns wrong readings and why should I skip checks?

  1. Implementing the IEquatable interface to provide equality checks for documents.
  2. Implementing the ICloneable interface to make a copy of a document. To be honest, I don't know in which case this would be helpful, since all properties are read only. It was just a thought.

I don't need it. My use case is to establish person age and document expiration.

  1. Reintroducing a (static) Parser class instead of parsing in the constructor. This would enable users of the library to directly use the Document class as a data transfer object.

Please take your time and don't rush with a response. I am not in a hurry 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants