Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

A permissive MARCReader #109

Closed
edsu opened this issue Oct 17, 2017 · 4 comments
Closed

A permissive MARCReader #109

edsu opened this issue Oct 17, 2017 · 4 comments

Comments

@edsu
Copy link
Owner

edsu commented Oct 17, 2017

Over in #89 there has been a long discussion about working with invalid MARC data in the wild. I must admit I don't work with MARC much these days, so I had no idea people were running into so many problems processing large batches of MARC records.

Currently when pymarc runs into a structural problem in a MARC record it will throw an exception (RecordLeaderInvalid, BaseAddressNotFound, BaseAddressInvalid, RecordDirectoryInvalid, NoFieldsFound) which will also cause record iteration to stop.

@anarchivist offered up a PermissiveMARCReader which he has used to process large amounts of MARC data. PermissiveMARCReader catches all the exceptions thrown by structural problems with the MARC record and moves on to the next record.

Rather than introducing a new class I suggest that a new parameter named strict be added to the MARC.MARCReader constructor. When set to True it will continue to throw these exceptions. When set to False it will catch the exceptions, log them, and move on to the next record. It may be that some of these exceptions need to be relaxed, and the invalid data interpreted in some way. But let's open new issue tickets for those situations as they come up.

Based on the conversation we've been seeing lately I think the default for strict should be set to False. The MARCReader API will be backwards compatible (code that uses pymarc won't need to change). However this will be a significant change in behavior so I think a new minor version release will be needed, v3.2.

What do folks think?

@timClicks
Copy link

Strong +1 from me. I deal with MARC21 files from repositories, so it's basically impossible for me to fix the problem at source. Its rare that I would like to stop processing completely if I'm generating an export.

@petrus-v
Copy link
Contributor

Hi ! I'd like to help on this issue, I'll create a PR in few days, do you have any contribution guide lines to know ?

@edsu
Copy link
Owner Author

edsu commented Dec 1, 2019

@petrus-v PEP-8 and including unit tests (with test data if necessary) will definitely help move PRs along.

petrus-v pushed a commit to anybox/pymarc that referenced this issue Dec 3, 2019
This allow to read a large amont of data that may contains
wrong structural records which will be ignore to give a chance
to read as much of possible of recods in a file even one is
incorect
edsu added a commit that referenced this issue Dec 5, 2019
@Wooble
Copy link
Collaborator

Wooble commented Dec 6, 2019

I don't know if there's anything that could be done to be even more permissive in more places, but I think #144 pretty much covered this, so closing...

@Wooble Wooble closed this as completed Dec 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants