-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken diacritics in MARC8 binary imports #713
Comments
The defective OL metadata was based on https://openlibrary.org/show-records/ia:discours00unse_b2r an older version than the scandate for the current IA and Newberry image. It would seem there have been rescan and updates this year. If OL reimported it should be good. |
I confirmed that after the fix in #655 the MarcXML gets imported correctly, but the binary MARC is still buggy, so it is really just a workaround for the root problem. I didn't dig into it too far, but it seems like there's an awful lot of custom MARC decoding code which could be eliminated and delegated to something like PyMARC. The bulk of https://github.com/internetarchive/openlibrary/tree/master/openlibrary/catalog/marc should be obsolete. I put together a new test case based on this record that I'll see if I can add to a PR even if I don't have time to fix the underlying bug. |
p.s. The other thing that needs to be done is figure out what classes of records are affected. It doesn't appear to be all diacritics, so it's likely to be just a particular form of encoding, but the fact that this section of the code hasn't been touched in 7+ years indicates that for whatever types of records are affected, there is probably the better part of a decade's worth of bad data. |
Assigning @hornc per slack discussion since this is import/data related. |
@tfmorris do you have the test case you mentioned handy? Adding specific test cases to confirm we can handle specific situations is a really good way to make improvements on our import process! |
Note to self -- a simple test would be to test Author and Title names can be imported correctly from MARC-8 encoded binary MARC files (Which is what https://openlibrary.org/show-records/ia:discours00unse_b2r shows was the source) update: Unfortunately there isn't a single MARC8 encoded binary MARC in the test data at https://github.com/internetarchive/openlibrary/tree/e2871f828bbdf4d30618db5a7effdf8069b1f1e5/openlibrary/catalog/marc/tests/test_data/bin_input which is concerning.
https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.mrc is MARC8 encoded, but the direct author and title don't have accents, just the |
The MARC for this author was either https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.mrc or https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.xml. I suspect the former based on my earlier comment about MarcXML being fixed. Note that the view at https://openlibrary.org/show-records/ia:discours00unse_b2r is broken as well which indicates that I may use common code with the importer. As I mentioned a couple of years ago, I think the right solution is to use PyMARC rather than trying to roll our own MARC code. |
To find MARC8 encoded e-acutes in binary MARC: |
The binary MARC record on the source https://openlibrary.org/show-records/ia:discours00unse_b2r calimed to be MARC8, but was actually UTF-8 encoded. I have corrected the header and reuploaded the binary MARC record to the source item. To prove we handle the general case of MARC-8 encoded binary MARC, I have added some tests cases in PR #2705 |
Yes. But with some Italian records that's not always enough to stop worrying about encoding altogether! edsu/pymarc#114 |
This record was just created in April 2017 so it appears this bug still appears in recent software.
https://openlibrary.org/authors/OL7369682A?v=1
The catalog record is correct at both The Newberry Library and the Internet Archive, so the corruption is happening in the import process.
The text was updated successfully, but these errors were encountered: