Broken diacritics in MARC8 binary imports #713

tfmorris · 2018-01-01T23:11:43Z

This record was just created in April 2017 so it appears this bug still appears in recent software.
https://openlibrary.org/authors/OL7369682A?v=1

The catalog record is correct at both The Newberry Library and the Internet Archive, so the corruption is happening in the import process.

mekarpeles · 2018-01-02T22:13:39Z

I think this is something that @hornc may have noticed re: #655

LeadSongDog · 2018-01-04T02:24:23Z

The defective OL metadata was based on https://openlibrary.org/show-records/ia:discours00unse_b2r an older version than the scandate for the current IA and Newberry image. It would seem there have been rescan and updates this year. If OL reimported it should be good.

tfmorris · 2018-01-04T02:44:31Z

I confirmed that after the fix in #655 the MarcXML gets imported correctly, but the binary MARC is still buggy, so it is really just a workaround for the root problem.

I didn't dig into it too far, but it seems like there's an awful lot of custom MARC decoding code which could be eliminated and delegated to something like PyMARC. The bulk of https://github.com/internetarchive/openlibrary/tree/master/openlibrary/catalog/marc should be obsolete.

I put together a new test case based on this record that I'll see if I can add to a PR even if I don't have time to fix the underlying bug.

tfmorris · 2018-01-04T02:47:11Z

p.s. The other thing that needs to be done is figure out what classes of records are affected. It doesn't appear to be all diacritics, so it's likely to be just a particular form of encoding, but the fact that this section of the code hasn't been touched in 7+ years indicates that for whatever types of records are affected, there is probably the better part of a decade's worth of bad data.

xayhewalo · 2019-11-08T21:49:44Z

Assigning @hornc per slack discussion since this is import/data related.

hornc · 2019-11-11T23:28:01Z

@tfmorris do you have the test case you mentioned handy? Adding specific test cases to confirm we can handle specific situations is a really good way to make improvements on our import process!

hornc · 2019-11-11T23:31:52Z

Note to self -- a simple test would be to test Author and Title names can be imported correctly from MARC-8 encoded binary MARC files (Which is what https://openlibrary.org/show-records/ia:discours00unse_b2r shows was the source)

update: Unfortunately there isn't a single MARC8 encoded binary MARC in the test data at https://github.com/internetarchive/openlibrary/tree/e2871f828bbdf4d30618db5a7effdf8069b1f1e5/openlibrary/catalog/marc/tests/test_data/bin_input which is concerning.

Add a representative MARC8 encoded book with i.e. French diacritics in both the Author and Ttitle to /catalog/marc/tests

https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.mrc is MARC8 encoded, but the direct author and title don't have accents, just the 100$c and alt title + 5XX notes fields, which still might be a good place to start.

tfmorris · 2019-11-12T00:35:36Z

The MARC for this author was either https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.mrc or https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.xml. I suspect the former based on my earlier comment about MarcXML being fixed.

Note that the view at https://openlibrary.org/show-records/ia:discours00unse_b2r is broken as well which indicates that I may use common code with the importer.

As I mentioned a couple of years ago, I think the right solution is to use PyMARC rather than trying to roll our own MARC code.

hornc · 2019-12-08T22:51:50Z

To find MARC8 encoded e-acutes in binary MARC: fgrep 'âe' *.mrc for assisting with finding good test case records.

hornc · 2019-12-08T23:42:37Z

The binary MARC record on the source https://openlibrary.org/show-records/ia:discours00unse_b2r calimed to be MARC8, but was actually UTF-8 encoded. I have corrected the header and reuploaded the binary MARC record to the source item.

To prove we handle the general case of MARC-8 encoded binary MARC, I have added some tests cases in PR #2705

nemobis · 2019-12-11T08:19:48Z

I think the right solution is to use PyMARC rather than trying to roll our own MARC code.

Yes. But with some Italian records that's not always enough to stop worrying about encoding altogether! edsu/pymarc#114

mekarpeles added importbot labels Jan 2, 2018

brad2014 added Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed importbot labels May 9, 2019

xayhewalo added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Priority: 2 Important, as time permits. [managed] State: Backlogged Type: Bug Something isn't working. [managed] labels Nov 8, 2019

xayhewalo assigned hornc Nov 8, 2019

hornc added Theme: MARC records and removed special characters labels Nov 12, 2019

hornc changed the title ~~Broken diacritics in ImportBot created records~~ Broken diacritics in MARC8 binary imports Nov 12, 2019

hornc mentioned this issue Dec 5, 2019

Load Marygrove MARC records into Open Library #2683

Closed

3 tasks

hornc mentioned this issue Dec 8, 2019

Add tests to prove MARC8 encoded import ability. #2705

Merged

mekarpeles closed this as completed in #2705 Dec 9, 2019

tfmorris mentioned this issue Apr 23, 2020

MarcBinary(data): Should the data parameter be bytes or str? #3390

Closed

tfmorris mentioned this issue Jun 14, 2023

Refactor to use pymarc instead of custom MARC parser #7969

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken diacritics in MARC8 binary imports #713

Broken diacritics in MARC8 binary imports #713

tfmorris commented Jan 1, 2018

mekarpeles commented Jan 2, 2018

LeadSongDog commented Jan 4, 2018

tfmorris commented Jan 4, 2018

tfmorris commented Jan 4, 2018

xayhewalo commented Nov 8, 2019

hornc commented Nov 11, 2019

hornc commented Nov 11, 2019 •

edited

Loading

tfmorris commented Nov 12, 2019

hornc commented Dec 8, 2019

hornc commented Dec 8, 2019

nemobis commented Dec 11, 2019

Broken diacritics in MARC8 binary imports #713

Broken diacritics in MARC8 binary imports #713

Comments

tfmorris commented Jan 1, 2018

mekarpeles commented Jan 2, 2018

LeadSongDog commented Jan 4, 2018

tfmorris commented Jan 4, 2018

tfmorris commented Jan 4, 2018

xayhewalo commented Nov 8, 2019

hornc commented Nov 11, 2019

hornc commented Nov 11, 2019 • edited Loading

tfmorris commented Nov 12, 2019

hornc commented Dec 8, 2019

hornc commented Dec 8, 2019

nemobis commented Dec 11, 2019

hornc commented Nov 11, 2019 •

edited

Loading