Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken diacritics in MARC8 binary imports #713

Closed
tfmorris opened this issue Jan 1, 2018 · 11 comments · Fixed by #2705
Closed

Broken diacritics in MARC8 binary imports #713

tfmorris opened this issue Jan 1, 2018 · 11 comments · Fixed by #2705
Assignees
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Theme: MARC records Type: Bug Something isn't working. [managed]

Comments

@tfmorris
Copy link
Contributor

tfmorris commented Jan 1, 2018

This record was just created in April 2017 so it appears this bug still appears in recent software.
https://openlibrary.org/authors/OL7369682A?v=1

The catalog record is correct at both The Newberry Library and the Internet Archive, so the corruption is happening in the import process.

@mekarpeles
Copy link
Member

I think this is something that @hornc may have noticed re: #655

@LeadSongDog
Copy link

The defective OL metadata was based on https://openlibrary.org/show-records/ia:discours00unse_b2r an older version than the scandate for the current IA and Newberry image. It would seem there have been rescan and updates this year. If OL reimported it should be good.

@tfmorris
Copy link
Contributor Author

tfmorris commented Jan 4, 2018

I confirmed that after the fix in #655 the MarcXML gets imported correctly, but the binary MARC is still buggy, so it is really just a workaround for the root problem.

I didn't dig into it too far, but it seems like there's an awful lot of custom MARC decoding code which could be eliminated and delegated to something like PyMARC. The bulk of https://github.com/internetarchive/openlibrary/tree/master/openlibrary/catalog/marc should be obsolete.

I put together a new test case based on this record that I'll see if I can add to a PR even if I don't have time to fix the underlying bug.

@tfmorris
Copy link
Contributor Author

tfmorris commented Jan 4, 2018

p.s. The other thing that needs to be done is figure out what classes of records are affected. It doesn't appear to be all diacritics, so it's likely to be just a particular form of encoding, but the fact that this section of the code hasn't been touched in 7+ years indicates that for whatever types of records are affected, there is probably the better part of a decade's worth of bad data.

@brad2014 brad2014 added Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed importbot labels May 9, 2019
@xayhewalo xayhewalo added Affects: Data Issues that affect book/author metadata or user/account data. [managed] Priority: 2 Important, as time permits. [managed] State: Backlogged Type: Bug Something isn't working. [managed] labels Nov 8, 2019
@xayhewalo
Copy link
Collaborator

Assigning @hornc per slack discussion since this is import/data related.

@hornc
Copy link
Collaborator

hornc commented Nov 11, 2019

@tfmorris do you have the test case you mentioned handy? Adding specific test cases to confirm we can handle specific situations is a really good way to make improvements on our import process!

@hornc
Copy link
Collaborator

hornc commented Nov 11, 2019

Note to self -- a simple test would be to test Author and Title names can be imported correctly from MARC-8 encoded binary MARC files (Which is what https://openlibrary.org/show-records/ia:discours00unse_b2r shows was the source)

update: Unfortunately there isn't a single MARC8 encoded binary MARC in the test data at https://github.com/internetarchive/openlibrary/tree/e2871f828bbdf4d30618db5a7effdf8069b1f1e5/openlibrary/catalog/marc/tests/test_data/bin_input which is concerning.

  • Add a representative MARC8 encoded book with i.e. French diacritics in both the Author and Ttitle to /catalog/marc/tests

https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.mrc is MARC8 encoded, but the direct author and title don't have accents, just the 100$c and alt title + 5XX notes fields, which still might be a good place to start.

@tfmorris
Copy link
Contributor Author

The MARC for this author was either https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.mrc or https://archive.org/download/discours00unse_b2r/discours00unse_b2r_meta.xml. I suspect the former based on my earlier comment about MarcXML being fixed.

Note that the view at https://openlibrary.org/show-records/ia:discours00unse_b2r is broken as well which indicates that I may use common code with the importer.

As I mentioned a couple of years ago, I think the right solution is to use PyMARC rather than trying to roll our own MARC code.

@hornc hornc changed the title Broken diacritics in ImportBot created records Broken diacritics in MARC8 binary imports Nov 12, 2019
@hornc
Copy link
Collaborator

hornc commented Dec 8, 2019

To find MARC8 encoded e-acutes in binary MARC: fgrep 'âe' *.mrc for assisting with finding good test case records.

@hornc
Copy link
Collaborator

hornc commented Dec 8, 2019

The binary MARC record on the source https://openlibrary.org/show-records/ia:discours00unse_b2r calimed to be MARC8, but was actually UTF-8 encoded. I have corrected the header and reuploaded the binary MARC record to the source item.

To prove we handle the general case of MARC-8 encoded binary MARC, I have added some tests cases in PR #2705

@nemobis
Copy link

nemobis commented Dec 11, 2019

I think the right solution is to use PyMARC rather than trying to roll our own MARC code.

Yes. But with some Italian records that's not always enough to stop worrying about encoding altogether! edsu/pymarc#114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Data Issues that affect book/author metadata or user/account data. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Priority: 2 Important, as time permits. [managed] Theme: MARC records Type: Bug Something isn't working. [managed]
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants