Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

Trouble reading Harvard Open Metadata MARC files (UTF-8 related?) #89

Open
viking2917 opened this issue Apr 25, 2016 · 28 comments
Open

Comments

@viking2917
Copy link

I am trying to use pymarc to read the Harvard Open Metadata MARC files.

Most of the files process ok but some (for example ab.bib.14.20160401.full.mrc) produce errors when processing. The error I am getting is:

Traceback (most recent call last):
  File "domark.py", line 21, in <module>
    for record in reader:
  File "/Library/Python/2.7/site-packages/six.py", line 535, in next
    return type(self).__next__(self)
  File "/Users/markwatkins/Sites/pharvard/pymarc/reader.py", line 97, in __next__
    utf8_handling=self.utf8_handling)
  File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 74, in __init__
    utf8_handling=utf8_handling)
  File "/Users/markwatkins/Sites/pharvard/pymarc/record.py", line 307, in decode_marc
    code = subfield[0:1].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range(128)

The driver code I am using is:

#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
from pymarc import MARCReader

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)

if len(sys.argv) >= 2:
    files = [sys.argv[1]]

for file in files:
    with open(file, 'rb') as fh:
        reader = MARCReader(fh, utf8_handling='ignore')
        for record in reader:
#            print "%s by %s" % (record.title(), record.author())
            print(record.as_json())

Other MARC processing tools (e.g. MarcEdit seem to process the file with no issues so I think the file is legitimate).

Am I doing something wrong? Is there an issue with pymarc, possibly UTF-8 processing related?

@Wooble
Copy link
Collaborator

Wooble commented Apr 25, 2016

Can you isolate a single record that's displaying this problem?

From the traceback it appears that there's a subfield code that's not ASCII, which is forbidden by the MARC21 spec.

(If such records exist in the wild, though, pymarc should probably have a way to deal with them. This is one area where there's currently no workaround as far as I know...)

@viking2917
Copy link
Author

Thank you! Working on isolating the record...it's unfortunately a massive binary file. :(. But from the python debugger, it does look like you are correct that there's unicode chars in a subfield, e.g.

(Pdb) entry_tag
u'040'
(Pdb) subs
[u'  ', '\xc4\x81TRCLS', 'beng', 'erda', 'cTRCLS', 'dOCLCO', 'dHMY']

@viking2917
Copy link
Author

Here's what the record looks like dumped to a text file by MarcEdit. It does indeed look like the 040 record has unicode in the subfield code.

If the MARC21 spec indeed forbids this, then this issue should probably be closed, although more tolerant error handling might be helpful.

=LDR  01444cam a2200397Ii 4500
=001  014333604-5
=005  20150806114915.0
=008  150209s2015\\\\ja\a\\\\\\\\\\000\0\jpn\d
=020  \\$a9784480068163
=020  \\$a4480068163
=035  0\$aocn902996729
=040  \\$āTRCLS$beng$erda$cTRCLS$dOCLCO$dHMY
=090  \\$aNA6310$b.A53 2015
=100  1\$6880-01$aAkase, Tatsuzō,$d1946-$eauthor.
=245  10$6880-02$aEki o dezain suru :$bkarā shinsho /$cAkase Tatsuzō.
=264  \1$6880-03$aTōkyō-to Taitō-ku :$bChikuma Shobō,$c2015.
=300  \\$a254 pages :$billustrations ;$c18 cm.
=336  \\$atext$btxt$2rdacontent
=337  \\$aunmediated$bn$2rdamedia
=338  \\$avolume$bnc$2rdacarrier
=490  1\$6880-04$aChikuma shinsho ;$v1112
=650  \0$aRailroad stations$xDesign and construction.
=650  \0$aRailroad stations$vDesigns and plans.
=650  07$6880-05$aEki.$2jlabsh/4
=650  07$6880-06$aShinboru māku.$2jlabsh/4
=880  1\$6100-01$a赤瀬達三,$d1946-$eauthor.
=880  10$6245-02$a駅をデザインする :$bカラー新書 /$c赤瀬達三.
=880  \1$6264-03$a東京都台東区 :$b筑摩書房,$c2015.
=880  1\$6490-04$aちくま新書 ;$v1112
=880  07$6650-05$a駅.$2jlabsh/4
=880  07$6650-06$aシンボルマーク.$2jlabsh/4
=830  \0$6880-07$aChikuma shinsho ;$v1112.
=880  \0$6830-07$aちくま新書 ;$v1112.
=988  \\$a20150327
=049  \\$aHMYY
=906  \\$0MH

@edsu
Copy link
Owner

edsu commented Apr 25, 2016

It looks like the record is coded as containing Unicode (leader position 9). I forget, why are you using

reader = MARCReader(fh, utf8_handling='ignore')

@viking2917
Copy link
Author

I was using 'ignore' because otherwise the processing would stop when it encountered encoding difficulties. (aside: I wonder if better error handling would be to skip offending records and keep going?). Right now when it throws an exception the script halts. (Caveat: I am a python newbie and likely doing something wrong).

@gugek
Copy link
Contributor

gugek commented Apr 25, 2016

@viking2917: right sadly the ignore parameter doesn't ignore everywhere in the MARC field, and also doesn't ignore in areas where UTF-8 isn't permitted.

ALEPH (the ILS Harvard is on) will let you save a unicode character to a subfield.

I have a branch somewhere that does some of this error handling.this for another project

In record.py:

for subfield in subs[1:]:
    if len(subfield) == 0:
        continue
    try:
        code = subfield[0:1].decode('ascii')
    except UnicodeDecodeError:
        if utf8_handling == 'replace':
            code = unidecode(subfield[0:1].decode(encoding,
                                 utf8_handling))
            message = "tag {0}: utf8 - sf code {1}"\
                                   .format(entry_tag, code)
            if self['001']:
                message = "=001 {0}: ".format(self['001'].data)\
                    + message
            logging.error("{0}".format(message))
        else:
            raise

https://github.com/gugek/pymarc/blob/leader-handling/pymarc/record.py

@viking2917
Copy link
Author

@gugek Thank you! Will give that a go.

@Wooble
Copy link
Collaborator

Wooble commented Apr 25, 2016

That sounds like a reasonable patch to me (of course, it wouldn't have helped here, with utf8_handling='ignore'; I can't think of a good way to use ignore here since it implies ending up with the subfield code completely blank.)

Decomposing the "ā" and throwing away diacritics would probably do the right thing in this particular case, since I can't see how that's possibly supposed to be anything but $a... But I don't know that's a good general solution :)

@viking2917
Copy link
Author

viking2917 commented Apr 25, 2016

Yes, 'replace' is probably the right option - I was happy to simply discard records with errors so was using 'ignore', but with 'replace' and this patch, I seem to be able to get pretty much everything.

Thanks everyone!

@viking2917
Copy link
Author

viking2917 commented Apr 26, 2016

I am not sure how aggressively the Harvard Open Metadata project is being maintained, but
I've reported the issue in case someone there is actively maintaining it.
(It is a treasure trove of open source book metadata....)

Thanks.

Mark

On Mon, Apr 25, 2016 at 4:44 PM, Jim Nicholls [email protected]
wrote:

Not that this helps anyone usefully progress, but I did want to point out
that this in fact not a valid MARC record.

According to the MARC 21 Specification
https://www.loc.gov/marc/specifications/specrecstruc.html:

subfield code "The two-character combination of a delimiter followed
by a data element identifier. [...]"

delimiter "ASCII control character 1F(hex) [...]"

data element identifier "A one-character code used to identify
individual data elements within a variable field. The data element may
be any ASCII lowercase alphabetic, numeric, or graphic symbol except blank."

And according to the Character Sets and Encoding Options
https://www.loc.gov/marc/specifications/speccharintro.html section:

ASCII "[...] a 7-bit coded character set [...]"

ASCII numerics "ASCII code points 30(hex) through 39(hex)"

ASCII lowercase alphabetics "ASCII code points 61(hex) through 6F(hex)
and 70(hex) through 7A(hex)"

ASCII graphic symbols "The ASCII graphic characters other than
numerics, alphabetics, space, and delete. Code points 21(hex) through
2F(hex), 3A(hex) through 3F(hex), 40(hex), 5B(hex) through 5F(hex),
60(hex), and 7B(hex) through 7E(hex) are included."


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#89 (comment)

Get great book recommendations at The Hawaii Project
http://www.thehawaiiproject.com - my new startup!

[email protected]
twitter: @viking2917 https://www.twitter.com/viking2917
linkedin: https://www.linkedin.com/in/markwatkins1

@josepablog
Copy link

josepablog commented Aug 10, 2016

I know this is an old issue, but I'm having the same problem than @viking2917 ...

I'm trying to parse the Harvard Open Metadata db, and I'm running into exceptions. Is there any patch that I could do? My code breaks and I cannot catch the exception to skip the record.

(I do not understand what @gugek suggested)...

@Wooble
Copy link
Collaborator

Wooble commented Aug 11, 2016

Can you isolate a record that displays the issue and attach it as MARC21? (I'd prefer to avoid trying to manually create a bad record by hand for testing, and I don't think my ILS can create one nor can I do it programmatically in pymarc :) )

@viking2917
Copy link
Author

@josepablog I did finally get around this problem. Here's how I got around it:

First I altered record.py (new file attached), to add some error handling.
record.py.txt

Instead of this driver from the github page:

from pymarc import MARCReader
with open('test/marc.dat', 'rb') as fh:
    reader = MARCReader(fh)
    for record in reader:
        print(record.title())

I changed the driver to this: (basically, importing codecs and sys.). As I recall I had to install


#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import codecs
import sys
from pymarc import MARCReader

UTF8Writer = codecs.getwriter('utf8')
sys.stdout = UTF8Writer(sys.stdout)

if len(sys.argv) >= 2:
    files = [sys.argv[1]]

for file in files:
    with open(file, 'rb') as fh:
        reader = MARCReader(fh, utf8_handling='replace')
        for record in reader:
            print(record.as_json())

On my mac, I needed to install a few python package:

sudo pip install unidecode
sudo pip install -U six

(I am not 100% sure if I needed to install six or not. Your mileage may vary).

I'm a python n00b and not sure this code is really production-ready, so I did not create a pull request. But it's been working for me. I only half-understand what I did, as I really don't know Python. Good luck!

@josepablog
Copy link

@Wooble the file is huge! And my understanding of Marc-21 is extremely limited

I'll give a try to @viking2917 's solution, and hopefully it would work...

Thank you to both!

@viking2917
Copy link
Author

(Aside: I traded some emails with the good folks at Harvard and they said something to the effect that there are in fact the occasional invalid record due to the large, distributed nature of their libraries and data. They did correct the issues I brought to their attention but I think it's a good idea to protect against invalid data where possible.)

@josepablog
Copy link

@viking2917 I downloaded the new Harvard db, and I don't have those problems if I use the utf8_handling='ignore' flag ...

@viking2917
Copy link
Author

@josepablog Interesting. That flag helped me get further but didn't solve all my issues. But glad it's working for you! Perhaps something has changed in the meantime....

@josepablog
Copy link

I think I declared victory too early.

PyMarc still breaks for these files:

ab.bib.11.20160805.full.mrc
ab.bib.13.20160805.full.mrc
ab.bib.14.20160805.full.mrc

Wished I knew how to isolate the record, to get some help from @edsu : )

@edsu
Copy link
Owner

edsu commented Aug 15, 2016

So it sounds like we might need a way to catch all exceptions when reading a record, and keep moving through the records?

@josepablog
Copy link

josepablog commented Aug 15, 2016

@edsu Yes, I think so

This is an example of the exception I get (I'm using utf8_handling='ignore', which I don't know if it makes sense, but reduces the number of errors):

File "//anaconda/lib/python2.7/site-packages/pymarc/reader.py", line 97, in __next__
    utf8_handling=self.utf8_handling)
  File "//anaconda/lib/python2.7/site-packages/pymarc/record.py", line 74, in __init__
    utf8_handling=utf8_handling)
  File "//anaconda/lib/python2.7/site-packages/pymarc/record.py", line 231, in decode_marc
    self.leader = marc[0:LEADER_LEN].decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 17: ordinal not in range(128)

Should I just wrap the whole thing in a catch? Or is there anything smarter to do?

Thank you again for your help, Ed!

@Wooble
Copy link
Collaborator

Wooble commented Aug 15, 2016

It's fairly annoying to have to do it yourself, since it would require calling next manually instead of just using a for loop.

I don't think changing utf8_handling is likely to help if the problem is in the indicators or the leader. pymarc itself should probably have a way to recover from these better; personally I'm running my fork's "leader_encoding" branch in production because our database itself has broken records. It's probably a good start but not really something I'd want to merge with master at the moment since it's a bit sledgehammery; it just fixes the leaders we have problems with in the obvious way and prints a warning, with no way to select strict behavior.

@pbnjay
Copy link

pbnjay commented Oct 16, 2017

Also having this issue with USDA National Agricultural Library's marc21 downloads. Pymarc 2.9.2 handled these files fine on an old system.

I can share a ~60mb file from this distribution if that helps test issues.

@edsu
Copy link
Owner

edsu commented Oct 17, 2017

@pbnjay Yes, sharing the data that can be used to demonstrate the problem is key.

On another note, I myself work with MARC records only rarely now. Does it make sense to move pymarc over to the code4lib organization account here on GitHub so it can be maintained/developed without me being a bottleneck?

@Wooble
Copy link
Collaborator

Wooble commented Oct 17, 2017

(If you can isolate a single problem record instead of a 60MB file that would probably be better, though)

@anarchivist
Copy link
Contributor

In case it's useful, I previously had bodged together a permissive reader version of pymarc.MARCReader: https://gist.github.com/anarchivist/4141681

@reeset
Copy link

reeset commented Oct 17, 2017

I'm not sure how python does character reading, but for large sets like this, I'd always recommend taking them out of marc and putting them into xml. You can't trust the encoding bits, and most ILS systems will do wonky things when exporting large sets (that you won't see with small sets). Additionally, they violate MARC21 rules (but not the general ISO 2701 rule structure) so you cannot code rules based on expected field values. When I process directly to XML (at least in how I do it in MarcEdit), I ignore character encoding completely, processing via a binary memory stream and sanitize characters for XML processing. This way I avoid these kinds of character issues. The other option (and MarcEdit does this as well) when MARC processing is have your stream do encoding swapping based on the record read -- but that requires having an algorithm that actually determines character encoding of the record so you can conform the reader to the individual record encoding block, and then convert the data into the encoding expected by the writer.

@edsu
Copy link
Owner

edsu commented Oct 17, 2017

I would still like to have a test record to play with that demonstrates the particular problem we are dealing with here. If we can't reproduce the problem it's really impossible to get enough traction to fix it.

I do like @anarchivist's idea of adding an option to pymarc.Reader. I'm going to open a new issue for that.

@pbnjay
Copy link

pbnjay commented Oct 17, 2017

I essentially just commented out all the instances of .decode('ascii') in record.py and it works fine now (line 307 was my particular exception also, but I just did them all).

I uploaded two problem files here: https://www.dropbox.com/sh/f4w7nv6e5ghnpmr/AACXD4L-GGqPhbc1YexBc6iea?dl=0 Since they're from the USDA they should be public domain, but just in case I'll unshare them once you have a copy to debug with. I'm not producing these files, just converting them to xml, so it'll probably be easier for someone else who knows what they're doing to isolate them.

Wooble added a commit to Wooble/pymarc that referenced this issue Jul 3, 2019
[I think this is the problem in edsu#89
and not really specific to Harvard Open Metadata]
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants