Add encoding option to marcxml.record_to_xml #105

aaronhelton · 2017-08-08T13:00:44Z

Unless I am missing something with respect to the marcxml functionality, the record_to_xml function seems to return text encoded in us-ascii, which causes problems when systems are expecting utf-8 encoding. Tracing this issue to its source revealed that xml.etree.ElementTree.tostring takes an optional encoding parameter, which defaults to us-ascii. I am proposing to be able to pass an optional encoding parameter from marcxml.record_to_xml's invocation of ET.tostring.

In my local fork, I have made the following change:

def record_to_xml(record, quiet=False, namespace=False, encoding='us-ascii'):
  node = record_to_xml_node(record, quiet, namespace)
  return ET.tostring(node, encoding=encoding)

Without the change, my output for record_to_xml on UTF-8 strings that contain diacritics looks like this:

<record>
    <leader>          22        4500</leader>
    <datafield ind1=" " ind2=" " tag="246">
      <subfield code="a">Nouvelles-H&#233;brides, communiqu&#233;s par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
      <subfield code="b">Lois et r&#232;glements promulgu&#233;s pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et r&#233;glementer la distribution des stup&#233;fiants, amend&#233;e par le Protocole du 11 d&#233;cembre 1946</subfield>
    </datafield>
</record>

And the resulting file ends up with a us-ascii encoding, which causes import of the record to fail on the MARC based system we are using.

With the change, I get output that looks like this when I pass the optional encoding:

<record>
	<leader>          22        4500</leader>
	<datafield ind1=" " ind2=" " tag="246">
		<subfield code="a">Nouvelles-Hébrides, communiqués par le gouvernement de la France et par le gouvernement du Royaume-Uni de Grande-Bretagne et d'Irlande du Nord :</subfield>
		<subfield code="b">Lois et règlements promulgués pour donner effet aux dispositions de la Convention du 13 juillet 1931 pour limiter la fabrication et réglementer la distribution des stupéfiants, amendée par le Protocole du 11 décembre 1946</subfield>
	</datafield>
</record>

I invoke as follows:

out_file.write(marcxml.record_to_xml(record,encoding='utf-8'))

And the resulting file ends up with a utf-8 encoding.

Note that I tried forcing encoding to utf-8 at each successive level beginning with the open() function and working backward to the record itself. The only thing I found that actually works is to pass an encoding parameter in this particular function. If I am missing something (obvious or not), I'd be interested in correcting my oversight.

The change looks trivial to me and preserves the default functionality, but I don't know if there are tests that depend on it.

The text was updated successfully, but these errors were encountered:

edsu · 2017-08-08T13:09:54Z

I'm curious what MARC based system you are using that rejected the record with the unicode character entities.

It seems to me that utf-8 should be the default encoding for XML so hard coding ET.tostring(node, encoding='utf-8') should be fine.

aaronhelton · 2017-08-08T13:27:05Z

We're using Invenio, but I don't really know what the internals are doing, since I don't have access to the source code our vendor is maintaining.

It's not that the system rejected the unicode characters. It's that the output file ended up with a MIME encoding of us-ascii (as reported by the file --mime-encoding command), and the Invenio batch uploader module rejected it as not being encoded properly.

Agreed that utf-8 should be default for XML encoding, which is why I find the function description in ET.tostring so strange:

xml.etree.ElementTree.tostring(element, encoding="us-ascii", method="xml", *, short_empty_elements=True)

Generates a string representation of an XML element, including all subelements. element is an Element instance. encoding [1] is the output encoding (default is US-ASCII). Use encoding="unicode" to generate a Unicode string (otherwise, a bytestring is generated). method is either "xml", "html" or "text" (default is "xml"). short_empty_elements has the same meaning as in ElementTree.write(). Returns an (optionally) encoded string containing the XML data.

edsu · 2017-08-08T14:24:23Z

If you have time to put together a pull request for the change and an accompanying test I would be grateful.

aaronhelton · 2017-08-10T20:03:43Z

This might cover it, but I admit I am still new to writing tests and may have taken the wrong approach.

https://github.com/dag-hammarskjold-library/pymarc/tree/marcxml-encode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add encoding option to marcxml.record_to_xml #105

Add encoding option to marcxml.record_to_xml #105

aaronhelton commented Aug 8, 2017 •

edited

Loading

edsu commented Aug 8, 2017 •

edited

Loading

aaronhelton commented Aug 8, 2017

edsu commented Aug 8, 2017

aaronhelton commented Aug 10, 2017

Add encoding option to marcxml.record_to_xml #105

Add encoding option to marcxml.record_to_xml #105

Comments

aaronhelton commented Aug 8, 2017 • edited Loading

edsu commented Aug 8, 2017 • edited Loading

aaronhelton commented Aug 8, 2017

edsu commented Aug 8, 2017

aaronhelton commented Aug 10, 2017

aaronhelton commented Aug 8, 2017 •

edited

Loading

edsu commented Aug 8, 2017 •

edited

Loading