XML extractor: `external_file` only works when `secondary_tag` is also used #17

lukavdplas · 2024-04-25T16:50:15Z

The only way in which the external_file option for the XML extractor is used in I-analyzer is in something like this:

https://github.com/UUDigitalHumanitieslab/I-analyzer/blob/932bbc4fe33caa8b46754f18e4ad3a7caebcf4b8/backend/corpora/periodicals/periodicals.py#L166-L175

Here, the extractor also contains a secondary_tag argument, where 'match' specifies the name of another field in the reader. We have no instances of using XML without a secondary tag. That would be something like:

XML('foo', external_file={'xml_tag_toplevel': 'bar', 'xml_tag_entry': 'baz'})

Since this is never used, it went unnoticed that such an extractor would raise a TypeError if you tried to use it. This is how the XMLReader implements external files:

https://github.com/UUDigitalHumanitieslab/ianalyzer-readers/blob/0372a6ea4a9b91a0666c1d113839fb5a26ce02a7/ianalyzer_readers/readers/xml.py#L185-L197

In human terms: the way it looks for the right tag is:

If a specification for a secondary_tag is provided, search for it, select its parent, and provide that as the "entry tag" to the XML extractor for the field.
Otherwise, provide the specification of the entry tag to the extractor (rather than the actual tag in the tree).

The bug itself is easily fixable, but it does reflect the oddness of this construction. For example:

The "entry_tag" echoes the language of the main reader which iterates over "entries", but external files are not iterated, so this concept doesn't really make sense here.
The implementation of secondary_tag is quite different if external_file is true. It's possible to match with another field, instead of the metadata, as long as that field has external_file=False. The behaviour when tag is a list or recursive=True is also radically different.
If you're not using the secondary_tag option to match with another field, there is actually no reason to open the external file at this stage, instead of during file discovery. (Hence why it was never done.)
The toplevel tag and entry tag for the external file are the value for the external_file argument, but we don't have any case where these values are not the same for the entire reader. Simply external_file=True would be fine (provided the toplevel tag is specified somewhere else).

The text was updated successfully, but these errors were encountered:

lukavdplas · 2024-06-26T14:54:02Z

closed by #18

lukavdplas added the bug Something isn't working label Apr 25, 2024

lukavdplas mentioned this issue Apr 30, 2024

Feature/redesign xml reader #18

Merged

lukavdplas linked a pull request Apr 30, 2024 that will close this issue

Feature/redesign xml reader #18

Merged

lukavdplas closed this as completed Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML extractor: `external_file` only works when `secondary_tag` is also used #17

XML extractor: `external_file` only works when `secondary_tag` is also used #17

lukavdplas commented Apr 25, 2024 •

edited

Loading

lukavdplas commented Jun 26, 2024

XML extractor: external_file only works when secondary_tag is also used #17

XML extractor: external_file only works when secondary_tag is also used #17

Comments

lukavdplas commented Apr 25, 2024 • edited Loading

lukavdplas commented Jun 26, 2024

XML extractor: `external_file` only works when `secondary_tag` is also used #17

XML extractor: `external_file` only works when `secondary_tag` is also used #17

lukavdplas commented Apr 25, 2024 •

edited

Loading