Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML extractor: external_file only works when secondary_tag is also used #17

Closed
lukavdplas opened this issue Apr 25, 2024 · 1 comment · Fixed by #18
Closed

XML extractor: external_file only works when secondary_tag is also used #17

lukavdplas opened this issue Apr 25, 2024 · 1 comment · Fixed by #18
Labels
bug Something isn't working

Comments

@lukavdplas
Copy link
Contributor

lukavdplas commented Apr 25, 2024

The only way in which the external_file option for the XML extractor is used in I-analyzer is in something like this:

https://github.com/UUDigitalHumanitieslab/I-analyzer/blob/932bbc4fe33caa8b46754f18e4ad3a7caebcf4b8/backend/corpora/periodicals/periodicals.py#L166-L175

Here, the extractor also contains a secondary_tag argument, where 'match' specifies the name of another field in the reader. We have no instances of using XML without a secondary tag. That would be something like:

XML('foo', external_file={'xml_tag_toplevel': 'bar', 'xml_tag_entry': 'baz'})

Since this is never used, it went unnoticed that such an extractor would raise a TypeError if you tried to use it. This is how the XMLReader implements external files:

https://github.com/UUDigitalHumanitieslab/ianalyzer-readers/blob/0372a6ea4a9b91a0666c1d113839fb5a26ce02a7/ianalyzer_readers/readers/xml.py#L185-L197

In human terms: the way it looks for the right tag is:

  • If a specification for a secondary_tag is provided, search for it, select its parent, and provide that as the "entry tag" to the XML extractor for the field.
  • Otherwise, provide the specification of the entry tag to the extractor (rather than the actual tag in the tree).

The bug itself is easily fixable, but it does reflect the oddness of this construction. For example:

  • The "entry_tag" echoes the language of the main reader which iterates over "entries", but external files are not iterated, so this concept doesn't really make sense here.
  • The implementation of secondary_tag is quite different if external_file is true. It's possible to match with another field, instead of the metadata, as long as that field has external_file=False. The behaviour when tag is a list or recursive=True is also radically different.
  • If you're not using the secondary_tag option to match with another field, there is actually no reason to open the external file at this stage, instead of during file discovery. (Hence why it was never done.)
  • The toplevel tag and entry tag for the external file are the value for the external_file argument, but we don't have any case where these values are not the same for the entire reader. Simply external_file=True would be fine (provided the toplevel tag is specified somewhere else).
@lukavdplas lukavdplas added the bug Something isn't working label Apr 25, 2024
@lukavdplas lukavdplas linked a pull request Apr 30, 2024 that will close this issue
@lukavdplas
Copy link
Contributor Author

closed by #18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant