Feature/redesign xml reader #18

lukavdplas · 2024-04-30T20:27:56Z

This is a major refactor of the XML reader. It contains breaking changes to the API for XMLReader, HTMLReader, and the XML extractor. Everything that was previously possible when extracting XML documents is still possible, but the code must be adjusted.

(CentreForDigitalHumanities/I-analyzer#1556 contains a draft for those adjustments in I-analyzer.)

The main goal is to make the interface of the reader less confusing. I focused on a few "pain points":

There were quite a few places where you would define a tag to look for, but each had its own system.
The XML extractor had a lot of options that were difficult to oversee. The interactions between these properties was often unpredictable and/or buggy. This update reduces the number of arguments by making "tag chaining" much more powerful.
Some options were so hyperspecific that they were hard to understand without the specific XML datasets they were designed to work for.
Searching for tags leans more heavily on the filter syntax from BeautifulSoup, since that's a fine system already.

The core concept of searching the XML tree is to provide a chain of Tag objects. A minimal example is

XML(Tag('a')) # a child tag `<a>`

But you can get really wild here, and chain traversals in the XML tree.

XML(
	ParentTag(2), # move up two steps in the tree
	SiblingTag(role='meta'), # find a sibling with `role=meta` attribute
	lambda metadata: Tag('id', string=metadata['id']), # find an <id> tag whose contents match the metadata field 'id'
	TransformTag(find_metadata), # handle complex manoeuvre in custom function
	Tag('author'), # find child tag <author>
	Tag(re.compile(r'(first|last)Name')), # find child tag <firstName> or <lastName>
)

Creating a general system for searching for tags also means that the interface is much more powerful. Everything that was supported somewhere is now supported everywhere. For example, this code above illustrates a few things you could not do before:

select a sibling tag (secondary tag) by any means other than string content
match a tag to metadata when it's not a sibling of the target tag
use a custom transform function but don't make it the last step in the chain

The increased power of tag chaining is also meant to help with clarity. To illustrate, some examples from the "old" syntax:

x1 = XML(['a', 'b', 'c'], recursive=True)
# are all tags searched recursively? if not, which ones are?

x2 = XML('a', parent_level=2)
# is the traversal for parent_level done before or after selecting the <a> tag?

The many possibilities for intersection put a lot of pressure on documentation. The new version avoids this by asking the user to input what order of operations they have in mind. So, for example, you would now write the above code as:

# the current behaviour
x1 = XML(Tag('a', recursive=False), Tag('b', recursive=False), Tag('c'))
x2 = XML(ParentTag(2), Tag('a'))

# or alternatively:
x1_alt = XML(Tag('a'), Tag('b', recursive=False), Tag('c', recursive=False))
x2_alt = XML(Tag('a'), ParentTag(2))

These new x1 / x2 are a bit more verbose, but they are also clearer because they're more explicit about what's happening.

Documentation and test coverage should be as complete as it was before (probably more so), but I intend to add some more documentation in a future PR before we use this in I-analyzer.

Issues closed: close #17 , close #10 , close #9

Breaking changes

`XMLReader`

tag_entry must now be a Tag or a callable that returns a Tag based on the document metadata.
Idem for tag_toplevel
If you want to use the external_file option of the XML extractor, the toplevel tag of external files should now be configured in the external_file_tag_top_level attribute rather than the extractor of each field.

Changes to default behaviour:

If an XML extractor has external_file=True and the reader did not receive document metadata (which should supply the file), the value of the field will be None. The extractor will not be applied to the primary document.
If the provided sources are binary streams and the reader cannot find the toplevel tag, the error message will no longer look for a tag <recordID> in the tree to find a name for the source to use in the message. (The name will be None.)

`XML` extractor

tag parameter is now tags, which takes a variable number of arguments. (Instead of optionally taking a list.) Each argument must be a Tag or a callable that returns a Tag based on the document metadata. To migrate: replace strings and regular expressions with Tag objects, e.g. 'a' -> Tag('a'). Replace '.' with CurrentTag() and '..' with ParentTag().
parent_level parameter is removed. You can migrate this query using a ParentTag in the tags.
secondary_tag parameter is removed. You can migrate this query by using a SiblingTag in the tags. If you need to match a tag's content with document metadata, use a callable in the tags.
recursive parameter is removed. Searching recursively can be configured in each of the tags where applicable.
transform_soup_func is removed. You can migrate this query by using a TransformTag. To be used in TransformTag, the transformation function must return an iterable of elements.
external_file parameter is now a boolean. (The toplevel tag of the external file is now configured in the XMLReader.)

Changes to the default behaviour:

While SiblingTag is the most natural successor to secondary_tag, an element is not its own sibling, while secondary_tag did allow elements to be their own secondary tag. For the old behaviour, use a ParentTag to traverse the tree instead of a SiblingTag.
Child tags are now searched recursively by default, as this is the behaviour of soup.find_all() in BeautifulSoup. Use Tag(recursive=False) for the old behaviour.
A chain of tags, e.g. XML(Tag('a'), Tag('b')) will identify more matches than the old XML(['a', 'b']). In this example, the extractor now searches for "any b tag that is a child of any a tag", rather than "any b tag that is a child of the first a tag". For instance, in <a></a><a><b></b></a>, the old version would not have found a match, but the new version will. If you want the old behaviour, use XML(Tag('a', limit=1), Tag('b'))
When multiple=True, the extractor always returns the results as a list. There is no longer an exception when used in combination with flatten=True. To migrate migrate extractors with multiple=True, flatten=True, add transform='\n'.join to the arguments to get the same result. There is also no longer an exception when there are no results (in that case, the extractor will return an empty list). If needed, use a transform argument to convert empty lists to None.
If an extract_soup_func is provided, that function is only called on matching tags; it will not be called with None input when there is no result. If you want to provide a value when there are no matches, use the transform argument or a Backup extractor.

`FilterAttribute` extractor

This extractor is removed; filtering attributes is now supported in the XML extractor.

ianalyzer_readers/xml_tag.py

BeritJanssen

Noice!

lukavdplas added 23 commits April 25, 2024 18:52

fix typo

4181965

define XMLTag class

a97d292

add _soup_and_metadata_from_source method

445158b

implement XMLTag for tag_toplevel and tag_entry

566761c

use XMLTag in XML extractor

2ad3c03

adjust sibling tag argument

a670a3e

implement ParentTag class

efa860c

add FindParentTag class

9b45ef1

add SiblingTag class

d7f169b

clearer code for resolving tags

2fc70a0

make find_in_soup return an iterable

ad7765c

refactor _select method

c1069ee

remove sibling_tag argument to XML

2d1cdcc

replace transform_soup_func with TransformTag

1bfa2a9

use CurrentTag class instead of None for tags

173a215

use filename in logging messages

ef42f1d

clarify handling of external files

e4f00f9

update docstrings

fca8da0

unify _resolve_tag methods

8cfe5bd

renaming for clarity

b0a1084

use * for tags input in XML extractor

3519e1c

improve documentation

59d176d

more tag classes

3065d00

This was linked to issues Apr 30, 2024

XML extractor: multiple overrides parent_level #9

Closed

XML extractor: tag None overrides parent_level #10

Closed

XML extractor: external_file only works when secondary_tag is also used #17

Closed

lukavdplas added 2 commits May 1, 2024 16:21

allow no tags in XML input

33637da

fix callables for tag_entry / tag_toplevel

21b78dc

lukavdplas mentioned this pull request May 3, 2024

Update XML corpus definitions CentreForDigitalHumanities/I-analyzer#1556

Merged

lukavdplas requested a review from Meesch May 15, 2024 14:09

lukavdplas marked this pull request as ready for review May 16, 2024 14:14

BeritJanssen reviewed Jun 26, 2024

View reviewed changes

ianalyzer_readers/xml_tag.py Outdated Show resolved Hide resolved

remove stray sentence in docstring

64b00db

BeritJanssen approved these changes Jun 26, 2024

View reviewed changes

lukavdplas merged commit 95ecfac into feature/expand-unit-tests Jun 26, 2024
4 checks passed

lukavdplas deleted the feature/redesign-xml-reader branch June 26, 2024 14:43

This was referenced Jun 26, 2024

XML extractor: multiple overrides parent_level #9

Closed

XML extractor: tag None overrides parent_level #10

Closed

XML extractor: external_file only works when secondary_tag is also used #17

Closed

lukavdplas mentioned this pull request Oct 22, 2024

Gaps in documentation #12

Closed

lukavdplas mentioned this pull request Oct 29, 2024

Improve conditional extractors #8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/redesign xml reader #18

Feature/redesign xml reader #18

lukavdplas commented Apr 30, 2024 •

edited

Loading

BeritJanssen left a comment

Feature/redesign xml reader #18

Feature/redesign xml reader #18

Conversation

lukavdplas commented Apr 30, 2024 • edited Loading

Breaking changes

XMLReader

XML extractor

FilterAttribute extractor

BeritJanssen left a comment

Choose a reason for hiding this comment

lukavdplas commented Apr 30, 2024 •

edited

Loading

`XMLReader`

`XML` extractor

`FilterAttribute` extractor