Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/redesign xml reader #18

Merged
merged 26 commits into from
Jun 26, 2024

Conversation

lukavdplas
Copy link
Contributor

@lukavdplas lukavdplas commented Apr 30, 2024

This is a major refactor of the XML reader. It contains breaking changes to the API for XMLReader, HTMLReader, and the XML extractor. Everything that was previously possible when extracting XML documents is still possible, but the code must be adjusted.

(CentreForDigitalHumanities/I-analyzer#1556 contains a draft for those adjustments in I-analyzer.)

The main goal is to make the interface of the reader less confusing. I focused on a few "pain points":

  • There were quite a few places where you would define a tag to look for, but each had its own system.
  • The XML extractor had a lot of options that were difficult to oversee. The interactions between these properties was often unpredictable and/or buggy. This update reduces the number of arguments by making "tag chaining" much more powerful.
  • Some options were so hyperspecific that they were hard to understand without the specific XML datasets they were designed to work for.
  • Searching for tags leans more heavily on the filter syntax from BeautifulSoup, since that's a fine system already.

The core concept of searching the XML tree is to provide a chain of Tag objects. A minimal example is

XML(Tag('a')) # a child tag `<a>`

But you can get really wild here, and chain traversals in the XML tree.

XML(
	ParentTag(2), # move up two steps in the tree
	SiblingTag(role='meta'), # find a sibling with `role=meta` attribute
	lambda metadata: Tag('id', string=metadata['id']), # find an <id> tag whose contents match the metadata field 'id'
	TransformTag(find_metadata), # handle complex manoeuvre in custom function
	Tag('author'), # find child tag <author>
	Tag(re.compile(r'(first|last)Name')), # find child tag <firstName> or <lastName>
)

Creating a general system for searching for tags also means that the interface is much more powerful. Everything that was supported somewhere is now supported everywhere. For example, this code above illustrates a few things you could not do before:

  • select a sibling tag (secondary tag) by any means other than string content
  • match a tag to metadata when it's not a sibling of the target tag
  • use a custom transform function but don't make it the last step in the chain

The increased power of tag chaining is also meant to help with clarity. To illustrate, some examples from the "old" syntax:

x1 = XML(['a', 'b', 'c'], recursive=True)
# are all tags searched recursively? if not, which ones are?

x2 = XML('a', parent_level=2)
# is the traversal for parent_level done before or after selecting the <a> tag?

The many possibilities for intersection put a lot of pressure on documentation. The new version avoids this by asking the user to input what order of operations they have in mind. So, for example, you would now write the above code as:

# the current behaviour
x1 = XML(Tag('a', recursive=False), Tag('b', recursive=False), Tag('c'))
x2 = XML(ParentTag(2), Tag('a'))

# or alternatively:
x1_alt = XML(Tag('a'), Tag('b', recursive=False), Tag('c', recursive=False))
x2_alt = XML(Tag('a'), ParentTag(2))

These new x1 / x2 are a bit more verbose, but they are also clearer because they're more explicit about what's happening.

Documentation and test coverage should be as complete as it was before (probably more so), but I intend to add some more documentation in a future PR before we use this in I-analyzer.

Issues closed: close #17 , close #10 , close #9

Breaking changes

XMLReader

  • tag_entry must now be a Tag or a callable that returns a Tag based on the document metadata.
  • Idem for tag_toplevel
  • If you want to use the external_file option of the XML extractor, the toplevel tag of external files should now be configured in the external_file_tag_top_level attribute rather than the extractor of each field.

Changes to default behaviour:

  • If an XML extractor has external_file=True and the reader did not receive document metadata (which should supply the file), the value of the field will be None. The extractor will not be applied to the primary document.
  • If the provided sources are binary streams and the reader cannot find the toplevel tag, the error message will no longer look for a tag <recordID> in the tree to find a name for the source to use in the message. (The name will be None.)

XML extractor

  • tag parameter is now tags, which takes a variable number of arguments. (Instead of optionally taking a list.) Each argument must be a Tag or a callable that returns a Tag based on the document metadata. To migrate: replace strings and regular expressions with Tag objects, e.g. 'a' -> Tag('a'). Replace '.' with CurrentTag() and '..' with ParentTag().
  • parent_level parameter is removed. You can migrate this query using a ParentTag in the tags.
  • secondary_tag parameter is removed. You can migrate this query by using a SiblingTag in the tags. If you need to match a tag's content with document metadata, use a callable in the tags.
  • recursive parameter is removed. Searching recursively can be configured in each of the tags where applicable.
  • transform_soup_func is removed. You can migrate this query by using a TransformTag. To be used in TransformTag, the transformation function must return an iterable of elements.
  • external_file parameter is now a boolean. (The toplevel tag of the external file is now configured in the XMLReader.)

Changes to the default behaviour:

  • While SiblingTag is the most natural successor to secondary_tag, an element is not its own sibling, while secondary_tag did allow elements to be their own secondary tag. For the old behaviour, use a ParentTag to traverse the tree instead of a SiblingTag.
  • Child tags are now searched recursively by default, as this is the behaviour of soup.find_all() in BeautifulSoup. Use Tag(recursive=False) for the old behaviour.
  • A chain of tags, e.g. XML(Tag('a'), Tag('b')) will identify more matches than the old XML(['a', 'b']). In this example, the extractor now searches for "any b tag that is a child of any a tag", rather than "any b tag that is a child of the first a tag". For instance, in <a></a><a><b></b></a>, the old version would not have found a match, but the new version will. If you want the old behaviour, use XML(Tag('a', limit=1), Tag('b'))
  • When multiple=True, the extractor always returns the results as a list. There is no longer an exception when used in combination with flatten=True. To migrate migrate extractors with multiple=True, flatten=True, add transform='\n'.join to the arguments to get the same result. There is also no longer an exception when there are no results (in that case, the extractor will return an empty list). If needed, use a transform argument to convert empty lists to None.
  • If an extract_soup_func is provided, that function is only called on matching tags; it will not be called with None input when there is no result. If you want to provide a value when there are no matches, use the transform argument or a Backup extractor.

FilterAttribute extractor

This extractor is removed; filtering attributes is now supported in the XML extractor.

@lukavdplas lukavdplas marked this pull request as ready for review May 16, 2024 14:14
Copy link
Contributor

@BeritJanssen BeritJanssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noice!

@lukavdplas lukavdplas merged commit 95ecfac into feature/expand-unit-tests Jun 26, 2024
4 checks passed
@lukavdplas lukavdplas deleted the feature/redesign-xml-reader branch June 26, 2024 14:43
@lukavdplas lukavdplas mentioned this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants