-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/redesign xml reader #18
Merged
lukavdplas
merged 26 commits into
feature/expand-unit-tests
from
feature/redesign-xml-reader
Jun 26, 2024
Merged
Feature/redesign xml reader #18
lukavdplas
merged 26 commits into
feature/expand-unit-tests
from
feature/redesign-xml-reader
Jun 26, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was
linked to
issues
Apr 30, 2024
BeritJanssen
approved these changes
Jun 26, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noice!
This was referenced Jun 26, 2024
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a major refactor of the XML reader. It contains breaking changes to the API for
XMLReader
,HTMLReader
, and theXML
extractor. Everything that was previously possible when extracting XML documents is still possible, but the code must be adjusted.(CentreForDigitalHumanities/I-analyzer#1556 contains a draft for those adjustments in I-analyzer.)
The main goal is to make the interface of the reader less confusing. I focused on a few "pain points":
XML
extractor had a lot of options that were difficult to oversee. The interactions between these properties was often unpredictable and/or buggy. This update reduces the number of arguments by making "tag chaining" much more powerful.The core concept of searching the XML tree is to provide a chain of
Tag
objects. A minimal example isBut you can get really wild here, and chain traversals in the XML tree.
Creating a general system for searching for tags also means that the interface is much more powerful. Everything that was supported somewhere is now supported everywhere. For example, this code above illustrates a few things you could not do before:
The increased power of tag chaining is also meant to help with clarity. To illustrate, some examples from the "old" syntax:
The many possibilities for intersection put a lot of pressure on documentation. The new version avoids this by asking the user to input what order of operations they have in mind. So, for example, you would now write the above code as:
These new
x1
/x2
are a bit more verbose, but they are also clearer because they're more explicit about what's happening.Documentation and test coverage should be as complete as it was before (probably more so), but I intend to add some more documentation in a future PR before we use this in I-analyzer.
Issues closed: close #17 , close #10 , close #9
Breaking changes
XMLReader
tag_entry
must now be aTag
or a callable that returns aTag
based on the document metadata.tag_toplevel
external_file
option of theXML
extractor, the toplevel tag of external files should now be configured in theexternal_file_tag_top_level
attribute rather than the extractor of each field.Changes to default behaviour:
XML
extractor hasexternal_file=True
and the reader did not receive document metadata (which should supply the file), the value of the field will beNone
. The extractor will not be applied to the primary document.<recordID>
in the tree to find a name for the source to use in the message. (The name will beNone
.)XML
extractortag
parameter is nowtags
, which takes a variable number of arguments. (Instead of optionally taking a list.) Each argument must be aTag
or a callable that returns aTag
based on the document metadata. To migrate: replace strings and regular expressions withTag
objects, e.g.'a'
->Tag('a')
. Replace'.'
withCurrentTag()
and'..'
withParentTag()
.parent_level
parameter is removed. You can migrate this query using aParentTag
in thetags
.secondary_tag
parameter is removed. You can migrate this query by using aSiblingTag
in thetags
. If you need to match a tag's content with document metadata, use a callable in thetags
.recursive
parameter is removed. Searching recursively can be configured in each of thetags
where applicable.transform_soup_func
is removed. You can migrate this query by using aTransformTag
. To be used inTransformTag
, the transformation function must return an iterable of elements.external_file
parameter is now a boolean. (The toplevel tag of the external file is now configured in theXMLReader
.)Changes to the default behaviour:
SiblingTag
is the most natural successor tosecondary_tag
, an element is not its own sibling, whilesecondary_tag
did allow elements to be their own secondary tag. For the old behaviour, use aParentTag
to traverse the tree instead of aSiblingTag
.soup.find_all()
in BeautifulSoup. UseTag(recursive=False)
for the old behaviour.XML(Tag('a'), Tag('b'))
will identify more matches than the oldXML(['a', 'b'])
. In this example, the extractor now searches for "anyb
tag that is a child of anya
tag", rather than "anyb
tag that is a child of the firsta
tag". For instance, in<a></a><a><b></b></a>
, the old version would not have found a match, but the new version will. If you want the old behaviour, useXML(Tag('a', limit=1), Tag('b'))
multiple=True
, the extractor always returns the results as a list. There is no longer an exception when used in combination withflatten=True
. To migrate migrate extractors withmultiple=True, flatten=True
, addtransform='\n'.join
to the arguments to get the same result. There is also no longer an exception when there are no results (in that case, the extractor will return an empty list). If needed, use atransform
argument to convert empty lists toNone
.extract_soup_func
is provided, that function is only called on matching tags; it will not be called withNone
input when there is no result. If you want to provide a value when there are no matches, use thetransform
argument or aBackup
extractor.FilterAttribute
extractorThis extractor is removed; filtering attributes is now supported in the
XML
extractor.