Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix blocks not starting on new lines (#411) #416

Merged
merged 2 commits into from
Nov 4, 2023

Conversation

zepinglee
Copy link
Contributor

@zepinglee zepinglee commented Nov 4, 2023

This PR modifies the @ marker regex so that it doesn't necessarily appear at a new line. It also passes all the tests.

Copy link
Collaborator

@MiWeiss MiWeiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Thanks a lot for opening the PR.

I left one minor comment regarding the test above.

A more significant problem is on line 175. Currently, we discover if a new block starts with a previous one still open, which typically highlights an error in the parsed file. In such a case, we abort processing the still open block, in deliberate deviation from "strict bibtex parsing". As we only considered @.... on the start of a new line, the risk of flagging a "false positive" was negligible, but at the same time the user experience increased drastically.

With your change, we would abort the parsing of a block when facing an unexpected @.... anywhere, e.g, the following could not be parsed:

@entry{key, title="all blocks must start with @entry{"} 

To reduce this risk, I would love to keep the previous behavior and only abort the parsing of a block if a new block starts on a new line. Does that make sense?

Note:
Another problem, but one which does not have to be fixed in this PR is that biber-style comment won't be working after fixing this PR anymore, e.g. the following will likely result in an error.,

% @entry{key, 
% title="sometitle"} 

However, we technically still don't support biber comments and a clean solution is required there anyways, thus I'd leave this to #372

tests/splitter_tests/test_splitter_basic.py Outdated Show resolved Hide resolved
@zepinglee
Copy link
Contributor Author

A more significant problem is on line 175. Currently, we discover if a new block starts with a previous one still open, which typically highlights an error in the parsed file. In such a case, we abort processing the still open block, in deliberate deviation from "strict bibtex parsing". As we only considered @.... on the start of a new line, the risk of flagging a "false positive" was negligible, but at the same time the user experience increased drastically.

I don't think the _move_to_comma_or_closing_curly_bracket() method where line 175 is located matches the original BibTeX's behavior. BibTeX only accepts strictly brace-balanced string in a literal field token (in terms of the bibtex.web) even when the outer delimiters are quotes. The braces or quotation marks in a field token are not treated as escaped characters. In addition, an @ character doesn't break the field.

def _move_to_comma_or_closing_curly_bracket(
self, currently_quote_escaped=False, num_open_curls=0
) -> int:
"""Index of the end of the field, taking quote-escape into account."""
if num_open_curls > 0 and currently_quote_escaped:
raise ParserStateException(
message="Internal error in parser. "
"Found a field-value that is both quote-escaped and curly-escaped. "
"Please report this bug."
)
def _is_escaped():
return currently_quote_escaped or num_open_curls > 0
# iterate over marks until we find end of field
while True:
next_mark = self._next_mark(accept_eof=False)
# Handle "escape" characters
if next_mark.group(0) == '"' and not num_open_curls > 0:
currently_quote_escaped = not currently_quote_escaped
continue
elif next_mark.group(0) == "{" and not currently_quote_escaped:
num_open_curls += 1
continue
elif (
next_mark.group(0) == "}"
and not currently_quote_escaped
and num_open_curls > 0
):
num_open_curls -= 1
continue
# Check for end of field
elif next_mark.group(0) == "," and not _is_escaped():
self._unaccepted_mark = next_mark
return next_mark.start()
# Check for end of entry:
elif next_mark.group(0) == "}" and not _is_escaped():
self._unaccepted_mark = next_mark
return next_mark.start()
# Sanity-check: If new block is starting, we abort
elif next_mark.group(0).startswith("@"):
self._unaccepted_mark = next_mark
if currently_quote_escaped:
looking_for = '`"`'
elif num_open_curls > 0:
looking_for = "`}`"
else:
looking_for = "`,` or `}`"
raise BlockAbortedException(
abort_reason=f"Unexpected block start: `{next_mark.group(0)}`. "
f"Was still looking for field-value closing {looking_for} ",
end_index=next_mark.start() - 1,
)

In the example of @entry{key, title="all blocks must start with @entry{"}, the second quotation mark is in brace level 1 and it doesn't terminate the field token. The final right brace make the string balanced and BibTeX look for another quotation mark to end the field. This is verified by running BibTeX and the raised error Illegal end of database file. Yet @entry{key, title="all blocks must start with @entry{"}"} works fine with BibTeX and the title field is all blocks must start with @entry{"}.

I'm planning to rewrite this method and make another PR.

@MiWeiss
Copy link
Collaborator

MiWeiss commented Nov 4, 2023

Thanks for your PR update and for your long comment.

First things first:

Your argument convinced me that a plain @entry{ would be very rare, and I thus see no reason not to merge this PR anymore. We may change on this feature in the future, based on received feedback, but for now I guess merging this PR is a clear improvement of things!

Regarding the larger argument you made:

This library is deliberately not 100% consistent with bibtex, but it attempts to be somewhat more fault-tolerant: We aim to be able to handle cases when there are small mistakes in the bibtex file gently, without messing up the parsing of subsequent blocks, or even prevent parsing at all. This is a request we received numerous times reagarding bibtexparser v1. Take the following example:

@string{myKey = {myValue))

@string{anotherString = {anotherValue}}

While the intention of the writer here is pretty clear, bibtex would not be happy: Note the )) on the first line which should be }}. Technically, the braces are not balanced after the first line, thus not allowing for a new string on the third line. Previously, in bibtexparser v2, we thus aborted the parsing of a block if a line started with a "block start", e.g. a @entry{. This allowed a "fresh" start with that block. After merging this PR, we now do this even if the "block start" appears anywhere, not just on a new line.

Thus, I am very happy to review any improvement you make to the parsing (and I am sure there are plenty to be made ;-)), but please keep in mind that we take a deliberate trade-off between sticking to the pure bibtex rules, and user friendly parsing of imperfect files. If in doubt, before investing too much time, feel free to ask for mine (and anyones) comment in a seperate issue before investing too much time.

Most importantly

Thanks again for your contribution, and I am looking forward to your next PRs.

@MiWeiss MiWeiss merged commit 92d38d8 into sciunto-org:main Nov 4, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants