Fix blocks not starting on new lines (#411) #416

zepinglee · 2023-11-04T02:29:20Z

This PR modifies the @ marker regex so that it doesn't necessarily appear at a new line. It also passes all the tests.

MiWeiss

Hi. Thanks a lot for opening the PR.

I left one minor comment regarding the test above.

A more significant problem is on line 175. Currently, we discover if a new block starts with a previous one still open, which typically highlights an error in the parsed file. In such a case, we abort processing the still open block, in deliberate deviation from "strict bibtex parsing". As we only considered @.... on the start of a new line, the risk of flagging a "false positive" was negligible, but at the same time the user experience increased drastically.

With your change, we would abort the parsing of a block when facing an unexpected @.... anywhere, e.g, the following could not be parsed:

@entry{key, title="all blocks must start with @entry{"}

To reduce this risk, I would love to keep the previous behavior and only abort the parsing of a block if a new block starts on a new line. Does that make sense?

Note:
Another problem, but one which does not have to be fixed in this PR is that biber-style comment won't be working after fixing this PR anymore, e.g. the following will likely result in an error.,

% @entry{key, 
% title="sometitle"}

However, we technically still don't support biber comments and a clean solution is required there anyways, thus I'd leave this to #372

tests/splitter_tests/test_splitter_basic.py

zepinglee · 2023-11-04T15:37:06Z

A more significant problem is on line 175. Currently, we discover if a new block starts with a previous one still open, which typically highlights an error in the parsed file. In such a case, we abort processing the still open block, in deliberate deviation from "strict bibtex parsing". As we only considered @.... on the start of a new line, the risk of flagging a "false positive" was negligible, but at the same time the user experience increased drastically.

I don't think the _move_to_comma_or_closing_curly_bracket() method where line 175 is located matches the original BibTeX's behavior. BibTeX only accepts strictly brace-balanced string in a literal field token (in terms of the bibtex.web) even when the outer delimiters are quotes. The braces or quotation marks in a field token are not treated as escaped characters. In addition, an @ character doesn't break the field.

python-bibtexparser/bibtexparser/splitter.py

Lines 131 to 189 in 770eb00

    
           def _move_to_comma_or_closing_curly_bracket( 
        
               self, currently_quote_escaped=False, num_open_curls=0 
        
           ) -> int: 
        
               """Index of the end of the field, taking quote-escape into account.""" 
        
               if num_open_curls > 0 and currently_quote_escaped: 
        
                   raise ParserStateException( 
        
                       message="Internal error in parser. " 
        
                       "Found a field-value that is both quote-escaped and curly-escaped. " 
        
                       "Please report this bug." 
        
                   ) 
        
               def _is_escaped(): 
        
                   return currently_quote_escaped or num_open_curls > 0 
        
               # iterate over marks until we find end of field 
        
               while True: 
        
                   next_mark = self._next_mark(accept_eof=False) 
        
                   # Handle "escape" characters 
        
                   if next_mark.group(0) == '"' and not num_open_curls > 0: 
        
                       currently_quote_escaped = not currently_quote_escaped 
        
                       continue 
        
                   elif next_mark.group(0) == "{" and not currently_quote_escaped: 
        
                       num_open_curls += 1 
        
                       continue 
        
                   elif ( 
        
                       next_mark.group(0) == "}" 
        
                       and not currently_quote_escaped 
        
                       and num_open_curls > 0 
        
                   ): 
        
                       num_open_curls -= 1 
        
                       continue 
        
                   # Check for end of field 
        
                   elif next_mark.group(0) == "," and not _is_escaped(): 
        
                       self._unaccepted_mark = next_mark 
        
                       return next_mark.start() 
        
                   # Check for end of entry: 
        
                   elif next_mark.group(0) == "}" and not _is_escaped(): 
        
                       self._unaccepted_mark = next_mark 
        
                       return next_mark.start() 
        
                   # Sanity-check: If new block is starting, we abort 
        
                   elif next_mark.group(0).startswith("@"): 
        
                       self._unaccepted_mark = next_mark 
        
                       if currently_quote_escaped: 
        
                           looking_for = '`"`' 
        
                       elif num_open_curls > 0: 
        
                           looking_for = "`}`" 
        
                       else: 
        
                           looking_for = "`,` or `}`" 
        
                       raise BlockAbortedException( 
        
                           abort_reason=f"Unexpected block start: `{next_mark.group(0)}`. " 
        
                           f"Was still looking for field-value closing {looking_for} ", 
        
                           end_index=next_mark.start() - 1, 
        
                       )

In the example of @entry{key, title="all blocks must start with @entry{"}, the second quotation mark is in brace level 1 and it doesn't terminate the field token. The final right brace make the string balanced and BibTeX look for another quotation mark to end the field. This is verified by running BibTeX and the raised error Illegal end of database file. Yet @entry{key, title="all blocks must start with @entry{"}"} works fine with BibTeX and the title field is all blocks must start with @entry{"}.

I'm planning to rewrite this method and make another PR.

MiWeiss · 2023-11-04T20:10:05Z

Thanks for your PR update and for your long comment.

First things first:

Your argument convinced me that a plain @entry{ would be very rare, and I thus see no reason not to merge this PR anymore. We may change on this feature in the future, based on received feedback, but for now I guess merging this PR is a clear improvement of things!

Regarding the larger argument you made:

This library is deliberately not 100% consistent with bibtex, but it attempts to be somewhat more fault-tolerant: We aim to be able to handle cases when there are small mistakes in the bibtex file gently, without messing up the parsing of subsequent blocks, or even prevent parsing at all. This is a request we received numerous times reagarding bibtexparser v1. Take the following example:

@string{myKey = {myValue))

@string{anotherString = {anotherValue}}

While the intention of the writer here is pretty clear, bibtex would not be happy: Note the )) on the first line which should be }}. Technically, the braces are not balanced after the first line, thus not allowing for a new string on the third line. Previously, in bibtexparser v2, we thus aborted the parsing of a block if a line started with a "block start", e.g. a @entry{. This allowed a "fresh" start with that block. After merging this PR, we now do this even if the "block start" appears anywhere, not just on a new line.

Thus, I am very happy to review any improvement you make to the parsing (and I am sure there are plenty to be made ;-)), but please keep in mind that we take a deliberate trade-off between sticking to the pure bibtex rules, and user friendly parsing of imperfect files. If in doubt, before investing too much time, feel free to ask for mine (and anyones) comment in a seperate issue before investing too much time.

Most importantly

Thanks again for your contribution, and I am looking forward to your next PRs.

Fix blocks not starting on new lines (sciunto-org#411)

d1b8d08

MiWeiss requested changes Nov 4, 2023

View reviewed changes

tests/splitter_tests/test_splitter_basic.py Outdated Show resolved Hide resolved

This was referenced Nov 4, 2023

handling of new blocks that aren't on a new line #411

Closed

Update splitter.py #412

Closed

Add assertions of start_line and raw value

d9a3245

zepinglee force-pushed the issue#411 branch from e15129b to d9a3245 Compare November 4, 2023 16:25

MiWeiss approved these changes Nov 4, 2023

View reviewed changes

MiWeiss merged commit 92d38d8 into sciunto-org:main Nov 4, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix blocks not starting on new lines (#411) #416

Fix blocks not starting on new lines (#411) #416

zepinglee commented Nov 4, 2023 •

edited

Loading

MiWeiss left a comment •

edited

Loading

zepinglee commented Nov 4, 2023

MiWeiss commented Nov 4, 2023

Fix blocks not starting on new lines (#411) #416

Fix blocks not starting on new lines (#411) #416

Conversation

zepinglee commented Nov 4, 2023 • edited Loading

MiWeiss left a comment • edited Loading

Choose a reason for hiding this comment

zepinglee commented Nov 4, 2023

MiWeiss commented Nov 4, 2023

First things first:

Regarding the larger argument you made:

Most importantly

zepinglee commented Nov 4, 2023 •

edited

Loading

MiWeiss left a comment •

edited

Loading