Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

untokenize() does not round-trip for code containing line breaks (\ + \n) #125553

Open
tomasr8 opened this issue Oct 15, 2024 · 0 comments
Open
Labels
topic-parser type-bug An unexpected behavior, bug, or error

Comments

@tomasr8
Copy link
Member

tomasr8 commented Oct 15, 2024

Bug report

Bug description:

Code which contains line breaks is not round-trip invariant:

import tokenize, io

source_code = r"""
1 + \
    2
"""

tokens = list(tokenize.generate_tokens(io.StringIO(source_code).readline))
x = tokenize.untokenize(tokens)
print(x)
# 1 +\
#     2

Notice that the space between + and \ is now missing. The current tokenizer code simply inserts a backslash when it encounters two subsequent tokens with a differeing row offset:

cpython/Lib/tokenize.py

Lines 179 to 182 in 9c2bb7d

row_offset = row - self.prev_row
if row_offset:
self.tokens.append("\\\n" * row_offset)
self.prev_col = 0

I think this should be fixed. The docstring of tokenize.untokenize says:

Round-trip invariant for full input:
Untokenized source will match input source exactly

To fix this, it will probably be necessary to inspect the raw line contents and count how much whitespace there is at the end of the line.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

@tomasr8 tomasr8 added type-bug An unexpected behavior, bug, or error topic-parser labels Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant