untokenize() does not round-trip for code containing line breaks (`\` + `\n`) #125553

tomasr8 · 2024-10-15T20:53:55Z

Bug report

Bug description:

Code which contains line breaks is not round-trip invariant:

import tokenize, io

source_code = r"""
1 + \
    2
"""

tokens = list(tokenize.generate_tokens(io.StringIO(source_code).readline))
x = tokenize.untokenize(tokens)
print(x)
# 1 +\
#     2

Notice that the space between + and \ is now missing. The current tokenizer code simply inserts a backslash when it encounters two subsequent tokens with a differeing row offset:

cpython/Lib/tokenize.py

Lines 179 to 182 in 9c2bb7d

    
           row_offset = row - self.prev_row 
        
           if row_offset: 
        
               self.tokens.append("\\\n" * row_offset) 
        
               self.prev_col = 0

I think this should be fixed. The docstring of tokenize.untokenize says:

Round-trip invariant for full input:
Untokenized source will match input source exactly

To fix this, it will probably be necessary to inspect the raw line contents and count how much whitespace there is at the end of the line.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

The text was updated successfully, but these errors were encountered:

tomasr8 added type-bug An unexpected behavior, bug, or error topic-parser labels Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

untokenize() does not round-trip for code containing line breaks (`\` + `\n`) #125553

untokenize() does not round-trip for code containing line breaks (`\` + `\n`) #125553

tomasr8 commented Oct 15, 2024 •

edited by github-actions bot

Loading

untokenize() does not round-trip for code containing line breaks (\ + \n) #125553

untokenize() does not round-trip for code containing line breaks (\ + \n) #125553

Comments

tomasr8 commented Oct 15, 2024 • edited by github-actions bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

untokenize() does not round-trip for code containing line breaks (`\` + `\n`) #125553

untokenize() does not round-trip for code containing line breaks (`\` + `\n`) #125553

tomasr8 commented Oct 15, 2024 •

edited by github-actions bot

Loading