Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

during tokenize, use UTF8 encoding on all platforms #510

Merged
merged 4 commits into from
Oct 10, 2023

Conversation

drammock
Copy link
Contributor

@drammock drammock commented Oct 9, 2023

(This is a PR-as-issue, but if I've guessed the wrong solution please feel free to close or suggest a better fix.)

An MNE-Python user who was trying to build our docs on Windows hit this error today:

Traceback (most recent call last):
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\sphinx\events.py", line 97, in emit
    results.append(listener.handler(self.app, *args))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\numpydoc\numpydoc.py", line 214, in mangle_docstrings
    report = validate(doc)
             ^^^^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\numpydoc\validate.py", line 617, in validate
    ignore_validation_comments = extract_ignore_validation_comments(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\site-packages\numpydoc\validate.py", line 145, in extract_ignore_validation_comments
    for token in tokenize.generate_tokens(file.readline):
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\tokenize.py", line 454, in _tokenize
    line = readline()
           ^^^^^^^^^^
  File "C:\Users\Carina\mambaforge\envs\mnedev\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3922: character maps to <undefined>

re-running after export PYTHONUTF8=1 resolved the issue, so I think explicitly invoking utf-8 during read should also prevent the error without requiring any user action.

Technically I think this is a backwards-incompatible change for windows users who had any non-ASCII characters in their source files if those characters are encoded differently in utf-8 than they are in the system's default codepage (which will vary with OS language settings). However, PYTHONUTF8=1 will become the effective default in 2025 with the release of Python 3.15 (so affected users will need to address this change eventually anyway).

@drammock
Copy link
Contributor Author

drammock commented Oct 9, 2023

I've now added a better test that captures the actual case we encountered. ready for review from my end.

@jarrodmillman
Copy link
Member

See #505 and #506.

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @drammock , the test is very helpful!

@rossbar rossbar merged commit 80a8708 into numpy:main Oct 10, 2023
23 checks passed
@drammock drammock deleted the tokenize-encoding branch December 12, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants