Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEP 756: PyUnicode_Export() #3960

Merged
merged 10 commits into from
Sep 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,7 @@ peps/pep-0752.rst @warsaw
# ...
# peps/pep-0754.rst
# ...
peps/pep-0756.rst @vstinner
peps/pep-0789.rst @njsmith
# ...
peps/pep-0801.rst @warsaw
Expand Down
377 changes: 377 additions & 0 deletions peps/pep-0756.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
PEP: 756
Title: Add PyUnicode_Export() and PyUnicode_Import() C functions
Author: Victor Stinner <[email protected]>
PEP-Delegate: C API Working Group
Status: Draft
Type: Standards Track
Created: 13-Sep-2024
Python-Version: 3.14

.. highlight:: c


Abstract
========

Add functions to the limited C API version 3.14:

* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer``
view.
* ``PyUnicode_Import()``: import a Python str object.

In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory
copy is needed. See the :ref:`specification <export-complexity>` for
cases when a copy is needed.


Rationale
=========

PEP 393
-------

:pep:`393` "Flexible String Representation" changed string internals in
Python 3.3 to use three formats:

* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff],
UCS-1, 1 byte/character.
* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff],
UCS-2, 2 bytes/character.
* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff],
UCS-4, 4 bytes/character.

A Python ``str`` object must always use the most compact format. For
example, a string which only contains ASCII characters must use the
UCS-1 format.

The ``PyUnicode_KIND()`` function can be used to know the format used by
a string.

One of the following functions can be used to access data:

* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``.
* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``.
* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``.

To get the best performance, a C extension should have 3 code paths for
each of these 3 string native formats.

Limited C API
-------------

:pep:`393` functions such as ``PyUnicode_KIND()`` and
``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not
possible to write code specialized for UCS formats. A C extension using
the limited C API can only use less efficient code paths and string
formats.

For example, the MarkupSafe project has a C extension specialized for
UCS formats for best performance, and so cannot use the limited C
API.


Specification
=============

API
---

Add the following API to the limited C API version 3.14::

int32_t PyUnicode_Export(
PyObject *unicode,
int32_t requested_formats,
int32_t flags,
Py_buffer *view);
PyObject* PyUnicode_Import(
const void *data,
Py_ssize_t nbytes,
int32_t format);

#define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1*
#define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2*
#define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4*
#define PyUnicode_FORMAT_UTF8 0x08 // char*
#define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string)

The ``int32_t`` type is used instead of ``int`` to have a well defined
type size and not depend on the platform or the compiler.
See `Avoid C-specific Types
<https://github.com/capi-workgroup/api-evolution/issues/10>`_ for the
longer rationale.

PyUnicode_Export()
------------------

API: ``int32_t PyUnicode_Export(PyObject *unicode, int32_t requested_formats, Py_buffer *view)``.

Export the contents of the *unicode* string in one of the *requested_formats*.

* On success, fill *view*, and return a format (greater than ``0``).
* On error, set an exception, and return ``-1``.
*view* is left unchanged.

After a successful call to ``PyUnicode_Export()``,
the *view* buffer must be released by ``PyBuffer_Release()``.
The contents of the buffer are valid until they are released.

The buffer is read-only and must not be modified.

*unicode* and *view* must not be NULL.

Available formats:

=================================== ======== ===========================
Constant Identifier Value Description
=================================== ======== ===========================
``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``)
vstinner marked this conversation as resolved.
Show resolved Hide resolved
``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``)
``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``)
``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``)
``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``)
=================================== ======== ===========================

UCS-2 and UCS-4 use the native byte order.

*requested_formats* can be a single format or a bitwise combination of the
formats in the table above.
On success, the returned format will be set to a single one of the requested
flags.

Note that future versions of Python may introduce additional formats.

.. _export-complexity:

Export complexity
-----------------

In general, an export has a complexity of *O*\ (1): no memory copy is
needed. There are cases when a copy is needed, *O*\ (*n*) complexity:

* If only UCS-2 is requested and the native format is UCS-1.
* If only UCS-4 is requested and the native format is UCS-1 or UCS-2.
* If only UTF-8 is requested: the string is encoded to UTF-8 at the
first call, and then the encoded UTF-8 string is cached.

To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to
support these 4 formats::

(PyUnicode_FORMAT_UCS1 \
| PyUnicode_FORMAT_UCS2 \
| PyUnicode_FORMAT_UCS4 \
| PyUnicode_FORMAT_UTF8)


Py_buffer format and item size
------------------------------

``Py_buffer`` uses the following format and item size depending on the
export format:

========================== ================== ============
Export format Buffer format Item size
========================== ================== ============
``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte
``PyUnicode_FORMAT_UCS2`` ``"H"`` 2 bytes
``PyUnicode_FORMAT_UCS4`` ``"I"`` or ``"L"`` 4 bytes
``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte
``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte
========================== ================== ============


PyUnicode_Import()
------------------

API: ``PyObject* PyUnicode_Import(const void *data, Py_ssize_t nbytes, int32_t format)``.

Create a Unicode string object from a buffer in a supported format.

* Return a reference to a new string object on success.
* Set an exception and return ``NULL`` on error.

*data* must not be NULL. *nbytes* must be positive or zero.

See ``PyUnicode_Export()`` for the available formats.


UTF-8 format
------------

CPython 3.14 doesn't use the UTF-8 format internally. The format is
provided for compatibility with PyPy which uses UTF-8 natively for
strings. However, in CPython, the encoded UTF-8 string is cached which
makes it convenient to be exported.

On CPython, the UTF-8 format has the lowest priority: ASCII and UCS
formats are preferred.

ASCII format
------------

When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the
``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1
strings.

The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for
``PyUnicode_Import()`` to validate that the string only contains ASCII
characters.


Surrogate characters and NUL characters
---------------------------------------

Surrogate characters are allowed: they can be imported and exported. For
example, the UTF-8 format uses the ``surrogatepass`` error handler.

Embedded NUL characters are allowed: they can be imported and exported.

An exported string does not end with a trailing NUL character: the
``PyUnicode_Export()`` caller must use ``Py_buffer.len`` to get the
string length.


Implementation
==============

https://github.com/python/cpython/pull/123738


Backwards Compatibility
=======================

There is no impact on the backward compatibility, only new C API
functions are added.


Open Questions
==============

* Should we guarantee that the exported buffer always ends with a NUL
character? Is it possible to implement it in *O*\ (1) complexity
in all Python implementations?
* Is it ok to allow surrogate characters?
* Should we add a flag to disallow embedded NUL characters? It would
have an *O*\ (*n*) complexity.
* Should we add a flag to disallow surrogate characters? It would
have an *O*\ (*n*) complexity.


Usage of PEP 393 C APIs
=======================

A code search on PyPI top 7,500 projects (in March 2024) shows that
there are many projects importing and exporting UCS formats with the
regular C API.

PyUnicode_FromKindAndData()
---------------------------

25 projects call ``PyUnicode_FromKindAndData()``:

* **Cython** (3.0.9)
* Levenshtein (0.25.0)
* PyICU (2.12)
* PyICU-binary (2.7.4)
* PyQt5 (5.15.10)
* PyQt6 (6.6.1)
* aiocsv (1.3.1)
* asyncpg (0.29.0)
* biopython (1.83)
* catboost (1.2.3)
* cffi (1.16.0)
* mojimoji (0.0.13)
* mwparserfromhell (0.6.6)
* numba (0.59.0)
* **numpy** (1.26.4)
* orjson (3.9.15)
* pemja (0.4.1)
* pyahocorasick (2.0.0)
* pyjson5 (1.6.6)
* rapidfuzz (3.6.2)
* regex (2023.12.25)
* srsly (2.4.8)
* tokenizers (0.15.2)
* ujson (5.9.0)
* unicodedata2 (15.1.0)


PyUnicode_4BYTE_DATA()
----------------------

21 projects call ``PyUnicode_2BYTE_DATA()`` and/or
``PyUnicode_4BYTE_DATA()``:

* **Cython** (3.0.9)
* **MarkupSafe** (2.1.5)
* Nuitka (2.1.2)
* PyICU (2.12)
* PyICU-binary (2.7.4)
* PyQt5_sip (12.13.0)
* PyQt6_sip (13.6.0)
* biopython (1.83)
* catboost (1.2.3)
* cement (3.0.10)
* cffi (1.16.0)
* duckdb (0.10.0)
* **mypy** (1.9.0)
* **numpy** (1.26.4)
* orjson (3.9.15)
* pemja (0.4.1)
* pyahocorasick (2.0.0)
* pyjson5 (1.6.6)
* pyobjc-core (10.2)
* sip (6.8.3)
* wxPython (4.2.1)


Rejected Ideas
==============

Reject embedded NUL characters and require trailing NUL character
-----------------------------------------------------------------

In C, it's convenient to have a trailing NUL character. For example,
the ``for (; *str != 0; str++)`` loop can be used to iterate on
characters and ``strlen()`` can be used to get a string length.

The problem is that a Python ``str`` object can embed NUL characters.
Example: ``"ab\0c"``. If a string contains an embedded NUL character,
code relying on the NUL character to find the string end truncates the
string. It can lead to bugs, or even security vulnerabilities.
See a previous discussion in the issue `Change PyUnicode_AsUTF8()
to return NULL on embedded null characters
<https://github.com/python/cpython/issues/111089>`_.

Rejecting embedded NUL characters require to scan the string which has
an *O*\ (*n*) complexity.

Reject surrogate characters
---------------------------

Surrogate characters are characters in the Unicode range [U+D800;
U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python
``str`` object can contain arbitrary lone surrogate characters. Example:
``"\uDC80"``.

Rejecting surrogate characters prevents exporting a string which contains
such a character. It can be surprising and annoying since the
``PyUnicode_Export()`` caller doesn't control the string contents.

Allowing surrogate characters allows to export any string and so avoid
this issue. For example, the UTF-8 codec can be used with the
``surrogatepass`` error handler to encode and decode surrogate
characters.


Discussions
===========

* https://github.com/capi-workgroup/decisions/issues/33
* https://github.com/python/cpython/issues/119609

Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.