From d8108be4f6f73a669fd11593d5b4e189f0644de2 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 10:15:58 +0200 Subject: [PATCH 01/10] PEP 756: PyUnicode_Export() --- peps/pep-0756.rst | 359 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 359 insertions(+) create mode 100644 peps/pep-0756.rst diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst new file mode 100644 index 00000000000..80c82815d27 --- /dev/null +++ b/peps/pep-0756.rst @@ -0,0 +1,359 @@ +PEP: 756 +Title: Add PyUnicode_Export() and PyUnicode_Import() C functions +Author: Victor Stinner +Status: Draft +Type: Standards Track +Created: 13-Sep-2024 +PEP-Delegate: C API Working Group +Python-Version: 3.14 + +.. highlight:: c + + +Abstract +======== + +Add functions to the limited C API version 3.14: + +* ``PyUnicode_Export()``: export a Python str object as a ``Py_buffer`` + view. +* ``PyUnicode_Import()``: import a Python str object. + +In general, ``PyUnicode_Export()`` has a O(1) complexity: no memory +copy is needed. See the :ref:`specification ` for +cases when a copy is needed. + + +Rationale +========= + +PEP 393 +------- + +:pep:`393` "Flexible String Representation" changed string internals in +Python 3.3 to use three formats: + +* UCS1: Unicode range [U+0000; U+00ff], 1 byte per character +* UCS2: Unicode range [U+0000; U+ffff], 2 bytes per character +* UCS4: Unicode range [U+0000; U+10ffff], 4 bytes per character + +A Python ``str`` object must always use the most compact format. For +example, a string which only contain ASCII characters must use the UCS1 +format. + +The ``PyUnicode_KIND()`` function can be used to know the format used by +a string: + +* ``PyUnicode_1BYTE_KIND``: UCS1, 1 byte per character +* ``PyUnicode_2BYTE_KIND``: UCS2, 2 bytes per character +* ``PyUnicode_4BYTE_KIND``: UCS4, 4 bytes per character + +Then, one of the following function can be used to access data: + +* UCS1: ``PyUnicode_1BYTE_DATA()`` +* UCS2: ``PyUnicode_2BYTE_DATA()`` +* UCS4: ``PyUnicode_4BYTE_DATA()`` + +To get best performance, a C extension should have 3 code paths for each +of these 3 string native formats. + +Limited C API +------------- + +PEP 393 functions such as ``PyUnicode_KIND()`` and +``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not +possible to write code specialized for UCS formats. A C extension using +the limited C API can only use less efficient code paths and string +formats. + +For example, the Markupsafe project has a C extension specialized for +UCS formats for best performance, and so cannot use the limited C +API. + + +Specification +============= + +API +--- + +Add the following API to the limited C API version 3.14:: + + int32_t PyUnicode_Export( + PyObject *unicode, + int32_t requested_formats, + int32_t flags, + Py_buffer *view); + PyObject* PyUnicode_Import( + const void *data, + Py_ssize_t nbytes, + int32_t format); + + #define PyUnicode_FORMAT_UCS1 0x01 // Py_UCS1* + #define PyUnicode_FORMAT_UCS2 0x02 // Py_UCS2* + #define PyUnicode_FORMAT_UCS4 0x04 // Py_UCS4* + #define PyUnicode_FORMAT_UTF8 0x08 // char* + #define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string) + +PyUnicode_Export() +------------------ + +API: ``int32_t PyUnicode_Export(PyObject *unicode, int32_t requested_formats, Py_buffer *view)``. + +Export the contents of the *unicode* string in one of the *requested_formats*. + +* On success, fill *view*, and return a format (greater than ``0``). +* On error, set an exception, and return ``-1``. + *view* is left unchanged. + +After a successful call to ``PyUnicode_Export()``, +the *view* buffer must be released by ``PyBuffer_Release()``. +The contents of the buffer are valid until they are released. + +The buffer is read-only and must not be modified. + +*unicode* and *view* must not be NULL. + +Available formats: + +=================================== ======== =========================== +Constant Identifier Value Description +=================================== ======== =========================== +``PyUnicode_FORMAT_UCS1`` ``0x01`` UCS-1 string (``Py_UCS1*``) +``PyUnicode_FORMAT_UCS2`` ``0x02`` UCS-2 string (``Py_UCS2*``) +``PyUnicode_FORMAT_UCS4`` ``0x04`` UCS-4 string (``Py_UCS4*``) +``PyUnicode_FORMAT_UTF8`` ``0x08`` UTF-8 string (``char*``) +``PyUnicode_FORMAT_ASCII`` ``0x10`` ASCII string (``Py_UCS1*``) +=================================== ======== =========================== + +UCS-2 and UCS-4 use the native byte order. + +*requested_formats* can be a single format or a bitwise combination of the +formats in the table above. +On success, the returned format will be set to a single one of the requested +flags. + +Note that future versions of Python may introduce additional formats. + +.. _export-complexity: + +Export complexity +----------------- + +In general, an export has a complexity of O(1): no memory copy is +needed. There are cases when a copy is needed, O(n) complexity: + +* If UCS2 is requested and the native format is UCS1. +* If UCS4 is requested and the native format is UCS1 or UCS2. +* If UTF8 is requested: the string is encoded to UTF-8 at the first + call, and then the encoded UTF-8 string is cached. + +To have a O(1) complexity on CPython and PyPy, it's recommended to +support these 4 formats:: + + (PyUnicode_FORMAT_UCS1 \ + | PyUnicode_FORMAT_UCS2 \ + | PyUnicode_FORMAT_UCS4 \ + | PyUnicode_FORMAT_UTF8) + + +Py_buffer format and item size +------------------------------ + +``Py_buffer`` uses the following format and item size depending on the +export format: + +========================== ================== ============ +Export format Buffer format Item size +========================== ================== ============ +``PyUnicode_FORMAT_UCS1`` ``"B"`` 1 byte +``PyUnicode_FORMAT_UCS2`` ``"H"`` 2 bytes +``PyUnicode_FORMAT_UCS4`` ``"I"`` or ``"L"`` 4 bytes +``PyUnicode_FORMAT_UTF8`` ``"B"`` 1 byte +``PyUnicode_FORMAT_ASCII`` ``"B"`` 1 byte +========================== ================== ============ + + +PyUnicode_Import() +------------------ + +API: ``PyObject* PyUnicode_Import(const void *data, Py_ssize_t nbytes, int32_t format)``. + +Create a Unicode string object from a buffer in a supported format. + +* Return a reference to a new string object on success. +* Set an exception and return ``NULL`` on error. + +*data* must not be NULL. *nbytes* must be positive or zero. + +See ``PyUnicode_Export()`` for the available formats. + +Note: The ``PyUnicode_Import()`` function is similar to +``PyUnicode_FromKindAndData()``, but ``PyUnicode_FromKindAndData()`` is +excluded from the limited C API. + +UTF-8 format +------------ + +CPython 3.14 doesn't use the UTF-8 format internally. The format is +provided for compatibility with PyPy which uses UTF-8 natively for +strings. Moreover, in CPython, the encoded UTF-8 string is cached which +makes it convenient to be exported. + +On CPython, the UTF-8 format has the lowest priority: ASCII and UCS +formats are preferred. + +ASCII format +------------ + +When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the +``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin1 +strings. + +The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for +``PyUnicode_Import()`` to validate that the string only contains ASCII +characters. + + +Surrogate characters and NUL character +--------------------------------------- + +Surrogate characters are allowed: they can be imported and exported. For +example, the UTF-8 format uses the ``surrogatepass`` error handler. + +Embedded NUL characters are allowed: they can be imported and exported. + +An exported string does not end with a trailing NUL character: the +``PyUnicode_Export()`` caller must use ``Py_buffer.len`` to get the +string length. + + +Implementation +============== + +https://github.com/python/cpython/pull/123738 + + +Backwards Compatibility +======================= + +There is no impact on the backward compatibility, only new C API +functions are added. + + +Usage of PEP 393 C APIs +======================= + +A code search on PyPI top 7,500 projects (on March 2024) shows that +there are many projects importing and exporting UCS formats with the +regular C API. + +PyUnicode_FromKindAndData() +--------------------------- + +25 projects call ``PyUnicode_FromKindAndData()``: + +* **Cython** (3.0.9) +* Levenshtein (0.25.0) +* PyICU (2.12) +* PyICU-binary (2.7.4) +* PyQt5 (5.15.10) +* PyQt6 (6.6.1) +* aiocsv (1.3.1) +* asyncpg (0.29.0) +* biopython (1.83) +* catboost (1.2.3) +* cffi (1.16.0) +* mojimoji (0.0.13) +* mwparserfromhell (0.6.6) +* numba (0.59.0) +* **numpy** (1.26.4) +* orjson (3.9.15) +* pemja (0.4.1) +* pyahocorasick (2.0.0) +* pyjson5 (1.6.6) +* rapidfuzz (3.6.2) +* regex (2023.12.25) +* srsly (2.4.8) +* tokenizers (0.15.2) +* ujson (5.9.0) +* unicodedata2 (15.1.0) + + +PyUnicode_4BYTE_DATA() +---------------------- + +21 projects call ``PyUnicode_2BYTE_DATA()`` and/or +``PyUnicode_4BYTE_DATA()``: + +* **Cython** (3.0.9) +* **MarkupSafe** (2.1.5) +* Nuitka (2.1.2) +* PyICU (2.12) +* PyICU-binary (2.7.4) +* PyQt5_sip (12.13.0) +* PyQt6_sip (13.6.0) +* biopython (1.83) +* catboost (1.2.3) +* cement (3.0.10) +* cffi (1.16.0) +* duckdb (0.10.0) +* **mypy** (1.9.0) +* **numpy** (1.26.4) +* orjson (3.9.15) +* pemja (0.4.1) +* pyahocorasick (2.0.0) +* pyjson5 (1.6.6) +* pyobjc-core (10.2) +* sip (6.8.3) +* wxPython (4.2.1) + + +Rejected Ideas +============== + +Reject embedded NUL character and trailing NUL character +-------------------------------------------------------- + +In C, it's convenient to have a trailing NUL character. For example, +the ``for (; *str != 0; str++)`` loop can be used to iterate on +characters and ``strlen()`` can be used to get a string length. + +The problem is that a Python ``str`` object can embed NUL characters. +Example: ``"ab\0c"``. If a string contains an embedded NUL character, +code relying on the NUL character to find the string end truncate the +string. It can lead to bugs, or even security vulnerabilities. + +Rejecting embedded NUL characters require to scan the string which has +a O(n) complexity. + +Reject surrogate characters +--------------------------- + +Surrogate characters are characters in the Unicode range [U+D800; +U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python +``str`` object can contain arbitrary lone surrogate characters. Example: +``"\uDC80"``. + +Rejecting surogate characters prevents exporting a string which contain +such character. It can be surprising and annoying since the +``PyUnicode_Export()`` caller doesn't control the string content. + +Allowing surrogate characters allows to export any string and so avoid +this issue. For example, the UTF-8 codec can be used with the +``surrogatepass`` error handler to encode and decode surrogate +characters. + + +Discussions +=========== + +* https://github.com/capi-workgroup/decisions/issues/33 +* https://github.com/python/cpython/issues/119609 + +Copyright +========= + +This document is placed in the public domain or under the +CC0-1.0-Universal license, whichever is more permissive. + From b6f757bc5f47f15369690699120b584d5cea8d1f Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 12:06:31 +0200 Subject: [PATCH 02/10] Fix header order --- peps/pep-0756.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 80c82815d27..7aa3ca4a7d9 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -1,10 +1,10 @@ PEP: 756 Title: Add PyUnicode_Export() and PyUnicode_Import() C functions Author: Victor Stinner +PEP-Delegate: C API Working Group Status: Draft Type: Standards Track Created: 13-Sep-2024 -PEP-Delegate: C API Working Group Python-Version: 3.14 .. highlight:: c From 1a51c958f5d2e6d31b809359280df87d0038bee2 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 15:09:39 +0200 Subject: [PATCH 03/10] Apply suggestions from code review Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com> --- peps/pep-0756.rst | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 7aa3ca4a7d9..66470d44dd3 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -19,7 +19,7 @@ Add functions to the limited C API version 3.14: view. * ``PyUnicode_Import()``: import a Python str object. -In general, ``PyUnicode_Export()`` has a O(1) complexity: no memory +In general, ``PyUnicode_Export()`` has an *O*\ (1) complexity: no memory copy is needed. See the :ref:`specification ` for cases when a copy is needed. @@ -48,25 +48,25 @@ a string: * ``PyUnicode_2BYTE_KIND``: UCS2, 2 bytes per character * ``PyUnicode_4BYTE_KIND``: UCS4, 4 bytes per character -Then, one of the following function can be used to access data: +Then, one of the following functions can be used to access data: * UCS1: ``PyUnicode_1BYTE_DATA()`` * UCS2: ``PyUnicode_2BYTE_DATA()`` * UCS4: ``PyUnicode_4BYTE_DATA()`` -To get best performance, a C extension should have 3 code paths for each +To get the best performance, a C extension should have 3 code paths for each of these 3 string native formats. Limited C API ------------- -PEP 393 functions such as ``PyUnicode_KIND()`` and +:pep:`393` functions such as ``PyUnicode_KIND()`` and ``PyUnicode_1BYTE_DATA()`` are excluded from the limited C API. It's not possible to write code specialized for UCS formats. A C extension using the limited C API can only use less efficient code paths and string formats. -For example, the Markupsafe project has a C extension specialized for +For example, the MarkupSafe project has a C extension specialized for UCS formats for best performance, and so cannot use the limited C API. @@ -140,15 +140,15 @@ Note that future versions of Python may introduce additional formats. Export complexity ----------------- -In general, an export has a complexity of O(1): no memory copy is -needed. There are cases when a copy is needed, O(n) complexity: +In general, an export has a complexity of *O*\ (1): no memory copy is +needed. There are cases when a copy is needed, *O*\ (*n*) complexity: * If UCS2 is requested and the native format is UCS1. * If UCS4 is requested and the native format is UCS1 or UCS2. * If UTF8 is requested: the string is encoded to UTF-8 at the first call, and then the encoded UTF-8 string is cached. -To have a O(1) complexity on CPython and PyPy, it's recommended to +To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to support these 4 formats:: (PyUnicode_FORMAT_UCS1 \ @@ -207,7 +207,7 @@ ASCII format ------------ When the ``PyUnicode_FORMAT_ASCII`` format is request for export, the -``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin1 +``PyUnicode_FORMAT_UCS1`` export format is used for ASCII and Latin-1 strings. The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for @@ -244,7 +244,7 @@ functions are added. Usage of PEP 393 C APIs ======================= -A code search on PyPI top 7,500 projects (on March 2024) shows that +A code search on PyPI top 7,500 projects (in March 2024) shows that there are many projects importing and exporting UCS formats with the regular C API. @@ -325,7 +325,7 @@ code relying on the NUL character to find the string end truncate the string. It can lead to bugs, or even security vulnerabilities. Rejecting embedded NUL characters require to scan the string which has -a O(n) complexity. +an *O*\ (*n*) complexity. Reject surrogate characters --------------------------- @@ -335,9 +335,9 @@ U+DFFF]. They are disallowed by UTF codecs such as UTF-8. A Python ``str`` object can contain arbitrary lone surrogate characters. Example: ``"\uDC80"``. -Rejecting surogate characters prevents exporting a string which contain -such character. It can be surprising and annoying since the -``PyUnicode_Export()`` caller doesn't control the string content. +Rejecting surrogate characters prevents exporting a string which contains +such a character. It can be surprising and annoying since the +``PyUnicode_Export()`` caller doesn't control the string contents. Allowing surrogate characters allows to export any string and so avoid this issue. For example, the UTF-8 codec can be used with the From c1e65850d5928962bd0b67b2028f83ff7a6c2d4f Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 15:13:51 +0200 Subject: [PATCH 04/10] Add hypens to UCS-n --- peps/pep-0756.rst | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 66470d44dd3..38ad8b86a31 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -33,26 +33,26 @@ PEP 393 :pep:`393` "Flexible String Representation" changed string internals in Python 3.3 to use three formats: -* UCS1: Unicode range [U+0000; U+00ff], 1 byte per character -* UCS2: Unicode range [U+0000; U+ffff], 2 bytes per character -* UCS4: Unicode range [U+0000; U+10ffff], 4 bytes per character +* UCS-1: Unicode range [U+0000; U+00ff], 1 byte per character +* UCS-2: Unicode range [U+0000; U+ffff], 2 bytes per character +* UCS-4: Unicode range [U+0000; U+10ffff], 4 bytes per character A Python ``str`` object must always use the most compact format. For -example, a string which only contain ASCII characters must use the UCS1 +example, a string which only contain ASCII characters must use the UCS-1 format. The ``PyUnicode_KIND()`` function can be used to know the format used by a string: -* ``PyUnicode_1BYTE_KIND``: UCS1, 1 byte per character -* ``PyUnicode_2BYTE_KIND``: UCS2, 2 bytes per character -* ``PyUnicode_4BYTE_KIND``: UCS4, 4 bytes per character +* ``PyUnicode_1BYTE_KIND``: UCS-1, 1 byte per character +* ``PyUnicode_2BYTE_KIND``: UCS-2, 2 bytes per character +* ``PyUnicode_4BYTE_KIND``: UCS-4, 4 bytes per character Then, one of the following functions can be used to access data: -* UCS1: ``PyUnicode_1BYTE_DATA()`` -* UCS2: ``PyUnicode_2BYTE_DATA()`` -* UCS4: ``PyUnicode_4BYTE_DATA()`` +* UCS-1: ``PyUnicode_1BYTE_DATA()`` +* UCS-2: ``PyUnicode_2BYTE_DATA()`` +* UCS-4: ``PyUnicode_4BYTE_DATA()`` To get the best performance, a C extension should have 3 code paths for each of these 3 string native formats. @@ -143,8 +143,8 @@ Export complexity In general, an export has a complexity of *O*\ (1): no memory copy is needed. There are cases when a copy is needed, *O*\ (*n*) complexity: -* If UCS2 is requested and the native format is UCS1. -* If UCS4 is requested and the native format is UCS1 or UCS2. +* If UCS-2 is requested and the native format is UCS-1. +* If UCS-4 is requested and the native format is UCS-1 or UCS-2. * If UTF8 is requested: the string is encoded to UTF-8 at the first call, and then the encoded UTF-8 string is cached. From 7baf0d0215f2b9a60d35c4ed20a7c54f05681e33 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 15:15:03 +0200 Subject: [PATCH 05/10] a string contain => a string contains --- peps/pep-0756.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 38ad8b86a31..85cb329d044 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -38,7 +38,7 @@ Python 3.3 to use three formats: * UCS-4: Unicode range [U+0000; U+10ffff], 4 bytes per character A Python ``str`` object must always use the most compact format. For -example, a string which only contain ASCII characters must use the UCS-1 +example, a string which only contains ASCII characters must use the UCS-1 format. The ``PyUnicode_KIND()`` function can be used to know the format used by From 94ea9b237a3ac0e920b3750985f46049fbd382be Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 15:15:58 +0200 Subject: [PATCH 06/10] Add hyphen to UTF-8 --- peps/pep-0756.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 85cb329d044..6d2cf3db8d0 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -145,7 +145,7 @@ needed. There are cases when a copy is needed, *O*\ (*n*) complexity: * If UCS-2 is requested and the native format is UCS-1. * If UCS-4 is requested and the native format is UCS-1 or UCS-2. -* If UTF8 is requested: the string is encoded to UTF-8 at the first +* If UTF-8 is requested: the string is encoded to UTF-8 at the first call, and then the encoded UTF-8 string is cached. To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to From fa2d06eaf4cdb0be8e61a0ff24bb08224802ee41 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 15:16:59 +0200 Subject: [PATCH 07/10] Add myself to CODEOWNERS --- .github/CODEOWNERS | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 89aca1b8b9c..dcd54b922c2 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -635,6 +635,7 @@ peps/pep-0752.rst @warsaw # ... # peps/pep-0754.rst # ... +peps/pep-0756.rst @vstinner peps/pep-0789.rst @njsmith # ... peps/pep-0801.rst @warsaw From bfa62f9784e526baa79c968d730abc86538dca3c Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 15:57:00 +0200 Subject: [PATCH 08/10] Clarify title --- peps/pep-0756.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 6d2cf3db8d0..ea8d334b9f3 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -312,8 +312,8 @@ PyUnicode_4BYTE_DATA() Rejected Ideas ============== -Reject embedded NUL character and trailing NUL character --------------------------------------------------------- +Reject embedded NUL characters and require trailing NUL character +----------------------------------------------------------------- In C, it's convenient to have a trailing NUL character. For example, the ``for (; *str != 0; str++)`` loop can be used to iterate on @@ -321,7 +321,7 @@ characters and ``strlen()`` can be used to get a string length. The problem is that a Python ``str`` object can embed NUL characters. Example: ``"ab\0c"``. If a string contains an embedded NUL character, -code relying on the NUL character to find the string end truncate the +code relying on the NUL character to find the string end truncates the string. It can lead to bugs, or even security vulnerabilities. Rejecting embedded NUL characters require to scan the string which has From 9a96b23ab506bd1e6dc96e29341e039875522097 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 13 Sep 2024 23:50:59 +0200 Subject: [PATCH 09/10] Open Questions; rephrase PEP 393 --- peps/pep-0756.rst | 44 ++++++++++++++++++++++++++++---------------- 1 file changed, 28 insertions(+), 16 deletions(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index ea8d334b9f3..0cf490de14a 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -33,29 +33,28 @@ PEP 393 :pep:`393` "Flexible String Representation" changed string internals in Python 3.3 to use three formats: -* UCS-1: Unicode range [U+0000; U+00ff], 1 byte per character -* UCS-2: Unicode range [U+0000; U+ffff], 2 bytes per character -* UCS-4: Unicode range [U+0000; U+10ffff], 4 bytes per character +* ``PyUnicode_1BYTE_KIND``: Unicode range [U+0000; U+00ff], + UCS-1, 1 byte/character. +* ``PyUnicode_2BYTE_KIND``: Unicode range [U+0000; U+ffff], + UCS-2, 2 bytes/character. +* ``PyUnicode_4BYTE_KIND``: Unicode range [U+0000; U+10ffff], + UCS-4, 4 bytes/character. A Python ``str`` object must always use the most compact format. For -example, a string which only contains ASCII characters must use the UCS-1 -format. +example, a string which only contains ASCII characters must use the +UCS-1 format. The ``PyUnicode_KIND()`` function can be used to know the format used by -a string: +a string. -* ``PyUnicode_1BYTE_KIND``: UCS-1, 1 byte per character -* ``PyUnicode_2BYTE_KIND``: UCS-2, 2 bytes per character -* ``PyUnicode_4BYTE_KIND``: UCS-4, 4 bytes per character +One of the following functions can be used to access data: -Then, one of the following functions can be used to access data: +* ``PyUnicode_1BYTE_DATA()`` for ``PyUnicode_1BYTE_KIND``. +* ``PyUnicode_2BYTE_DATA()`` for ``PyUnicode_2BYTE_KIND``. +* ``PyUnicode_4BYTE_DATA()`` for ``PyUnicode_4BYTE_KIND``. -* UCS-1: ``PyUnicode_1BYTE_DATA()`` -* UCS-2: ``PyUnicode_2BYTE_DATA()`` -* UCS-4: ``PyUnicode_4BYTE_DATA()`` - -To get the best performance, a C extension should have 3 code paths for each -of these 3 string native formats. +To get the best performance, a C extension should have 3 code paths for +each of these 3 string native formats. Limited C API ------------- @@ -241,6 +240,19 @@ There is no impact on the backward compatibility, only new C API functions are added. +Open Questions +============== + +* Should we guarantee that the exported buffer always ends with a NUL + character? Is it possible to implement it in *O*\ (1) complexity + in all Python implementations? +* Is it ok to allow surrogate characters? +* Should we add a flag to disallow embedded NUL characters? It would + have an *O*\ (*n*) complexity. +* Should we add a flag to disallow surrogate characters? It would + have an *O*\ (*n*) complexity. + + Usage of PEP 393 C APIs ======================= From f51510c3f62f7bf4cbd3078ce02928e1b4c9371a Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Sat, 14 Sep 2024 10:51:26 +0200 Subject: [PATCH 10/10] Update --- peps/pep-0756.rst | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/peps/pep-0756.rst b/peps/pep-0756.rst index 0cf490de14a..68942a04aa2 100644 --- a/peps/pep-0756.rst +++ b/peps/pep-0756.rst @@ -94,6 +94,12 @@ Add the following API to the limited C API version 3.14:: #define PyUnicode_FORMAT_UTF8 0x08 // char* #define PyUnicode_FORMAT_ASCII 0x10 // char* (ASCII string) +The ``int32_t`` type is used instead of ``int`` to have a well defined +type size and not depend on the platform or the compiler. +See `Avoid C-specific Types +`_ for the +longer rationale. + PyUnicode_Export() ------------------ @@ -142,10 +148,10 @@ Export complexity In general, an export has a complexity of *O*\ (1): no memory copy is needed. There are cases when a copy is needed, *O*\ (*n*) complexity: -* If UCS-2 is requested and the native format is UCS-1. -* If UCS-4 is requested and the native format is UCS-1 or UCS-2. -* If UTF-8 is requested: the string is encoded to UTF-8 at the first - call, and then the encoded UTF-8 string is cached. +* If only UCS-2 is requested and the native format is UCS-1. +* If only UCS-4 is requested and the native format is UCS-1 or UCS-2. +* If only UTF-8 is requested: the string is encoded to UTF-8 at the + first call, and then the encoded UTF-8 string is cached. To have an *O*\ (1) complexity on CPython and PyPy, it's recommended to support these 4 formats:: @@ -187,16 +193,13 @@ Create a Unicode string object from a buffer in a supported format. See ``PyUnicode_Export()`` for the available formats. -Note: The ``PyUnicode_Import()`` function is similar to -``PyUnicode_FromKindAndData()``, but ``PyUnicode_FromKindAndData()`` is -excluded from the limited C API. UTF-8 format ------------ CPython 3.14 doesn't use the UTF-8 format internally. The format is provided for compatibility with PyPy which uses UTF-8 natively for -strings. Moreover, in CPython, the encoded UTF-8 string is cached which +strings. However, in CPython, the encoded UTF-8 string is cached which makes it convenient to be exported. On CPython, the UTF-8 format has the lowest priority: ASCII and UCS @@ -214,7 +217,7 @@ The ``PyUnicode_FORMAT_ASCII`` format is mostly useful for characters. -Surrogate characters and NUL character +Surrogate characters and NUL characters --------------------------------------- Surrogate characters are allowed: they can be imported and exported. For @@ -335,6 +338,9 @@ The problem is that a Python ``str`` object can embed NUL characters. Example: ``"ab\0c"``. If a string contains an embedded NUL character, code relying on the NUL character to find the string end truncates the string. It can lead to bugs, or even security vulnerabilities. +See a previous discussion in the issue `Change PyUnicode_AsUTF8() +to return NULL on embedded null characters +`_. Rejecting embedded NUL characters require to scan the string which has an *O*\ (*n*) complexity.