gh-125651: Fix UUID hex parsing with underscores #125652

sevdog · 2024-10-17T16:18:09Z

Adds sanitization from "digit grouping separators" in UUID parser to comply with RFC.

Issue: UUID wrongly parses hex values with underscores #125651

cpython-cla-bot · 2024-10-17T16:18:12Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2024-10-17T16:18:13Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

nineteendo · 2024-10-17T17:36:58Z

Lib/uuid.py

@@ -176,7 +176,7 @@ def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None,
                            'or int arguments must be given')
        if hex is not None:
            hex = hex.replace('urn:', '').replace('uuid:', '')
-            hex = hex.strip('{}').replace('-', '')
+            hex = hex.strip("{}").replace("-", "").replace("_", "")
            if len(hex) != 32:


Shouldn't we also check for a 0x prefix like the json module?

cpython/Lib/json/decoder.py

Line 61 in f203d1c

if len(esc) == 4 and esc[1] not in 'xX':

Suggested change

if len(hex) != 32:

if len(hex) != 32 or hex[0] == "+" or hex[1] in "xX":

Edit 1: And a plus sign?
Edit 2: The number can be surrounded by whitespace too...

Good catch, also those are not valid UUID.

However I think that it would be better to lstrip those elements, to allow providing a UUID like 0x12345678123456781234567812345678 or +12345678123456781234567812345678, but now that I wrote it down it feels strange to me.

What do you thing @nineteendo?

I would reject strings that don't match this regex: r'[0-9A-Fa-f]{32}', supporting them is a feature that needs to be discussed.

Suggested change

if len(hex) != 32:

if len(hex) != 32 or not re.match(r'[0-9A-Fa-f]{32}', hex)

I will use this check (and remove the underscore strip on the above line): the length check will easy get other errors and is faster that a regex, for everything else the regex matches any unwanted character before passing to int.

You shouldn't worry about performance in case of an error:

Suggested change

if len(hex) != 32:

if not re.fullmatch(r'[0-9A-Fa-f]{32}', hex):

nineteendo · 2024-10-18T06:53:06Z

Can you sign the Contributor License Agreement above?

sevdog · 2024-10-18T06:57:52Z

Signed, I was waiting for approval regarding the bug/solution before signing.

Now I will update code and tests with what we discussed above and then I will push them.

… to int

bedevere-app · 2024-10-18T07:50:28Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

nineteendo · 2024-10-18T07:54:05Z

Lib/test/test_uuid.py

@@ -232,9 +232,17 @@ def test_exceptions(self):
        # Badly formed hex strings.
        badvalue(lambda: self.uuid.UUID(''))
        badvalue(lambda: self.uuid.UUID('abc'))
+        badvalue(lambda: self.uuid.UUID('123_4567812345678123456781234567'))
+        badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456'))
+        badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456'))


Duplicate:

Suggested change

badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456'))

bedevere-app · 2024-10-18T09:03:58Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

nineteendo · 2024-10-18T09:14:37Z

Can you add a news entry: https://blurb-it.herokuapp.com/add_blurb.

sevdog · 2024-10-18T09:47:31Z

Ok, I added a simple news using the blurb_it site. Tell me if it is fine.

nineteendo · 2024-10-18T09:49:51Z

Misc/NEWS.d/next/Library/2024-10-18-09-45-34.gh-issue-125651.M0wSAS.rst

@@ -0,0 +1 @@
+Fix parsing of HEX encoded UUID string


Suggested change

Fix parsing of HEX encoded UUID string

Fix HEX parsing of :class:`uuid.UUID`.

nineteendo · 2024-10-18T09:50:54Z

Can you also add a test with unicode digits:

>>> int("\uff10", 16)
0

picnixz · 2024-10-21T15:32:58Z

I'm not fond of the regex import. We take care of delaying imports as much as possible (e.g., the time module) so I would prefer not to use it.

If possible we should compile the pattern in advance + cache the method. However for this kind of check we can just keep the length check + a check on the range of ord() (possibly combined with a .lower() to make it simpler). Or use strings.hexdigits and set intersection.

sevdog · 2024-10-21T16:24:37Z

@picnixz to avoid imports you would like something like these

# set diff: if the result is not empty there are unwanted characters
len(hex) != 32 or set(hex.lower()) - set('abcdef0123456789')
# range comparison with ord
len(hex) != 32 or not all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in hex.lower())
# range comparison with characters 
len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' for h in hex.lower())

picnixz · 2024-10-21T16:31:35Z

Yes, something like this. I'm not sure which one is the most efficient nor whether this warrants this kind of alternative but I feel that we shouldn't import the re module (or at least delay its import). I think the 3rd option is the best one. Or we could also skip the .lower() and add 'A' to 'F' checks (probably faster) but remember that we don't care about performances in case of a failure. (The first is the clearest though IMO)

I'd like a core dev opinion on that matter though.

sevdog · 2024-10-22T07:11:23Z

In the meantime i tried a timeit (with IPython) on the abovementioned solutions

In [6]: hex = '0123456781234567812345678z2345_6'

In [7]: %%timeit
   ...: len(hex) != 32 or set(hex.lower()) - set('abcdef0123456789')
926 ns ± 39.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [8]: %%timeit
   ...: len(hex) != 32 or not all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in hex.lower())
2.53 μs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %%timeit
   ...: len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' for h in hex.lower())
2.26 μs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [10]: %%timeit
    ...: len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' or 'A' <= h <= 'F' for h in hex)
2.25 μs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [11]: %%timeit
    ...: len(hex) != 32 or set(hex) - set('abcdefABCDEF0123456789')
1.06 μs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Set difference on my host (with Python 3.10) seems to be the fastest one (about 50% less time than the others), but a better benchmarking may performed.

I will wait for the opinion of a core developer before proceeding with further changes and tests.

picnixz · 2024-10-22T07:39:02Z

Set is probably the fastest and cleanest. You could save a few keystrokes using a literal set construction for the rhs though and check if the .lower() call is better than introducing the capital letters in the set.

I'll do more benchmarks today

sevdog · 2024-10-22T08:02:28Z

An other way to give a small improvements could be to hold the set of HEX characters in a variable.

Regarding the time difference between set difference with/without lower in the benchamrk: it is minimal and the two values overlaps if you consider error margins.
Also the test HEX string is only digits, a better benchmark should use also alfa-hex characters and maybe considering the case in which:

every A-F character is uppercase
every A-F is lowercase
there are more uppercase than lowercase
there are more lowercase than uppercase

😇

picnixz · 2024-10-22T13:26:16Z

TL;DR: the fastest is: using .lower() and set.

Here are the benchmarks (PGO build):

+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| Benchmark                        | uuid-ci_all | uuid-ci_all_ord       | uuid-ci_set          | uuid-cs_all           | uuid-cs_all_ord       | uuid-cs_set          |
+==================================+=============+=======================+======================+=======================+=======================+======================+
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | 1.55 us     | 1.41 us: 1.10x faster | 299 ns: 5.18x faster | 1.60 us: 1.03x slower | 1.91 us: 1.23x slower | 394 ns: 3.93x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA | 1.55 us     | 1.42 us: 1.09x faster | 304 ns: 5.10x faster | 2.03 us: 1.31x slower | 1.49 us: 1.04x faster | 398 ns: 3.90x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| fcaffdeaebaecddabaacfadcdadbddcc | 1.73 us     | 1.54 us: 1.13x faster | 339 ns: 5.11x faster | 1.76 us: 1.01x slower | 2.11 us: 1.22x slower | 434 ns: 3.99x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| cFfAaAabfeaBbAbFDFbdadDcbFbEaeee | 1.71 us     | 1.53 us: 1.12x faster | 339 ns: 5.04x faster | 2.09 us: 1.22x slower | 1.97 us: 1.15x slower | 436 ns: 3.91x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| ECECEFCEEBABCADECEDDCEEAFBAFEECA | 1.68 us     | 1.55 us: 1.08x faster | 343 ns: 4.90x faster | 2.22 us: 1.32x slower | 1.56 us: 1.08x faster | 434 ns: 3.88x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 30205096390204237858385715385475 | 1.01 us     | 936 ns: 1.08x faster  | 356 ns: 2.85x faster | 999 ns: 1.02x faster  | 912 ns: 1.11x faster  | 439 ns: 2.31x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 529ed150aac236b3d6df8e6add146d41 | 1.36 us     | 1.21 us: 1.13x faster | 366 ns: 3.72x faster | 1.38 us: 1.01x slower | 1.42 us: 1.04x slower | 465 ns: 2.93x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 63C90D7D25167B86846A38E7BBDCB16D | 1.32 us     | 1.23 us: 1.08x faster | 363 ns: 3.64x faster | 1.46 us: 1.10x slower | 1.22 us: 1.08x faster | 456 ns: 2.90x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 13DcFAF6f7E64B6bd7775AccC5300FFe | 1.46 us     | 1.25 us: 1.17x faster | 366 ns: 3.99x faster | 1.61 us: 1.10x slower | 1.34 us: 1.09x faster | 469 ns: 3.11x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| Geometric mean                   | (ref)       | 1.11x faster          | 4.31x faster         | 1.12x slower          | 1.02x slower          | 3.37x faster         |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+

Here is the script I used:

import pyperf

import random
import string

random.seed(1234)

PATTERNS = [
    'a' * 32,
    'A' * 32,
    ''.join(random.choices('abcdef', k=32)),
    ''.join(random.choices('abcdefABCDEF', k=32)),
    ''.join(random.choices('ABCDEF', k=32)),
    ''.join(random.choices(string.digits, k=32)),
    ''.join(random.choices(string.digits + 'abcdef', k=32)),
    ''.join(random.choices(string.digits + 'ABCDDEF', k=32)),
    ''.join(random.choices(string.hexdigits, k=32)),
]

def cs_set(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = set(val) <= set('0123456789abcdefABCDEF')
    return pyperf.perf_counter() - t0

def ci_set(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = set(val.lower()) <= set('0123456789abcdef')
    return pyperf.perf_counter() - t0

def cs_all_ord(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all(48 <= ord(h) <= 57 or 65 <= ord(h) <= 70 or 97 <= ord(h) <= 102 for h in val)
    return pyperf.perf_counter() - t0

def ci_all_ord(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in val.lower())
    return pyperf.perf_counter() - t0

def cs_all(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all('0' <= h <= '9' or 'a' <= h <= 'f' or 'A' <= h <= 'F' for h in val)
    return pyperf.perf_counter() - t0

def ci_all(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all('0' <= h <= '9' or 'a' <= h <= 'f' for h in val.lower())
    return pyperf.perf_counter() - t0

def bench(runner, func):
    for pattern in PATTERNS:
        runner.bench_time_func(pattern, func, pattern)

def add_cmdline_args(cmd, args):
    cmd.append(args.impl)

if __name__ == '__main__':
    runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
    runner.argparser.add_argument('impl', choices=['cs_set', 'ci_set', 'cs_all_ord', 'ci_all_ord', 'cs_all', 'ci_all'])
    args = runner.parse_args()
    bench(runner, globals()[args.impl])

</script>

pythongh-125651: Fix UUID hex parsing with underscores

6c7a1c5

bedevere-app bot added the awaiting review label Oct 17, 2024

bedevere-app bot mentioned this pull request Oct 17, 2024

UUID wrongly parses hex values with underscores #125651

Open

nineteendo reviewed Oct 17, 2024

View reviewed changes

nineteendo mentioned this pull request Oct 17, 2024

Python implementation of json.loads() accepts invalid unicode escapes #125660

Closed

pythongh-125651: Use regex to avoid any unallowed character to passed…

cde9e36

… to int

nineteendo reviewed Oct 18, 2024

View reviewed changes

pythongh-125651: Remove duplicated check in test

6719b85

📜🤖 Added by blurb_it.

965db01

nineteendo reviewed Oct 18, 2024

View reviewed changes

pythongh-125651: Add test for unicode sequence and adjust news message

5961212

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-125651: Fix UUID hex parsing with underscores #125652

gh-125651: Fix UUID hex parsing with underscores #125652

sevdog commented Oct 17, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Oct 17, 2024 •

edited

Loading

bedevere-app bot commented Oct 17, 2024

nineteendo Oct 17, 2024 •

edited

Loading

sevdog Oct 18, 2024

nineteendo Oct 18, 2024

sevdog Oct 18, 2024

nineteendo Oct 18, 2024

nineteendo commented Oct 18, 2024

sevdog commented Oct 18, 2024

bedevere-app bot commented Oct 18, 2024

nineteendo Oct 18, 2024

bedevere-app bot commented Oct 18, 2024

nineteendo commented Oct 18, 2024

sevdog commented Oct 18, 2024

nineteendo Oct 18, 2024

nineteendo commented Oct 18, 2024

picnixz commented Oct 21, 2024

sevdog commented Oct 21, 2024 •

edited

Loading

picnixz commented Oct 21, 2024 •

edited

Loading

sevdog commented Oct 22, 2024

picnixz commented Oct 22, 2024 •

edited

Loading

sevdog commented Oct 22, 2024

picnixz commented Oct 22, 2024 •

edited

Loading

	if len(hex) != 32:
	if len(hex) != 32 or hex[0] == "+" or hex[1] in "xX":

	if len(hex) != 32:
	if len(hex) != 32 or not re.match(r'[0-9A-Fa-f]{32}', hex)

	if len(hex) != 32:
	if not re.fullmatch(r'[0-9A-Fa-f]{32}', hex):

	Fix parsing of HEX encoded UUID string
	Fix HEX parsing of :class:`uuid.UUID`.

gh-125651: Fix UUID hex parsing with underscores #125652

Are you sure you want to change the base?

gh-125651: Fix UUID hex parsing with underscores #125652

Conversation

sevdog commented Oct 17, 2024 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Oct 17, 2024 • edited Loading

bedevere-app bot commented Oct 17, 2024

nineteendo Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

sevdog Oct 18, 2024

Choose a reason for hiding this comment

nineteendo Oct 18, 2024

Choose a reason for hiding this comment

sevdog Oct 18, 2024

Choose a reason for hiding this comment

nineteendo Oct 18, 2024

Choose a reason for hiding this comment

nineteendo commented Oct 18, 2024

sevdog commented Oct 18, 2024

bedevere-app bot commented Oct 18, 2024

nineteendo Oct 18, 2024

Choose a reason for hiding this comment

bedevere-app bot commented Oct 18, 2024

nineteendo commented Oct 18, 2024

sevdog commented Oct 18, 2024

nineteendo Oct 18, 2024

Choose a reason for hiding this comment

nineteendo commented Oct 18, 2024

picnixz commented Oct 21, 2024

sevdog commented Oct 21, 2024 • edited Loading

picnixz commented Oct 21, 2024 • edited Loading

sevdog commented Oct 22, 2024

picnixz commented Oct 22, 2024 • edited Loading

sevdog commented Oct 22, 2024

picnixz commented Oct 22, 2024 • edited Loading

sevdog commented Oct 17, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Oct 17, 2024 •

edited

Loading

nineteendo Oct 17, 2024 •

edited

Loading

sevdog commented Oct 21, 2024 •

edited

Loading

picnixz commented Oct 21, 2024 •

edited

Loading

picnixz commented Oct 22, 2024 •

edited

Loading

picnixz commented Oct 22, 2024 •

edited

Loading