Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-125651: Fix UUID hex parsing with underscores #125652

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sevdog
Copy link

@sevdog sevdog commented Oct 17, 2024

Adds sanitization from "digit grouping separators" in UUID parser to comply with RFC.

Copy link

cpython-cla-bot bot commented Oct 17, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link

bedevere-app bot commented Oct 17, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Lib/uuid.py Outdated
@@ -176,7 +176,7 @@ def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None,
'or int arguments must be given')
if hex is not None:
hex = hex.replace('urn:', '').replace('uuid:', '')
hex = hex.strip('{}').replace('-', '')
hex = hex.strip("{}").replace("-", "").replace("_", "")
if len(hex) != 32:
Copy link
Contributor

@nineteendo nineteendo Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also check for a 0x prefix like the json module?

if len(esc) == 4 and esc[1] not in 'xX':

Suggested change
if len(hex) != 32:
if len(hex) != 32 or hex[0] == "+" or hex[1] in "xX":

Edit 1: And a plus sign?
Edit 2: The number can be surrounded by whitespace too...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, also those are not valid UUID.

However I think that it would be better to lstrip those elements, to allow providing a UUID like 0x12345678123456781234567812345678 or +12345678123456781234567812345678, but now that I wrote it down it feels strange to me.

What do you thing @nineteendo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reject strings that don't match this regex: r'[0-9A-Fa-f]{32}', supporting them is a feature that needs to be discussed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(hex) != 32:
if len(hex) != 32 or not re.match(r'[0-9A-Fa-f]{32}', hex)

I will use this check (and remove the underscore strip on the above line): the length check will easy get other errors and is faster that a regex, for everything else the regex matches any unwanted character before passing to int.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't worry about performance in case of an error:

Suggested change
if len(hex) != 32:
if not re.fullmatch(r'[0-9A-Fa-f]{32}', hex):

@nineteendo
Copy link
Contributor

Can you sign the Contributor License Agreement above?

@sevdog
Copy link
Author

sevdog commented Oct 18, 2024

Signed, I was waiting for approval regarding the bug/solution before signing.

Now I will update code and tests with what we discussed above and then I will push them.

@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@@ -232,9 +232,17 @@ def test_exceptions(self):
# Badly formed hex strings.
badvalue(lambda: self.uuid.UUID(''))
badvalue(lambda: self.uuid.UUID('abc'))
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781234567'))
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456'))
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate:

Suggested change
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456'))

@bedevere-app
Copy link

bedevere-app bot commented Oct 18, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@nineteendo
Copy link
Contributor

Can you add a news entry: https://blurb-it.herokuapp.com/add_blurb.

@sevdog
Copy link
Author

sevdog commented Oct 18, 2024

Ok, I added a simple news using the blurb_it site. Tell me if it is fine.

@@ -0,0 +1 @@
Fix parsing of HEX encoded UUID string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Fix parsing of HEX encoded UUID string
Fix HEX parsing of :class:`uuid.UUID`.

@nineteendo
Copy link
Contributor

Can you also add a test with unicode digits:

>>> int("\uff10", 16)
0

@picnixz
Copy link
Contributor

picnixz commented Oct 21, 2024

I'm not fond of the regex import. We take care of delaying imports as much as possible (e.g., the time module) so I would prefer not to use it.

If possible we should compile the pattern in advance + cache the method. However for this kind of check we can just keep the length check + a check on the range of ord() (possibly combined with a .lower() to make it simpler). Or use strings.hexdigits and set intersection.

@sevdog
Copy link
Author

sevdog commented Oct 21, 2024

@picnixz to avoid imports you would like something like these

# set diff: if the result is not empty there are unwanted characters
len(hex) != 32 or set(hex.lower()) - set('abcdef0123456789')
# range comparison with ord
len(hex) != 32 or not all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in hex.lower())
# range comparison with characters 
len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' for h in hex.lower())

@picnixz
Copy link
Contributor

picnixz commented Oct 21, 2024

Yes, something like this. I'm not sure which one is the most efficient nor whether this warrants this kind of alternative but I feel that we shouldn't import the re module (or at least delay its import). I think the 3rd option is the best one. Or we could also skip the .lower() and add 'A' to 'F' checks (probably faster) but remember that we don't care about performances in case of a failure. (The first is the clearest though IMO)

I'd like a core dev opinion on that matter though.

@sevdog
Copy link
Author

sevdog commented Oct 22, 2024

In the meantime i tried a timeit (with IPython) on the abovementioned solutions

In [6]: hex = '0123456781234567812345678z2345_6'

In [7]: %%timeit
   ...: len(hex) != 32 or set(hex.lower()) - set('abcdef0123456789')
926 ns ± 39.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [8]: %%timeit
   ...: len(hex) != 32 or not all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in hex.lower())
2.53 μs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [9]: %%timeit
   ...: len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' for h in hex.lower())
2.26 μs ± 36.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [10]: %%timeit
    ...: len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' or 'A' <= h <= 'F' for h in hex)
2.25 μs ± 26.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [11]: %%timeit
    ...: len(hex) != 32 or set(hex) - set('abcdefABCDEF0123456789')
1.06 μs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Set difference on my host (with Python 3.10) seems to be the fastest one (about 50% less time than the others), but a better benchmarking may performed.

I will wait for the opinion of a core developer before proceeding with further changes and tests.

@picnixz
Copy link
Contributor

picnixz commented Oct 22, 2024

Set is probably the fastest and cleanest. You could save a few keystrokes using a literal set construction for the rhs though and check if the .lower() call is better than introducing the capital letters in the set.

I'll do more benchmarks today

@sevdog
Copy link
Author

sevdog commented Oct 22, 2024

An other way to give a small improvements could be to hold the set of HEX characters in a variable.

Regarding the time difference between set difference with/without lower in the benchamrk: it is minimal and the two values overlaps if you consider error margins.
Also the test HEX string is only digits, a better benchmark should use also alfa-hex characters and maybe considering the case in which:

  • every A-F character is uppercase
  • every A-F is lowercase
  • there are more uppercase than lowercase
  • there are more lowercase than uppercase

😇

@picnixz
Copy link
Contributor

picnixz commented Oct 22, 2024

TL;DR: the fastest is: using .lower() and set.

Here are the benchmarks (PGO build):

+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| Benchmark                        | uuid-ci_all | uuid-ci_all_ord       | uuid-ci_set          | uuid-cs_all           | uuid-cs_all_ord       | uuid-cs_set          |
+==================================+=============+=======================+======================+=======================+=======================+======================+
| aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | 1.55 us     | 1.41 us: 1.10x faster | 299 ns: 5.18x faster | 1.60 us: 1.03x slower | 1.91 us: 1.23x slower | 394 ns: 3.93x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA | 1.55 us     | 1.42 us: 1.09x faster | 304 ns: 5.10x faster | 2.03 us: 1.31x slower | 1.49 us: 1.04x faster | 398 ns: 3.90x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| fcaffdeaebaecddabaacfadcdadbddcc | 1.73 us     | 1.54 us: 1.13x faster | 339 ns: 5.11x faster | 1.76 us: 1.01x slower | 2.11 us: 1.22x slower | 434 ns: 3.99x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| cFfAaAabfeaBbAbFDFbdadDcbFbEaeee | 1.71 us     | 1.53 us: 1.12x faster | 339 ns: 5.04x faster | 2.09 us: 1.22x slower | 1.97 us: 1.15x slower | 436 ns: 3.91x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| ECECEFCEEBABCADECEDDCEEAFBAFEECA | 1.68 us     | 1.55 us: 1.08x faster | 343 ns: 4.90x faster | 2.22 us: 1.32x slower | 1.56 us: 1.08x faster | 434 ns: 3.88x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 30205096390204237858385715385475 | 1.01 us     | 936 ns: 1.08x faster  | 356 ns: 2.85x faster | 999 ns: 1.02x faster  | 912 ns: 1.11x faster  | 439 ns: 2.31x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 529ed150aac236b3d6df8e6add146d41 | 1.36 us     | 1.21 us: 1.13x faster | 366 ns: 3.72x faster | 1.38 us: 1.01x slower | 1.42 us: 1.04x slower | 465 ns: 2.93x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 63C90D7D25167B86846A38E7BBDCB16D | 1.32 us     | 1.23 us: 1.08x faster | 363 ns: 3.64x faster | 1.46 us: 1.10x slower | 1.22 us: 1.08x faster | 456 ns: 2.90x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| 13DcFAF6f7E64B6bd7775AccC5300FFe | 1.46 us     | 1.25 us: 1.17x faster | 366 ns: 3.99x faster | 1.61 us: 1.10x slower | 1.34 us: 1.09x faster | 469 ns: 3.11x faster |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+
| Geometric mean                   | (ref)       | 1.11x faster          | 4.31x faster         | 1.12x slower          | 1.02x slower          | 3.37x faster         |
+----------------------------------+-------------+-----------------------+----------------------+-----------------------+-----------------------+----------------------+

Here is the script I used:

import pyperf

import random
import string

random.seed(1234)

PATTERNS = [
    'a' * 32,
    'A' * 32,
    ''.join(random.choices('abcdef', k=32)),
    ''.join(random.choices('abcdefABCDEF', k=32)),
    ''.join(random.choices('ABCDEF', k=32)),
    ''.join(random.choices(string.digits, k=32)),
    ''.join(random.choices(string.digits + 'abcdef', k=32)),
    ''.join(random.choices(string.digits + 'ABCDDEF', k=32)),
    ''.join(random.choices(string.hexdigits, k=32)),
]

def cs_set(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = set(val) <= set('0123456789abcdefABCDEF')
    return pyperf.perf_counter() - t0

def ci_set(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = set(val.lower()) <= set('0123456789abcdef')
    return pyperf.perf_counter() - t0

def cs_all_ord(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all(48 <= ord(h) <= 57 or 65 <= ord(h) <= 70 or 97 <= ord(h) <= 102 for h in val)
    return pyperf.perf_counter() - t0

def ci_all_ord(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in val.lower())
    return pyperf.perf_counter() - t0

def cs_all(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all('0' <= h <= '9' or 'a' <= h <= 'f' or 'A' <= h <= 'F' for h in val)
    return pyperf.perf_counter() - t0

def ci_all(loops, val):
    __ = range(loops)
    t0 = pyperf.perf_counter()
    for _ in __:
        ___ = all('0' <= h <= '9' or 'a' <= h <= 'f' for h in val.lower())
    return pyperf.perf_counter() - t0

def bench(runner, func):
    for pattern in PATTERNS:
        runner.bench_time_func(pattern, func, pattern)

def add_cmdline_args(cmd, args):
    cmd.append(args.impl)

if __name__ == '__main__':
    runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
    runner.argparser.add_argument('impl', choices=['cs_set', 'ci_set', 'cs_all_ord', 'ci_all_ord', 'cs_all', 'ci_all'])
    args = runner.parse_args()
    bench(runner, globals()[args.impl])
</script>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants