-
-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-125651: Fix UUID hex parsing with underscores #125652
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Lib/uuid.py
Outdated
@@ -176,7 +176,7 @@ def __init__(self, hex=None, bytes=None, bytes_le=None, fields=None, | |||
'or int arguments must be given') | |||
if hex is not None: | |||
hex = hex.replace('urn:', '').replace('uuid:', '') | |||
hex = hex.strip('{}').replace('-', '') | |||
hex = hex.strip("{}").replace("-", "").replace("_", "") | |||
if len(hex) != 32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also check for a 0x
prefix like the json module?
Line 61 in f203d1c
if len(esc) == 4 and esc[1] not in 'xX': |
if len(hex) != 32: | |
if len(hex) != 32 or hex[0] == "+" or hex[1] in "xX": |
Edit 1: And a plus sign?
Edit 2: The number can be surrounded by whitespace too...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, also those are not valid UUID.
However I think that it would be better to lstrip those elements, to allow providing a UUID like 0x12345678123456781234567812345678
or +12345678123456781234567812345678
, but now that I wrote it down it feels strange to me.
What do you thing @nineteendo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reject strings that don't match this regex: r'[0-9A-Fa-f]{32}'
, supporting them is a feature that needs to be discussed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(hex) != 32: | |
if len(hex) != 32 or not re.match(r'[0-9A-Fa-f]{32}', hex) |
I will use this check (and remove the underscore strip on the above line): the length check will easy get other errors and is faster that a regex, for everything else the regex matches any unwanted character before passing to int
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't worry about performance in case of an error:
if len(hex) != 32: | |
if not re.fullmatch(r'[0-9A-Fa-f]{32}', hex): |
Can you sign the Contributor License Agreement above? |
Signed, I was waiting for approval regarding the bug/solution before signing. Now I will update code and tests with what we discussed above and then I will push them. |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Lib/test/test_uuid.py
Outdated
@@ -232,9 +232,17 @@ def test_exceptions(self): | |||
# Badly formed hex strings. | |||
badvalue(lambda: self.uuid.UUID('')) | |||
badvalue(lambda: self.uuid.UUID('abc')) | |||
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781234567')) | |||
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456')) | |||
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate:
badvalue(lambda: self.uuid.UUID('123_4567812345678123456781_23456')) |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Can you add a news entry: https://blurb-it.herokuapp.com/add_blurb. |
Ok, I added a simple news using the blurb_it site. Tell me if it is fine. |
@@ -0,0 +1 @@ | |||
Fix parsing of HEX encoded UUID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix parsing of HEX encoded UUID string | |
Fix HEX parsing of :class:`uuid.UUID`. |
Can you also add a test with unicode digits: >>> int("\uff10", 16)
0 |
I'm not fond of the regex import. We take care of delaying imports as much as possible (e.g., the time module) so I would prefer not to use it. If possible we should compile the pattern in advance + cache the method. However for this kind of check we can just keep the length check + a check on the range of ord() (possibly combined with a .lower() to make it simpler). Or use strings.hexdigits and set intersection. |
@picnixz to avoid imports you would like something like these # set diff: if the result is not empty there are unwanted characters
len(hex) != 32 or set(hex.lower()) - set('abcdef0123456789')
# range comparison with ord
len(hex) != 32 or not all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in hex.lower())
# range comparison with characters
len(hex) != 32 or not all('0' <= h <= '9' or 'a' <= h <= 'f' for h in hex.lower()) |
Yes, something like this. I'm not sure which one is the most efficient nor whether this warrants this kind of alternative but I feel that we shouldn't import the re module (or at least delay its import). I think the 3rd option is the best one. Or we could also skip the .lower() and add 'A' to 'F' checks (probably faster) but remember that we don't care about performances in case of a failure. (The first is the clearest though IMO) I'd like a core dev opinion on that matter though. |
In the meantime i tried a
Set difference on my host (with Python 3.10) seems to be the fastest one (about 50% less time than the others), but a better benchmarking may performed. I will wait for the opinion of a core developer before proceeding with further changes and tests. |
Set is probably the fastest and cleanest. You could save a few keystrokes using a literal set construction for the rhs though and check if the .lower() call is better than introducing the capital letters in the set. I'll do more benchmarks today |
An other way to give a small improvements could be to hold the set of HEX characters in a variable. Regarding the time difference between set difference with/without lower in the benchamrk: it is minimal and the two values overlaps if you consider error margins.
😇 |
TL;DR: the fastest is: using .lower() and Here are the benchmarks (PGO build):
Here is the script I used: import pyperf
import random
import string
random.seed(1234)
PATTERNS = [
'a' * 32,
'A' * 32,
''.join(random.choices('abcdef', k=32)),
''.join(random.choices('abcdefABCDEF', k=32)),
''.join(random.choices('ABCDEF', k=32)),
''.join(random.choices(string.digits, k=32)),
''.join(random.choices(string.digits + 'abcdef', k=32)),
''.join(random.choices(string.digits + 'ABCDDEF', k=32)),
''.join(random.choices(string.hexdigits, k=32)),
]
def cs_set(loops, val):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
___ = set(val) <= set('0123456789abcdefABCDEF')
return pyperf.perf_counter() - t0
def ci_set(loops, val):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
___ = set(val.lower()) <= set('0123456789abcdef')
return pyperf.perf_counter() - t0
def cs_all_ord(loops, val):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
___ = all(48 <= ord(h) <= 57 or 65 <= ord(h) <= 70 or 97 <= ord(h) <= 102 for h in val)
return pyperf.perf_counter() - t0
def ci_all_ord(loops, val):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
___ = all(48 <= ord(h) <= 57 or 97 <= ord(h) <= 102 for h in val.lower())
return pyperf.perf_counter() - t0
def cs_all(loops, val):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
___ = all('0' <= h <= '9' or 'a' <= h <= 'f' or 'A' <= h <= 'F' for h in val)
return pyperf.perf_counter() - t0
def ci_all(loops, val):
__ = range(loops)
t0 = pyperf.perf_counter()
for _ in __:
___ = all('0' <= h <= '9' or 'a' <= h <= 'f' for h in val.lower())
return pyperf.perf_counter() - t0
def bench(runner, func):
for pattern in PATTERNS:
runner.bench_time_func(pattern, func, pattern)
def add_cmdline_args(cmd, args):
cmd.append(args.impl)
if __name__ == '__main__':
runner = pyperf.Runner(add_cmdline_args=add_cmdline_args)
runner.argparser.add_argument('impl', choices=['cs_set', 'ci_set', 'cs_all_ord', 'ci_all_ord', 'cs_all', 'ci_all'])
args = runner.parse_args()
bench(runner, globals()[args.impl]) |
Adds sanitization from "digit grouping separators" in UUID parser to comply with RFC.