Message 410337 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	bup
Recipients	bup
Date	2022-01-11.22:10:11
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1641939012.19.0.13776155937.issue46350@roundup.psfhosted.org>
In-reply-to

Content
The docs use the phrase "unknown escapes of ASCII letters are reserved for future use and treated as errors". That seems ambiguous enough to question why "\x", "\u", "\U", and "\N{}" escapes aren't expanded in the template parameter like they are in patterns. Since I didn't get a response to the security report I submitted a few weeks ago about \N{} escapes, I'm cautiously assuming it's safe to bring it up here that the "unicode-escape" encoding and re and probably everything else that uses it ignores two obvious clues that a name lookup will fail: length and the presence of invalid characters. I didn't look very hard for a definite length cap in the spec, but 255 seems more than sufficient, based on longest name at present with its 82 characters. Even something as absurd as 65535 would be preferable to the current implementations, which will keep going to the end as in: >>> r"\N{%s}" % ("\ufb03"230) searching or a terminating "}" and still perform a lookup of the 2*30 character name. Another tangentially related "bug" (which probably deserves its own issue) is the inconsistency between group names and standard Python identifiers. The following example shows how the python compiler decomposes a ligature 'ﬃ' in source code to the ASCII string "ffi", while re merely checks if it could be converted to an identifier: >>> ﬃ = re.search("(?P<ﬃ>.)", "xxx") >>> ffi.groupdict() {'ﬃ': 'x'} >>> "\ufb03" in vars(), "\ufb03" in _ (False, True)

The docs use the phrase "unknown escapes of ASCII letters are reserved for future use and treated as errors". That seems ambiguous enough to question why "\x", "\u", "\U", and "\N{}" escapes aren't expanded in the template parameter like they are in patterns. 

Since I didn't get a response to the security report I submitted a few weeks ago about \N{} escapes, I'm cautiously assuming it's safe to bring it up here that the "unicode-escape" encoding and re and probably everything else that uses it ignores two obvious clues that a name lookup will fail: length and the presence of invalid characters. I didn't look very hard for a  definite length cap in the spec, but 255 seems more than sufficient, based on longest name at present with its 82 characters. Even something as absurd as 65535 would be preferable to the current implementations, which will keep going to the end as in:

    >>> r"\N{%s}" % ("\ufb03"*2**30)

searching or a terminating "}" and still perform a lookup of the 2**30 character name.

Another tangentially related "bug" (which probably deserves its own issue) is the inconsistency between group names and standard Python identifiers. The following example shows how the python compiler decomposes a ligature 'ﬃ' in source code to the ASCII string "ffi", while re merely checks if it could be converted to an identifier:

    >>> ﬃ = re.search("(?P<ﬃ>.)", "xxx")
    >>> ffi.groupdict()
    {'ﬃ': 'x'}
    >>> "\ufb03" in vars(), "\ufb03" in _
    (False, True)

History
Date	User	Action	Args
2022-01-11 22:10:12	bup	set	recipients: + bup
2022-01-11 22:10:12	bup	set	messageid: <1641939012.19.0.13776155937.issue46350@roundup.psfhosted.org>
2022-01-11 22:10:12	bup	link	issue46350 messages
2022-01-11 22:10:11	bup	create