Message 347411 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Jeffrey.Kintscher
Recipients	Jeffrey.Kintscher, bup, ezio.melotti, mrabarnett, serhiy.storchaka, terry.reedy
Date	2019-07-06.00:53:00
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1562374380.59.0.512587842456.issue37367@roundup.psfhosted.org>
In-reply-to

Content
Here is the problematic code in _PyBytes_DecodeEscape in Objects/bytesobject.c: c = s[-1] - '0'; if (s < end && '0' <= s && s <= '7') { c = (c<<3) + s++ - '0'; if (s < end && '0' <= s && s <= '7') c = (c<<3) + s++ - '0'; } *p++ = c; c is an int, and p is a char pointer to the new bytes object's string buffer. For b'\407', c gets correctly calculated as 263 (0x107), but the upper bits are lost when it gets recast as a char and stored in the location pointed to by p. Hence, b'\407' becomes b'\x07' when the object is created. IMO, this should raise "ValueError: bytes must be in range(0, 256)" instead of silently throwing away the upper bits. I will work on a PR. I also took a look at how escaped hex values are handled by the same function. It may seem at first glance that >>> b'\x107' b'\x107' is returning the hex value 0x107, but in reality it is returning '\x10' as the first character and '7' as the second character. While visually misleading, it is syntactically and semantically correct.

Here is the problematic code in _PyBytes_DecodeEscape in Objects/bytesobject.c:

            c = s[-1] - '0';
            if (s < end && '0' <= *s && *s <= '7') {
                c = (c<<3) + *s++ - '0';
                if (s < end && '0' <= *s && *s <= '7')
                    c = (c<<3) + *s++ - '0';
            }
            *p++ = c;

c is an int, and p is a char pointer to the new bytes object's string buffer.  For b'\407', c gets correctly calculated as 263 (0x107), but the upper bits are lost when it gets recast as a char and stored in the location pointed to by p.  Hence, b'\407' becomes b'\x07' when the object is created.

IMO, this should raise "ValueError: bytes must be in range(0, 256)" instead of silently throwing away the upper bits.  I will work on a PR.

I also took a look at how escaped hex values are handled by the same function.  It may seem at first glance that

>>> b'\x107'
b'\x107'

is returning the hex value 0x107, but in reality it is returning '\x10' as the first character and '7' as the second character.  While visually misleading, it is syntactically and semantically correct.

History
Date	User	Action	Args
2019-07-06 00:53:00	Jeffrey.Kintscher	set	recipients: + Jeffrey.Kintscher, terry.reedy, ezio.melotti, mrabarnett, serhiy.storchaka, bup
2019-07-06 00:53:00	Jeffrey.Kintscher	set	messageid: <1562374380.59.0.512587842456.issue37367@roundup.psfhosted.org>
2019-07-06 00:53:00	Jeffrey.Kintscher	link	issue37367 messages
2019-07-06 00:53:00	Jeffrey.Kintscher	create