classification
Title: octal escapes applied inconsistently throughout the interpreter and lib
Type: Stage: patch review
Components: Regular Expressions Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, bup, ezio.melotti, mrabarnett, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2019-06-21 19:46 by bup, last changed 2019-07-09 03:13 by Jeffrey.Kintscher.

Pull Requests
URL Status Linked Edit
PR 14654 open Jeffrey.Kintscher, 2019-07-09 02:21
Messages (5)
msg346246 - (view) Author: Dan Snider (bup) * Date: 2019-06-21 19:46
At present, the bytecode compiler can generate 512 different unicode characters, one for each integral from the range [0-511), 512 being the total number of syntactically valid permutations of 3 octal digits preceded by a backslash. However, this does not match the regex compiler, which raises an error regardless of the input type when it encounters an an octal escape character with a decimal value greater than 255. On the other hand... the bytes literal:

>>> b'\407'

is somehow valid, and can lead to extremely difficult bugs to track down, such as this nonsense:

>>> re.compile(b'\407').search(b'\a')
<re.Match object; span=(0, 1), match=b'\x07'>

I propose that the regex parser be augmented, enabling for unicode patterns the interpretation of three character octal escapes from the range(256, 512), while the bytecode parser be adjusted to match the behavior of the regex parser, raising an error for bytes literals > b"\400", rather than truncating the 9th bit.
msg346865 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-06-28 21:04
https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
says, for \ooo, "In a bytes literal, hexadecimal and octal escapes denote the byte with the given value. In a string literal, these escapes denote a Unicode character with the given value."

I agree that sometimes truncating an invalid integer instead of always raising ValueError is strange.

>>> ord(b'\407')
7
>>> bytes((0o407,))
...
ValueError: bytes must be in range(0, 256)

I don't know is there was an intentional back-compatibility reason for this.

Without an example of re raising, I don't understand the re complaint.
msg347261 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-07-04 08:07
>>> b'\407'
b'\x07'
>>> ord(b'\407')
7

This is the object structure passed to builtin_ord():

(lldb) p *((PyBytesObject *)(c))
(PyBytesObject) $19 = {
  ob_base = {
    ob_base = {
      ob_refcnt = 4
      ob_type = 0x00000001003cb0b0
    }
    ob_size = 1
  }
  ob_shash = 8685212186264880044
  ob_sval = {
    [0] = '\a'
  }
}

If two bytes were stored (0x107), I would expect ob_sval[0] to be 7 ('\a') and ob_sval[1] to be 1 on a little endian system, but ob_sval[1] is 0:

(lldb) p (long)(unsigned char) (((PyBytesObject *)(c))->ob_sval[0])
(long) $23 = 7
(lldb) p (long)(unsigned char) (((PyBytesObject *)(c))->ob_sval[1])
(long) $24 = 0

This means the truncation to a single byte is happening when the byte string object is created.
msg347411 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-07-06 00:53
Here is the problematic code in _PyBytes_DecodeEscape in Objects/bytesobject.c:

            c = s[-1] - '0';
            if (s < end && '0' <= *s && *s <= '7') {
                c = (c<<3) + *s++ - '0';
                if (s < end && '0' <= *s && *s <= '7')
                    c = (c<<3) + *s++ - '0';
            }
            *p++ = c;

c is an int, and p is a char pointer to the new bytes object's string buffer.  For b'\407', c gets correctly calculated as 263 (0x107), but the upper bits are lost when it gets recast as a char and stored in the location pointed to by p.  Hence, b'\407' becomes b'\x07' when the object is created.

IMO, this should raise "ValueError: bytes must be in range(0, 256)" instead of silently throwing away the upper bits.  I will work on a PR.

I also took a look at how escaped hex values are handled by the same function.  It may seem at first glance that

>>> b'\x107'
b'\x107'

is returning the hex value 0x107, but in reality it is returning '\x10' as the first character and '7' as the second character.  While visually misleading, it is syntactically and semantically correct.
msg347516 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-07-09 03:13
The PR is ready for review.  It raises ValueError if the escaped octal value in a byte string is greater than 255.
History
Date User Action Args
2019-07-09 03:13:18Jeffrey.Kintschersetmessages: + msg347516
2019-07-09 02:21:57Jeffrey.Kintschersetkeywords: + patch
stage: patch review
pull_requests: + pull_request14461
2019-07-06 00:53:00Jeffrey.Kintschersetmessages: + msg347411
2019-07-04 08:07:50Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
messages: + msg347261
2019-06-28 21:04:31terry.reedysetnosy: + terry.reedy
messages: + msg346865
2019-06-21 19:49:26SilentGhostsetnosy: + ezio.melotti, serhiy.storchaka, mrabarnett

components: + Regular Expressions
versions: + Python 3.9
2019-06-21 19:46:08bupcreate