This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author twouters
Recipients gregory.p.smith, lys.nikolaou, pablogsal, twouters
Date 2021-11-16.19:45:26
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1637091926.91.0.847266821878.issue45822@roundup.psfhosted.org>
In-reply-to
Content
Py_CompileString() in Python 3.9 and later, using the PEG parser, appears to no longer honours source encoding cookies. A reduced test case:

    #include "Python.h"
    #include <stdio.h>

    const char *src = (
    "# -*- coding: Latin-1 -*-\n"
    "'''\xc3'''\n");

    int main(int argc, char **argv)
    {
        Py_Initialize();
        PyObject *res = Py_CompileString(src, "some_path", Py_file_input);
        if (res) {
            fprintf(stderr, "Compile succeeded.\n");
            return 0;
        } else {
            fprintf(stderr, "Compile failed.\n");
            PyErr_Print();
            return 1;
        }
    }

Compiling and running the resulting binary with Python 3.8 (or earlier):

    % ./encoding_bug
    Compile succeeded.

With 3.9 and PYTHONOLDPARSER=1:

    % PYTHONOLDPARSER=1 ./encoding_bug
    Compile succeeded.

With 3.9 (without the env var) or 3.10:
    % ./encoding_bug
    Compile failed.
      File "some_path", line 2
        '''�'''
             ^
    SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Writing the same bytes to a file and making python3.9 or python3.10 import them works fine, as does passing the bytes to compile():

    Python 3.10.0+ (heads/3.10-dirty:7bac598819, Nov 16 2021, 20:35:12) [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> b = open('encoding_bug.py', 'rb').read()
    >>> b
    b"# -*- coding: Latin-1 -*-\n'''\xc3'''\n"
    >>> import encoding_bug
    >>> encoding_bug.__doc__
    'Ã'
    >>> co = compile(b, 'some_path', 'exec')
    >>> co
    <code object <module> at 0x7f447e1b0c90, file "some_path", line 1>
    >>> co.co_consts[0]
    'Ã'


It's just Py_CompileString() that fails. I don't understand why, and I do believe it's a regression.
History
Date User Action Args
2021-11-16 19:45:26twouterssetrecipients: + twouters, gregory.p.smith, lys.nikolaou, pablogsal
2021-11-16 19:45:26twouterssetmessageid: <1637091926.91.0.847266821878.issue45822@roundup.psfhosted.org>
2021-11-16 19:45:26twouterslinkissue45822 messages
2021-11-16 19:45:26twouterscreate