Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compile() doesn't ignore the source encoding when a string is passed in #48876

Closed
brettcannon opened this issue Dec 11, 2008 · 29 comments
Closed
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@brettcannon
Copy link
Member

BPO 4626
Nosy @brettcannon, @amauryfa, @vstinner, @benjaminp
Files
  • tokenizer_ignore_cookie-4.patch
  • tok_ignore_cookie-5.patch
  • unnamed
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/benjaminp'
    closed_at = <Date 2009-03-02.23:32:33.644>
    created_at = <Date 2008-12-11.06:19:16.441>
    labels = ['interpreter-core', 'type-bug']
    title = "compile() doesn't ignore the source encoding when a string is passed in"
    updated_at = <Date 2009-03-30.19:26:09.596>
    user = 'https://github.com/brettcannon'

    bugs.python.org fields:

    activity = <Date 2009-03-30.19:26:09.596>
    actor = 'jmfauth'
    assignee = 'benjamin.peterson'
    closed = True
    closed_date = <Date 2009-03-02.23:32:33.644>
    closer = 'benjamin.peterson'
    components = ['Interpreter Core']
    creation = <Date 2008-12-11.06:19:16.441>
    creator = 'brett.cannon'
    dependencies = []
    files = ['12889', '13087', '13411']
    hgrepos = []
    issue_num = 4626
    keywords = ['patch', 'needs review']
    message_count = 29.0
    messages = ['77590', '78534', '78917', '78930', '79102', '79103', '79104', '79992', '79997', '80579', '80613', '80635', '80791', '80804', '82102', '82393', '82615', '82793', '82815', '83047', '83048', '83069', '84093', '84125', '84153', '84154', '84159', '84160', '84623']
    nosy_count = 6.0
    nosy_names = ['brett.cannon', 'sjmachin', 'amaury.forgeotdarc', 'vstinner', 'benjamin.peterson', 'jmfauth']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4626'
    versions = ['Python 3.0', 'Python 3.1']

    @brettcannon
    Copy link
    Member Author

    When compile() is called with a string it is a reasonable assumption
    that it has already been decoded. But this is not in fact the case and
    leads to errors when trying to use non-ASCII identifiers::

     >>> source = "# coding=latin-1\n\u00c6 = '\u00c6'"
     >>> compile(source, '<test>', 'exec')
     Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       File "<test>", line 2
         � = '�'
            ^
     SyntaxError: invalid character in identifier
     >>> compile(source.encode('latin-1'), '<test>', 'exec')
     <code object <module> at 0x389cc8, file "<test>", line 2>

    @brettcannon brettcannon added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Dec 11, 2008
    @amauryfa
    Copy link
    Member

    bpo-4742 is similar issue:

    >>> source = b"# coding=cp1252\n\x94 = '\x94'".decode('cp1252')
    >>> compile(source, '<test>', 'exec')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "<test>", line 0
    SyntaxError: unknown encoding: cp1252

    The real error here is masked; just before the exception is set, there
    is a pending
    SyntaxError: 'charmap' codec can't decode byte 0x9d in position 18:
    character maps to <undefined>

    It seems that the source internal representation is correct utf-8, but
    this is decoded again with the "coding=" cookie.

    @brettcannon
    Copy link
    Member Author

    Here is what I have found out so far.
    Python/bltinmodule.c:builtin_compile takes in a PyObject and gets the
    char * representation of that object and passes it to
    Python/pythonrun.c:Py_CompileStringFlags. Unfortunately no other
    information is passed along in the call, including what the encoding
    happens to be. This is unfortunate as builtin_compile makes sure that
    the char* data is encoded using the default encoding before calling
    Py_CompileStringFlags.

    I just tried setting a PyCF flag to denote that the char* data is
    encoded using the default encoding, but Parser/tokenizer.c is not
    compiled against unicodeobject.c and thus one cannot use
    PyUnicode_GetDefaultEncoding() to know what the data is stored as.

    I'm going to try to explicitly convert to UTF-8 and see if that works.

    @brettcannon
    Copy link
    Member Author

    So explicitly converting to UTF-8 didn't work, or at least as simply as
    I had hoped.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 5, 2009

    The function decode_str() (Parser/tokenizer.c) is responsible to
    detect the encoding using the BOM or the cookie ("coding: xxx").
    decode_str() reencodes also the text to utf-8 if the encoding is
    different than utf-8. I think that we can just skip this function if
    the input text is already unicode (utf-8). Attached patch implements
    this idea.

    The patch introduces a new compiler flag (PyCF_IGNORE_COOKIE) and a
    new parser flag (PyPARSE_IGNORE_COOKIE). The new compiler flag is set
    by source_as_string() when the input is a PyUnicode object. "Ignore
    cookie" is maybe not the best name for this flag.

    With my patch, the first Brett's example displays:
       $ ./python com2.py
       Traceback (most recent call last):
         File "com2.py", line 3, in <module>
           compile(source, '<test>', 'exec')
         File "<test>", line 2= ''
             ^
       SyntaxError: invalid character in identifier

    The error cursor is not at the right column (bug related to the issue
    2382 or introduced by my patch?).

    The patch changes the public API: PyTokenizer_FromString() prototype
    changed to get a new argument. I don't like changing public API. The
    new argument should be a bit vector (flags) instead of a single bit
    (ignore_cookie). We can avoid changing the public API by creating a
    new function (eg. "PyTokenizer_FromUnicode" ;-)).

    There are some old PyPARSE_xxx constants in Include/parsetok.h that
    might be removed. PyPARSE_WITH_IS_KEYWORD value is 3 which is strange
    since flags is a bit vector (changed with | and tested by &). But
    PyPARSE_WITH_IS_KEYWORD is a dead constant (written in #if
    0...#endif).

    @vstinner
    Copy link
    Member

    vstinner commented Jan 5, 2009

    Oops, I attached the wrong file :-p

    @vstinner
    Copy link
    Member

    vstinner commented Jan 5, 2009

    With my patch, the first Brett's example displays:
    ...
    ” = '”'
    ^
    SyntaxError: invalid character in identifier

    The error cursor is not at the right column (bug related to the issue
    2382 or introduced by my patch?).

    I tried py3k_adjust_cursor_at_syntax_error_v2.patch (issue bpo-2382) and the
    cursor is displayed at the right column:

       $ ./python com2.py
       Traceback (most recent call last):
         ...
         File "<test>", line 2= ''
           ^

    So it's not a new bug ;-)

    @vstinner
    Copy link
    Member

    New version of my patch: add a regression test.

    @brett.cannon: Could you review my patch?

    @brettcannon
    Copy link
    Member Author

    I will see when I can get to it (other stuff is taking priority). Not
    going to assign to myself quite yet in case someone wants to beat me to
    this. I won't lose track since I created the bug and have a saved query
    to look at all bugs I make.

    @vstinner
    Copy link
    Member

    Ping! Can anyone review my patch?

    @benjaminp
    Copy link
    Contributor

    I don't like the change of API to PyTokenizer_FromString. I would prefer
    another function like PyTokenizer_IgnoreCodingCookie() blows up when
    parsing has already started.

    The (char *) cast in PyTokenizer_FromString is unneeded.

    You need to indent the "else" clause after you test for ignore_cookie.

    I'd like to see a test that shows that byte strings still have their
    cookies examined.

    @amauryfa
    Copy link
    Member

    Note that PyTokenizer_FromString is not an API function: it's not marked
    with PyAPI_FUNC, and not exported from the windows DLL.

    @vstinner
    Copy link
    Member

    I don't like the change of API to PyTokenizer_FromString.
    I would prefer another function like
    PyTokenizer_IgnoreCodingCookie()

    Ok, I created a new function PyTokenizer_FromUnicode(). I
    choosed "FromUnicode" because the string is encoded in unicode (as
    UTF-8, even if it's not the wchar_t* type).

    The (char *) cast in PyTokenizer_FromString is unneeded.

    The cast on the decode_str() result? It was already present in the
    original code. I removed it in my new patch.

    You need to indent the "else" clause after you test for
    ignore_cookie.

    Ooops, I always have problems to generate a diff because my editor
    removes trailing spaces and then I have to ignore space changes to
    create the diff.

    I'd like to see a test that shows that byte strings still have their
    cookies examined.

    test_pep263 has already two tests using a "#coding:" header.

    @benjaminp
    Copy link
    Contributor

    On Thu, Jan 29, 2009 at 5:13 PM, STINNER Victor <report@bugs.python.org> wrote:

    Ok, I created a new function PyTokenizer_FromUnicode(). I
    choosed "FromUnicode" because the string is encoded in unicode (as
    UTF-8, even if it's not the wchar_t* type).

    How about PyTokenizer_FromUTF8() then?

    > The (char *) cast in PyTokenizer_FromString is unneeded.

    The cast on the decode_str() result? It was already present in the
    original code. I removed it in my new patch.

    No, I was referring to this line:

    tok->encoding = (char *)PyMem_MALLOC

    @benjaminp
    Copy link
    Contributor

    Here's another patch. The one problem is that it causes test_coding to
    fail because coding cookies are ignored and not checked to see if they
    are the same as the encoding of the file. It seems fine to me to just
    skip this test.

    @vstinner
    Copy link
    Member

    @benjamin.peterson: I don't see you changes... I read both patches
    carrefuly and I saw:

    • source_as_string(): remove "cf.cf_flags = supplied_flags |
      PyCF_SOURCE_IS_UTF8;"
    • rename PyTokenizer_FromUnicode() to PyTokenizer_FromUTF8()
    • update NEWS

    I don't understand the change in source_as_string(). Except of that,
    it looks correct.

    The one problem is that it causes test_coding to fail because
    coding cookies are ignored and not checked to see if they
    are the same as the encoding of the file.

    The test have to fail, but the error is not the the compile() patch,
    but in the test. Input file is opened as unicode instead of bytes. A
    propose this patch to fix the test:

    Index: Lib/test/test_coding.py
    ===================================================================
    --- Lib/test/test_coding.py (révision 69723)
    +++ Lib/test/test_coding.py (copie de travail)
    @@ -17,7 +17,7 @@

             path = os.path.dirname(__file__)
             filename = os.path.join(path, module_name + '.py')
    -        fp = open(filename, encoding='utf-8')
    +        fp = open(filename, 'rb')
             text = fp.read()
             fp.close()
             self.assertRaises(SyntaxError, compile, text, 
    filename, 'exec')

    @benjaminp
    Copy link
    Contributor

    On Tue, Feb 17, 2009 at 6:22 PM, STINNER Victor <report@bugs.python.org> wrote:

    I don't understand the change in source_as_string(). Except of that,
    it looks correct.

    Py_CFFLAGS_SOURCE_IS_UTF8 is already set in compile().

    > The one problem is that it causes test_coding to fail because
    > coding cookies are ignored and not checked to see if they
    > are the same as the encoding of the file.

    The test have to fail, but the error is not the the compile() patch,
    but in the test. Input file is opened as unicode instead of bytes. A
    propose this patch to fix the test:

    That fix is correct, but I think it avoids what the test is meant to
    test. The test is supposed to check that compile() complains if the
    encoding of the unicode string is wrong compared to the coding cookie,
    but I think that feature is ok to not support.

    @vstinner
    Copy link
    Member

    I think that feature is ok to not support.

    Yeah!

    Anyone to review and/or commit the last patch?

    @benjaminp
    Copy link
    Contributor

    I'll deal with it eventually.

    @benjaminp benjaminp self-assigned this Feb 27, 2009
    @benjaminp
    Copy link
    Contributor

    Fixed in r70112.

    @benjaminp
    Copy link
    Contributor

    Should this be backported?

    @vstinner
    Copy link
    Member

    vstinner commented Mar 3, 2009

    Should this be backported?

    It's the r70113 (not the 70112). I see that pitrou backported the fix
    to 3.0.x. I think that it's enough, 2.x doesn't require the fix.

    @jmfauth
    Copy link
    Mannequin

    jmfauth mannequin commented Mar 24, 2009

    I'm glad to have discovered this topic. I bumped into something similar
    when I toyed with an interactive interpreter.

    from code import InteractiveInterpreter
    
    ii = InteractiveInterpreter()
    source = ...
    ii.runsource(source)

    What should be the encoding and/or the type (str, bytes) of the "source"
    string? Taking into account the encoding of the script which contains
    this code, I have the feeling there is always something going wrong,
    this can be a "non ascii" char in the source (encoded in utf-8!) or the
    interactive interpreter does not accept very well a byte string
    representing a utf-8 encoded string.

    IDLE is not suffering from this. Its interactive interpreter is somehow
    receiving "ucs-2 ready string" from tkinter.

    I'm a little bit confused here (win2k, winXP sp2, Python 3.0.1).

    @brettcannon
    Copy link
    Member Author

    On Tue, Mar 24, 2009 at 09:24, Jean-Michel Fauth <report@bugs.python.org>wrote:

    Jean-Michel Fauth <wxjmfauth@gmail.com> added the comment:

    I'm glad to have discovered this topic. I bumped into something similar
    when I toyed with an interactive interpreter.

    from code import InteractiveInterpreter

    ii = InteractiveInterpreter()
    source = ...
    ii.runsource(source)

    What should be the encoding and/or the type (str, bytes) of the "source"
    string?

    Off the top of my head it should be UTF-8. Otherwise it can probably be
    bytes as long as it has universal newlines.

    @jmfauth
    Copy link
    Mannequin

    jmfauth mannequin commented Mar 25, 2009

    When I was preparing some test examples to be submitted here.
    I noticed the module codeop.py used by the InteractiveInterpreter,
    does not like byte strings very much.

    IDLE, Python 3.0.1, winxp sp2

    >>> source = b'print(999)'
    >>> compile(source, '<in>', 'exec')
    <code object <module> at 0x00AA5CC8, file "<in>", line 1>
    >>> r = compile(source, '<in>', 'exec')
    >>> exec(r)
    999
    >>> from code import InteractiveInterpreter
    >>> ii = InteractiveInterpreter()
    >>> ii.runsource(source)
    Traceback (most recent call last):
      File "<pyshell#6>", line 1, in <module>
        ii.runsource(source)
      File "C:\Python30\lib\code.py", line 63, in runsource
        code = self.compile(source, filename, symbol)
      File "C:\Python30\lib\codeop.py", line 168, in __call__
        return _maybe_compile(self.compiler, source, filename, symbol)
      File "C:\Python30\lib\codeop.py", line 70, in _maybe_compile
        for line in source.split("\n"):
    TypeError: Type str doesn't support the buffer API
    >>> 
    >>> source = 'print(999)'
    >>> ii.runsource(source)
    999
    False

    @vstinner
    Copy link
    Member

    @jmfauth: Can you open a different issue for the IDLE issue?

    @jmfauth
    Copy link
    Mannequin

    jmfauth mannequin commented Mar 25, 2009

    Victor

    Yes, I could, but I think this is not an IDLE issue, I'm just
    using IDLE to illustrate the problem. I got the same when I'm
    working from within an editor or with my interactive interpreter
    I wrote for the fun.

    Code in the editor:

    # -- coding: cp1252 --
    # Python 3.0.1, winxp sp2

    from code import InteractiveInterpreter
    
    ii = InteractiveInterpreter()
    source = b'print(999)'
    ii.runsource(source)

    Output:

    >c:\python30\pythonw -u "uuu.py"
    Traceback (most recent call last):
      File "uuu.py", line 8, in <module>
        ii.runsource(source)
      File "c:\python30\lib\code.py", line 63, in runsource
        code = self.compile(source, filename, symbol)
      File "c:\python30\lib\codeop.py", line 168, in __call__
        return _maybe_compile(self.compiler, source, filename, symbol)
      File "c:\python30\lib\codeop.py", line 70, in _maybe_compile
        for line in source.split("\n"):
    TypeError: Type str doesn't support the buffer API
    >Exit code: 1

    My interactive interpreter

    >> ---

    from code import InteractiveInterpreter
    >>> ---
    ii = InteractiveInterpreter()
    >>> ---
    source = b'print(999)'
    >>> ---
    ii.runsource(source)
    Traceback (most recent call last):
      File "<smid last command>", line 1, in <module>
      File "c:\Python30\lib\code.py", line 63, in runsource
        code = self.compile(source, filename, symbol)
      File "c:\Python30\lib\codeop.py", line 168, in __call__
        return _maybe_compile(self.compiler, source, filename, symbol)
      File "c:\Python30\lib\codeop.py", line 70, in _maybe_compile
        for line in source.split("\n"):
    TypeError: Type str doesn't support the buffer API
    >>> ---
    

    =======================

    I realised and missed the fact the str() function is now accepting
    an encoding argument (very good). With it, the above code works.

    Code in the editor:

    # -- coding: cp1252 --
    # Python 3.0.1, winxp sp2

    from code import InteractiveInterpreter
    
    ii = InteractiveInterpreter()
    source = b'print(999)'
    source = str(source, 'cp1252') #<<<<<<<<<<
    ii.runsource(source)

    Output: (ok)

    c:\python30\pythonw -u "uuu.py"
    999
    Exit code: 0

    =======================

    In a few words, my empirical understanding of the story.

    1. Things are in good shape, but Python, itsself and its
      components (compile, interactiveinterpreter module, runsource(),
      exec(), ...) are lacking in consistency, the all accept
      miscellanous arguments type or they are not strict enough in
      their arguments acceptance. When they accept a str, which
      encoding is accepted?

    2. Python 3 is now using unicode as str. Of course, this is welcome,
      but the caveat is that there are now two encodings in use. Code source
      defaults to utf-8, but the str in code defaults to ucs-2/4 (eg. in
      IDLE).

    3. Maybe a solution is to have an optional encoding argument as we now
      have everywhere eg. str(), open(), which defaults to one encoding.

    compile(source, filename, encodings='utf-8', ...)
    (Problem: BOM, coding cookies?).

    I suspect the miscellaneous discussions one finds from people attempting
    to write a "correct" execfile() for Python 3 are coming from this.

    Regards.

    @vstinner
    Copy link
    Member

    Yes, I could, but I think this is not an IDLE issue
    (...)
    File "uuu.py", line 8, in <module>
    ii.runsource(source)
    (...)
    File "c:\python30\lib\codeop.py", line 70, in _maybe_compile
    for line in source.split("\n"):
    TypeError: Type str doesn't support the buffer API

    compile() works as expected. Your problem is related to
    InteractiveInterpreter().runsource(source) which is part of IDLE. runsource()
    is not compatible with bytes, only 'str' type is accepted.

    The error comes from bytes.split(str): _maybe_compile() should use
    source.split(b'\n') if source type is bytes. Or runsource() should reject
    bytes directly.

    source = str(source, 'cp1252') #<<<<<<<<<<
    ii.runsource(source)

    Output: (ok)

    Yes, runsource() (only) works with the str type.

    I suspect the miscellaneous discussions one finds from people attempting
    to write a "correct" execfile() for Python 3 are coming from this.

    Please see issues:

    • issue bpo-5524: execfile() removed from Python3
    • issue bpo-4628: No universal newline support for compile() when using bytes

    @jmfauth
    Copy link
    Mannequin

    jmfauth mannequin commented Mar 30, 2009

    Quick feedback from a Windows user.

    I made a few more tests with the freshly installed Pyton 3.1a1. The
    compile() function is running very well.

    As a side effect, it now possible to write an "execfile()" without
    any problem. Tests with execfile() versions reading files as text
    or as binary, with/without coding cookies, BOM, cp1252, cp850, cp437,
    utf-8 cookie, utf-8 with bom, ....

    (Of course, taking in account and managing universal newline).

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants