Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.0 distutils byte-compiling -> Syntax error: unknown encoding: cp1252 #48992

Closed
sjmachin mannequin opened this issue Dec 24, 2008 · 8 comments
Closed

3.0 distutils byte-compiling -> Syntax error: unknown encoding: cp1252 #48992

sjmachin mannequin opened this issue Dec 24, 2008 · 8 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@sjmachin
Copy link
Mannequin

sjmachin mannequin commented Dec 24, 2008

BPO 4742
Nosy @malemburg, @amauryfa, @tarekziade
Superseder
  • bpo-4626: compile() doesn't ignore the source encoding when a string is passed in
  • Files
  • py3encbug2.zip
  • x9d.py
  • encoding.issue.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tarekziade'
    closed_at = <Date 2009-01-01.12:32:40.649>
    created_at = <Date 2008-12-24.22:39:26.277>
    labels = ['library', 'type-crash']
    title = '3.0 distutils byte-compiling -> Syntax error: unknown encoding: cp1252'
    updated_at = <Date 2009-01-01.12:32:40.648>
    user = 'https://bugs.python.org/sjmachin'

    bugs.python.org fields:

    activity = <Date 2009-01-01.12:32:40.648>
    actor = 'georg.brandl'
    assignee = 'tarek'
    closed = True
    closed_date = <Date 2009-01-01.12:32:40.649>
    closer = 'georg.brandl'
    components = ['Distutils']
    creation = <Date 2008-12-24.22:39:26.277>
    creator = 'sjmachin'
    dependencies = []
    files = ['12446', '12492', '12494']
    hgrepos = []
    issue_num = 4742
    keywords = ['patch']
    message_count = 8.0
    messages = ['78273', '78275', '78518', '78522', '78524', '78525', '78528', '78535']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'sjmachin', 'amaury.forgeotdarc', 'tarek']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = None
    status = 'closed'
    superseder = '4626'
    type = 'crash'
    url = 'https://bugs.python.org/issue4742'
    versions = ['Python 3.0']

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Dec 24, 2008

    File foo3.py is [cut down (orig 87Kb)] output of 2to3 conversion tool
    and (coincidentally) is still valid 2.x syntax. There are no syntax
    errors reported by any of the following:
    \python26\python -c "import foo3"
    \python26\python foo3.py
    \python26\python setup.py install
    \python30\python -c "import foo3"
    \python30\python foo3.py
    However 3.0 install
    \python30\python setup.py install
    produces:
    """
    [snip]
    running install_lib
    copying build\lib\foo3.py -> C:\python30\Lib\site-packages
    byte-compiling C:\python30\Lib\site-packages\foo3.py to foo3.pyc
    File "C:\python30\Lib\site-packages\foo3.py", line 0
    ### Note also "line 0" above ###
    SyntaxError: unknown encoding: cp1252
    """
    Same happens if alternative name windows-1252 is used instead of cp1252.

    NOTE: file foo3.py actually does have some non-ASCII characters (\xa0,
    \x93, \x94), in comments. Another file (bar3.py) from the same package
    contains \xb7 twice, but doesn't have the unknown encoding problem.
    There are several other files in the same package that start with "# --
    coding: windows-1252 -
    -" (or cp1252, or even cp1251(!)) but have no
    non-ASCII characters in them. They don't get this incorrect error
    message either.

    @sjmachin sjmachin mannequin added the stdlib Python modules in the Lib dir label Dec 24, 2008
    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Dec 24, 2008

    A clue:

    >>> print(ascii(b'\xa0\x93\x94\xb7'.decode('cp1252')))
    '\xa0\u201c\u201d\xb7'

    Could be that it only happens where there's a cp1252 character that's
    not in latin1; see files x93.py and x94.py (have problem) and xa0b7.py
    (doesn't have problem).

    @tarekziade tarekziade mannequin self-assigned this Dec 30, 2008
    @tarekziade tarekziade mannequin added the type-crash A hard crash of the interpreter, possibly with a core dump label Dec 30, 2008
    @tarekziade
    Copy link
    Mannequin

    tarekziade mannequin commented Dec 30, 2008

    Here's a status:

    The problem is located in the codec that decodes the data (called by the
    compile builtin).

    It throws an error :

    *** UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in
    position 853: character maps to <undefined>

    Which is caught by compile and translated into:

    SyntaxError: unknown encoding: cp1252

    So I see two problems:

    1/ why compile throws such an error when there's an UnicodeDecodeError
    2/ why compile works well under Py2 since 0x9d is not part of the
    cp1252 mapping

    I have written a test that reproduces the problem, and I am still
    investigating. If I can't find the problem I will ask for help on
    python-dev because I have no knowledge in the compiler internals yet.

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Dec 30, 2008

    TWO POINTS:
    (1) I am not very concerned about chars like \x9d which are not valid in
    the declared encoding; I am more concerned with chars like \x93 and \x94
    which *ARE* valid in the declared encoding. Please ensure that these
    cases are included in tests.
    (2) Please check your test data and test results. I get different
    results. I have created a file x9d.py by making the minimal changes to
    x94.py. For me, this blows up on bytecompiling with *both* 3.0
    (UnicodeDecodeError, as expected) and 2.x (Syntax Error unknown encoding
    cp1252, wrong message) -- see below.

    byte-compiling C:\python30\Lib\site-packages\x9d.py to x9d.pyc
    Traceback (most recent call last):
      File "setup.py", line 5, in <module>
        py_modules = ["foo3", "bar3", "x93", "x94", "x9d", "xa0b7"]
      File "C:\python30\lib\distutils\core.py", line 149, in setup
        dist.run_commands()
      File "C:\python30\lib\distutils\dist.py", line 942, in run_commands
        self.run_command(cmd)
      File "C:\python30\lib\distutils\dist.py", line 962, in run_command
        cmd_obj.run()
      File "C:\python30\lib\distutils\command\install.py", line 571, in run
        self.run_command(cmd_name)
      File "C:\python30\lib\distutils\cmd.py", line 317, in run_command
        self.distribution.run_command(command)
      File "C:\python30\lib\distutils\dist.py", line 962, in run_command
        cmd_obj.run()
      File "C:\python30\lib\distutils\command\install_lib.py", line 91, in run
        self.byte_compile(outfiles)
      File "C:\python30\lib\distutils\command\install_lib.py", line 125, in
    byte_compile
        dry_run=self.dry_run)
      File "C:\python30\lib\distutils\util.py", line 520, in byte_compile
        compile(file, cfile, dfile)
      File "C:\python30\lib\py_compile.py", line 137, in compile
        codestring = f.read()
      File "C:\python30\lib\io.py", line 1724, in read
        decoder.decode(self.buffer.read(), final=True))
      File "C:\python30\lib\io.py", line 1295, in decode
        output = self.decoder.decode(input, final=final)
      File "C:\python30\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position
    64: character maps to <undefined>

    byte-compiling C:\python26\Lib\site-packages\x9d.py to x9d.pyc
    SyntaxError: ('unknown encoding: cp1252',
    ('C:\\python26\\Lib\\site-packages\\x9d.py', 0, 0, None))

    byte-compiling c:\python25\Lib\site-packages\x9d.py to x9d.pyc
    File "c:\python25\Lib\site-packages\x9d.py", line 0
    SyntaxError: ('unknown encoding: cp1252',
    ('c:\\python25\\Lib\\site-packages\\x9d.py', 0, 0, None))

    @malemburg
    Copy link
    Member

    On 2008-12-30 13:20, John Machin wrote:

    byte-compiling C:\python26\Lib\site-packages\x9d.py to x9d.pyc
    SyntaxError: ('unknown encoding: cp1252',
    ('C:\\python26\\Lib\\site-packages\\x9d.py', 0, 0, None))

    byte-compiling c:\python25\Lib\site-packages\x9d.py to x9d.pyc
    File "c:\python25\Lib\site-packages\x9d.py", line 0
    SyntaxError: ('unknown encoding: cp1252',
    ('c:\\python25\\Lib\\site-packages\\x9d.py', 0, 0, None))

    Added file: http://bugs.python.org/file12492/x9d.py

    FWIW, I've tried that file with Python 2.5 and 2.6 on my machine:

    lemburg/tmp> python2.5 ~/bin/pycompile.py x9d.py
    compiling x9d.py -> x9d.pyc
    XXX <type 'exceptions.SyntaxError'>: unknown encoding: cp1252 (x9d.py, line 0)

    lemburg/tmp> python2.6 ~/bin/pycompile.py x9d.py
    compiling x9d.py -> x9d.pyc
    XXX <type 'exceptions.SyntaxError'>: unknown encoding: cp1252 (x9d.py, line 0)

    Note that the line number is wrong in both messages.

    It is interesting that simply running the files gives a more correct
    error message:

    lemburg/tmp> python2.5 x9d.py
    File "x9d.py", line 2
    SyntaxError: 'charmap' codec can't decode byte 0x9d in position 0: character
    maps to <undefined>

    lemburg/tmp> python2.6 x9d.py
    File "x9d.py", line 2
    SyntaxError: 'charmap' codec can't decode byte 0x9d in position 0: character
    maps to <undefined>

    The character position is wrong again in both messages.

    Needless to say that the encoding "cp1252" is *not* unknown. It looks
    like compile() causes the decoding error to be overwritten with a
    misleading error message.

    @tarekziade
    Copy link
    Mannequin

    tarekziade mannequin commented Dec 30, 2008

    yup, here's the test I have written to demonstrate the problem. In any
    case, compile doesn't behave right way in the first place.

    @sjmachin
    Copy link
    Mannequin Author

    sjmachin mannequin commented Dec 30, 2008

    (1) what am I supposed to infer from "Yup"?? That all of that \x9d stuff
    was a mistake?

    (2)
    + def tearDown(self):
    + pyc_file = os.path.join(os.path.dirname(file), 'cp1252.pyc')
    + if os.path.exists(pyc_file):
    + os.patth.remove(pyc_file)

    os.patth is novel :-)

    @amauryfa
    Copy link
    Member

    This is a duplicate of bpo-4626.

    Here, the content is correctly decoded with cp1252, then passed to
    compile(); but compile() works on the internal utf-8 representation, and
    tries to decode it again with cp1252!

    Yes, the error message is overwritten. If I remove the code that sets
    the "unknown encoding" exception, I get:

    >>> compile(open("c:/temp/t1252.py", encoding="cp1252").read(),
    "t1252.py", "exec")
    SyntaxError: 'charmap' codec can't decode byte 0x9d in position 35:
    character maps to <undefined>
    
    The 0x9d explains easily:
    >>> b"\x94".decode('cp1252').encode('utf8')
    b'\xe2\x80\x9d'

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants