Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Py3k fails to parse a file with an iso-8859-1 string #46912

Closed
azverkan mannequin opened this issue Apr 19, 2008 · 8 comments
Closed

Py3k fails to parse a file with an iso-8859-1 string #46912

azverkan mannequin opened this issue Apr 19, 2008 · 8 comments
Labels
topic-2to3 topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@azverkan
Copy link
Mannequin

azverkan mannequin commented Apr 19, 2008

BPO 2660
Nosy @vstinner, @devdanzin, @benjaminp
Files
  • 2to3bug.py: testcase
  • 2to3_encoding.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-05-09.00:33:45.244>
    created_at = <Date 2008-04-19.21:04:59.639>
    labels = ['type-bug', 'expert-2to3', 'expert-unicode']
    title = 'Py3k fails to parse a file with an iso-8859-1 string'
    updated_at = <Date 2009-05-09.00:33:45.213>
    user = 'https://bugs.python.org/azverkan'

    bugs.python.org fields:

    activity = <Date 2009-05-09.00:33:45.213>
    actor = 'benjamin.peterson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-05-09.00:33:45.244>
    closer = 'benjamin.peterson'
    components = ['Unicode', '2to3 (2.x to 3.x conversion tool)']
    creation = <Date 2008-04-19.21:04:59.639>
    creator = 'azverkan'
    dependencies = []
    files = ['10063', '13877']
    hgrepos = []
    issue_num = 2660
    keywords = ['patch']
    message_count = 8.0
    messages = ['65637', '65638', '65641', '65642', '86641', '86643', '87175', '87481']
    nosy_count = 5.0
    nosy_names = ['collinwinter', 'vstinner', 'ajaksu2', 'benjamin.peterson', 'azverkan']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2660'
    versions = ['Python 2.6', 'Python 3.1']

    @azverkan
    Copy link
    Mannequin Author

    azverkan mannequin commented Apr 19, 2008

    While running the 2to3 script on the scons codebase, I ran into an
    UnicodeDecodeError.

    Attached is just the portion of the script that causes the error.

    2to3 throws an error on the string regardless of whether the unicode
    string literal is prepended on the front.

    RefactoringTool: Skipping implicit fixer: buffer
    RefactoringTool: Skipping implicit fixer: idioms
    RefactoringTool: Skipping implicit fixer: ws_comma
    Traceback (most recent call last):
      File "/usr/local/bin/2to3", line 5, in <module>
        sys.exit(refactor.main())
      File "/usr/local/lib/python3.0/lib2to3/refactor.py", line 81, in main
        rt.refactor_args(args)
      File "/usr/local/lib/python3.0/lib2to3/refactor.py", line 188, in
    refactor_args
        self.refactor_file(arg)
      File "/usr/local/lib/python3.0/lib2to3/refactor.py", line 217, in
    refactor_file
        input = f.read() + "\n" # Silence certain parse errors
      File "/usr/local/lib/python3.0/io.py", line 1611, in read
        decoder.decode(self.buffer.read(), final=True))
      File "/usr/local/lib/python3.0/io.py", line 1199, in decode
        output = self.decoder.decode(input, final=final)
      File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
        (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 59-60:
    invalid data

    @azverkan azverkan mannequin assigned collinwinter Apr 19, 2008
    @azverkan azverkan mannequin added the topic-2to3 label Apr 19, 2008
    @collinwinter
    Copy link
    Mannequin

    collinwinter mannequin commented Apr 19, 2008

    2to3 running under Python 2.5.1 handles this file just fine. 2to3
    running under 3.0a4+ (r62404) fails as detailed below. However, that
    file doesn't run correctly under Python itself:

    collinwinter@Silves:/src/python/py3k$ ./python
    /home/collinwinter/Desktop/2to3bug.py
    File "/home/collinwinter/Desktop/2to3bug.py", line 3
    collinwinter@Silves:
    /src/python/py3k

    This suggests this problem isn't 2to3-specific. Refiling this issue
    against py3k's Unicode support.

    @collinwinter collinwinter mannequin added topic-unicode and removed topic-2to3 labels Apr 19, 2008
    @collinwinter collinwinter mannequin removed their assignment Apr 19, 2008
    @collinwinter collinwinter mannequin changed the title 2to3 throws a utf8 decode error on a iso-8859-1 string Py3k fails to parse a file with an iso-8859-1 string Apr 19, 2008
    @azverkan
    Copy link
    Mannequin Author

    azverkan mannequin commented Apr 20, 2008

    Someone on the #python IRC channel suggested that the default for python
    3.0 for unicode string literals is reversed from python 2.5.

    If you remove the unicode string literal (u'') from the front of the
    string, it runs fine under python 3.0 and fails under 2.5 and 2.6 instead.

    @azverkan
    Copy link
    Mannequin Author

    azverkan mannequin commented Apr 20, 2008

    Also, I can confirm that running 2to3 with Python 2.6 correctly converts
    the script but running 2to3 with Python 3.0 results in a
    UnicodeDecodeError exception.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Apr 27, 2009

    Confirmed in py3k on rev71995.

    @devdanzin devdanzin mannequin added topic-2to3 type-bug An unexpected behavior, bug, or error labels Apr 27, 2009
    @benjaminp
    Copy link
    Contributor

    The problem is that 2to3 just reads the file with whatever
    locale.getpreferredencoding() returns. It should use
    tokenize.detect_encoding() to discover the correct encoding to open it with.

    @vstinner
    Copy link
    Member

    vstinner commented May 4, 2009

    Patch using tokenize.detect_encoding() to read the encoding of Python
    scripts instead of using default io.open() encoding (utf-8).

    We might write unit test.

    See also related issue: bpo-5093

    @benjaminp
    Copy link
    Contributor

    Fixed in r72491.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-2to3 topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants