Py3k fails to parse a file with an iso-8859-1 string #46912

azverkan · 2008-04-19T21:05:00Z

BPO	2660
Nosy	@vstinner, @devdanzin, @benjaminp
Files	2to3bug.py: testcase 2to3_encoding.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-05-09.00:33:45.244>
created_at = <Date 2008-04-19.21:04:59.639>
labels = ['type-bug', 'expert-2to3', 'expert-unicode']
title = 'Py3k fails to parse a file with an iso-8859-1 string'
updated_at = <Date 2009-05-09.00:33:45.213>
user = 'https://bugs.python.org/azverkan'

bugs.python.org fields:

activity = <Date 2009-05-09.00:33:45.213>
actor = 'benjamin.peterson'
assignee = 'none'
closed = True
closed_date = <Date 2009-05-09.00:33:45.244>
closer = 'benjamin.peterson'
components = ['Unicode', '2to3 (2.x to 3.x conversion tool)']
creation = <Date 2008-04-19.21:04:59.639>
creator = 'azverkan'
dependencies = []
files = ['10063', '13877']
hgrepos = []
issue_num = 2660
keywords = ['patch']
message_count = 8.0
messages = ['65637', '65638', '65641', '65642', '86641', '86643', '87175', '87481']
nosy_count = 5.0
nosy_names = ['collinwinter', 'vstinner', 'ajaksu2', 'benjamin.peterson', 'azverkan']
pr_nums = []
priority = 'high'
resolution = 'fixed'
stage = 'test needed'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue2660'
versions = ['Python 2.6', 'Python 3.1']

azverkan · 2008-04-19T21:04:58Z

While running the 2to3 script on the scons codebase, I ran into an
UnicodeDecodeError.

Attached is just the portion of the script that causes the error.

2to3 throws an error on the string regardless of whether the unicode
string literal is prepended on the front.

RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: ws_comma
Traceback (most recent call last):
  File "/usr/local/bin/2to3", line 5, in <module>
    sys.exit(refactor.main())
  File "/usr/local/lib/python3.0/lib2to3/refactor.py", line 81, in main
    rt.refactor_args(args)
  File "/usr/local/lib/python3.0/lib2to3/refactor.py", line 188, in
refactor_args
    self.refactor_file(arg)
  File "/usr/local/lib/python3.0/lib2to3/refactor.py", line 217, in
refactor_file
    input = f.read() + "\n" # Silence certain parse errors
  File "/usr/local/lib/python3.0/io.py", line 1611, in read
    decoder.decode(self.buffer.read(), final=True))
  File "/usr/local/lib/python3.0/io.py", line 1199, in decode
    output = self.decoder.decode(input, final=final)
  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 59-60:
invalid data

collinwinter · 2008-04-19T21:48:49Z

2to3 running under Python 2.5.1 handles this file just fine. 2to3
running under 3.0a4+ (r62404) fails as detailed below. However, that
file doesn't run correctly under Python itself:

collinwinter@Silves:/src/python/py3k$ ./python
/home/collinwinter/Desktop/2to3bug.py
File "/home/collinwinter/Desktop/2to3bug.py", line 3
collinwinter@Silves:/src/python/py3k

This suggests this problem isn't 2to3-specific. Refiling this issue
against py3k's Unicode support.

azverkan · 2008-04-20T01:38:09Z

Someone on the #python IRC channel suggested that the default for python
3.0 for unicode string literals is reversed from python 2.5.

If you remove the unicode string literal (u'') from the front of the
string, it runs fine under python 3.0 and fails under 2.5 and 2.6 instead.

azverkan · 2008-04-20T01:40:01Z

Also, I can confirm that running 2to3 with Python 2.6 correctly converts
the script but running 2to3 with Python 3.0 results in a
UnicodeDecodeError exception.

devdanzin · 2009-04-27T01:42:31Z

Confirmed in py3k on rev71995.

benjaminp · 2009-04-27T02:39:29Z

The problem is that 2to3 just reads the file with whatever
locale.getpreferredencoding() returns. It should use
tokenize.detect_encoding() to discover the correct encoding to open it with.

vstinner · 2009-05-04T20:55:18Z

Patch using tokenize.detect_encoding() to read the encoding of Python
scripts instead of using default io.open() encoding (utf-8).

We might write unit test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Py3k fails to parse a file with an iso-8859-1 string #46912

Py3k fails to parse a file with an iso-8859-1 string #46912

azverkan mannequin commented Apr 19, 2008

azverkan mannequin commented Apr 19, 2008

collinwinter mannequin commented Apr 19, 2008

azverkan mannequin commented Apr 20, 2008

azverkan mannequin commented Apr 20, 2008

devdanzin mannequin commented Apr 27, 2009

benjaminp commented Apr 27, 2009

vstinner commented May 4, 2009

benjaminp commented May 9, 2009

Py3k fails to parse a file with an iso-8859-1 string #46912

Py3k fails to parse a file with an iso-8859-1 string #46912

Comments

azverkan mannequin commented Apr 19, 2008

azverkan mannequin commented Apr 19, 2008

collinwinter mannequin commented Apr 19, 2008

azverkan mannequin commented Apr 20, 2008

azverkan mannequin commented Apr 20, 2008

devdanzin mannequin commented Apr 27, 2009

benjaminp commented Apr 27, 2009

vstinner commented May 4, 2009

benjaminp commented May 9, 2009