msg71251 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-17 00:39 |
The following leads to a SyntaxError in 3.0:
compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec')
That is not the case in Python 2.6.
|
msg71252 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-17 00:39 |
Looks like Parser/tokenizer.c:check_coding_spec() considered Latin-1 a
raw encoding just like UTF-8. Patch is in the works.
|
msg71257 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-17 01:56 |
Here is a potential fix. It broke test_imp because it assumed that
Latin-1 source files would be encoded at Latin-1 instead of UTF-8 when
returned by imp.new_module(). Doesn't seem like a critical change as the
file is still properly decoded.
|
msg71258 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-17 01:57 |
Attached is a test for test_pep3120 (since that is what most likely
introduced the breakage). It's a separate patch since the source file is
marked as binary and thus can't be diffed by ``svn diff``.
|
msg71401 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-19 02:34 |
Can someone double-check this patch for me? I don't have much experience
with the parser so I want to make sure I am not doing anything wrong.
|
msg71402 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-19 02:37 |
There is a potential dependency on issue3594 as it would change how
imp.find_module() acts and thus make test_imp no longer fail in the way
it has.
|
msg71403 - (view) |
Author: Benjamin Peterson (benjamin.peterson) * |
Date: 2008-08-19 02:54 |
That line dates back to the PEP 263 implementation. Martin?
|
msg71846 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-08-24 17:57 |
Since this is marked "release blocker", I'll provide a shallow comment:
I don't think it should be a release blocker. It's a bug in the compile
function, and there are various work-arounds (such as saving the bytes
to a temporary file and executing that one, or decoding the byte string
to a Unicode string, and then compiling the Unicode string). It is
sufficient to fix it in 3.0.1.
I don't think the patch is right: as the test had to be changed, it
means that somewhere, the detection of the encoding declaration now
fails. This is clearly a new bug, but I don't have the time to analyse
the cause further.
In principle, there is nothing wrong with the tokenizer treating latin-1
as "raw" - that only means we don't go through a codec.
|
msg71850 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-24 18:12 |
Actually, the tests don't have to change; if issue 3594 gets applied
then that change cascades into this issue and negates the need to change
the tests themselves.
As for treating Latin-1 as a raw encoding, how can that be theoretically
okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?
|
msg71852 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-08-24 19:19 |
> As for treating Latin-1 as a raw encoding, how can that be theoretically
> okay if the parser assumes UTF-8 and Latin-1 is not a superset of Latin-1?
The parser doesn't assume UTF-8, but "ascii+", i.e. it passes all
non-ASCII bytes on to the AST, which then needs to deal with them;
it then could (but apparently doesn't) take into account whether the
internal representation was UTF-8 or Latin-1: see ast.c:decode_unicode
for some remains of that.
The other case (besides string literals) where bytes > 127 matter is
tokenizer.c:verify_identifier; this indeed assumes UTF-8 only (but
could be easily extended to support Latin-1 as well).
The third case where non-ASCII bytes are allowed is comments; there
they are entirely ignored (i.e. it is not even verified that the
comment is well-formed UTF-8).
Removal of the special case should simplify the code; I would agree
that any speedup gained by not going through a codec is irrelevant.
I'm still puzzled why test_imp if the special case is removed.
|
msg71855 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-08-24 19:53 |
The test_imp stuff has to do with PyTokenizer_FindEncoding().
imp.find_module() only opens the file, passes the file descriptor to
PyTokenizer_FindEncoding() and then returns a file object with the found
encoding.
Problem is that (as issue 3594 points out), PyTokenizer_FindEncoding()
always fails. That means it assumes only the raw encodings are okay.
With Latin-1 being one of them, it returns the file opened as Latin-1 as
is correct. Removing that case here means PyTokenizer_FindEncoding()
fails, and thus assumes only UTF-8 as a legitimate encoding and opens
the files with the UTF-8 encoding. It took a while to find these two
bugs obviously. =)
|
msg72624 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-09-05 20:36 |
I have attached a new version of the patch with the changes to test_imp
removed as issue 3594 fixed the need for the change. I have also
directly uploaded test_pep3120.py since it is flagged as binary and thus
cannot be diffed by svn.
|
msg74287 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-04 00:19 |
Using py3k trunk + fix_latin.diff:
- compile(b'# coding: latin-1\nu = "\xC7"\n', '<dummy>', 'exec')
doesn't fail
- test_pep3120.py is ok
- but execute a ISO-8859-1 script fails: see attached iso.py
Original Python3:
$ python iso.py
'Bonjour ma ch\xe8re amie'
Patched Python3:
$ python iso.py
'Bonjour ma ch\xc3\xa8re amie'
|
msg74288 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-04 00:23 |
See also Lib/test/test_shlex.py: trunk is ok, but with fix_latin.diff
the test fails.
|
msg74291 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-04 00:40 |
It looks like the problem of fix_latin.diff is the decoding_state:
it's set to STATE_NORMAL whereas current behaviour is to stay in state
STATE_RAW.
I wrote another patch which is a mix of case 1 (utf-8: just set
tok->encoding) and case 2 (another charset: set tok->enc,
tok->encoding and tok>decoding_state): a new case 3 which set enc,
encoding but stay a the state STATE_RAW. I don't understand my patch,
so review it (twice or more :-D). Using my patch:
- compile(...) works
- test_shlex.py works
- test_pep3120.py
- iso.py works
|
msg74294 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-04 01:24 |
But why does iso-8859-1 need to be treated as a special case? UTF-8 is
special because it is the default encoding for source. But iso-8859-1
really shouldn't be special, IMO.
Your patch does exactly what happens lower when set_readline() succeeds.
So I want to know why set_readline() is not returning something for
iso-8859-1 (which your patch seems to be suggesting).
|
msg74295 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-04 01:28 |
Sorry, I mis-spoke: your patch, Victor, doesn't change the state to
NORMAL. But my worry still stands; why does iso-8859-1 need to be
special-cased? It suggests to me that some more fundamental needs to be
dealt with instead of just patching around iso-8859-1.
|
msg74296 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-04 01:42 |
@brett.cannon: I found it: ast.c used a hack for iso-8859-1! Since
this hack introduces a bug (your compile(...) example), I prefer to
remove it to simplify to code. The new patch just removes the hack in
tokenizer.c and ast.c. It does also document encoding and enc
attributes of tok_state in tokenizer.h.
Using tokenizer_iso-8859-1-patch2.patch, all tests (test_pep3120.py,
iso.py, test_shlex.py, etc.) are OK.
|
msg74299 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-04 02:23 |
Thanks for finding that, Victor! I will do a patch review when I have a
chance (it won't be until after the weekend).
|
msg74300 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-04 02:36 |
After reading tokenizer.c 1000 times, I finally used grep:
$ grep -l -i 'iso.8859.1' $(find -name "*.c")
./Python/ast.c <~~~ WTF?
./Objects/unicodeobject.c
./Parser/tokenizer.c
./Modules/cjkcodecs/_codecs_iso2022.c
./Modules/expat/xmltok.c
|
msg74393 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-06 21:48 |
My patch version 2 included an "unrelated" fix for the issue2384.
|
msg74415 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-07 00:26 |
So I tried the patch (I attached my version with different comments in
the header file and removed some redundant change in whitespacing), and
test_sys consistently fails for me:
test_current_frames (__main__.SysModuleTest)
AssertionError: '' != 'g456()'
And this is after a ``make distclean`` and a new compile. This failure
doesn't occur on a clean checkout made after a 'distclean'.
|
msg74428 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-07 08:21 |
test_sys failure is fixed by the issue #2384.
|
msg74620 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2008-10-10 08:33 |
Amaury applied my both patches for issues #2384 and #3975. So all
tests now pass with python trunk + alt_latin_1.diff.
|
msg74642 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-10 20:24 |
Yep, it passes for me now.
Martin, have any objection to this patch?
|
msg74793 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-10-15 06:05 |
The patch looks fine to me, please apply.
I notice that the diff file reports changes to test_pep3120.py. No such
changes should be necessary, so please exclude them from committing.
|
msg74807 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-15 17:15 |
On Tue, Oct 14, 2008 at 11:05 PM, Martin v. Löwis
<report@bugs.python.org> wrote:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> The patch looks fine to me, please apply.
>
Great!
> I notice that the diff file reports changes to test_pep3120.py. No such
> changes should be necessary, so please exclude them from committing.
>
I added a test for compile() in there, which is why the patch is
claiming that. There is an uploaded version of test_pep3120.py on the
issue.
|
msg74811 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2008-10-15 19:33 |
> I added a test for compile() in there, which is why the patch is
> claiming that. There is an uploaded version of test_pep3120.py on the
> issue.
Ah, ok. I missed that - that change is also fine.
|
msg74892 - (view) |
Author: Barry A. Warsaw (barry) * |
Date: 2008-10-17 01:55 |
Brett, please apply and close the issue.
|
msg74894 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-17 03:34 |
Sorry about that; been one of those days.
Doing a svn up and making sure it still compiles fine.
|
msg74895 - (view) |
Author: Brett Cannon (brett.cannon) * |
Date: 2008-10-17 03:39 |
Applied in r66951.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:37 | admin | set | github: 47824 |
2008-10-17 03:39:11 | brett.cannon | set | status: open -> closed messages:
+ msg74895 |
2008-10-17 03:34:07 | brett.cannon | set | messages:
+ msg74894 |
2008-10-17 01:55:11 | barry | set | nosy:
+ barry messages:
+ msg74892 |
2008-10-15 19:33:06 | loewis | set | messages:
+ msg74811 |
2008-10-15 17:15:34 | brett.cannon | set | messages:
+ msg74807 |
2008-10-15 06:05:29 | loewis | set | keywords:
- needs review assignee: loewis -> brett.cannon resolution: accepted messages:
+ msg74793 |
2008-10-10 20:24:41 | brett.cannon | set | assignee: loewis messages:
+ msg74642 |
2008-10-10 08:33:03 | vstinner | set | messages:
+ msg74620 |
2008-10-07 08:21:16 | vstinner | set | dependencies:
+ [Py3k] line number is wrong after encoding declaration messages:
+ msg74428 |
2008-10-07 00:26:56 | brett.cannon | set | files:
+ alt_latin_1.diff |
2008-10-07 00:26:18 | brett.cannon | set | assignee: brett.cannon -> (no value) messages:
+ msg74415 |
2008-10-06 21:49:17 | vstinner | set | files:
+ tokenizer_iso-8859-1-patch3.patch |
2008-10-06 21:49:07 | vstinner | set | files:
- python3_bytes_filename-3.patch |
2008-10-06 21:48:35 | vstinner | set | files:
+ python3_bytes_filename-3.patch messages:
+ msg74393 |
2008-10-06 21:32:38 | vstinner | set | files:
- tokenizer_iso-8859-1.patch |
2008-10-04 02:36:38 | vstinner | set | messages:
+ msg74300 |
2008-10-04 02:23:42 | brett.cannon | set | assignee: brett.cannon messages:
+ msg74299 |
2008-10-04 01:42:22 | vstinner | set | files:
+ tokenizer_iso-8859-1-patch2.patch messages:
+ msg74296 |
2008-10-04 01:28:14 | brett.cannon | set | messages:
+ msg74295 |
2008-10-04 01:24:43 | brett.cannon | set | messages:
+ msg74294 |
2008-10-04 00:40:19 | vstinner | set | files:
+ tokenizer_iso-8859-1.patch messages:
+ msg74291 |
2008-10-04 00:23:23 | vstinner | set | messages:
+ msg74288 |
2008-10-04 00:19:39 | vstinner | set | files:
+ iso.py nosy:
+ vstinner messages:
+ msg74287 |
2008-10-02 12:54:23 | barry | set | priority: deferred blocker -> release blocker |
2008-09-26 22:18:34 | barry | set | priority: release blocker -> deferred blocker |
2008-09-18 05:43:20 | barry | set | priority: deferred blocker -> release blocker |
2008-09-09 13:12:01 | barry | set | priority: release blocker -> deferred blocker |
2008-09-05 20:37:13 | brett.cannon | set | files:
- pep3120_test.diff |
2008-09-05 20:37:07 | brett.cannon | set | files:
+ test_pep3120.py |
2008-09-05 20:36:41 | brett.cannon | set | files:
- fix_latin.diff |
2008-09-05 20:36:29 | brett.cannon | set | files:
+ fix_latin.diff messages:
+ msg72624 |
2008-08-24 19:53:27 | brett.cannon | set | messages:
+ msg71855 |
2008-08-24 19:19:03 | loewis | set | messages:
+ msg71852 |
2008-08-24 18:12:06 | brett.cannon | set | messages:
+ msg71850 |
2008-08-24 17:57:03 | loewis | set | messages:
+ msg71846 |
2008-08-21 20:33:57 | brett.cannon | set | keywords:
+ needs review |
2008-08-21 18:35:23 | brett.cannon | set | priority: critical -> release blocker |
2008-08-19 02:54:03 | benjamin.peterson | set | nosy:
+ loewis, benjamin.peterson messages:
+ msg71403 |
2008-08-19 02:37:05 | brett.cannon | set | dependencies:
+ PyTokenizer_FindEncoding() never succeeds messages:
+ msg71402 |
2008-08-19 02:34:23 | brett.cannon | set | messages:
+ msg71401 |
2008-08-17 01:58:08 | brett.cannon | set | type: behavior |
2008-08-17 01:57:59 | brett.cannon | set | priority: critical files:
+ pep3120_test.diff messages:
+ msg71258 components:
+ Interpreter Core versions:
+ Python 3.0 |
2008-08-17 01:56:36 | brett.cannon | set | files:
+ fix_latin.diff keywords:
+ patch messages:
+ msg71257 |
2008-08-17 00:39:59 | brett.cannon | set | messages:
+ msg71252 |
2008-08-17 00:39:39 | brett.cannon | create | |