classification
Title: [Py3k] No text shown when SyntaxError (when not UTF8)
Type: behavior Stage:
Components: None Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: loewis, ocean-city
Priority: normal Keywords:

Created on 2008-03-16 13:37 by ocean-city, last changed 2008-03-17 20:45 by loewis. This issue is now closed.

Messages (9)
msg63576 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-03-16 13:37
Following code

# coding: utf-8
print "年"

outputs

C:\Documents and Settings\WhiteRabbit>py3k b.py
  File "b.py", line 3
    print "年"

as expected, but following code

# coding: cp932
print "年"

outputs

C:\Documents and Settings\WhiteRabbit>py3k a.py
  File "a.py", line 4
    [22605 refs]

Probably this happens because PyUnicode_DecodeUTF8 at
Python/pythonrun.c(1757) assumes err->text to be UTF8, but this is not
true when source file is not encoded with UTF8.

# Sorry there is no patch.
msg63578 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-03-16 14:47
Probably same problem exists in PyErr_ProgramText().
msg63581 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-03-16 15:47
This will involve quite some work to fix. When fetching the code, the 
source encoding must be recognized. Contributions are welcome.
(I personally consider this issue minor, as I would encourage users to use 
UTF-8 as the source encoding, anyway).
msg63628 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-03-17 08:56
Hello. I tracked down source code and found where err->text is set.

Index: Parser/parsetok.c
===================================================================
--- Parser/parsetok.c	(revision 61411)
+++ Parser/parsetok.c	(working copy)
@@ -218,7 +218,7 @@
 			assert(tok->cur - tok->buf < INT_MAX);
 			err_ret->offset = (int)(tok->cur - tok->buf);
 			len = tok->inp - tok->buf;
-			text = PyTokenizer_RestoreEncoding(tok, len, &err_ret->offset);
+/*			text = PyTokenizer_RestoreEncoding(tok, len, &err_ret->offset); */
 			if (text == NULL) {
 				text = (char *) PyObject_MALLOC(len + 1);
 				if (text != NULL) {

It seems tok->buf is encoded with UTF-8, and
PyTokenizer_RestoreEncoding() resotores it to original encoding of
source file. So I tried above patch, output was expected on cp932/euc_jp
source files.

Maybe this function is not needed in py3k? I cannot find other place
where this function is used.

# Probably PyErr_ProgramText() needs more effort to be fixed.
msg63633 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-03-17 11:18
You are probably right about the source of the problem; I was confusing 
it with a regular exception, e.g.

print("年",a)

However, I also fail to reproduce the problem on OSX. I get

  File "a.py", line 3
    print "�N"
             ^
SyntaxError: invalid syntax

I'm not quite sure what the N is doing in there, but the first character 
is the replacement character (hopefully, the tracker will reproduce it 
correctly); I get that because pythonrun uses the "replace" codec.

I guess you are not seeing it because then the replacement character 
cannot actually be output to your terminal. Please try

print("\ufffd")

to see what that does.
msg63636 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-03-17 13:30
> I was confusing it with a regular exception, e.g.
> print("年",a)

I'm now invesigating this problem. This comes from another reason.
Please look at fp_setreadl in Parser/tokenizer.c.
This function opens file using codec and doesn't seek to current
position. (fp_setreadl is used when codecs is neigher utf-8 nor 
iso-8859-1 .... tok->decoding_state == STATE_NORMAL)

So

# coding: ascii
# 1
# 2
# 3
raise RuntimeError("a")
# 4
# 5
# 6

outputs 

C:\Documents and Settings\WhiteRabbit>py3k ascii.py

Traceback (most recent call last):
  File "ascii.py", line 6, in <module>
    # 4
RuntimeError: a
[22821 refs]

# One line shifted.

And

# dummy
# coding: ascii
# 1
# 2
# 3
raise RuntimeError("a")
# 4
# 5
# 6

outputs

C:\Documents and Settings\WhiteRabbit>py3k ascii.py

Traceback (most recent call last):
  File "ascii.py", line 8, in <module>
    # 5
RuntimeError: a
[22821 refs]

# Two lines shifted.
msg63639 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-03-17 13:36
>However, I also fail to reproduce the problem on OSX. I get
>
>  File "a.py", line 3
>    print "�N"
>             ^
>SyntaxError: invalid syntax

Umm, strange... I can output correct result even if
using euc_jp (my terminal named command prompt cannot
output euc_jp string directly, AFAIK)

> print("\ufffd")

>>> print("\ufffd")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "e:\python-dev\py3k\lib\io.py", line 1247, in write
    b = encoder.encode(s)
UnicodeEncodeError: 'cp932' codec can't encode character '\ufffd' in
position 0:
 illegal multibyte sequence
msg63641 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-03-17 13:42
>I'm now invesigating this problem. This comes from another reason.
Of course, even if this line number problem is fixed, encoding
problem still remains. Probably I'll look at it next.
msg63767 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-03-17 20:45
The original issue is now fixed in r61462. Please open another issue for 
the case of regular exceptions.
History
Date User Action Args
2008-03-17 20:45:08loewissetstatus: open -> closed
resolution: fixed
messages: + msg63767
2008-03-17 13:42:20ocean-citysetmessages: + msg63641
2008-03-17 13:36:26ocean-citysetmessages: + msg63639
2008-03-17 13:30:16ocean-citysetmessages: + msg63636
2008-03-17 11:18:17loewissetmessages: + msg63633
2008-03-17 08:56:14ocean-citysetmessages: + msg63628
2008-03-16 15:47:28loewissetnosy: + loewis
messages: + msg63581
2008-03-16 14:47:13ocean-citysetmessages: + msg63578
2008-03-16 13:38:22ocean-citysettitle: [Py3k] -> [Py3k] No text shown when SyntaxError (when not UTF8)
2008-03-16 13:37:32ocean-citycreate