New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Py3k] SyntaxError cursor shifted if multibyte character is in line. #46635
Comments
Hello. I found another problem related to bpo-2301. I think this is because err->text is stored as UTF-8 So "^" is shited to right 5 bytes because there is 5 multibyte chars. C:\\Documents and Settings\\WhiteRabbit\>py3k x.py
push any key....
File "x.py", line 3
print "あいうえお"
^
SyntaxError: invalid syntax
[22567 refs] Sorry, I didn't know what PyTokenizer_RestoreEncoding really doing. C:\\Documents and Settings\\WhiteRabbit\>py a.py
File "a.py", line 2
x "、「、、、ヲ、ィ、ェ"
^
SyntaxError: invalid syntax
[8728 refs] I tried to fix this problem, but I'm not sure how to fix this. |
Quick observation... ///////////////////////////////////
///////////////////////////////////
//////////////////////////////////////////////////// Attached as experimental patch of solution 2. Looks agly, but |
This assumption still lives, but I cannot find better solution. |
Patch revised. |
I think that your patch works only for terminals where one byte of the In the attached patch, I tried to write some unit tests, (I had to adapt |
You are right, this issue is more difficult than I thought... Maybe we can use |
For the moment, I'd suggest that one unicode character has a the same Then the C implementation could do something similar to the statements I |
Amaury, if doing so, the cursor will shift left by 5 columns on my print "あいうえお"
^ |
This seems to be a difficult problem. Doesn't the exact width depend on An easy way to put the caret at the same exact position is to repeat the At least my "one unicode char is one space" suggestion corrects the case |
See also a related issue: bpo-3975. |
I'm not happy with this solution. ;-(
I have to admit you are right. Nevertheless, I got coLinux(Debian) which has localed wcswidth(3), so I The strategy is ...
This patch ignores file encoding. Again, this patch is experimental, P.S.
I tested this patch on coLinux with ja_JP.UTF-8 locale and manual
#define HAVE_WCSWIDTH 1
because I don't know how to change configure script. |
Experimental patch was experimental, wcswidth(3) returns 1 for East debian:~/python-dev/py3k# ./python /mnt/windows/a.py
File "/mnt/windows/a.py", line 3
"♪xÅx" abc
^ should point 'c'. And another one debian:~/python-dev/py3k# export LANG=C
debian:~/python-dev/py3k# ./python /mnt/windows/a.py
File "/mnt/windows/a.py", line 3
"\\u266ax\\u212bx" abc
^
SyntaxError: invalid syntax Please forget my patch. :-( |
This issue is a problem of units. The error text is an utf8 *byte* It's already possible to get (2) from the utf8 string, and code from I will try to implement that. |
Resolution of this may be applicable to bpo-3446 as well. |
Proof of concept of patch fixing this issue:
def utf8_to_unicode_offset(text, byte_offset):
utf8 = text.encode("utf-8")
utf8 = utf8[:byte_offset]
text = str(utf8, "utf-8")
return len(text)
The patch should be refactorized:
|
For an easier review, I splitted my patch in multiple small patches:
Dependencies:
Changes since bpo-2382.patch:
|
Comments about my own patches. unicode_width.patch:
adjust_offset.patch:
print_exception.patch:
|
I just created the issue bpo-12568 for unicode_width.patch. |
What's the status of this issue? FWIW, this is not only a problem with east asian characters: >>> ä äää
File "<stdin>", line 1
ä äää
^
SyntaxError: invalid syntax |
Here is a patch upgraded to Python 3.3. It uses a little different approach and works with invalid encoded data. unicode_utf8size.patch is not needed. This patch fixes a half of the issue - working with non-ascii non-wide characters. It's enough for many people. Let's commit it and go further. |
The purpose of this issue is to handle CJK characters taking 2 columns instead of 1 in a terminal, or did I misunderstand it? |
haypo> The purpose of this issue is to handle CJK characters taking 2 haypo> columns instead of 1 in a terminal, or did I misunderstand it? That's the other half of the problem, but the more common issue is misplaced caret when non-ascii characters are present: >>> ¡™£¢∞§¶•ªº
File "<stdin>", line 1
¡™£¢∞§¶•ªº
^
SyntaxError: invalid character in identifier With Serhiy's patch: >>> ¡™£¢∞§¶•ªº
File "<stdin>", line 1
¡™£¢∞§¶•ªº
^
SyntaxError: invalid character in identifier |
Serhiy's patch is lacking tests, but it passes the test I proposed at bpo-10382 at attaching here. |
Added tests. I think it will be worth apply this patch which fixes the issue for most Europeans and than continue working on the issue of wide characters. |
If no one complain I'll commit last patch tomorrow. |
New changeset eb7565c212f1 by Serhiy Storchaka in branch '3.3': New changeset ea34b2b0b8ae by Serhiy Storchaka in branch 'default': |
The issue bpo-10384 has been marked as a duplicate of this issue: it's a similar issue, identifier which contains invisible character. |
The original problem is still present Python 3.5.0a0 (default:5313b4c0bb6c, Sep 30 2014, 18:55:45)
>>> A_I_U_E_O$ = None
File "<stdin>", line 1
A_I_U_E_O$ = None
^
SyntaxError: invalid syntax Replace A_I_U_E_O above with the Japanese script. I get codec error from the server when I try to paste my session as is. (Note that invalid character is $ above and not the Japanese AIUEO.) Another outstanding issue is with zero-width characters. See bpo-10384. |
IDLE avoids the problem of calculating a location for a '^' below the bad line by instead asking tk to give the marked character (and maybe more) a 'ERROR' tag, which shows as a red background. So it marks the '$' of 'A_I_U_E_O$' and the 'alid' slice of 'inv\u200balid' (from duplicate bpo-10384). When the marked character is '\n', the space following the line is tagged. Is it possible to do something similar with any of the major system consoles? |
I think it has been fixed by now. On main (3.12) I get:
|
I tried the examples in the comments above and the output look correct to me. Note however that there are two different issues. The first is about multibyte characters that have regular width (like >>> ä äää
File "<stdin>", line 1
ä äää
^^^
SyntaxError: invalid syntax The second issue affects characters that, when displayed, are wider than normal (regardless of the number of bytes used to represent them): >>> A_I_U_E_O$ = None
File "<stdin>", line 1
A_I_U_E_O$ = None
^
SyntaxError: invalid character '$' (U+FF04)
>>> print("あいうえお"$)
File "<stdin>", line 1
print("あいうえお"$)
^
SyntaxError: invalid syntax For these, the cursor is preceded by the correct amount of spaces, but since the characters are wider than usual (even with a monospace font), the output looks misaligned. IIUC this report is about the first issue, which is now fixed, so we can close this. |
The original error is still present.
This issue was not closed in 2014, because it consists of two parts, and my patch only fixed the first part. It now mostly works for many European languages (unless you use combining characters), but not for the CJK languages with double width characters and not for languages with a wide use of zero width characters. |
If you are referring to the fact that In the example you posted, there are 13 characters and thus 13 >>> print "あいうえお" # 13 ASCII/Hiragana characters
File "<stdin>", line 1
print "あいうえお"
^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)? The alignment doesn't match simply because -- even with a monospace font -- the Hiragana character are wider than the ASCII characters. If the >>> print "AIUEO" # 13 ASCII characters
File "<stdin>", line 1
print "AIUEO"
^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)? If instead we want to use a number of If the Hiragana characters were twice as wide we could in theory duplicate the number of
I also suspect the exact output depends both on the available (and used) fonts, and by the rendering system (that might e.g. decide to increase the width of the ASCII characters to match the one of the widest character in the string). For character that actually are twice as wide or zero-width character we might in theory duplicate or remove the >>> print 'AIUEO' # 17 ASCII/ZERO WIDTH SPACE characters
File "<stdin>", line 1
print 'AIUEO'
^^^^^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)? Arguably having the number of |
Superseded by and fixed in gh-102310. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: