Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Py3k] SyntaxError cursor shifted if multibyte character is in line. #46635

Closed
ocean-city mannequin opened this issue Mar 18, 2008 · 34 comments
Closed

[Py3k] SyntaxError cursor shifted if multibyte character is in line. #46635

ocean-city mannequin opened this issue Mar 18, 2008 · 34 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@ocean-city
Copy link
Mannequin

ocean-city mannequin commented Mar 18, 2008

BPO 2382
Nosy @terryjreedy, @amauryfa, @abalkin, @vstinner, @ezio-melotti, @akheron, @serhiy-storchaka
Files
  • py3k_adjust_cursor_at_syntax_error.patch
  • traceback_adjust_cursor.patch
  • py3k_adjust_cursor_at_syntax_error_v2.patch
  • issue2382.patch
  • unicode_utf8size.patch
  • unicode_width.patch
  • adjust_offset.patch
  • print_exception.patch
  • test.py
  • adjust_offset_2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2008-03-18.05:22:30.823>
    labels = ['interpreter-core', 'type-bug']
    title = '[Py3k] SyntaxError cursor shifted if multibyte character is in line.'
    updated_at = <Date 2018-08-18.21:23:35.490>
    user = 'https://bugs.python.org/ocean-city'

    bugs.python.org fields:

    activity = <Date 2018-08-18.21:23:35.490>
    actor = 'terry.reedy'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Interpreter Core']
    creation = <Date 2008-03-18.05:22:30.823>
    creator = 'ocean-city'
    dependencies = []
    files = ['11548', '11670', '11707', '13354', '13356', '13357', '13358', '13359', '30534', '31874']
    hgrepos = []
    issue_num = 2382
    keywords = ['patch']
    message_count = 29.0
    messages = ['63895', '63904', '64156', '73539', '74106', '74114', '74119', '74129', '74148', '74149', '74361', '74362', '83636', '83659', '83700', '83712', '83714', '140377', '149960', '172500', '172525', '190933', '190934', '198419', '208115', '208697', '228027', '228034', '323734']
    nosy_count = 11.0
    nosy_names = ['terry.reedy', 'amaury.forgeotdarc', 'belopolsky', 'vstinner', 'ocean-city', 'LambertDW', 'ezio.melotti', 'Arfrever', 'python-dev', 'petri.lehtinen', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2382'
    versions = ['Python 3.3', 'Python 3.4']

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 18, 2008

    Hello. I found another problem related to bpo-2301.
    SyntaxError cursor "^" is shifted when multibyte
    characters are in line (before "^").

    I think this is because err->text is stored as UTF-8
    which requires 3 bytes for multibyte character,
    but actually cp932 (my console encoding) requires only 2 bytes for it.

    So "^" is shited to right 5 bytes because there is 5 multibyte chars.

    C:\\Documents and Settings\\WhiteRabbit\>py3k x.py
    push any key....
    
      File "x.py", line 3
        print "あいうえお"
                              ^
    SyntaxError: invalid syntax
    [22567 refs]

    Sorry, I didn't know what PyTokenizer_RestoreEncoding really doing.
    That function adjusted err_ret->offset for this encoding conversion.
    So, Python2.5 can output cursor in right place. (Of course, if source
    encoding is not compatible for console encoding, broken string is printed
    though. Anyway, cursor is right)

    C:\\Documents and Settings\\WhiteRabbit\>py a.py
      File "a.py", line 2
        x "、「、、、ヲ、ィ、ェ"
                     ^
    SyntaxError: invalid syntax
    [8728 refs]

    I tried to fix this problem, but I'm not sure how to fix this.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 18, 2008

    I tried to fix this problem, but I'm not sure how to fix this.

    Quick observation...

    ///////////////////////////////////
    // Possible Solution

    1. Convert err->text to console compatible encoding (not to source
      encoding like in python2.x) where PyTokenizer_RestoreEncoding is there.

    2. err->text is UTF-8, actual output is done in
      Python/pythonrun.c(print_error_text), so adjust offset there.

    ///////////////////////////////////
    // Solution requires...
    1.

    • PyUnicode_DecodeUTF8 in Python/pythonrun.c(err_input) should
      be changed to some kind of "bytes" API.

    • The way to write "bytes" to File object directly is needed.

    • The way to know actual byte length of given unicode + encoding.

    ////////////////////////////////////////////////////
    // Experimental patch

    Attached as experimental patch of solution 2. Looks agly, but
    seems working on my environment.
    (I assumed get_length_in_bytes(f, " ", 1) == 1 but I'm not sure
    this is always true in other platforms. Probably nicer and more
    general solution may exist)

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Mar 20, 2008

    (I assumed get_length_in_bytes(f, " ", 1) == 1 but I'm not sure
    this is always true in other platforms. Probably nicer and more
    general solution may exist)

    This assumption still lives, but I cannot find better solution.
    I'm thinking now attached patch is good enough.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Sep 21, 2008

    Patch revised.

    @ocean-city ocean-city mannequin added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Sep 21, 2008
    @amauryfa
    Copy link
    Member

    I think that your patch works only for terminals where one byte of the
    encoded text is displayed as one character on the terminal. This is not
    true for utf-8 terminals, for example.

    In the attached patch, I tried to write some unit tests, (I had to adapt
    the traceback module as well), and one test still fails because the
    captured stderr has a utf-8 encoding.
    I think that it's better to count unicode characters.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Oct 1, 2008

    You are right, this issue is more difficult than I thought...
    I found wcswidth(3), if this function is available we can use this
    function, but unfortunately there is no such function in VC6 and this
    function is meaningless on cygwn, so I cannot test it. ;-(

    Maybe we can use
    import unicodedata
    unicodedata.east_asian_width()
    but I need to investigate more.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 1, 2008

    For the moment, I'd suggest that one unicode character has a the same
    with as the space character, assuming that stdout.encoding correctly
    matches the terminal.

    Then the C implementation could do something similar to the statements I
    added in traceback.py:
    offset = len(line.encode('utf-8')[:offset].decode('utf-8'))

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Oct 1, 2008

    Amaury, if doing so, the cursor will shift left by 5 columns on my
    environment like this, no? ("あ" requires 2 columns for example)

        print "あいうえお"
                    ^

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 1, 2008

    This seems to be a difficult problem. Doesn't the exact width depend on
    the terminal capabilities? and fonts, and combining diacritics...

    An easy way to put the caret at the same exact position is to repeat the
    beginning of the line up to the offending offset:
    print "あいうえお"
    print "あいうえお^<------------------
    But I don't know how to make it look less ugly.

    At least my "one unicode char is one space" suggestion corrects the case
    of Western languages, and all messages with single-width characters.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 2, 2008

    See also a related issue: bpo-3975.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Oct 6, 2008

    At least my "one unicode char is one space" suggestion corrects the case
    of Western languages, and all messages with single-width characters.

    I'm not happy with this solution. ;-(

    Doesn't the exact width depend on
    the terminal capabilities? and fonts, and combining diacritics...

    I have to admit you are right.

    Nevertheless, I got coLinux(Debian) which has localed wcswidth(3), so I
    created another experimental patch.
    (py3k_adjust_cursor_at_syntax_error_v2.patch)

    The strategy is ...

    1. Try to convert to unicode. If fails, nothing changed to offset.
    2. If system has wcswidth(3), try that function
    3. If system is windows, try WideCharToMultibyte with CP_ACP
    4. If above 2/3 fails or system is others, use unicode length as offset
      (Amaury's suggestion)

    This patch ignores file encoding. Again, this patch is experimental,
    best effort, but maybe better than current state.

    P.S.
    I tested this patch on coLinux with ja_JP.UTF-8 locale and manual
    #define HAVE_WCSWIDTH 1
    because I don't know how to change configure script.

    @ocean-city
    Copy link
    Mannequin Author

    ocean-city mannequin commented Oct 6, 2008

    Experimental patch was experimental, wcswidth(3) returns 1 for East
    Asian Ambiguous character.

    debian:~/python-dev/py3k# ./python /mnt/windows/a.py
      File "/mnt/windows/a.py", line 3
        "♪xÅx" abc
                 ^

    should point 'c'. And another one

    debian:~/python-dev/py3k# export LANG=C
    debian:~/python-dev/py3k# ./python /mnt/windows/a.py
      File "/mnt/windows/a.py", line 3
        "\\u266ax\\u212bx" abc
                 ^
    SyntaxError: invalid syntax

    Please forget my patch. :-(

    @vstinner
    Copy link
    Member

    This issue is a problem of units. The error text is an utf8 *byte*
    string and offset is a number of *bytes*. The goal is to get the text
    *width* of a *character* string. We have to:
    1- convert offset from bytes number to character number
    2- get the error message as (unicode) characters
    3- get the width of text[:offset]

    It's already possible to get (2) from the utf8 string, and code from
    ocean-city's patch (py3k_adjust_cursor_at_syntax_error_v2.patch) can
    be used for (3). The most difficult point is (1).

    I will try to implement that.

    @lambertdw
    Copy link
    Mannequin

    lambertdw mannequin commented Mar 16, 2009

    Resolution of this may be applicable to bpo-3446 as well.
    "center, ljust and rjust are inconsistent with unicode parameters"

    @vstinner
    Copy link
    Member

    Proof of concept of patch fixing this issue:

    • parse_syntax_error() reads the text line into a PyUnicodeObject*
      instead of a "const char**"
    • create utf8_to_unicode_offset(): convert byte offset to a number of
      characters. The Python version should be something like:
       def utf8_to_unicode_offset(text, byte_offset):
          utf8 = text.encode("utf-8")
          utf8 = utf8[:byte_offset]
          text = str(utf8, "utf-8")
          return len(text)
    • reuse adjust_offset() from
      py3k_adjust_cursor_at_syntax_error_v2.patch, but force the use of
      wcswidth() because HAVE_WCSWIDTH is not defined by configure
    • print_error_text() works on unicode characters and not on bytes!

    The patch should be refactorized:

    • move adjust_offset(), utf8_to_unicode_offset(), utf8_len() in
      unicodeobject.c. You might create a new method "width()" for the
      unicode type. This method can be used to fix center(), ljust() and
      rjust() unicode methods (see issue bpo-3446).

    @vstinner
    Copy link
    Member

    For an easier review, I splitted my patch in multiple small patches:

    • unicode_utf8size.patch: create _PyUnicode_UTF8Size() function:
      Number of bytes needed to encode the unicode character as UTF-8
    • unicode_width.patch: create PyUnicode_Width(): Number of column
      needed to represent the string in the current locale. -1 is returned
      in case of an error.
    • adjust_offset.patch: Change unit of SyntaxError.offset, convert
      utf8 offset to unicode offset
    • print_exception.patch: process error text as an unicode string
      (instead of a byte string), convert offset from characters
      to "columns"

    Dependencies:

    • adjust_offset.patch depends on unicode_utf8size.patch
    • print_exception.patch depends on unicode_width.patch

    Changes since bpo-2382.patch:

    • PyUnicode_Width() doesn't change the locale
    • PyUnicode_Width() uses WideCharToMultiByte() on MS_WINDOWS, and
      wcswidth() otherwise (before: do nothing if HAVE_WCSWIDTH is not
      definied)
    • the offset was converted from utf8 index to unicode index only in
      print_error_text(), not on SyntaxError creation
    • _PyUnicode_UTF8Size() and PyUnicode_Width() are public

    @vstinner
    Copy link
    Member

    Comments about my own patches.

    unicode_width.patch:

    • error messages should be improved:
      ValueError("Unable to compute string width") for Windows
      IOError(strerror(errno)) otherwise

    adjust_offset.patch:

    • format_exception_only() from Lib/traceback.py may need a fix
    • about the documentation: it looks like SyntaxError.offset unit is
      not documentation in exceptions.rst (should it be documented, or
      leaved unchanged?)

    print_exception.patch:

    • i'm not sure of the reference counts (ref leak?)
    • in case of PyUnicode_FromUnicode(text, textlen) error,

    >PyFile_WriteObject(textobj, f, Py_PRINT_RAW);
    PyFile_WriteString("\n", f);<< is used to display the line but textobj
    may already ends with \n.

    • format_exception_only() from Lib/traceback.py should do the same job
      than fixed print_exception(): get the string width (to fix this
      issue!)

    @vstinner
    Copy link
    Member

    I just created the issue bpo-12568 for unicode_width.patch.

    @akheron
    Copy link
    Member

    akheron commented Dec 21, 2011

    What's the status of this issue?

    FWIW, this is not only a problem with east asian characters:

    >>> ä äää
      File "<stdin>", line 1
        ä äää
                ^
    SyntaxError: invalid syntax

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch upgraded to Python 3.3. It uses a little different approach and works with invalid encoded data. unicode_utf8size.patch is not needed.

    This patch fixes a half of the issue - working with non-ascii non-wide characters. It's enough for many people. Let's commit it and go further.

    @vstinner
    Copy link
    Member

    vstinner commented Oct 9, 2012

    This patch fixes a half of the issue - working with non-ascii
    non-wide characters.

    The purpose of this issue is to handle CJK characters taking 2 columns instead of 1 in a terminal, or did I misunderstand it?

    @abalkin
    Copy link
    Member

    abalkin commented Jun 10, 2013

    haypo> The purpose of this issue is to handle CJK characters taking 2 haypo> columns instead of 1 in a terminal, or did I misunderstand it?

    That's the other half of the problem, but the more common issue is misplaced caret when non-ascii characters are present:

    >>> ¡™£¢∞§¶•ªº
      File "<stdin>", line 1
        ¡™£¢∞§¶•ªº
                              ^
    SyntaxError: invalid character in identifier

    With Serhiy's patch:

    >>> ¡™£¢∞§¶•ªº
      File "<stdin>", line 1
        ¡™£¢∞§¶•ªº
                 ^
    SyntaxError: invalid character in identifier

    @abalkin
    Copy link
    Member

    abalkin commented Jun 10, 2013

    Serhiy's patch is lacking tests, but it passes the test I proposed at bpo-10382 at attaching here.

    @serhiy-storchaka
    Copy link
    Member

    Added tests. I think it will be worth apply this patch which fixes the issue for most Europeans and than continue working on the issue of wide characters.

    @serhiy-storchaka
    Copy link
    Member

    If no one complain I'll commit last patch tomorrow.

    @serhiy-storchaka serhiy-storchaka self-assigned this Jan 14, 2014
    @serhiy-storchaka serhiy-storchaka added the type-bug An unexpected behavior, bug, or error label Jan 14, 2014
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 21, 2014

    New changeset eb7565c212f1 by Serhiy Storchaka in branch '3.3':
    Issue bpo-2382: SyntaxError cursor "^" now is written at correct position in most
    http://hg.python.org/cpython/rev/eb7565c212f1

    New changeset ea34b2b0b8ae by Serhiy Storchaka in branch 'default':
    Issue bpo-2382: SyntaxError cursor "^" now is written at correct position in most
    http://hg.python.org/cpython/rev/ea34b2b0b8ae

    @serhiy-storchaka serhiy-storchaka removed their assignment Jan 21, 2014
    @vstinner
    Copy link
    Member

    The issue bpo-10384 has been marked as a duplicate of this issue: it's a similar issue, identifier which contains invisible character.

    @abalkin
    Copy link
    Member

    abalkin commented Sep 30, 2014

    The original problem is still present

    Python 3.5.0a0 (default:5313b4c0bb6c, Sep 30 2014, 18:55:45)
    >>> A_I_U_E_O$ = None
      File "<stdin>", line 1
        A_I_U_E_O$ = None
             ^
    SyntaxError: invalid syntax

    Replace A_I_U_E_O above with the Japanese script. I get codec error from the server when I try to paste my session as is.

    (Note that invalid character is $ above and not the Japanese AIUEO.)

    Another outstanding issue is with zero-width characters. See bpo-10384.

    @terryjreedy
    Copy link
    Member

    IDLE avoids the problem of calculating a location for a '^' below the bad line by instead asking tk to give the marked character (and maybe more) a 'ERROR' tag, which shows as a red background. So it marks the '$' of 'A_I_U_E_O$' and the 'alid' slice of 'inv\u200balid' (from duplicate bpo-10384). When the marked character is '\n', the space following the line is tagged. Is it possible to do something similar with any of the major system consoles?

    @iritkatriel
    Copy link
    Member

    The original problem is still present

    Python 3.5.0a0 (default:5313b4c0bb6c, Sep 30 2014, 18:55:45)
    >>> A_I_U_E_O$ = None
      File "<stdin>", line 1
        A_I_U_E_O$ = None
             ^
    SyntaxError: invalid syntax
    

    I think it has been fixed by now. On main (3.12) I get:

    >>> A_I_U_E_O$ = None
      File "<stdin>", line 1
        A_I_U_E_O$ = None
                 ^
    SyntaxError: invalid syntax
    
    

    @ezio-melotti
    Copy link
    Member

    I tried the examples in the comments above and the output look correct to me.

    Note however that there are two different issues. The first is about multibyte characters that have regular width (like ä) -- and that issue is fixed:

    >>> ä äää
      File "<stdin>", line 1
        ä äää
          ^^^
    SyntaxError: invalid syntax

    The second issue affects characters that, when displayed, are wider than normal (regardless of the number of bytes used to represent them):

    >>> A_I_U_E_O= None
      File "<stdin>", line 1
        A_I_U_E_O= None
                 ^
    SyntaxError: invalid character '$' (U+FF04)
    >>> print("あいうえお"$)
      File "<stdin>", line 1
        print("あいうえお"$)
                     ^
    SyntaxError: invalid syntax

    For these, the cursor is preceded by the correct amount of spaces, but since the characters are wider than usual (even with a monospace font), the output looks misaligned.

    IIUC this report is about the first issue, which is now fixed, so we can close this.
    I'm not sure if the second issue is something we can/should fix on Python's side.

    @serhiy-storchaka
    Copy link
    Member

    The original error is still present.

    >>> print "あいうえお"
      File "<stdin>", line 1
        print "あいうえお"
        ^^^^^^^^^^^^^
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?
    

    This issue was not closed in 2014, because it consists of two parts, and my patch only fixed the first part. It now mostly works for many European languages (unless you use combining characters), but not for the CJK languages with double width characters and not for languages with a wide use of zero width characters.

    @ezio-melotti
    Copy link
    Member

    If you are referring to the fact that print "あいうえお" is longer than ^^^^^^^^^^^^^, is there even any way to fix this on our end?

    In the example you posted, there are 13 characters and thus 13 ^ in the error message:

    >>> print "あいうえお"  # 13 ASCII/Hiragana characters
      File "<stdin>", line 1
        print "あいうえお"
        ^^^^^^^^^^^^^
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?

    The alignment doesn't match simply because -- even with a monospace font -- the Hiragana character are wider than the ASCII characters.

    If the ^s are simply supposed to match the number of characters, then this is no different from:

    >>> print "AIUEO"  # 13 ASCII characters
      File "<stdin>", line 1
        print "AIUEO"
        ^^^^^^^^^^^^^
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?

    If instead we want to use a number of ^s that aligns nicely with the Hiragana characters, then it's more complicated.

    If the Hiragana characters were twice as wide we could in theory duplicate the number of ^ (even though it might be confusing to have two ^^ for a single character), but they aren't, so we can't do that:

    01234567890123456789
    あいうえおあいうえお
    ^^^^^^^^^^^^^^^^^^^^
    

    I also suspect the exact output depends both on the available (and used) fonts, and by the rendering system (that might e.g. decide to increase the width of the ASCII characters to match the one of the widest character in the string).
    So in order to produce the "right" amount of ^s, Python will have to know the width of the string and divide it by the width of a single ^. In my browser, the example above would need about 16.5 ^.

    For character that actually are twice as wide or zero-width character we might in theory duplicate or remove the ^ for each such character (assuming there is a Unicode property to identify them), and the output will look right. Currently with zero-width characters it looks like:

    >>> print 'AIUEO'  # 17 ASCII/ZERO WIDTH SPACE characters
      File "<stdin>", line 1
        print 'AIUEO'
        ^^^^^^^^^^^^^^^^^
    SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?

    Arguably having the number of ^ matching the number of characters -- in addition to being much easier to handle -- is also better in some situation. In this example it hints to the fact that there are 4 extra invisible characters.

    @arhadthedev
    Copy link
    Member

    Superseded by and fixed in gh-102310.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants