classification
Title: Command line error marker misplaced on unicode entry
Type: behavior Stage: patch review
Components: Interpreter Core Versions: Python 3.2
process
Status: closed Resolution: duplicate
Dependencies: Superseder: [Py3k] SyntaxError cursor shifted if multibyte character is in line.
View: 2382
Assigned To: belopolsky Nosy List: belopolsky, ezio.melotti, lemburg, loewis, vstinner
Priority: normal Keywords: patch

Created on 2010-11-10 19:34 by belopolsky, last changed 2013-06-10 20:37 by belopolsky. This issue is now closed.

Files
File name Uploaded Description Edit
issue10382.diff belopolsky, 2010-11-11 00:04 review
issue10382a.diff belopolsky, 2010-11-11 23:06 review
Messages (5)
msg120930 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-10 19:34
>>> ¡™£¢∞§¶•ªº
  File "<stdin>", line 1
    ¡™£¢∞§¶•ªº
                          ^
SyntaxError: invalid character in identifier


It looks like strlen() is used instead of number of characters in the decoded string.
msg120933 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-11 00:04
I am attaching a patch that seems to fix the issue.  Note that I considered fixing the problem in parsetok.c where offset is originally computed, but this is part of pgen which has to be compiled without unicode support.

The test case suitable to be included in unittests is:

try:
    eval(b'\xc2\xa1'.decode('utf-8'))
except SyntaxError as err:
    assert(err.offset == 1)
msg120941 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-11-11 08:53
See also #2382: I wrote patches two years ago for this issue.
msg120982 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-11-11 23:05
haypo> See also #2382: I wrote patches two years ago for this issue.

Yes, this is the same issue.  I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches.  What I am trying to achieve here is similar to the adjust_offset.patch there.

I am attaching a patch that takes an alternative approach and computes the number of characters in the parser.  I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text.  If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters)

The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first.
msg190931 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2013-06-10 20:37
The latest patch at #2382 is simpler than mine, so I am closing this as duplicate.
History
Date User Action Args
2013-06-10 20:37:57belopolskysetstatus: open -> closed
superseder: [Py3k] SyntaxError cursor shifted if multibyte character is in line.
resolution: duplicate
messages: + msg190931
2010-11-11 23:06:14belopolskysetfiles: + issue10382a.diff
2010-11-11 23:05:52belopolskysetmessages: + msg120982
2010-11-11 08:53:41vstinnersetmessages: + msg120941
2010-11-11 01:37:09belopolskylinkissue10384 dependencies
2010-11-11 00:17:27belopolskysetnosy: + loewis
2010-11-11 00:04:06belopolskysetfiles: + issue10382.diff
messages: + msg120933

assignee: belopolsky
keywords: + patch
stage: needs patch -> patch review
2010-11-10 20:57:44belopolskysetnosy: + lemburg, vstinner, ezio.melotti
2010-11-10 19:34:23belopolskycreate