Issue 10382: Command line error marker misplaced on unicode entry

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54591

classification

Title:	Command line error marker misplaced on unicode entry
Type:	behavior	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.2

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	[Py3k] SyntaxError cursor shifted if multibyte character is in line. View: 2382
Assigned To:	belopolsky	Nosy List:	belopolsky, ezio.melotti, lemburg, loewis, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-11-10 19:34 by belopolsky, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue10382.diff	belopolsky, 2010-11-11 00:04		review
issue10382a.diff	belopolsky, 2010-11-11 23:06		review

Messages (5)
msg120930 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-10 19:34
>>> ¡™£¢∞§¶•ªº File "<stdin>", line 1 ¡™£¢∞§¶•ªº ^ SyntaxError: invalid character in identifier It looks like strlen() is used instead of number of characters in the decoded string.
msg120933 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-11 00:04
I am attaching a patch that seems to fix the issue. Note that I considered fixing the problem in parsetok.c where offset is originally computed, but this is part of pgen which has to be compiled without unicode support. The test case suitable to be included in unittests is: try: eval(b'\xc2\xa1'.decode('utf-8')) except SyntaxError as err: assert(err.offset == 1)
msg120941 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-11-11 08:53
See also #2382: I wrote patches two years ago for this issue.
msg120982 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2010-11-11 23:05
haypo> See also #2382: I wrote patches two years ago for this issue. Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there. I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters) The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first.
msg190931 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2013-06-10 20:37
The latest patch at #2382 is simpler than mine, so I am closing this as duplicate.

History
Date	User	Action	Args
2022-04-11 14:57:08	admin	set	github: 54591
2013-06-10 20:37:57	belopolsky	set	status: open -> closed superseder: [Py3k] SyntaxError cursor shifted if multibyte character is in line. resolution: duplicate messages: + msg190931
2010-11-11 23:06:14	belopolsky	set	files: + issue10382a.diff
2010-11-11 23:05:52	belopolsky	set	messages: + msg120982
2010-11-11 08:53:41	vstinner	set	messages: + msg120941
2010-11-11 01:37:09	belopolsky	link	issue10384 dependencies
2010-11-11 00:17:27	belopolsky	set	nosy: + loewis
2010-11-11 00:04:06	belopolsky	set	files: + issue10382.diff messages: + msg120933 assignee: belopolsky keywords: + patch stage: needs patch -> patch review
2010-11-10 20:57:44	belopolsky	set	nosy: + lemburg, vstinner, ezio.melotti
2010-11-10 19:34:23	belopolsky	create