Message 222418 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	taleinat
Recipients	ezio.melotti, lemburg, loewis, taleinat, terry.reedy
Date	2014-07-06.21:02:59
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1404680581.99.0.877331282597.issue21765@psf.upfronthosting.co.za>
In-reply-to

Content
Indeed, I seem to have been misinterpreting the grammar, despite taking care and reading it several times. This strengthens my opinion that we should use str.isidentifier() rather than attempt to correctly re-implement just the parts that we need. Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of my testing (tests included). When non-ASCII characters are encountered, this patch uses Terry's suggestion of checking for valid identifier characters using ('a' + string_part).isidentifier(). It also employs his suggestion of how to avoid executing this check at every index, by skipping 4 characters at a time. However, even with this fix, HyperParser.get_expression() still fails with non-ASCII Unicode strings. This is because it uses PyParse, which doesn't support Unicode! For example, it apparently replaces all non-ASCII characters with 'x'. I've added (in this patch) a few tests for this, which currently fail. FWIW, PyParse includes a comment to this effect[1]: <quote> The parse functions have no idea what to do with Unicode, so replace all Unicode characters with "x". This is "safe" so long as the only characters germane to parsing the structure of Python are 7-bit ASCII. It's necessary because Unicode strings don't have a .translate() method that supports deletechars. </quote> Properly resolving this issue will apparently require fixing PyParse to properly support Unicode. .. [1]: http://hg.python.org/cpython/file/d25ae22cc992/Lib/idlelib/PyParse.py#l117

Indeed, I seem to have been misinterpreting the grammar, despite taking care and reading it several times. This strengthens my opinion that we should use str.isidentifier() rather than attempt to correctly re-implement just the parts that we need.

Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of my testing (tests included).

When non-ASCII characters are encountered, this patch uses Terry's suggestion of checking for valid identifier characters using ('a' + string_part).isidentifier(). It also employs his suggestion of how to avoid executing this check at every index, by skipping 4 characters at a time.

However, even with this fix, HyperParser.get_expression() still fails with non-ASCII Unicode strings. This is because it uses PyParse, which doesn't support Unicode! For example, it apparently replaces all non-ASCII characters with 'x'. I've added (in this patch) a few tests for this, which currently fail.

FWIW, PyParse includes a comment to this effect[1]:

<quote>
The parse functions have no idea what to do with Unicode, so
replace all Unicode characters with "x".  This is "safe"
so long as the only characters germane to parsing the structure
of Python are 7-bit ASCII.  It's *necessary* because Unicode
strings don't have a .translate() method that supports
deletechars.
</quote>

Properly resolving this issue will apparently require fixing PyParse to properly support Unicode.

.. [1]: http://hg.python.org/cpython/file/d25ae22cc992/Lib/idlelib/PyParse.py#l117

History
Date	User	Action	Args
2014-07-06 21:03:02	taleinat	set	recipients: + taleinat, lemburg, loewis, terry.reedy, ezio.melotti
2014-07-06 21:03:01	taleinat	set	messageid: <1404680581.99.0.877331282597.issue21765@psf.upfronthosting.co.za>
2014-07-06 21:03:01	taleinat	link	issue21765 messages
2014-07-06 21:03:01	taleinat	create