Author taleinat
Recipients ezio.melotti, lemburg, loewis, taleinat, terry.reedy
Date 2014-07-06.21:02:59
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Indeed, I seem to have been misinterpreting the grammar, despite taking care and reading it several times. This strengthens my opinion that we should use str.isidentifier() rather than attempt to correctly re-implement just the parts that we need.

Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of my testing (tests included).

When non-ASCII characters are encountered, this patch uses Terry's suggestion of checking for valid identifier characters using ('a' + string_part).isidentifier(). It also employs his suggestion of how to avoid executing this check at every index, by skipping 4 characters at a time.

However, even with this fix, HyperParser.get_expression() still fails with non-ASCII Unicode strings. This is because it uses PyParse, which doesn't support Unicode! For example, it apparently replaces all non-ASCII characters with 'x'. I've added (in this patch) a few tests for this, which currently fail.

FWIW, PyParse includes a comment to this effect[1]:

The parse functions have no idea what to do with Unicode, so
replace all Unicode characters with "x".  This is "safe"
so long as the only characters germane to parsing the structure
of Python are 7-bit ASCII.  It's *necessary* because Unicode
strings don't have a .translate() method that supports

Properly resolving this issue will apparently require fixing PyParse to properly support Unicode.

.. [1]:
Date User Action Args
2014-07-06 21:03:02taleinatsetrecipients: + taleinat, lemburg, loewis, terry.reedy, ezio.melotti
2014-07-06 21:03:01taleinatsetmessageid: <>
2014-07-06 21:03:01taleinatlinkissue21765 messages
2014-07-06 21:03:01taleinatcreate