classification
Title: Idle: make 3.x HyperParser work with non-ascii identifiers.
Type: behavior Stage: resolved
Components: Versions: Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: 21686 Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, loewis, python-dev, taleinat, terry.reedy
Priority: normal Keywords: patch

Created on 2014-06-14 23:12 by terry.reedy, last changed 2014-07-16 17:57 by terry.reedy. This issue is now closed.

Files
File name Uploaded Description Edit
taleinat.20140706.IDLE_HyperParser_unicode_ids.patch taleinat, 2014-07-06 21:02 review
taleinat.20140716.IDLE_HyperParser_unicode_ids_v2.patch taleinat, 2014-07-15 22:04 first fix + tests review
Messages (26)
msg220593 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-06-14 23:12
idlelib.HyperParser.Hyperparser has these lines
    _whitespace_chars = " \t\n\\"
    _id_chars = string.ascii_letters + string.digits + "_"
    _id_first_chars = string.ascii_letters + "_"
used in _eat_identifier() and get_expression. At least the latter two should be turned into functions that access the unicode datebase. (Such functions, if not already present in the stdlib, should be. But adding something elsewhere would be another issue.)
msg220594 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-06-14 23:19
#21686 adds the test file that a new test would go in.
msg220613 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-06-15 06:22
It seems that the unicodedata module already supplies relevant functions which can be used for this. For example, we can replace "char in self._id_first_chars" with something like:

from unicodedata import normalize, category
norm_char = normalize(char)[0]
is_id_first_char = norm_char_first == '_' or category(norm_char_first) in {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl"}

I'm not sure what the "Other_ID_Start property" mentioned in [1] and [2] means, though. Can we get someone with more in-depth knowledge of unicode to help with this? 

The real question is how to do this *fast*, since HyperParser does a *lot* of these checks. Do you think caching would be a good approach?

See:
.. [1]: https://docs.python.org/3/reference/lexical_analysis.html#identifiers
.. [2]: http://legacy.python.org/dev/peps/pep-3131/
msg220614 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-06-15 06:23
Bah, I messed up the code sample in my previous message. It was supposed to be:

from unicodedata import normalize, category
norm_char_first = normalize(char)[0]
is_id_first_char = (
    norm_char_first == '_' or
    category(norm_char_first) in {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl"}
)
msg220621 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-06-15 09:14
I checked for usage: _id(_first)_chars is only used in _eat_identifier, which is used in one place in get_expression. That is called once each in AutoComplete and CallTips. Both are seldom visible accept as requested (by waiting, for calltips). Calltips is only called on entry of '('. Is AutoComplete doing hidden checks?

_eat_identifier currently does a linear 'c in string' scan of the 2 char strings. I believe that both are long enough that O(1) 'c in set' checks would be faster. The sets could be augmented with latin1 id chars without becoming hugh or slowing the check (see below). This is a change we could make as soon as the test file and new failing tests are ready.

I just discovered, new in 3.x, str.isidentifier.
>>> '1'.isidentifier()
False
>>> 'a'.isidentifier()
True
>>> '\ucccc'
'쳌'
>>> '\ucccc'.isidentifier()
True

This is, however, meant to be used in the forward direction. If s[pos].isidentifier(), check s[pos:end].identifier(), where end is progressively incremented until the check fails. For backwards checking, it could be used with a start char prefixed: ('a'+s[start:pos]).isidentifier(). To limit the cost, the start decrement could be 4 chars at a time, with 2 extra tests (binary search) on failure to find the actual start.

The 3.x versions of other isxyg functions could be useful: isalpha, isdecimal, isdigit, isnumeric. We just have to check their definitions against the two identifier class definitions.

What is slightly annoying is that in CPython 3.3+, all-ascii strings are marked as such but the info is not directly accessible without without ctypes. I believe all-latin-1 strings can be detected by comparing sys.getsizeof(s) to len(s), so we could augment the char sets to include the extra identifier chars in latin-1.

We could add a configuation option to assume all-ascii (or better, all-latin1 code chars or not, and note that 'all latin1' will run faster but not recognize identifiers for the two features that use this.
msg220626 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-06-15 11:03
AutoComplete isn't doing hidden checks. My concern is that auto-completion happens automatically and the parsing is done synchronously, so if the parsing takes a significant amount of time it can cause a delay long enough to be noticeable by users. We should also consider the IDLE is being used on some relatively weak computers, such as Raspberry Pi. I have one for testing if needed!

Your suggestion of jumping backwards several characters at a time and calling isidentifier() seems problematic. For example, how would it deal with something like "2fast2furious"? Recognizing "fast2furious" as the identifier candidate is incorrect, but that's what we'd get with a straightforward implementation of your suggestion. I can't think of a way to get this right without be able to check if each character is valid for identifiers.

I still think the category(normalize(char)[0]) in {...} approach could be just fine. I'd really like to get feedback from an expert on unicode. The experts index lists three people as experts on unicodedata; I've added them to the nosy for this issue.

Once we have an identifier candidate, the code currently just checks that the first char is valid and that the identifier isn't a keyword. We probably should use str.isidentifier() instead of just checking the first character.
msg221066 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-06-20 04:08
> I'm not sure what the "Other_ID_Start property" mentioned in [1] and
> [2] means, though. Can we get someone with more in-depth knowledge of
> unicode to help with this? 

See http://www.unicode.org/reports/tr31/#Backward_Compatibility.
Basically they were considered valid ID_Start characters in previous versions of Unicode, but they are no longer valid.  I think it's safe to leave them out (perhaps they could/should be removed from the Python parser too), but if you want to add them the list includes only 4 characters (there are 12 more for Other_ID_Continue).

> The real question is how to do this *fast*, since HyperParser does a
> *lot* of these checks. Do you think caching would be a good approach?

I think it would be enough to check explicitly for ASCII chars, since most of them will be ASCII anyway.  If they are not ASCII you can use unicodedata.category (or .isidentifier() if it does the right thing).
msg221158 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-06-21 07:15
Alright, so I'm going to use the equivalent of the following code, unless someone can tell me that something is wrong:


from keyword import iskeyword
from unicodedata import category, normalize

_ID_FIRST_CATEGORIES = {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl",
                        "Other_ID_Start"}
_ID_CATEGORIES = _ID_FIRST_CATEGORIES | {"Mn", "Mc", "Nd", "Pc",
                                         "Other_ID_Continue"}

_ASCII_ID_CHARS = set(string.ascii_letters + string.digits + "_")
_ID_KEYWORDS = {"True", "False", "None"}

def is_id_char(char):
    return char in _ASCII_ID_CHARS or (
        ord(char) >= 128 and
        category(normalize(char)[0]) in _ID_CATEGORIES
    )

def is_identifier(id_candidate):
    return id_candidate.isidentifier() and (
        (not iskeyword(id_candidate)) or
        id_candidate in _ID_KEYWORDS
    )

 def _eat_identifier(str, limit, pos):
    i = pos
    while i > limit and is_id_char(str[pos - i]):
        i -= 1
    if i < pos and not is_identifier(str[i:pos]):
        return 0
    return pos - i
msg221161 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-06-21 08:33
> _ID_FIRST_CATEGORIES = {"Lu", "Ll", "Lt", "Lm", "Lo", "Nl",
>                         "Other_ID_Start"}
> _ID_CATEGORIES = _ID_FIRST_CATEGORIES | {"Mn", "Mc", "Nd", "Pc",
>                                          "Other_ID_Continue"}

Note that "Other_ID_Start" and "Other_ID_Continue" are not categories -- they are properties -- and that unicodedata.category() won't return them, so adding them to these set won't have any effect.  I don't think there's a way to check if chars have that property, but as I said in my previous message it's probably safe to ignore them (nothing will explode even in the unlikely case that those chars are used, right?).

> def is_id_char(char):
>    return char in _ASCII_ID_CHARS or (
>        ord(char) >= 128 and

What's the reason for checking if the ord is >= 128?

>        category(normalize(char)[0]) in _ID_CATEGORIES
>    )
msg221165 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-06-21 10:26
The Other_ID_Start property is defined in

http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

It currently includes 4 characters.

However, I would propose that methods .isidstart and .isidcontinue get added to the str type if there is a need for them.
msg221172 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-06-21 11:54
> What's the reason for checking if the ord is >= 128?

It's an optimization. Assuming the majority of characters will be ASCII, most non-identifier characters will fail this test, thus avoiding the more involved generic Unicode check.

> However, I would propose that methods .isidstart and
> .isidcontinue get added to the str type if there is a
> need for them.

I was surprised to find .isidentifier() is a method of the str type. Even if that is considered useful enough to be a method of str, I don't think the functions you suggest are generally useful enough to warrant being added as methods.

Then again, I'm no authority on decided what gets included as methods of base types. Regardless, I'd rather this patch go in without affecting str; adding these methods to str is a separate issue. If someone decides to pursue that, feel free to use this code and/or +nosy me.
msg221184 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2014-06-21 18:28
> It's an optimization. Assuming the majority of characters will be
> ASCII, most non-identifier characters will fail this test, thus
> avoiding the more involved generic Unicode check.

I don't know what kind of characters are usually received as input.  If things like (ASCII) spaces, parentheses, commas are common, then the optimization is probably OK.

@Martin
Do you know the reason why characters with the Other_ID_Start have been included in the first place, given that they are no longer considered valid identifiers and I can hardly think any situation where someone would need it?  Could they be removed from 3.5 if that makes the code simpler?
msg221236 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-06-22 08:53
The reason the Unicode consortium made this list (Other_ID_Start) is that they want to promise 100% backwards compatibility: if some programming language had been using UAX#31, changes to the Unicode database might break existing code. To avoid this, UAX#31 guarantees 100% stability.

The reason Python uses it is because it uses UAX#31, with the minimum number of modifications. We really shouldn't be making arbitrary changes to it. If we would e.g. say that we drop these four characters now, the next Unicode version might add more characters to Other_ID_Start, and then we would have to say that we include some, but not all, characters from Other_ID_Start.

So if IDLE wants to reimplement the XID_Start and XID_Continue properties, it should do it correctly. Note that the proposed patch only manages to replicate the ID_Start and ID_Continue properties. For the XID versions, see

http://www.unicode.org/reports/tr31/#NFKC_Modifications

Unfortunately, the specification doesn't explain exactly how these modifications are performed. For item 1, I think it is:

Characters which are in ID_Start (because they count as letters) but their NFKC decomposition does not start with an ID_Start character (because it starts with a modifier instead) are removed in XID_Start

For item 2, they unfortunately don't list all characters that get excluded. For the one example that they do give, the reason is clear: U+037A (GREEK YPOGEGRAMMENI, category Lm) decomposes to U+0020 (SPACE) U+0345 (COMBINING GREEK YPOGEGRAMMENI). Having a space in an identifier is clearly out of the question. I assume similar problems occur with "certain Arabic presentation forms". I wish the consortium was more explicit as to what precise algorithms they use to derive their derived properties.
msg221238 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-06-22 09:32
Tal: If you want to verify your is_id_char function, you could use the code

for i in range(65536):
    c = chr(i)
    c2 = 'a'+c
    if is_id_char(c) != c2.isidentifier():
        print('\\u%.4x'%i,is_id_char(c),c2.isidentifier())

Alternatively, you could use the strategy taken in that code for is_id_char itself:

def is_id_char(c):
  return ('a'+c).isidentifier()
msg221267 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-06-22 16:07
> Note that the proposed patch only manages to replicate the
> ID_Start and ID_Continue properties.

Is this just because of the mishandling of the Other_ID_Start and Other_ID_Continue properties, or something else as well? I based my code on the definitions in:

https://docs.python.org/3/reference/lexical_analysis.html#identifiers

Are those actually wrong?


Note that my code uses category(normalize(char)[0]), so it's always making sure that the first character is valid. Actually, though, I now realize that it should check all of the values returned by normalize().

Regarding testing ('a'+something).isidentifier(), Terry already suggested something along those lines. I think I'll end up using something of the sort, to avoid adding additional complex Unicode-related code to maintain in the future.
msg221279 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-06-22 16:56
I think you are misinterpreting the grammar. Your code declares that U+00B2 (SUPERSCRIPT TWO, ²) is an identifier character. Its category is No, so it is actually not. However, its normalization is U+0032 (DIGIT TWO, 2), which is an identifier character - but that doesn't make SUPERSCRIPT TWO a member of XID_Continue.
msg222418 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-07-06 21:02
Indeed, I seem to have been misinterpreting the grammar, despite taking care and reading it several times. This strengthens my opinion that we should use str.isidentifier() rather than attempt to correctly re-implement just the parts that we need.

Attached is a patch which fixes HyperParser._eat_identifier(), to the extent of my testing (tests included).

When non-ASCII characters are encountered, this patch uses Terry's suggestion of checking for valid identifier characters using ('a' + string_part).isidentifier(). It also employs his suggestion of how to avoid executing this check at every index, by skipping 4 characters at a time.

However, even with this fix, HyperParser.get_expression() still fails with non-ASCII Unicode strings. This is because it uses PyParse, which doesn't support Unicode! For example, it apparently replaces all non-ASCII characters with 'x'. I've added (in this patch) a few tests for this, which currently fail.

FWIW, PyParse includes a comment to this effect[1]:

<quote>
The parse functions have no idea what to do with Unicode, so
replace all Unicode characters with "x".  This is "safe"
so long as the only characters germane to parsing the structure
of Python are 7-bit ASCII.  It's *necessary* because Unicode
strings don't have a .translate() method that supports
deletechars.
</quote>

Properly resolving this issue will apparently require fixing PyParse to properly support Unicode.

.. [1]: http://hg.python.org/cpython/file/d25ae22cc992/Lib/idlelib/PyParse.py#l117
msg222454 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-07-07 12:15
Two observations:
1. This issue is only about identifiers. So processing of string literals is technically out of scope.
2. I'd suggest to replace .translate with regular expressions:

py> re.sub('[^(){}\[\]]','','foo(b[a]{r}≠)')
'([]{})'

I'm sure people interested in performance will want to time this approach.
msg222457 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-07-07 12:45
@Martin:

> 1. This issue is only about identifiers. So processing of
> string literals is technically out of scope.

I added a test with a non-ASCII string literal only for good measure, while I was already adding a test with a non-ASCII identifier. The patch doesn't change anything in that regard. Supporting Unicode string literals could be considered a separate issue, though the root cause of the problems is the same, and fixing it properly will fix both issues.

> 2. I'd suggest to replace .translate with regular expressions:

Indeed, something along those lines should work well, and probably would be fast enough.
msg222726 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-07-11 05:44
I just noticed that ColorDelegator has
 idprog = re.compile(r"\s+(\w+)", re.S)
which will recognize unicode 'words', if not exactly Python 'identifiers'.

However, UndoDelegator has 
    alphanumeric = string.ascii_letters + string.digits + "_"
which is the same as in Hyperparser. It is used in

    def classify(self, c):
        if c in self.alphanumeric:
            return "alphanumeric"
        if c == "\n":
            return "newline"
        return "punctuation"
and probably does not do what we really want.
msg222739 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-07-11 12:46
If you think ColorDelegator and UndoDelegator should be fixed as well, I'd be happy to take a look, but we should open a separate tracker issue for it.
msg222767 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2014-07-11 17:32
I did not open another issue yet because I did not want to split the general discussion of parsing 3.x unicode-based python. We might also need another issue for idlelib/PyParse, and that might need to come first.

What I think is that Idle should have at 1 definition of 'identifier' and not a least 3, in 3 separate places ;-). Hyperparser seems like the most appropriate place if there is to be more than str.isidentifier.

That is not to say that equivalents versions might be needed for different uses. ColorDelegator, as it is, needs an re version to be combined with other regexes. I have not looked yet at what UndoDelegator does with its character classification function and whether str.isidentifier could be used.  Hyperparsar seems the hardest since it parses backwards.

I do not consider this issue to be the highest priority, at the moment, but we have collected enough info with help from Ezio and Martin to do some experiments.
msg223150 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-07-15 21:26
I'm attaching a patch which really fixes this issue, along with additional tests for idlelib.HyperParser.

I did indeed have to fix PyParse as well. I got it working with re.subn() as Martin suggested, but the performance was much worse (between 100x and 1000x slower according to my timings). So I came up with the clever solution of passing a defaultdict to str.translate(), with the default value being ord('x'). That works as expected and is much faster than the regexp.

Finally, since the defaultdict is kept around as long as IDLE is running, I decided to avoid having it grow continually and consume memory unnecessarily. So I wrote a simple Mapping class, which wraps a normal dict and uses a custom default value instead of None, ord('x') in this case. Works like a charm :)

I haven't written tests for PyParse since it currently doesn't have any tests at all. But from the HyperParser tests and from some manual testing, it seems to be working as expected.
msg223191 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2014-07-16 12:02
LGTM
msg223199 - (view) Author: Roundup Robot (python-dev) Date: 2014-07-16 13:46
New changeset 8b3f7aecdf85 by Tal Einat in branch '3.4':
Issue #21765: Add support for non-ascii identifiers to HyperParser
http://hg.python.org/cpython/rev/8b3f7aecdf85

New changeset 73a8c614af4d by Tal Einat in branch 'default':
Issue #21765: Add support for non-ascii identifiers to HyperParser
http://hg.python.org/cpython/rev/73a8c614af4d
msg223202 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2014-07-16 13:55
Fix committed to 3.4 and merged to default. (My first commit!)

Not back-porting this to 2.7 despite PEP 434, because support for non-ASCII identifiers only exists in 3.x.

Close this issue as fixed!
History
Date User Action Args
2014-07-16 17:57:41terry.reedysetstatus: open -> closed
resolution: fixed
stage: test needed -> resolved
2014-07-16 13:55:54taleinatsetmessages: + msg223202
2014-07-16 13:46:33python-devsetnosy: + python-dev
messages: + msg223199
2014-07-16 12:02:01loewissetmessages: + msg223191
2014-07-15 22:04:04taleinatsetfiles: + taleinat.20140716.IDLE_HyperParser_unicode_ids_v2.patch
2014-07-15 22:03:42taleinatsetfiles: - taleinat.20140716.IDLE_HyperParser_unicode_ids.patch
2014-07-15 21:26:55taleinatsetfiles: + taleinat.20140716.IDLE_HyperParser_unicode_ids.patch

messages: + msg223150
2014-07-11 17:32:40terry.reedysetmessages: + msg222767
2014-07-11 12:46:12taleinatsetmessages: + msg222739
2014-07-11 05:44:52terry.reedysetmessages: + msg222726
2014-07-07 12:45:47taleinatsetmessages: + msg222457
2014-07-07 12:15:49loewissetmessages: + msg222454
2014-07-06 21:03:01taleinatsetfiles: + taleinat.20140706.IDLE_HyperParser_unicode_ids.patch
keywords: + patch
messages: + msg222418
2014-06-22 16:56:51loewissetmessages: + msg221279
2014-06-22 16:07:28taleinatsetmessages: + msg221267
2014-06-22 09:32:13loewissetmessages: + msg221238
2014-06-22 08:53:57loewissetmessages: + msg221236
2014-06-21 18:28:20ezio.melottisetmessages: + msg221184
2014-06-21 11:54:07taleinatsetmessages: + msg221172
2014-06-21 10:26:11loewissetmessages: + msg221165
2014-06-21 08:33:39ezio.melottisetmessages: + msg221161
2014-06-21 07:15:49taleinatsetmessages: + msg221158
2014-06-20 04:08:37ezio.melottisetmessages: + msg221066
2014-06-15 11:03:05taleinatsetnosy: + lemburg, loewis, ezio.melotti
messages: + msg220626
2014-06-15 09:14:46terry.reedysetmessages: + msg220621
2014-06-15 06:23:58taleinatsetmessages: + msg220614
2014-06-15 06:22:46taleinatsetmessages: + msg220613
2014-06-14 23:19:34terry.reedysetdependencies: + IDLE - Test hyperparser
messages: + msg220594
2014-06-14 23:12:54terry.reedycreate