Author terry.reedy
Recipients taleinat, terry.reedy
Date 2014-06-15.09:14:45
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1402823686.95.0.815055158855.issue21765@psf.upfronthosting.co.za>
In-reply-to
Content
I checked for usage: _id(_first)_chars is only used in _eat_identifier, which is used in one place in get_expression. That is called once each in AutoComplete and CallTips. Both are seldom visible accept as requested (by waiting, for calltips). Calltips is only called on entry of '('. Is AutoComplete doing hidden checks?

_eat_identifier currently does a linear 'c in string' scan of the 2 char strings. I believe that both are long enough that O(1) 'c in set' checks would be faster. The sets could be augmented with latin1 id chars without becoming hugh or slowing the check (see below). This is a change we could make as soon as the test file and new failing tests are ready.

I just discovered, new in 3.x, str.isidentifier.
>>> '1'.isidentifier()
False
>>> 'a'.isidentifier()
True
>>> '\ucccc'
'쳌'
>>> '\ucccc'.isidentifier()
True

This is, however, meant to be used in the forward direction. If s[pos].isidentifier(), check s[pos:end].identifier(), where end is progressively incremented until the check fails. For backwards checking, it could be used with a start char prefixed: ('a'+s[start:pos]).isidentifier(). To limit the cost, the start decrement could be 4 chars at a time, with 2 extra tests (binary search) on failure to find the actual start.

The 3.x versions of other isxyg functions could be useful: isalpha, isdecimal, isdigit, isnumeric. We just have to check their definitions against the two identifier class definitions.

What is slightly annoying is that in CPython 3.3+, all-ascii strings are marked as such but the info is not directly accessible without without ctypes. I believe all-latin-1 strings can be detected by comparing sys.getsizeof(s) to len(s), so we could augment the char sets to include the extra identifier chars in latin-1.

We could add a configuation option to assume all-ascii (or better, all-latin1 code chars or not, and note that 'all latin1' will run faster but not recognize identifiers for the two features that use this.
History
Date User Action Args
2014-06-15 09:14:46terry.reedysetrecipients: + terry.reedy, taleinat
2014-06-15 09:14:46terry.reedysetmessageid: <1402823686.95.0.815055158855.issue21765@psf.upfronthosting.co.za>
2014-06-15 09:14:46terry.reedylinkissue21765 messages
2014-06-15 09:14:45terry.reedycreate