Message 141995 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, r.david.murray, tchrist, terry.reedy
Date	2011-08-12.23:04:52
SpamBayes Score	1.4229129e-10
Marked as misclassified	No
Message-id	<3138.1313190285@chthon>
In-reply-to	<1313187719.71.0.775803768006.issue12729@psf.upfronthosting.co.za>

Content
"Terry J. Reedy" <report@bugs.python.org> wrote on Fri, 12 Aug 2011 22:21:59 -0000: > Does the regex module handle these particular issues better? No, it currently does not. One would have to ask Matthew directly, but I believe it was because he was trying to stay compatible with re, sometimes apparently even if that means being bug compatible. I have brought it to his attention though, and at last report he was pondering the matter. In contrast to how Python behaves on narrow builds, even though Java uses UTF-16 for its internal representation of strings, its Java Pattern is quite adamant about treating with logical code points alone. Besides running afoul of tr18, it is senseless to do otherwise. A dot is one Unicode code point, no matter whether you have 8-bit code units, 16-bit code units, or 32-bit code units. Similarly, character classes and their negations only match entire code points, never pieces of the same. ICU's regexes work the same way the normal Java Pattern library does. So too do Perl, Ruby, and Go. Python is really the odd man out here. Almost. One interesting counterexample is the vim editor. It has dot match a complete grapheme no matter how many code points that requires, because we're dealing with user-visible characters now, not programmer-visible one. It is an unreasonable burden to make the programmer deal with the fine-grained details of low-level serialization schemes instead of at least() the code point level of operations, which is the minimum for getting real work done. (Note that tr18 admits that accessing text at the code point level meets only programmer expectations, not those of the user, and therefore to meet user expectations much more elaborate patterns must necessarily be constructed than if logical groups of coarser granularity than code points alone are supported.) Python should not be subject to changing its behavior from one build to the next. This astonishing narrow-vs-wide build behavior makes it virtually impossible to write portable code to work on arbitrary Unicode text. You cannot even know whether you need to match one dot or two to get a single code point, and similarly for character indexing, etc. Even identifiers come into play. Surrogates should be utterly nonexistent/invisible at this, the normal level of operation. An API that minimally but uniformly deals with logical code points and nothing finer in granularity is the only way to go here. Please trust me on this one. Graphemes (tr18 Level 2) and collation elements (Level 3) will someday build on that, but one must first support code points properly. That's why it's a Level 1 requirement. --tom

"Terry J. Reedy" <report@bugs.python.org> wrote
   on Fri, 12 Aug 2011 22:21:59 -0000: 

> Does the regex module handle these particular issues better?

No, it currently does not.  One would have to ask Matthew directly, but I
believe it was because he was trying to stay compatible with re, sometimes
apparently even if that means being bug compatible.  I have brought it 
to his attention though, and at last report he was pondering the matter.

In contrast to how Python behaves on narrow builds, even though Java uses
UTF-16 for its internal representation of strings, its Java Pattern is
quite adamant about treating with logical code points alone.  Besides
running afoul of tr18, it is senseless to do otherwise.  A dot is one
Unicode code point, no matter whether you have 8-bit code units, 16-bit
code units, or 32-bit code units.  Similarly, character classes and their
negations only match entire code points, never pieces of the same.

ICU's regexes work the same way the normal Java Pattern library does.  
So too do Perl, Ruby, and Go.  Python is really the odd man out here.  

Almost.

One interesting counterexample is the vim editor.  It has dot match a
complete grapheme no matter how many code points that requires, because
we're dealing with user-visible characters now, not programmer-visible one.

It is an unreasonable burden to make the programmer deal with the
fine-grained details of low-level serialization schemes instead of at
least(*) the code point level of operations, which is the minimum for
getting real work done.  (*Note that tr18 admits that accessing text at the
code point level meets only programmer expectations, not those of the user,
and therefore to meet user expectations much more elaborate patterns must
necessarily be constructed than if logical groups of coarser granularity
than code points alone are supported.)

Python should not be subject to changing its behavior from one build to the
next.  This astonishing narrow-vs-wide build behavior makes it virtually
impossible to write portable code to work on arbitrary Unicode text. You
cannot even know whether you need to match one dot or two to get a single
code point, and similarly for character indexing, etc. Even identifiers
come into play.  Surrogates should be utterly nonexistent/invisible at
this, the normal level of operation.  

An API that minimally but uniformly deals with logical code points and
nothing finer in granularity is the only way to go here.  Please trust me
on this one.  Graphemes (tr18 Level 2) and collation elements (Level 3)
will someday build on that, but one must first support code points
properly. That's why it's a Level 1 requirement.

--tom

History
Date	User	Action	Args
2011-08-12 23:04:54	tchrist	set	recipients: + tchrist, terry.reedy, Arfrever, r.david.murray
2011-08-12 23:04:53	tchrist	link	issue12729 messages
2011-08-12 23:04:52	tchrist	create