Message 141924 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	tchrist
Date	2011-08-11.19:59:52
SpamBayes Score	6.445584e-05
Marked as misclassified	No
Message-id	<1313092793.79.0.645060327577.issue12733@psf.upfronthosting.co.za>
In-reply-to

Content
Without proper grapheme support in the regular expression library, it is impossible to correctly process Unicode. And the very least, one needs the \X escape supported, which is an extended grapheme cluster per UTS#18. This escape is supported by many regex libraries, include Perl's own and of course PCRE (and thence PHP, the standard ICU library, and Matthew Barnett's replacement regex library for Python. How do you process a string by graphemes if you cannot split on \X? How can you avoid splitting a grapheme into silly pieces if you cannot match one? How do I match the letter O no matter what diacritics have been applied to it otherwise? A match of (?=O)\X against an NFD string is by far the simplest and best way. This is necessary for a wide variety of reasons. Adding \pM and \PM go a little ways, but not far enough, because that is not how grapheme clusters are defined. You need a proper \X.

Without proper grapheme support in the regular expression library, it is impossible to correctly process Unicode.  And the very least, one needs the \X escape supported, which is an extended grapheme cluster per UTS#18. This escape is supported by many regex libraries, include Perl's own and of course PCRE (and thence PHP, the standard ICU library, and Matthew Barnett's replacement regex library for Python.

How do you process a string by graphemes if you cannot split on \X?  How can you avoid splitting a grapheme into silly pieces if you cannot match one?  How do I match the letter O no matter what diacritics have been applied to it otherwise?  A match of (?=O)\X against an NFD string is by far the simplest and best way.

This is necessary for a wide variety of reasons.  Adding \pM and \PM go a little ways, but not far enough, because that is not how grapheme clusters are defined.  You need a proper \X.

History
Date	User	Action	Args
2011-08-11 19:59:53	tchrist	set	recipients: + tchrist
2011-08-11 19:59:53	tchrist	set	messageid: <1313092793.79.0.645060327577.issue12733@psf.upfronthosting.co.za>
2011-08-11 19:59:53	tchrist	link	issue12733 messages
2011-08-11 19:59:52	tchrist	create