Author vbr
Recipients akitada, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date 2010-09-20.23:51:33
SpamBayes Score 7.79155e-13
Marked as misclassified No
Message-id <1285026697.37.0.770895579108.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
I like the idea of the general "new" flag introducing the reasonable, backwards incompatible behaviour; one doesn't have to remember a list of non-standard flags to get this features.

While I recognise, that the module probably can't work correctly with wide unicode characters on a narrow python build (py 2.7, win XP in this case), i noticed a difference to re in this regard (it might be based on the absence of the wide unicode literal in the latter).

re.findall(u"\\U00010337", u"a\U00010337bc")
[]
re.findall(u"(?i)\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"(?i)\\U00010337", u"a\U00010337bc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 203, in findall
    return _compile(pattern, flags).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 310, in _compile
    parsed = parsed.optimise(info)
  File "C:\Python27\lib\_regex_core.py", line 1735, in optimise
    if self.is_case_sensitive(info):
  File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive
    return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

I.e. re fails to match this pattern (as it actually looks for "U00010337" ), regex doesn't recognise the wide unicode as surrogate pair either, but it also raises an error from narrow unichr. Not sure, whether/how it should be fixed, but the difference based on the i-flag seems unusual.

Of course it would be nice, if surrogate pairs were interpreted, but I can imagine, that it would open a whole can of worms, as this is not thoroughly supported in the builtin unicode either (len, indices, slicing).

I am trying to make wide unicode characters somehow usable in my app, mainly with hacks like extended unichr
("\U"+hex(67)[2:].zfill(8)).decode("unicode-escape") 
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000

Actually, using regex, one can work around some of these limitations of len, index or slice using a list form of the string containing surrogates

regex.findall(ur"(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|.", u"ab𐌷𐌸𐌹cd")
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']

but apparently things like wide unicode literals or character sets (even extending of the shorthands like \w etc.) are much more complicated.

regards,
   vbr
History
Date User Action Args
2010-09-20 23:51:38vbrsetrecipients: + vbr, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jhalcrow
2010-09-20 23:51:37vbrsetmessageid: <1285026697.37.0.770895579108.issue2636@psf.upfronthosting.co.za>
2010-09-20 23:51:35vbrlinkissue2636 messages
2010-09-20 23:51:33vbrcreate