Message 117008 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vbr
Recipients	akitada, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date	2010-09-20.23:51:33
SpamBayes Score	7.791545e-13
Marked as misclassified	No
Message-id	<1285026697.37.0.770895579108.issue2636@psf.upfronthosting.co.za>
In-reply-to

Content
I like the idea of the general "new" flag introducing the reasonable, backwards incompatible behaviour; one doesn't have to remember a list of non-standard flags to get this features. While I recognise, that the module probably can't work correctly with wide unicode characters on a narrow python build (py 2.7, win XP in this case), i noticed a difference to re in this regard (it might be based on the absence of the wide unicode literal in the latter). re.findall(u"\\U00010337", u"a\U00010337bc") [] re.findall(u"(?i)\\U00010337", u"a\U00010337bc") [] regex.findall(u"\\U00010337", u"a\U00010337bc") [] regex.findall(u"(?i)\\U00010337", u"a\U00010337bc") Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Python27\lib\regex.py", line 203, in findall return _compile(pattern, flags).findall(string, pos, endpos, File "C:\Python27\lib\regex.py", line 310, in _compile parsed = parsed.optimise(info) File "C:\Python27\lib\_regex_core.py", line 1735, in optimise if self.is_case_sensitive(info): File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive return char_type(self.value).lower() != char_type(self.value).upper() ValueError: unichr() arg not in range(0x10000) (narrow Python build) I.e. re fails to match this pattern (as it actually looks for "U00010337" ), regex doesn't recognise the wide unicode as surrogate pair either, but it also raises an error from narrow unichr. Not sure, whether/how it should be fixed, but the difference based on the i-flag seems unusual. Of course it would be nice, if surrogate pairs were interpreted, but I can imagine, that it would open a whole can of worms, as this is not thoroughly supported in the builtin unicode either (len, indices, slicing). I am trying to make wide unicode characters somehow usable in my app, mainly with hacks like extended unichr ("\U"+hex(67)[2:].zfill(8)).decode("unicode-escape") or likewise for ord surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000 Actually, using regex, one can work around some of these limitations of len, index or slice using a list form of the string containing surrogates regex.findall(ur"(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})\|.", u"ab𐌷𐌸𐌹cd") [u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd'] but apparently things like wide unicode literals or character sets (even extending of the shorthands like \w etc.) are much more complicated. regards, vbr

I like the idea of the general "new" flag introducing the reasonable, backwards incompatible behaviour; one doesn't have to remember a list of non-standard flags to get this features.

While I recognise, that the module probably can't work correctly with wide unicode characters on a narrow python build (py 2.7, win XP in this case), i noticed a difference to re in this regard (it might be based on the absence of the wide unicode literal in the latter).

re.findall(u"\\U00010337", u"a\U00010337bc")
[]
re.findall(u"(?i)\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"\\U00010337", u"a\U00010337bc")
[]
regex.findall(u"(?i)\\U00010337", u"a\U00010337bc")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "C:\Python27\lib\regex.py", line 203, in findall
    return _compile(pattern, flags).findall(string, pos, endpos,
  File "C:\Python27\lib\regex.py", line 310, in _compile
    parsed = parsed.optimise(info)
  File "C:\Python27\lib\_regex_core.py", line 1735, in optimise
    if self.is_case_sensitive(info):
  File "C:\Python27\lib\_regex_core.py", line 1727, in is_case_sensitive
    return char_type(self.value).lower() != char_type(self.value).upper()
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

I.e. re fails to match this pattern (as it actually looks for "U00010337" ), regex doesn't recognise the wide unicode as surrogate pair either, but it also raises an error from narrow unichr. Not sure, whether/how it should be fixed, but the difference based on the i-flag seems unusual.

Of course it would be nice, if surrogate pairs were interpreted, but I can imagine, that it would open a whole can of worms, as this is not thoroughly supported in the builtin unicode either (len, indices, slicing).

I am trying to make wide unicode characters somehow usable in my app, mainly with hacks like extended unichr
("\U"+hex(67)[2:].zfill(8)).decode("unicode-escape") 
or likewise for ord
surrog_ord = (ord(first) - 0xD800) * 0x400 + (ord(second) - 0xDC00) + 0x10000

Actually, using regex, one can work around some of these limitations of len, index or slice using a list form of the string containing surrogates

regex.findall(ur"(?s)(?:\p{inHighSurrogates}\p{inLowSurrogates})|.", u"ab𐌷𐌸𐌹cd")
[u'a', u'b', u'\U00010337', u'\U00010338', u'\U00010339', u'c', u'd']

but apparently things like wide unicode literals or character sets (even extending of the shorthands like \w etc.) are much more complicated.

regards,
   vbr

History
Date	User	Action	Args
2010-09-20 23:51:38	vbr	set	recipients: + vbr, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jhalcrow
2010-09-20 23:51:37	vbr	set	messageid: <1285026697.37.0.770895579108.issue2636@psf.upfronthosting.co.za>
2010-09-20 23:51:35	vbr	link	issue2636 messages
2010-09-20 23:51:33	vbr	create