msg66715 - (view) |
Author: Sven Siegmund (sven.siegmund) |
Date: 2008-05-12 08:43 |
re cannot ignore case of special latin characters:
Python 3.0a5 (py3k:62932M, May 9 2008, 16:23:11) [MSC v.1500 32 bit
(Intel)] on win32
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>> import re
>>> rx = re.compile('Á', re.IGNORECASE)
>>> rx.match('á') # should match but won't
>>> rx.match('Á') # will match
<_sre.SRE_Match object at 0x014B08A8>
>>> rx = re.compile('á', re.IGNORECASE)
>>> rx.match('Á') # should match but won't
>>> rx.match('á') # will match
<_sre.SRE_Match object at 0x014B08A8>
|
msg66727 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2008-05-12 14:44 |
Try adding re.LOCALE to the flags. I'm not sure why that is needed but
it seems to fix this issue.
I still think this is a legitimate bug though.
|
msg67622 - (view) |
Author: Manuel Kaufmann (humitos) * |
Date: 2008-06-02 00:23 |
I have the same error with the re.LOCALE flag...
[humitos] [~]$ python3.0
Python 3.0a5+ (py3k:63855, Jun 1 2008, 13:05:09)
[GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> rx = re.compile('á', re.LOCALE | re.IGNORECASE)
>>> rx.match('Á')
>>> rx.match('á')
<_sre.SRE_Match object at 0x2b955e204d30>
>>> rx = re.compile('Á', re.IGNORECASE | re.LOCALE)
>>> rx.match('Á')
<_sre.SRE_Match object at 0x2b955e204e00>
>>> rx.match('á')
>>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á'
True
>>>
|
msg68901 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-06-28 19:40 |
Same here, re.LOCALE doesn't circumvent the problem.
|
msg68905 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-06-28 20:27 |
Uh, actually, it works if you specify re.UNICODE. If you don't, the
getlower() function in _sre.c falls back to the plain ASCII algorithm.
>>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE)
>>> pat.match('á')
<_sre.SRE_Match object at 0xb7c66c28>
>>> pat.match('Á')
<_sre.SRE_Match object at 0xb7c66cd0>
I wonder if re.UNICODE shouldn't be the default in Py3k, at least when
the pattern is a string and not a bytes object. There may also be a
re.ASCII flag for those cases where people want to fallback to the old
behaviour.
|
msg68920 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2008-06-28 22:19 |
Sounds like re.UNICODE should be on by default when the pattern is a str
instance.
Also (per mailing list discussion) we should probably only allow
matching bytes when the pattern is bytes, and matching str when the
pattern is str.
Finally, is there a use case of re.LOCALE any more? I'm thinking not.
|
msg68922 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-06-28 22:35 |
Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit :
> Finally, is there a use case of re.LOCALE any more? I'm thinking not.
It's used for locale-specific case matching in the non-unicode case. But
it looks to me like a bad practice and we could probably remove it.
'C'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE)
>>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1')
'fr_FR.ISO-8859-1'
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE)
>>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE | re.LOCALE)
<_sre.SRE_Match object at 0xb7b9ac28>
|
msg68932 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-06-29 01:15 |
Here is a preliminary patch which doesn't remove re.LOCALE, but adds
TypeError's for mistyped matchings, a ValueError when specifying
re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode
patterns. The test suite runs fine after a few fixes.
It also includes the patch for #3231 ("re.compile fails with some bytes
patterns").
|
msg68966 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-06-29 20:21 |
This new patch also introduces re.ASCII as discussed on the mailing-list.
|
msg68967 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-06-29 20:36 |
Improved patch which also detects incompatibilities for "(?u)".
|
msg69298 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-07-05 21:09 |
This new patch adds re.ASCII in all sensitive places I could find in the
stdlib (except lib2to3 which as far as I understand is maintained in a
separate branch, and even has its own copy of tokenize.py...).
Also, I didn't get an answer to the following question on the ML: should
an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so
as to set the ASCII flag from inside a pattern string.
|
msg69301 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-07-05 21:30 |
http://codereview.appspot.com/2439
|
msg70354 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-07-28 16:39 |
Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please
review:
http://codereview.appspot.com/2439
|
msg70370 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2008-07-28 20:41 |
Are all those re.ASCII flags mandatory, or are they here just for
theoretical correctness?
For example, the output of "gcc -dumpversion" is certainly plain ASCII.
I don't mind that \d also matches some exotic digit - it just won't happen.
|
msg70371 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-07-28 20:49 |
Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit :
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
> Are all those re.ASCII flags mandatory, or are they here just for
> theoretical correctness?
For theoretical correctness. I just don't want to analyze each case
individually and I'm probably not competent for many of them.
|
msg70780 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-06 10:29 |
If nobody (except Amaury :-)) has anything to say about the current
patch, should it be committed?
|
msg70787 - (view) |
Author: Guido van Rossum (gvanrossum) * |
Date: 2008-08-06 16:34 |
Let's make sure the release manager is OK with this.
|
msg71186 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-15 21:31 |
Barry?
|
msg71413 - (view) |
Author: Barry A. Warsaw (barry) * |
Date: 2008-08-19 12:57 |
I haven't looked at the specific patch, but based on the description of
the behavior, I'm +1 on committing this before beta 3. I'm fine with
leaving the re.ASCII flags in there -- it will be a marker to indicate
perhaps the code needs a closer examination (eventually).
|
msg71414 - (view) |
Author: Barry A. Warsaw (barry) * |
Date: 2008-08-19 12:58 |
Make sure of course that the documentation is updated and a NEWS file
entry is added.
|
msg71455 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-19 17:59 |
Fixed in r65860. Someone should check the docs though (at least try to
generate them, and review my changes a bit since English isn't my mother
tongue).
|
msg71516 - (view) |
Author: Mark Summerfield (mark) * |
Date: 2008-08-20 07:36 |
On 2008-08-19, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Fixed in r65860. Someone should check the docs though (at least try to
> generate them, and review my changes a bit since English isn't my mother
> tongue).
I've revised the ASCII and LOCALE-related texts in re.rst in r65903.
|
msg71517 - (view) |
Author: Mark Summerfield (mark) * |
Date: 2008-08-20 07:40 |
On 2008-08-19, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Fixed in r65860. Someone should check the docs though (at least try to
> generate them, and review my changes a bit since English isn't my mother
> tongue).
And two more (tiny) fixes in r65904; that's my lot:-)
|
msg71519 - (view) |
Author: Antoine Pitrou (pitrou) * |
Date: 2008-08-20 08:49 |
Thanks a lot Mark!
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:34 | admin | set | github: 47083 |
2009-02-13 14:02:50 | ezio.melotti | set | nosy:
+ ezio.melotti |
2009-02-13 13:52:31 | ocean-city | link | issue5239 dependencies |
2009-02-13 12:34:49 | ocean-city | link | issue5240 dependencies |
2008-08-20 08:49:38 | pitrou | set | messages:
+ msg71519 |
2008-08-20 07:40:55 | mark | set | messages:
+ msg71517 |
2008-08-20 07:36:30 | mark | set | messages:
+ msg71516 |
2008-08-19 17:59:29 | pitrou | set | status: open -> closed resolution: accepted -> fixed messages:
+ msg71455 |
2008-08-19 12:58:07 | barry | set | messages:
+ msg71414 |
2008-08-19 12:57:41 | barry | set | resolution: accepted messages:
+ msg71413 |
2008-08-15 21:31:11 | pitrou | set | messages:
+ msg71186 |
2008-08-06 16:34:33 | gvanrossum | set | nosy:
+ barry messages:
+ msg70787 |
2008-08-06 10:29:16 | pitrou | set | messages:
+ msg70780 |
2008-07-28 20:49:16 | pitrou | set | messages:
+ msg70371 |
2008-07-28 20:41:56 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg70370 |
2008-07-28 16:39:31 | pitrou | set | files:
+ reunicode5.patch messages:
+ msg70354 |
2008-07-24 15:07:53 | pitrou | set | priority: critical assignee: pitrou |
2008-07-24 12:39:00 | mark | set | nosy:
+ mark |
2008-07-05 21:30:04 | pitrou | set | messages:
+ msg69301 |
2008-07-05 21:10:11 | pitrou | set | files:
+ reunicode4.patch messages:
+ msg69298 |
2008-06-29 20:36:38 | pitrou | set | files:
+ reunicode3.patch messages:
+ msg68967 |
2008-06-29 20:21:07 | pitrou | set | files:
+ reunicode2.patch messages:
+ msg68966 |
2008-06-29 01:19:44 | pitrou | set | files:
+ reunicode.patch |
2008-06-29 01:19:17 | pitrou | set | files:
- reunicode.patch |
2008-06-29 01:15:28 | pitrou | set | files:
+ reunicode.patch keywords:
+ patch messages:
+ msg68932 |
2008-06-28 22:35:39 | pitrou | set | messages:
+ msg68922 |
2008-06-28 22:19:03 | gvanrossum | set | messages:
+ msg68920 |
2008-06-28 20:27:24 | pitrou | set | messages:
+ msg68905 |
2008-06-28 19:40:35 | pitrou | set | nosy:
+ pitrou messages:
+ msg68901 |
2008-06-02 00:23:02 | humitos | set | nosy:
+ humitos messages:
+ msg67622 |
2008-05-12 14:44:03 | gvanrossum | set | nosy:
+ gvanrossum messages:
+ msg66727 |
2008-05-12 08:44:03 | sven.siegmund | create | |