classification
Title: re.sub confusion between count and flags args
Type: enhancement Stage:
Components: Regular Expressions, Unicode Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: eric.araujo, eric.smith, ezio.melotti, mindauga, mmilkin, mrabarnett, rhettinger, terry.reedy
Priority: normal Keywords:

Created on 2011-04-29 18:27 by mindauga, last changed 2013-04-16 10:28 by rhettinger.

Messages (13)
msg134806 - (view) Author: Mindaugas (mindauga) Date: 2011-04-29 18:27
re.sub don't substitute not ASCII characters:

Python 2.7.1 (r271:86832, Apr 15 2011, 12:11:58) Arch Linux

>>>import re

>>>a=u'aaa'
>>>print re.search('(\w+)',a,re.U).groups()
(u'aaa')
>>>print re.sub('(\w+)','x',a,re.U)
x

      BUT:

>>>a=u'ąąą'
>>>print re.search('(\w+)',a,re.U).groups()
(u'\u0105\u0105\u0105')
>>>print re.sub('(\w+)','x',a,re.U)
ąąą
msg134820 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2011-04-29 22:58
The 4th parameter to re.sub() is a count, not flags.
msg134830 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-04-30 02:23
Since this has been reported already several times (see e.g. #11947), and it's a fairly common mistake, I think we should do something to avoid it.

A few possibilities are:
  1) add a warning in the doc;
  2) make count and flag keyword-only argument (raising a deprecation warning in 3.3 and actually change it later);
  3) change the regex flags to some object that can be distinguished from ints and raise an error when a flag is passed to count;
msg135371 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-05-06 21:41
I like the idea of an internal REflag class with __new__, __or__, and __repr__==__str__. Str(re.A|re.L) might print as
"REflag: re.ASCII | re.IGNORE"
If it is *not* an int subclass, any attempt to use or mix with an int would raise. I checked and the doc only promises that flags can be or'ed. An __and__ method might be added if it were thought that people currently use & to check for flags set, though that is not currently promised.
msg135386 - (view) Author: Matthew Barnett (mrabarnett) * Date: 2011-05-06 23:32
Something like "<re.Flag ASCII | IGNORE>" may be more Pythonic.
msg135391 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2011-05-07 00:13
Agreed, if we go that route.
msg136657 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2011-05-23 15:01
I’d favor 1) or 2) over 3).  Ints are short and very commonly used for flags.
msg143520 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-09-05 14:55
See also #12888 for an error in the stdlib caused by this.
msg186784 - (view) Author: Mike Milkin (mmilkin) * Date: 2013-04-13 18:27
I like option #2, and I was thinking of working on it today, poke me if anyone has a problem with this.
msg186825 - (view) Author: Mike Milkin (mmilkin) * Date: 2013-04-13 20:24
There is no sane way to issue a warning without changing the signature and we don't want to change the signature without issuing a deprecation warning for the function, so sadly option 3 is the only way for this to work, (Im going to not touch this till ENUMS are merged in.)
msg186832 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-13 20:32
Can't you use *args and **kwargs and then raise a deprecation warning if count and/or flags are in args?
Even if enums are merged in, there might still be issues depending on their implementation.
msg186844 - (view) Author: Mike Milkin (mmilkin) * Date: 2013-04-13 21:00
We could do that but we would be changing the signature before adding the warning
msg186856 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-04-13 21:37
The change would still be backwards compatible (even though inspect.signature and similar functions might return something different).  Note that I'm not saying that's the best option, but it should be doable.
History
Date User Action Args
2013-04-16 10:28:23rhettingersetassignee: rhettinger -> ezio.melotti
2013-04-13 21:37:56ezio.melottisetmessages: + msg186856
2013-04-13 21:00:59mmilkinsetmessages: + msg186844
2013-04-13 20:32:46ezio.melottisetmessages: + msg186832
2013-04-13 20:24:01mmilkinsetmessages: + msg186825
2013-04-13 18:27:21mmilkinsetnosy: + mmilkin
messages: + msg186784
2013-04-10 16:53:10ezio.melottisettype: enhancement
versions: + Python 3.4, - Python 3.1, Python 3.2
2012-11-10 05:25:06eric.snowsetnosy: - eric.snow
2011-12-15 19:08:01eric.snowsetnosy: + eric.snow
2011-09-05 14:55:41ezio.melottisetmessages: + msg143520
2011-05-23 15:01:30eric.araujosetnosy: + eric.araujo
messages: + msg136657
2011-05-14 22:30:36rhettingersetassignee: rhettinger

nosy: + rhettinger
2011-05-14 21:49:39ezio.melottilinkissue12078 superseder
2011-05-07 00:13:28terry.reedysetmessages: + msg135391
2011-05-06 23:32:44mrabarnettsetmessages: + msg135386
2011-05-06 21:41:18terry.reedysetnosy: + terry.reedy
messages: + msg135371
2011-04-30 02:23:35ezio.melottisetnosy: + ezio.melotti, mrabarnett
title: re.sub problem with unicode string -> re.sub confusion between count and flags args
messages: + msg134830

versions: + Python 3.1, Python 3.2, Python 3.3
2011-04-29 22:58:40eric.smithsetnosy: + eric.smith
messages: + msg134820
2011-04-29 18:27:10mindaugacreate