Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re.sub confusion between count and flags args #56166

Closed
mindauga mannequin opened this issue Apr 29, 2011 · 30 comments
Closed

re.sub confusion between count and flags args #56166

mindauga mannequin opened this issue Apr 29, 2011 · 30 comments
Assignees
Labels
3.7 (EOL) end of life topic-regex type-feature A feature request or enhancement

Comments

@mindauga
Copy link
Mannequin

mindauga mannequin commented Apr 29, 2011

BPO 11957
Nosy @rhettinger, @terryjreedy, @vstinner, @ericvsmith, @jwilk, @ezio-melotti, @merwok, @serhiy-storchaka
Dependencies
  • bpo-23591: enum: Add Flags and IntFlags
  • Files
  • patch_11957
  • re_keyword_only.patch
  • re_check_flags_type.patch
  • re_deprecate_positional_count.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = None
    created_at = <Date 2011-04-29.18:27:10.198>
    labels = ['expert-regex', 'type-feature', '3.7']
    title = 're.sub confusion between count and flags args'
    updated_at = <Date 2017-04-14.18:55:29.189>
    user = 'https://bugs.python.org/mindauga'

    bugs.python.org fields:

    activity = <Date 2017-04-14.18:55:29.189>
    actor = 'jwilk'
    assignee = 'ezio.melotti'
    closed = False
    closed_date = None
    closer = None
    components = ['Regular Expressions']
    creation = <Date 2011-04-29.18:27:10.198>
    creator = 'mindauga'
    dependencies = ['23591']
    files = ['30821', '37067', '44825', '44826']
    hgrepos = []
    issue_num = 11957
    keywords = ['patch']
    message_count = 27.0
    messages = ['134806', '134820', '134830', '135371', '135386', '135391', '136657', '143520', '186784', '186825', '186832', '186844', '186856', '192416', '230223', '230224', '230226', '230236', '230237', '230238', '230358', '230375', '277386', '277398', '291655', '291659', '291677']
    nosy_count = 13.0
    nosy_names = ['rhettinger', 'terry.reedy', 'vstinner', 'eric.smith', 'jwilk', 'ezio.melotti', 'eric.araujo', 'mrabarnett', 'python-dev', 'mindauga', 'serhiy.storchaka', 'mmilkin', 'umi']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue11957'
    versions = ['Python 3.6', 'Python 3.7']

    Linked PRs

    @mindauga
    Copy link
    Mannequin Author

    mindauga mannequin commented Apr 29, 2011

    re.sub don't substitute not ASCII characters:

    Python 2.7.1 (r271:86832, Apr 15 2011, 12:11:58) Arch Linux

    >>import re

    >>a=u'aaa'
    >>print re.search('(\w+)',a,re.U).groups()
    (u'aaa')
    >>print re.sub('(\w+)','x',a,re.U)
    x

      BUT:
    

    >>a=u'ąąą'
    >>print re.search('(\w+)',a,re.U).groups()
    (u'\u0105\u0105\u0105')
    >>print re.sub('(\w+)','x',a,re.U)
    ąąą

    @ericvsmith
    Copy link
    Member

    The 4th parameter to re.sub() is a count, not flags.

    @ezio-melotti
    Copy link
    Member

    Since this has been reported already several times (see e.g. bpo-11947), and it's a fairly common mistake, I think we should do something to avoid it.

    A few possibilities are:

    1. add a warning in the doc;
    2. make count and flag keyword-only argument (raising a deprecation warning in 3.3 and actually change it later);
    3. change the regex flags to some object that can be distinguished from ints and raise an error when a flag is passed to count;

    @ezio-melotti ezio-melotti changed the title re.sub problem with unicode string re.sub confusion between count and flags args Apr 30, 2011
    @terryjreedy
    Copy link
    Member

    I like the idea of an internal REflag class with __new__, __or__, and __repr__==str. Str(re.A|re.L) might print as
    "REflag: re.ASCII | re.IGNORE"
    If it is *not* an int subclass, any attempt to use or mix with an int would raise. I checked and the doc only promises that flags can be or'ed. An __and__ method might be added if it were thought that people currently use & to check for flags set, though that is not currently promised.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented May 6, 2011

    Something like "<re.Flag ASCII | IGNORE>" may be more Pythonic.

    @terryjreedy
    Copy link
    Member

    Agreed, if we go that route.

    @rhettinger rhettinger self-assigned this May 14, 2011
    @merwok
    Copy link
    Member

    merwok commented May 23, 2011

    I’d favor 1) or 2) over 3). Ints are short and very commonly used for flags.

    @ezio-melotti
    Copy link
    Member

    See also bpo-12888 for an error in the stdlib caused by this.

    @ezio-melotti ezio-melotti added the type-feature A feature request or enhancement label Apr 10, 2013
    @mmilkin
    Copy link
    Mannequin

    mmilkin mannequin commented Apr 13, 2013

    I like option #2, and I was thinking of working on it today, poke me if anyone has a problem with this.

    @mmilkin
    Copy link
    Mannequin

    mmilkin mannequin commented Apr 13, 2013

    There is no sane way to issue a warning without changing the signature and we don't want to change the signature without issuing a deprecation warning for the function, so sadly option 3 is the only way for this to work, (Im going to not touch this till ENUMS are merged in.)

    @ezio-melotti
    Copy link
    Member

    Can't you use *args and **kwargs and then raise a deprecation warning if count and/or flags are in args?
    Even if enums are merged in, there might still be issues depending on their implementation.

    @mmilkin
    Copy link
    Mannequin

    mmilkin mannequin commented Apr 13, 2013

    We could do that but we would be changing the signature before adding the warning

    @ezio-melotti
    Copy link
    Member

    The change would still be backwards compatible (even though inspect.signature and similar functions might return something different). Note that I'm not saying that's the best option, but it should be doable.

    @umi
    Copy link
    Mannequin

    umi mannequin commented Jul 6, 2013

    Please see my patch, I have changed flags to be instances of IntEnum and added a check to re.sub, re.subn and re.split. The patch contains some tests. This solution also allowed me to discover several bugs in the standard library, and I am going to create tickets for them shortly.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 29, 2014

    New changeset 767fd62b59a9 by Victor Stinner in branch 'default':
    Issue bpo-11957: Explicit parameter name when calling re.split() and re.sub()
    https://hg.python.org/cpython/rev/767fd62b59a9

    @vstinner
    Copy link
    Member

    I suggest to make the 2 last parameters of re.sub(), re.subn() and re.split() parameters as keyword-only. It will break applications using count and maxsplit parameters as index parameters, but it's easy to fix these applications if they want to support also Python 3.5.

    I checked Python 2.6: the name of the maxsplit and count parameters didn't change. So it's possible to write code working on Python 2.6-3.5 if the parameter name is explicitly used:

    • re.sub("a", "a", "a", count=1)
    • re.subn("a", "a", "a", count=1)
    • re.split("a", "a", maxsplit=1)

    The flags parameter was added to re.sub(), re.subn() and re.split() functions in Python 2.7:

    See my attached re_keyword_only.patch:

    • sub(), subn(): count and flags become keyword-only parameters
    • split(): maxsplit and flags become keyword-only parameters

    @vstinner
    Copy link
    Member

    Confusion between count/maxplit and count parameters is common, duplicated issues:

    See also issue bpo-13385 which proposed an explicit "re.NOFLAGS flag".

    @serhiy-storchaka
    Copy link
    Member

    Thank you for your patch Valentina. But it makes flags combinations not pickleable.

    >>> import re, pickle
    >>> pickle.dumps(re.I|re.S, 3)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    _pickle.PicklingError: Can't pickle <enum 'SubFlag'>: attribute lookup SubFlag on sre_constants failed
    >>> pickle.dumps(re.I|re.S, 4)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    _pickle.PicklingError: Can't pickle <enum 'SubFlag'>: attribute lookup BaseFlags.__or__.<locals>.SubFlag on sre_constants failed

    And I'm afraid that creating new class in the "|" operator can affect performance.

    @serhiy-storchaka
    Copy link
    Member

    As for 767fd62b59a9, I doubt that changing positional arguments to keyword argumennts in tests is justified. This can hide a bug.

    @serhiy-storchaka
    Copy link
    Member

    And again about patch_11957. I afraid that testing isinstance(count, sre_constants.BaseFlags) on every re.sub() call will hit performance too.

    @ezio-melotti
    Copy link
    Member

    I agree about 767fd62b59a9, there should be tests for args passed both by position and as keyword args.

    Serhiy, do you think the enum solution is worth pursuing, or is it better to just turn those args to keyword-only (after a proper deprecation process)?

    @serhiy-storchaka
    Copy link
    Member

    I think that the enum solution is worth pursuing, and that we need general
    class which represents the set of specified named flags. I'm working on
    implementation of enum.IntFlags.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 25, 2016

    New changeset 216e8b809e4e by Serhiy Storchaka in branch '3.5':
    Issue bpo-11957: Restored re tests for passing count and maxsplit as positional
    https://hg.python.org/cpython/rev/216e8b809e4e

    New changeset b39b09290718 by Serhiy Storchaka in branch '3.6':
    Issue bpo-11957: Restored re tests for passing count and maxsplit as positional
    https://hg.python.org/cpython/rev/b39b09290718

    New changeset da2c96cf2ce6 by Serhiy Storchaka in branch 'default':
    Issue bpo-11957: Restored re tests for passing count and maxsplit as positional
    https://hg.python.org/cpython/rev/da2c96cf2ce6

    @serhiy-storchaka
    Copy link
    Member

    Here are two alternative patches. The first patch checks if count or maxsplit arguments are re.RegexFlag and raise TypeError if it is true. This makes misusing flags fail fast. The second patch deprecates passing count and maxsplit arguments as positional arguments. This imposes your to change your code (even if it is valid now) and makes hard misusing flags.

    Unfortunately both ways slow down calling functions.

    $ ./python -m perf timeit -s "import re" -- 're.split(":", ":a:b::c", 2)'

    unpatched: Median +- std dev: 2.73 us +- 0.09 us
    check_flags_type: Median +- std dev: 3.74 us +- 0.09 us
    deprecate_positional_count: Median +- std dev: 10.6 us +- 0.2 us

    $ ./python -m perf timeit -s "import re" -- 're.split(":", ":a:b::c", maxsplit=2)'

    unpatched: Median +- std dev: 2.78 us +- 0.07 us
    check_flags_type: Median +- std dev: 3.75 us +- 0.10 us
    deprecate_positional_count: Median +- std dev: 2.86 us +- 0.08 us

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Sep 25, 2016
    @vstinner
    Copy link
    Member

    My issue bpo-30072 has been marked as a duplicate of this one. Copy of my msg291650:

    The re API seems commonly misused. Example passing a re flag to re.sub():

    >>> re.sub("A", "B", "ahah", re.I)
    'ahah'

    No error, no warning, but it doesn't work. Oh, sub has 5 paramters, no 4...

    I suggest to convert count and flags to keyword-only parameters. To not break the world, especially legit code passing the count parameter as a position argument, an option is to have a deprecation period if these two parameters are passed a positional-only parameter.

    --

    Another option would be to rely on the fact that re flags are now enums instead of raw integers, and so add basic type check...

    Is there are risk of applications using re flags serialized by pickle from Pyhon < 3.6 and so getting integers?

    Maybe the check should only be done if flags are passing as positional-only argument... but the implementation of such check seems may be overkill for such simple and performance-critical function, no?

    See issue bpo-30067 for a recent bug in the Python stdlib!

    @serhiy-storchaka
    Copy link
    Member

    Victor, I borrowed Guido's time machine and wrote patches implementing both your suggestions a half year ago.

    @jwilk
    Copy link
    Mannequin

    jwilk mannequin commented Apr 14, 2017

    + raise TypeError("sub() takes from 2 to 4 positional arguments "
    + "but %d were given" % (4 + len(args)))

    It's actually 3 to 5 for sub() and subn().

    @Zac-HD
    Copy link
    Contributor

    Zac-HD commented Aug 8, 2023

    I'd like to propose that we merge Serhiy's second patch, which deprecates passing count and maxsplit arguments as positional arguments.

    If necessary, I'd be happy to write or organize a PR.

    @hauntsaninja
    Copy link
    Contributor

    Thanks for bumping this, I'd forgotten about it. As you can tell by my previous comments, I think this is something we need to do. Just fixed another one of these a couple weeks ago. I'll open a PR with Serhiy's patch

    hauntsaninja added a commit to hauntsaninja/cpython that referenced this issue Aug 8, 2023
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 8, 2023
    …e functions
    
    Deprecate passing optional arguments maxsplit, count and flags in
    module-level functions re.split(), re.sub() and re.subn() as positional.
    They should only be passed by keyword.
    @serhiy-storchaka
    Copy link
    Member

    It is an old patch and it required polishing.

    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Aug 8, 2023
    …e functions
    
    Deprecate passing optional arguments maxsplit, count and flags in
    module-level functions re.split(), re.sub() and re.subn() as positional.
    They should only be passed by keyword.
    hauntsaninja pushed a commit that referenced this issue Aug 16, 2023
    …tions (#107778)
    
    Deprecate passing optional arguments maxsplit, count and flags in
    module-level functions re.split(), re.sub() and re.subn() as positional.
    They should only be passed by keyword.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-regex type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    9 participants