Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate splitting on possible zero-width re patterns #67007

Closed
serhiy-storchaka opened this issue Nov 8, 2014 · 13 comments
Closed

Deprecate splitting on possible zero-width re patterns #67007

serhiy-storchaka opened this issue Nov 8, 2014 · 13 comments
Assignees
Labels
extension-modules C modules in the Modules dir topic-regex type-bug An unexpected behavior, bug, or error

Comments

@serhiy-storchaka
Copy link
Member

BPO 22818
Nosy @pitrou, @ezio-melotti, @serhiy-storchaka
Files
  • re_deprecate_split_zero_width.patch
  • re_deprecate_split_zero_width_2.patch
  • re_deprecate_split_zero_width_3.patch
  • re_deprecate_split_zero_width_4.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-02-03.09:12:15.984>
    created_at = <Date 2014-11-08.11:01:20.317>
    labels = ['extension-modules', 'expert-regex', 'type-bug']
    title = 'Deprecate splitting on possible zero-width re patterns'
    updated_at = <Date 2015-02-03.09:12:15.983>
    user = 'https://github.com/serhiy-storchaka'

    bugs.python.org fields:

    activity = <Date 2015-02-03.09:12:15.983>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-02-03.09:12:15.984>
    closer = 'serhiy.storchaka'
    components = ['Extension Modules', 'Regular Expressions']
    creation = <Date 2014-11-08.11:01:20.317>
    creator = 'serhiy.storchaka'
    dependencies = []
    files = ['37148', '37765', '37863', '37864']
    hgrepos = []
    issue_num = 22818
    keywords = ['patch']
    message_count = 13.0
    messages = ['230843', '232657', '234259', '234262', '234263', '234737', '234742', '234762', '235158', '235162', '235209', '235321', '235323']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'ezio.melotti', 'mrabarnett', 'python-dev', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue22818'
    versions = ['Python 3.5']

    @serhiy-storchaka
    Copy link
    Member Author

    For now re.split doesn't split with zero-width regex. There are a number of issues for this (bpo-852532, bpo-988761, bpo-3262, bpo-22817). This is definitely a bug, but fixing this bug will likely break existing code which use regular expressions which can match zero-width (e.g. re.split('(:*)', 'ab')).

    I propose to deprecate splitting on possible zero-width regular expressions. This expressions either not work at all as expected (r'\b' never split) or can be rewritten to not match empty string ('(:*)' to '(:+)').

    In next release (3.6) we can convert deprecation warning to the exception, an then after transitional period change behavior to more correct handling zero-width matches without breaking backward compatibility.

    @serhiy-storchaka serhiy-storchaka added extension-modules C modules in the Modules dir topic-regex type-bug An unexpected behavior, bug, or error labels Nov 8, 2014
    @serhiy-storchaka
    Copy link
    Member Author

    I there are no objections I'm going to commit the patch soon.

    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 15, 2014
    @serhiy-storchaka
    Copy link
    Member Author

    Now patterns which could match only an empty string (e.g. '(?m)^$' or '(?<=\w-)(?=\w)') are rejected at all. They never worked with current regex engine. Updated the documentation.

    Could anyone please make a review and correct my wording. It is desirable to get this in alpha 1 and receive early feedback.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 18, 2015

    I don't really understand why "this is definitely a bug".

    @serhiy-storchaka
    Copy link
    Member Author

    Because users expect that split() supports zero-width patterns (as sub() supports them) and regexps in other languages support splitting on zero-width patterns. This looks as accidental implementation detail (see my patch in bpo-22817 -- the difference is pretty small) frozen in the ages for backward compatibility. We can't change this behavior in maintained releases because this will break mach code which accidentally use zero-width patterns. But we can change it in future as new feature, after deprecating current behavior. This would be very useful feature. For example it would allow to simplify and speed up the regex used for splitting on hyphens in textwrap (something like r'(?<=\w-)(?=\w)').

    @serhiy-storchaka
    Copy link
    Member Author

    Could anyone please make a review (mainly documentation)? It would be good to get this change in first alpha.

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you Ezio for your review. Updated patch includes most of your suggestions. But I think some places still can be dim.

    @serhiy-storchaka
    Copy link
    Member Author

    Updated patch includes Ezio's suggestions. Thank you Ezio, they looks great to me.

    @serhiy-storchaka
    Copy link
    Member Author

    I hesitate about warning type. Originally I was going to emit a DeprecationWarning in 3.5, may be change it to a UserWarning in 3.6, and raise a ValueError or change behavior in 3.7. What would be better?

    @serhiy-storchaka
    Copy link
    Member Author

    May be RuntimeWarning or FutureWarning are more appropriate?

    @ezio-melotti
    Copy link
    Member

    DeprecationWarning: Base class for warnings about deprecated features.
    UserWarning: Base class for warnings generated by user code.
    RuntimeWarning: Base class for warnings about dubious runtime behavior.
    FutureWarning: Base class for warnings about constructs that will change semantically in the future.

    I think FutureWarning would be a good choice here, since we are going to change the semantics of a construct (before zero-width pattern didn't split -> in the future they will).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 3, 2015

    New changeset 7c667d8ae10d by Serhiy Storchaka in branch 'default':
    Issue bpo-22818: Splitting on a pattern that could match an empty string now
    https://hg.python.org/cpython/rev/7c667d8ae10d

    @serhiy-storchaka
    Copy link
    Member Author

    Thank you Ezio and Berker for your reviews.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants