Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

horrible performance of textwrap.wrap() with a long word #66877

Closed
inkerman mannequin opened this issue Oct 21, 2014 · 31 comments
Closed

horrible performance of textwrap.wrap() with a long word #66877

inkerman mannequin opened this issue Oct 21, 2014 · 31 comments
Assignees
Labels
performance Performance or resource usage

Comments

@inkerman
Copy link
Mannequin

inkerman mannequin commented Oct 21, 2014

BPO 22687
Nosy @birkenfeld, @pitrou, @bitdancer, @serhiy-storchaka
Files
  • wordsplit_complexity.patch
  • wordsplit_complexity2.patch
  • wordsplit.patch
  • wordsplit_2.patch
  • wordsplit_3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-03-24.20:05:10.991>
    created_at = <Date 2014-10-21.16:45:26.428>
    labels = ['performance']
    title = 'horrible performance of textwrap.wrap() with a long word'
    updated_at = <Date 2015-03-24.20:05:10.990>
    user = 'https://bugs.python.org/inkerman'

    bugs.python.org fields:

    activity = <Date 2015-03-24.20:05:10.990>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-03-24.20:05:10.991>
    closer = 'serhiy.storchaka'
    components = []
    creation = <Date 2014-10-21.16:45:26.428>
    creator = 'inkerman'
    dependencies = []
    files = ['37179', '37180', '37188', '37190', '38149']
    hgrepos = []
    issue_num = 22687
    keywords = ['patch']
    message_count = 31.0
    messages = ['229768', '229770', '230864', '231037', '231039', '231044', '231045', '231048', '231050', '231052', '231065', '231066', '231067', '231068', '231071', '231104', '231105', '231106', '231116', '231121', '231122', '231127', '231128', '231129', '231130', '231131', '231144', '231474', '234882', '236046', '239156']
    nosy_count = 7.0
    nosy_names = ['georg.brandl', 'pitrou', 'r.david.murray', 'python-dev', 'serhiy.storchaka', 'roippi', 'inkerman']
    pr_nums = []
    priority = 'low'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue22687'
    versions = ['Python 3.5']

    @inkerman
    Copy link
    Mannequin Author

    inkerman mannequin commented Oct 21, 2014

    Wrapping a paragraph containing a long word takes a lot of time:

    $ time python3 -c 'import textwrap; textwrap.wrap("a" * 2 ** 16)'

    real 3m14.923s
    user 3m14.792s
    sys 0m0.016s
    $

    A straightforward replacement is 5000 times faster:

    $ time python3 -c '("".join(x) for x in zip(*[iter("a" * 2 ** 16)] * 70))'

    real 0m0.053s
    user 0m0.032s
    sys 0m0.016s
    $

    Tested on Debian with python3.4 3.4.2-1 and python2.7 2.7.8-10.

    @inkerman inkerman mannequin added the performance Performance or resource usage label Oct 21, 2014
    @serhiy-storchaka
    Copy link
    Member

    This particular case is related to the behavior of the wordsep_re regular expression in worst case. When text contains long sequence of words characters which is not ended by a hypen, or long sequence of non-word and non-space characters (and in some other cases), computational complexity of this regular expression matching is quadratic. This is a peculiarity of current implementation of regular expression engine. May be it is possible to rewrite the regular expression so that quadratic complexity will gone, but this is not so easy.

    The workaround -- use break_on_hyphens=False.

    @serhiy-storchaka serhiy-storchaka self-assigned this Oct 21, 2014
    @serhiy-storchaka
    Copy link
    Member

    May be atomic grouping or possessive quantifiers (bpo-433030) will help with this issue.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 11, 2014

    Here is a patch which solves the algorithmic complexity issue by using a different scheme: instead of splitting, match words incrementally.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 11, 2014

    Actually, it is enough to change the regexp while still using re.split(). Updated patch attached.

    @serhiy-storchaka
    Copy link
    Member

    Unfortunately there are two disadvantages:

    1. wordsep_re and wordsep_simple_re are public attributes and user code can depend on this. Changing their is a way to customize TextWrapper.

    2. This is slowdown common case (no abnormally long words):

    $ ./python -m timeit -s 'import textwrap; s = "abcde " * 10**4' -- 'textwrap.wrap(s)'

    Unpatched: 178 msec per loop
    Patched: 285 msec per loop

    First reason stopped me from writing a patch.

    When change the way how to split words, I suggest to use undocumented re scanner.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 11, 2014

    Are you sure? I get the reverse results here (second patch):

    Unpatched:
    $ ./python -m timeit -s 'import textwrap; s = "abcde " * 10**4' -- 'textwrap.wrap(s)'
    10 loops, best of 3: 27 msec per loop

    Patched:
    $ ./python -m timeit -s 'import textwrap; s = "abcde " * 10**4' -- 'textwrap.wrap(s)'
    10 loops, best of 3: 19.2 msec per loop

    wordsep_re and wordsep_simple_re are public attributes and user code can depend on this. Changing their is a way to customize TextWrapper.

    With my second patch, that shouldn't be a problem.

    @serhiy-storchaka
    Copy link
    Member

    Oh, sorry, I tested your first patch. Your second patch is faster than current
    code to me. But it changes behavior.

    >>> textwrap.wrap('"1a-2b', width=5)
    ['"1a-', '2b']

    With the patch the result is ['"1a-2', 'b'].

    @pitrou
    Copy link
    Member

    pitrou commented Nov 11, 2014

    Yes... but in both cases the result is nonsensical, and untested.

    @serhiy-storchaka
    Copy link
    Member

    Possessive quantifiers (bpo-433030) is not a panacea. They allow to speed up regular expressions, but the complexity is still quadratic. Antoine's patch makes the complexity linear.

    @serhiy-storchaka
    Copy link
    Member

    Current regex produces insane result.

    $ ./python -c "import textwrap; print(textwrap.wrap('this-is-a-useful-feature', width=1, break_long_words=False))"
    ['this-', 'is-a', '-useful-', 'feature']

    Antoine's regex produces more correct result for this case: ['this-', 'is-', 'a-', 'useful-', 'feature']. But this is not totally correct, one-letter word should not be separated. This can be easy fixed.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 12, 2014

    But this is not totally correct, one-letter word should not be
    separated.

    Why not? I guess it depends on English's rules for word splitting, which I don't know.
    In any case, this issue is not about improving correctness, only performance.

    @serhiy-storchaka
    Copy link
    Member

    Why not? I guess it depends on English's rules for word splitting, which I
    don't know.

    I suppose this is common rule in many languages. And current code supports it (there is a special code in the regex to ensure this rule).

    In any case, this issue is not about improving correctness,
    only performance.

    But the patch shouldn't add a regression.

    $ ./python -c "import textwrap; print(textwrap.wrap('this-is-a-useful', width=1, break_long_words=False))"

    Current code: ['this-', 'is-a-useful']
    Patched: ['this-', 'is-', 'a-', 'useful']

    Just use lookahead assertion to ensure that the hyphen is followed by at least two letters.

    My previous message is about that current code is not always correct so it is acceptable to replace it with not absolutely equivalent code.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 12, 2014

    I suppose this is common rule in many languages.

    I frankly don't know about this rule. And the tests don't check for it, so for me it's not broken.

    @serhiy-storchaka
    Copy link
    Member

    Tests are not perfect. But this is intentional design. The part of initial
    regex:

    r'\w{2,}-(?=\w{2,})|'     # hyphenated words
    

    Now it is more complicated. Note '(?=\w{2,})'.

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch which is closer to current code but solves complexity issue and also fixes some bugs in current code.

    $ ./python -c "import textwrap; print(textwrap.wrap('this-is-a-useful-feature', width=1, break_long_words=False))"
    ['this-', 'is-a', '-useful-', 'feature']
    $ ./python -c "import textwrap; print(textwrap.wrap('what-d\x27you-call-it.', width=1, break_long_words=False))"
    ['what-d', "'you-", 'call-', 'it.']

    @birkenfeld
    Copy link
    Member

    LGTM.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 13, 2014

    I don't understand:

    + expect = ("this-|is-a-useful-|feature-|for-|"
    + "reformatting-|posts-|from-|tim-|peters'ly").split('|')
    + self.check_wrap(text, 1, expect, break_long_words=False)
    + self.check_split(text, expect)

    Why would "is-a-useful" remain unsplit? It looks like you're making up new rules.

    @serhiy-storchaka
    Copy link
    Member

    This is old rule. \w{2,}-(?=\w{2,} -- single letter shouldn't be separated. But there was a bug in such simple regex, it splits a word after non-word character (in particular apostrophe or hyphen) if it followed by word characters and hyphen. There were attempts to fix this bug in bpo-596434 and bpo-965425 but they missed a cases when non-word character is occurred inside a word.

    Originally I had assigned this issue only to 3.5 because I supposed that the solution needs either new features in re or backward-incompatible changes to word splitting algorithm. But found solution doesn't require 3.5-only features, doesn't change interface, and fixes performance and behavior bugs. So I think it should be applied to maintained releases too.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 13, 2014

    This is old rule. \w{2,}-(?=\w{2,} -- single letter shouldn't be separated.

    I don't agree. This was an implementation detail. There was no test, and it wasn't specified anywhere.
    If you think single letter shouldn't be separated, there should be some grammatical or typographical reference on the Internet to prove it.

    There were attempts to fix this bug in bpo-596434 and bpo-965425

    Those don't seem related to single letters between hyphens.

    But found solution doesn't require 3.5-only features, doesn't change interface, and fixes performance and behavior bugs.

    It does change behaviour in ways that could break existing code. The textwrap behaviour is underspecified so it's not ok to assume that previous behaviour was obviously buggy.

    @bitdancer
    Copy link
    Member

    https://owl.english.purdue.edu/owl/resource/576/01/

    Rule 8.

    So, no, in the middle of the word single letters aren't a problem, only at the beginning or the end of the word.

    @serhiy-storchaka
    Copy link
    Member

    Thank you David. If splitting single letter surrounded with hyphens is desirable, here is more complicated patch which does this. It deviates from original code more, but it doesn't look break any reasonable example.

    The textwrap behaviour is underspecified so it's not ok to assume that previous behaviour was obviously buggy.

    Aren't ['this-', 'is-a', '-useful-', 'feature'] and ['what-d', "'you-", 'call-', 'it.'] obvious bugs?

    @pitrou
    Copy link
    Member

    pitrou commented Nov 13, 2014

    To clarify, I would be fine with the previous patch if it didn't add the tests.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 13, 2014

    Aren't ['this-', 'is-a', '-useful-', 'feature'] and
    ['what-d', "'you-", 'call-', 'it.'] obvious bugs?

    Obvious according to which rules?

    If we want to improve the behaviour of textwrap, IMHO it should be in a separate issue. And someone would have to study the word-wrapping rules of the English language :-)

    @bitdancer
    Copy link
    Member

    What I usually do in cases like this is to add the tests but mark them with comments saying that the tests test current behavior but are not testing parts of the (currently defined) API. That way you know if a change changes behavior and then can decide if that is a problem or not, as opposed to inadvertently changing behavior and only finding out when the bug reports roll in :)

    But yeah, defining the rules textwrap should follow is a different issue than the performance issue.

    @serhiy-storchaka
    Copy link
    Member

    To clarify, I would be fine with the previous patch if it didn't add the tests.

    The absent of tests could cause introducing new non-detected bugs and reappearing old bugs.

    Obvious according to which rules?

    If you think a word should be splitted before hyphen or apostrophe, there should be some grammatical or typographical reference on the Internet to prove it.

    I would be fine with moving the fix of textwrap behavior to a separate issue, but what to do with this issue then? We have not a patch which only fixes performance complexity and doesn't change the behavior.

    @pitrou
    Copy link
    Member

    pitrou commented Nov 14, 2014

    What I usually do in cases like this is to add the tests but mark
    them with comments saying that the tests test current behavior but
    are not testing parts of the (currently defined) API. That way
    you know if a change changes behavior and then can decide if that is
    a problem or not, as opposed to inadvertently changing behavior
    and only finding out when the bug reports roll in :)

    That's a good idea!

    @serhiy-storchaka
    Copy link
    Member

    So what the patch (with mitigated tests) is more preferable?

    @serhiy-storchaka
    Copy link
    Member

    Ping. What can I do to move this issue forward?

    @serhiy-storchaka
    Copy link
    Member

    wordsplit_3.patch is wordsplit_2.patch with few added comments in tests. Is it enough?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Mar 24, 2015

    New changeset 7bd87a219813 by Serhiy Storchaka in branch 'default':
    Issue bpo-22687: Fixed some corner cases in breaking words in tetxtwrap.
    https://hg.python.org/cpython/rev/7bd87a219813

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants