Issue 22687: horrible performance of textwrap.wrap() with a long word

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/66877

classification

Title:	horrible performance of textwrap.wrap() with a long word
Type:	performance	Stage:	resolved
Components:		Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	georg.brandl, inkerman, pitrou, python-dev, r.david.murray, roippi, serhiy.storchaka
Priority:	low	Keywords:	patch

Created on 2014-10-21 16:45 by inkerman, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
wordsplit_complexity.patch	pitrou, 2014-11-11 18:44		review
wordsplit_complexity2.patch	pitrou, 2014-11-11 19:00		review
wordsplit.patch	serhiy.storchaka, 2014-11-13 10:31		review
wordsplit_2.patch	serhiy.storchaka, 2014-11-13 18:19		review
wordsplit_3.patch	serhiy.storchaka, 2015-02-15 16:54		review

Messages (31)
msg229768 - (view)	Author: Piotr Engelking (inkerman)	Date: 2014-10-21 16:45
Wrapping a paragraph containing a long word takes a lot of time: $ time python3 -c 'import textwrap; textwrap.wrap("a" * 2 ** 16)' real 3m14.923s user 3m14.792s sys 0m0.016s $ A straightforward replacement is 5000 times faster: $ time python3 -c '("".join(x) for x in zip([iter("a" 2 ** 16)] * 70))' real 0m0.053s user 0m0.032s sys 0m0.016s $ Tested on Debian with python3.4 3.4.2-1 and python2.7 2.7.8-10.
msg229770 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-10-21 18:09
This particular case is related to the behavior of the wordsep_re regular expression in worst case. When text contains long sequence of words characters which is not ended by a hypen, or long sequence of non-word and non-space characters (and in some other cases), computational complexity of this regular expression matching is quadratic. This is a peculiarity of current implementation of regular expression engine. May be it is possible to rewrite the regular expression so that quadratic complexity will gone, but this is not so easy. The workaround -- use break_on_hyphens=False.
msg230864 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-08 15:21
May be atomic grouping or possessive quantifiers (issue433030) will help with this issue.
msg231037 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-11 18:44
Here is a patch which solves the algorithmic complexity issue by using a different scheme: instead of splitting, match words incrementally.
msg231039 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-11 19:00
Actually, it is enough to change the regexp while still using re.split(). Updated patch attached.
msg231044 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-11 19:42
Unfortunately there are two disadvantages: 1. wordsep_re and wordsep_simple_re are public attributes and user code can depend on this. Changing their is a way to customize TextWrapper. 2. This is slowdown common case (no abnormally long words): $ ./python -m timeit -s 'import textwrap; s = "abcde " * 10**4' -- 'textwrap.wrap(s)' Unpatched: 178 msec per loop Patched: 285 msec per loop First reason stopped me from writing a patch. When change the way how to split words, I suggest to use undocumented re scanner.
msg231045 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-11 19:46
Are you sure? I get the reverse results here (second patch): Unpatched: $ ./python -m timeit -s 'import textwrap; s = "abcde " * 10*4' -- 'textwrap.wrap(s)' 10 loops, best of 3: 27 msec per loop Patched: $ ./python -m timeit -s 'import textwrap; s = "abcde " 10**4' -- 'textwrap.wrap(s)' 10 loops, best of 3: 19.2 msec per loop > wordsep_re and wordsep_simple_re are public attributes and user code can depend on this. Changing their is a way to customize TextWrapper. With my second patch, that shouldn't be a problem.
msg231048 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-11 20:14
Oh, sorry, I tested your first patch. Your second patch is faster than current code to me. But it changes behavior. >>> textwrap.wrap('"1a-2b', width=5) ['"1a-', '2b'] With the patch the result is ['"1a-2', 'b'].
msg231050 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-11 20:24
Yes... but in both cases the result is nonsensical, and untested.
msg231052 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-11 20:48
Possessive quantifiers (issue433030) is not a panacea. They allow to speed up regular expressions, but the complexity is still quadratic. Antoine's patch makes the complexity linear.
msg231065 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-12 08:34
Current regex produces insane result. $ ./python -c "import textwrap; print(textwrap.wrap('this-is-a-useful-feature', width=1, break_long_words=False))" ['this-', 'is-a', '-useful-', 'feature'] Antoine's regex produces more correct result for this case: ['this-', 'is-', 'a-', 'useful-', 'feature']. But this is not totally correct, one-letter word should not be separated. This can be easy fixed.
msg231066 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-12 09:41
> But this is not totally correct, one-letter word should not be > separated. Why not? I guess it depends on English's rules for word splitting, which I don't know. In any case, this issue is not about improving correctness, only performance.
msg231067 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-12 10:06
> Why not? I guess it depends on English's rules for word splitting, which I > don't know. I suppose this is common rule in many languages. And current code supports it (there is a special code in the regex to ensure this rule). > In any case, this issue is not about improving correctness, > only performance. But the patch shouldn't add a regression. $ ./python -c "import textwrap; print(textwrap.wrap('this-is-a-useful', width=1, break_long_words=False))" Current code: ['this-', 'is-a-useful'] Patched: ['this-', 'is-', 'a-', 'useful'] Just use lookahead assertion to ensure that the hyphen is followed by at least two letters. My previous message is about that current code is not always correct so it is acceptable to replace it with not absolutely equivalent code.
msg231068 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-12 10:13
> I suppose this is common rule in many languages. I frankly don't know about this rule. And the tests don't check for it, so for me it's not broken.
msg231071 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-12 11:06
Tests are not perfect. But this is intentional design. The part of initial regex: r'\w{2,}-(?=\w{2,})\|' # hyphenated words Now it is more complicated. Note '(?=\w{2,})'.
msg231104 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-13 10:31
Here is a patch which is closer to current code but solves complexity issue and also fixes some bugs in current code. $ ./python -c "import textwrap; print(textwrap.wrap('this-is-a-useful-feature', width=1, break_long_words=False))" ['this-', 'is-a', '-useful-', 'feature'] $ ./python -c "import textwrap; print(textwrap.wrap('what-d\x27you-call-it.', width=1, break_long_words=False))" ['what-d', "'you-", 'call-', 'it.']
msg231105 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2014-11-13 10:50
LGTM.
msg231106 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-13 11:00
I don't understand: + expect = ("this-\|is-a-useful-\|feature-\|for-\|" + "reformatting-\|posts-\|from-\|tim-\|peters'ly").split('\|') + self.check_wrap(text, 1, expect, break_long_words=False) + self.check_split(text, expect) Why would "is-a-useful" remain unsplit? It looks like you're making up new rules.
msg231116 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-13 14:43
This is old rule. \w{2,}-(?=\w{2,} -- single letter shouldn't be separated. But there was a bug in such simple regex, it splits a word after non-word character (in particular apostrophe or hyphen) if it followed by word characters and hyphen. There were attempts to fix this bug in issue596434 and issue965425 but they missed a cases when non-word character is occurred inside a word. Originally I had assigned this issue only to 3.5 because I supposed that the solution needs either new features in re or backward-incompatible changes to word splitting algorithm. But found solution doesn't require 3.5-only features, doesn't change interface, and fixes performance and behavior bugs. So I think it should be applied to maintained releases too.
msg231121 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-13 15:23
> This is old rule. \w{2,}-(?=\w{2,} -- single letter shouldn't be separated. I don't agree. This was an implementation detail. There was no test, and it wasn't specified anywhere. If you think single letter shouldn't be separated, there should be some grammatical or typographical reference on the Internet to prove it. > There were attempts to fix this bug in issue596434 and issue965425 Those don't seem related to single letters between hyphens. > But found solution doesn't require 3.5-only features, doesn't change interface, and fixes performance and behavior bugs. It does change behaviour in ways that could break existing code. The textwrap behaviour is underspecified so it's not ok to assume that previous behaviour was obviously buggy.
msg231122 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-11-13 16:03
https://owl.english.purdue.edu/owl/resource/576/01/ Rule 8. So, no, in the middle of the word single letters aren't a problem, only at the beginning or the end of the word.
msg231127 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-13 18:18
Thank you David. If splitting single letter surrounded with hyphens is desirable, here is more complicated patch which does this. It deviates from original code more, but it doesn't look break any reasonable example. > The textwrap behaviour is underspecified so it's not ok to assume that previous behaviour was obviously buggy. Aren't ['this-', 'is-a', '-useful-', 'feature'] and ['what-d', "'you-", 'call-', 'it.'] obvious bugs?
msg231128 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-13 18:19
To clarify, I would be fine with the previous patch if it didn't add the tests.
msg231129 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-13 18:21
> Aren't ['this-', 'is-a', '-useful-', 'feature'] and > ['what-d', "'you-", 'call-', 'it.'] obvious bugs? Obvious according to which rules? If we want to improve the behaviour of textwrap, IMHO it should be in a separate issue. And someone would have to study the word-wrapping rules of the English language :-)
msg231130 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-11-13 18:42
What I usually do in cases like this is to add the tests but mark them with comments saying that the tests test current behavior but are not testing parts of the (currently defined) API. That way you know if a change changes behavior and then can decide if that is a problem or not, as opposed to inadvertently changing behavior and only finding out when the bug reports roll in :) But yeah, defining the rules textwrap should follow is a different issue than the performance issue.
msg231131 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-13 18:49
> To clarify, I would be fine with the previous patch if it didn't add the tests. The absent of tests could cause introducing new non-detected bugs and reappearing old bugs. > Obvious according to which rules? If you think a word should be splitted before hyphen or apostrophe, there should be some grammatical or typographical reference on the Internet to prove it. I would be fine with moving the fix of textwrap behavior to a separate issue, but what to do with this issue then? We have not a patch which only fixes performance complexity and doesn't change the behavior.
msg231144 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-11-14 00:35
> What I usually do in cases like this is to add the tests but mark > them with comments saying that the tests test current behavior but > are not testing parts of the (currently defined) API. That way > you know if a change changes behavior and then can decide if that is > a problem or not, as opposed to inadvertently changing behavior > and only finding out when the bug reports roll in :) That's a good idea!
msg231474 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-11-21 14:31
So what the patch (with mitigated tests) is more preferable?
msg234882 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-01-28 10:24
Ping. What can I do to move this issue forward?
msg236046 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-02-15 16:54
wordsplit_3.patch is wordsplit_2.patch with few added comments in tests. Is it enough?
msg239156 - (view)	Author: Roundup Robot (python-dev)	Date: 2015-03-24 16:35
New changeset 7bd87a219813 by Serhiy Storchaka in branch 'default': Issue #22687: Fixed some corner cases in breaking words in tetxtwrap. https://hg.python.org/cpython/rev/7bd87a219813

History
Date	User	Action	Args
2022-04-11 14:58:09	admin	set	github: 66877
2015-12-15 16:19:45	r.david.murray	link	issue25870 superseder
2015-03-24 20:05:10	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2015-03-24 16:35:56	python-dev	set	nosy: + python-dev messages: + msg239156
2015-02-15 16:54:19	serhiy.storchaka	set	files: + wordsplit_3.patch messages: + msg236046
2015-01-28 10:24:08	serhiy.storchaka	set	messages: + msg234882
2014-11-21 14:31:35	serhiy.storchaka	set	messages: + msg231474
2014-11-14 00:35:07	pitrou	set	messages: + msg231144
2014-11-13 18:49:07	serhiy.storchaka	set	messages: + msg231131
2014-11-13 18:42:08	r.david.murray	set	messages: + msg231130
2014-11-13 18:21:26	pitrou	set	messages: + msg231129
2014-11-13 18:19:54	pitrou	set	messages: + msg231128
2014-11-13 18:19:03	serhiy.storchaka	set	files: + wordsplit_2.patch
2014-11-13 18:18:23	serhiy.storchaka	set	messages: + msg231127
2014-11-13 16:03:21	r.david.murray	set	nosy: + r.david.murray messages: + msg231122
2014-11-13 15:23:03	pitrou	set	messages: + msg231121
2014-11-13 14:43:04	serhiy.storchaka	set	messages: + msg231116
2014-11-13 11:00:49	pitrou	set	versions: - Python 2.7, Python 3.4
2014-11-13 11:00:17	pitrou	set	messages: + msg231106
2014-11-13 10:50:57	georg.brandl	set	messages: + msg231105
2014-11-13 10:31:44	serhiy.storchaka	set	files: + wordsplit.patch messages: + msg231104 versions: + Python 2.7, Python 3.4
2014-11-12 11:06:15	serhiy.storchaka	set	messages: + msg231071
2014-11-12 10:13:27	pitrou	set	messages: + msg231068
2014-11-12 10:06:55	serhiy.storchaka	set	messages: + msg231067
2014-11-12 09:41:42	pitrou	set	messages: + msg231066
2014-11-12 08:34:21	serhiy.storchaka	set	messages: + msg231065
2014-11-11 20:48:14	serhiy.storchaka	set	messages: + msg231052
2014-11-11 20:24:41	pitrou	set	messages: + msg231050
2014-11-11 20:14:16	serhiy.storchaka	set	messages: + msg231048
2014-11-11 19:46:12	pitrou	set	messages: + msg231045
2014-11-11 19:42:36	serhiy.storchaka	set	messages: + msg231044
2014-11-11 19:00:25	pitrou	set	files: + wordsplit_complexity2.patch messages: + msg231039
2014-11-11 18:44:00	pitrou	set	files: + wordsplit_complexity.patch versions: - Python 2.7, Python 3.4 keywords: + patch nosy: + pitrou messages: + msg231037 stage: needs patch -> patch review
2014-11-08 15:21:16	serhiy.storchaka	set	messages: + msg230864
2014-10-21 21:17:32	roippi	set	nosy: + roippi
2014-10-21 18:09:14	serhiy.storchaka	set	priority: normal -> low assignee: serhiy.storchaka messages: + msg229770 stage: needs patch
2014-10-21 17:15:20	serhiy.storchaka	set	nosy: + georg.brandl, serhiy.storchaka versions: + Python 3.5
2014-10-21 16:45:26	inkerman	create