Issue 32397: textwrap output may change if you wrap a paragraph twice

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/76578

classification

Title:	textwrap output may change if you wrap a paragraph twice
Type:	enhancement	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	andrei.avk, cheryl.sabella, iritkatriel, larry, mdk, serhiy.storchaka
Priority:	low	Keywords:	patch

Created on 2017-12-21 14:10 by larry, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
textwrap.isnt.stable.py	larry, 2017-12-21 14:10

Pull Requests
URL	Status	Linked	Edit
PR 5615	open	larry, 2018-02-11 01:48
PR 27587	open	andrei.avk, 2021-08-04 04:12

Messages (10)
msg308872 - (view)	Author: Larry Hastings (larry) *	Date: 2017-12-21 14:10
If you word-wrap a paragraph twice with textwrap, you may get different results. Specifically, you will get different results when: * the original text has a line that is too long by one character, * the last word on the line is the first word in a new sentence, and * there are two spaces after the period. The first textwrap will replace the two spaces after the period with a newline; the second textwrap will replace the newline with a single space. Attached is a test case demonstrating the problem. It's not a big problem, but it did cause an assertion failure in blurb. The workaround was to word-wrap all paragraphs twice, which works but is kind of dumb.
msg308873 - (view)	Author: Larry Hastings (larry) *	Date: 2017-12-21 14:11
FWIW, the test program produces this output: ------------------------------ original: 'xxxx xxxx xxxx xxxx xxxx. xxxx' wrapped: 'xxxx xxxx xxxx xxxx xxxx.\nxxxx' wrapped twice: 'xxxx xxxx xxxx xxxx xxxx. xxxx' Traceback (most recent call last): File "textwrap.isnt.stable.py", line 24, in <module> assert wrapped == wrapped2 AssertionError
msg342825 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2019-05-18 23:52
@larry, it looks like this was close to being merged pending some review comments by Serhiy. Although this is considered a bug and not a new feature, it might be nice to try to get this in for 3.8. Thanks!
msg378248 - (view)	Author: Julien Palard (mdk) *	Date: 2020-10-08 13:20
Could be "related" to https://bugs.python.org/issue41975.
msg398726 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-08-02 06:01
This issue is due to drop_whitespace arg being True by default. Documentation states that drop_whitespace applies after wrapping, so the fix in the PR would break that promise because, as far as I understand, it applies the wrapping after drop_whitespace. (Irit's comment on the PR refers to this). I feel like the behaviour in the PR is more logical and probably what most users would prefer and need in most cases. But it's a backwards compatibility change of behaviour that's not buggy. 'foo bar' # wrap for width=10 ['foo','bar'] # current behaviour ['foo bar'] # more intuitive wrapping to width=10 As I was looking at this, I thought that wrapping with drop_whitespace=False would surely be stable. Not so! It looks like the logic assumes that \n needs to be replaced by a space for cases like 'foo\nbar', which makes sense because otherwise two words would be joined together. But with drop_whitespace=False, repeated fill() calls will keep replacing \n with a space and then adding a new \n to split lines again: original: 'xxxx xxxx xxxx xxxx xxxx. xxxx' wrapped: 'xxxx xxxx xxxx xxxx xxxx. \nxxxx' wrapped twice: 'xxxx xxxx xxxx xxxx xxxx. \nxxxx' wrapped thrice: 'xxxx xxxx xxxx xxxx xxxx. \nxxxx' Further, a newline with a nearby space within the requested width will be converted to two spaces: 'a\n b' => 'a b' I don't know if the original issue is worth fixing or not, but I think the issue shown above would be good to fix.
msg398836 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-08-03 16:04
I think fix to make `drop_whitespace=False` stable, can be as simple as adding two lines in `_munge_whitespace()`: + text = re.sub(r' \n', ' ', text) + text = re.sub(r'\n ', ' ', text) text = text.translate(self.unicode_whitespace_trans) The perf impact is not small though, 12% : 2892 (~/opensource/cpython) % ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\nbar baz", 5)' --INS-- 5000 loops, best of 5: 60.2 usec per loop 2893 (~/opensource/cpython) % r --INS-- ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\nbar baz", 5)' 5000 loops, best of 5: 52.9 usec per loop I don't know if it's worth doing, but if yes, the options are: - just add this change for drop_whitespace=False, which is not the default, so perf regression will not affect default usage of wrap. - add a new arg that will only have effect when drop_whitespace=False, and will run these 2 lines. Name could be something like `collapse_space_newline`. It's hard to think of a good name. If '\r\n' is handled, it needs one additional `sub()` line, and the perf. difference is 22%.
msg398839 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-08-03 17:06
You should be able to do them in one re, something like text = re.sub(r' ?\n', ' ', text)
msg398846 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-08-03 17:53
Irit: I assume you mean r' \r?\n', that's a great idea, it's much faster than adding a separate replacement step. Latest version I came up with is this: if re.search(r' \r?\n', text): text = re.sub(r' \r?\n', ' ', text) if re.search(r'\r?\n ', text): text = re.sub(r'\r?\n ', ' ', text) This optimizes the case when there's no newlines, which is likely the most common case for small fragments of text, but it may be the less common case for larger fragments where performance is more important; so I'm not sure if it's worth it. Timings: # sub() has to run 2904 (~/opensource/cpython) % ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\n bar baz", 5)' ----VICMD---- 5000 loops, best of 5: 67.6 usec per loop # search() runs; but sub() does NOT because there's no adjacent space 2906 (~/opensource/cpython) % ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\nbar baz", 5)' ----VICMD---- 5000 loops, best of 5: 60.3 usec per loop
msg398847 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-08-03 17:54
Note that I'm not handling a single '\r' because that was before Mac OS X; but it is handled by the following line (i.e. by the old logic): text = text.translate(self.unicode_whitespace_trans)
msg398858 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-08-04 04:15
I've added an initial draft PR: https://github.com/python/cpython/pull/27587/files I will add docs and news if this looks good in general.

History
Date	User	Action	Args
2022-04-11 14:58:55	admin	set	github: 76578
2021-08-23 19:11:29	terry.reedy	set	type: behavior -> enhancement versions: + Python 3.11, - Python 3.7, Python 3.8, Python 3.9
2021-08-04 04:15:31	andrei.avk	set	messages: + msg398858
2021-08-04 04:12:08	andrei.avk	set	pull_requests: + pull_request26089
2021-08-03 17:54:40	andrei.avk	set	messages: + msg398847
2021-08-03 17:53:12	andrei.avk	set	messages: + msg398846
2021-08-03 17:06:55	iritkatriel	set	nosy: + iritkatriel messages: + msg398839
2021-08-03 16:04:21	andrei.avk	set	messages: + msg398836
2021-08-02 06:01:27	andrei.avk	set	nosy: + andrei.avk messages: + msg398726
2020-10-08 13:20:58	mdk	set	nosy: + mdk messages: + msg378248
2019-10-20 13:03:55	cheryl.sabella	set	versions: + Python 3.9, - Python 3.6
2019-05-18 23:52:03	cheryl.sabella	set	nosy: + cheryl.sabella messages: + msg342825
2018-02-11 09:39:18	serhiy.storchaka	set	nosy: + serhiy.storchaka type: behavior versions: + Python 3.6, Python 3.8
2018-02-11 01:48:02	larry	set	keywords: + patch stage: patch review pull_requests: + pull_request5425
2017-12-21 14:11:26	larry	set	messages: + msg308873
2017-12-21 14:10:43	larry	create