This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: textwrap output may change if you wrap a paragraph twice
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: andrei.avk, cheryl.sabella, iritkatriel, larry, mdk, serhiy.storchaka
Priority: low Keywords: patch

Created on 2017-12-21 14:10 by larry, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
textwrap.isnt.stable.py larry, 2017-12-21 14:10
Pull Requests
URL Status Linked Edit
PR 5615 open larry, 2018-02-11 01:48
PR 27587 open andrei.avk, 2021-08-04 04:12
Messages (10)
msg308872 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2017-12-21 14:10
If you word-wrap a paragraph twice with textwrap, you may get different results.  Specifically, you *will* get different results when:
* the original text has a line that is too long by one character,
* the last word on the line is the first word in a new sentence, and
* there are two spaces after the period.

The first textwrap will replace the two spaces after the period with a newline; the second textwrap will replace the newline with a single space.

Attached is a test case demonstrating the problem.

It's not a big problem, but it did cause an assertion failure in blurb.  The workaround was to word-wrap all paragraphs twice, which works but is kind of dumb.
msg308873 - (view) Author: Larry Hastings (larry) * (Python committer) Date: 2017-12-21 14:11
FWIW, the test program produces this output:

------------------------------

     original: 'xxxx xxxx xxxx xxxx xxxx.  xxxx'
      wrapped: 'xxxx xxxx xxxx xxxx xxxx.\nxxxx'
wrapped twice: 'xxxx xxxx xxxx xxxx xxxx. xxxx'

Traceback (most recent call last):
  File "textwrap.isnt.stable.py", line 24, in <module>
    assert wrapped == wrapped2
AssertionError
msg342825 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-05-18 23:52
@larry, it looks like this was close to being merged pending some review comments by Serhiy.  Although this is considered a bug and not a new feature, it might be nice to try to get this in for 3.8.  Thanks!
msg378248 - (view) Author: Julien Palard (mdk) * (Python committer) Date: 2020-10-08 13:20
Could be "related" to https://bugs.python.org/issue41975.
msg398726 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-02 06:01
This issue is due to *drop_whitespace* arg being True by default.

Documentation states that *drop_whitespace* applies after wrapping, so the fix in the PR would break that promise because, as far as I understand, it applies the wrapping after *drop_whitespace*. (Irit's comment on the PR refers to this).

I feel like the behaviour in the PR is more logical and probably what most users would prefer and need in most cases. But it's a backwards compatibility change of behaviour that's not buggy.

'foo                  bar'
# wrap for width=10
['foo','bar']  # current behaviour
['foo bar']    # more intuitive wrapping to width=10

As I was looking at this, I thought that wrapping with drop_whitespace=False would surely be stable. Not so!

It looks like the logic assumes that \n needs to be replaced by a space for cases like 'foo\nbar', which makes sense because otherwise two words would be joined together. But with drop_whitespace=False, repeated fill() calls will keep replacing \n with a space and then adding a new \n to split lines again:

      original: 'xxxx xxxx xxxx xxxx xxxx.  xxxx'
       wrapped: 'xxxx xxxx xxxx xxxx xxxx.  \nxxxx'
 wrapped twice: 'xxxx xxxx xxxx xxxx xxxx.   \nxxxx'
wrapped thrice: 'xxxx xxxx xxxx xxxx xxxx.    \nxxxx'

Further, a newline with a nearby space within the requested width will be converted to two spaces:
'a\n b' => 'a  b'

I don't know if the original issue is worth fixing or not, but I think the issue shown above would be good to fix.
msg398836 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-03 16:04
I think fix to make `drop_whitespace=False` stable, can be as simple as adding two lines in `_munge_whitespace()`:

+            text = re.sub(r' \n', ' ', text)
+            text = re.sub(r'\n ', ' ', text)
             text = text.translate(self.unicode_whitespace_trans)

The perf impact is not small though, 12% :

2892 (~/opensource/cpython) % ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\nbar baz", 5)'              --INS--
5000 loops, best of 5: 60.2 usec per loop

2893 (~/opensource/cpython) % r                                                                                           --INS--
./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\nbar baz", 5)'
5000 loops, best of 5: 52.9 usec per loop


I don't know if it's worth doing, but if yes, the options are:

 - just add this change for drop_whitespace=False, which is not the default, so perf regression will not affect default usage of wrap.

 - add a new arg that will only have effect when drop_whitespace=False, and will run these 2 lines. Name could be something like `collapse_space_newline`. It's hard to think of a good name.

If '\r\n' is handled, it needs one additional `sub()` line, and the perf. difference is 22%.
msg398839 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-03 17:06
You should be able to do them in one re, something like 

text = re.sub(r' ?\n', ' ', text)
msg398846 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-03 17:53
Irit: I assume you mean r' \r?\n', that's a great idea, it's much faster than adding a separate replacement step.

Latest version I came up with is this:

                if re.search(r' \r?\n', text):
                    text = re.sub(r' \r?\n', ' ', text)
                if re.search(r'\r?\n ', text):
                    text = re.sub(r'\r?\n ', ' ', text)

This optimizes the case when there's no newlines, which is likely the most common case for small fragments of text, but it may be the less common case for larger fragments where performance is more important; so I'm not sure if it's worth it.

Timings:
# sub() has to run
2904 (~/opensource/cpython) % ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\n bar baz", 5)'       ----VICMD----
5000 loops, best of 5: 67.6 usec per loop

# search() runs; but sub() does NOT because there's no adjacent space
2906 (~/opensource/cpython) % ./python.exe -mtimeit 'import textwrap' 'textwrap.wrap("abc foo\nbar baz", 5)'        ----VICMD----
5000 loops, best of 5: 60.3 usec per loop
msg398847 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-03 17:54
Note that I'm not handling a single '\r' because that was before Mac OS X; but it is handled by the following line (i.e. by the old logic):

            text = text.translate(self.unicode_whitespace_trans)
msg398858 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-08-04 04:15
I've added an initial draft PR: https://github.com/python/cpython/pull/27587/files

I will add docs and news if this looks good in general.
History
Date User Action Args
2022-04-11 14:58:55adminsetgithub: 76578
2021-08-23 19:11:29terry.reedysettype: behavior -> enhancement
versions: + Python 3.11, - Python 3.7, Python 3.8, Python 3.9
2021-08-04 04:15:31andrei.avksetmessages: + msg398858
2021-08-04 04:12:08andrei.avksetpull_requests: + pull_request26089
2021-08-03 17:54:40andrei.avksetmessages: + msg398847
2021-08-03 17:53:12andrei.avksetmessages: + msg398846
2021-08-03 17:06:55iritkatrielsetnosy: + iritkatriel
messages: + msg398839
2021-08-03 16:04:21andrei.avksetmessages: + msg398836
2021-08-02 06:01:27andrei.avksetnosy: + andrei.avk
messages: + msg398726
2020-10-08 13:20:58mdksetnosy: + mdk
messages: + msg378248
2019-10-20 13:03:55cheryl.sabellasetversions: + Python 3.9, - Python 3.6
2019-05-18 23:52:03cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg342825
2018-02-11 09:39:18serhiy.storchakasetnosy: + serhiy.storchaka

type: behavior
versions: + Python 3.6, Python 3.8
2018-02-11 01:48:02larrysetkeywords: + patch
stage: patch review
pull_requests: + pull_request5425
2017-12-21 14:11:26larrysetmessages: + msg308873
2017-12-21 14:10:43larrycreate