This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: textwrap.shorten does not always respect word boundaries
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.11, Python 3.10, Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: andrei.avk, annesylvie, pitrou, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2021-03-16 16:20 by annesylvie, last changed 2022-04-11 14:59 by admin.

Messages (4)
msg388858 - (view) Author: (annesylvie) Date: 2021-03-16 16:24
The `shorten` function from the `textwrap` module does not always break strings at the correct location.

`shorten("hello world!", width=7, placeholder="")`
 returns
`'hello'`
as expected, but
`shorten("hello world!!!!!!", width=7, placeholder="")`
returns
`'hello w'`
which is incorrect. The error seems to appear when two or more exclamation marks (in this specific example) are added.
msg389129 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2021-03-20 00:16
Verified in 3.10.0a6 that change is at 3 !s.  I agree that is is a bug relative to the doc.

The issue is that 'world!!!' is 8 chars, and by default, wrap splits that into 'w' and 'orld!!!' and add ' w' to 'hello'.
>>> sh('hello world!!!', width=7)
['hello w', 'orld!!!']

A solution is to not break long words.
>>> sh('hello world!!!', width=7, placeholder='', break_long_words=False)
'hello'

Then

>>> sh('hello!!!! world!!!', width=7, placeholder='', break_long_words=False)
''

versus

>>> sh('hello!!!! world!!!', width=7, placeholder='')
'hello!!'

The docstring and doc say "enough words are dropped from the end so that the remaining words plus the placeholder fit within width:".  Taking this literally, '' is correct.  So a fix would be to add "break_long_words=False" to options if break_long_words not in options.

Antoine, you last touched the shorten docstring.  Serhiy, you last touched its code.  What do you two think?
msg396821 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-07-01 18:22
Also see https://bugs.python.org/issue44544 which I think will help users avoid this issue.
msg396835 - (view) Author: Andrei Kulakov (andrei.avk) * (Python triager) Date: 2021-07-02 00:43
Some observations:

 - Just to be clear (because annesylvie implied this is caused by exclamation marks), punctuation at the end of the word is not required to hit this bug:

In [44]: shorten("hello universe", width=7, placeholder="")
Out[44]: 'hello u'

(so for example adding an option to break at the boundary of word/punctuation would not fix this issue)

 - It would be good to fix this because my guess would be most code using `shorten` does it with default value of break_long_words, and this issue is easy to miss in testing.

 - My guess is that the goal of shorten is to return a shortened (okay, this much is obvious :) ) but representative snapshot of the text.

 - A user might also expect that it's consistent with TextWrapper, since it's essentially a wrapper around TextWrapper :)

Therefore if we make a backwards incompatible change, the following would be also nice to have, perhaps requiring a new arg:

width=5
 1. universe => unive
 2. hi universe => hi
 3. hi universe => hi un
 4. universe => universe # allow longer if can't get width without breaking words

#4 would be consistent with TextWrapper handling of `break_long_words=False`

Some option (perhaps new arg?) should produce both #1 and #2, the idea being that we remove the Nth word if it doesn't fit, but break the 1st word so that there's still representation of text rather than a blank.

#3 would be the existing `break_long_words=True`, respecting width but providing max possible representation.

 - Generally speaking, shortening into one line is somewhat different than splitting into multiple lines, so it results in awkwardness when shortening is done by splitting into lines and keeping the first line.
History
Date User Action Args
2022-04-11 14:59:42adminsetgithub: 87684
2021-07-02 08:35:05serhiy.storchakasetassignee: serhiy.storchaka
versions: + Python 3.11, - Python 3.8
2021-07-02 00:43:19andrei.avksetmessages: + msg396835
2021-07-01 18:22:31andrei.avksetnosy: + andrei.avk
messages: + msg396821
2021-03-20 00:16:07terry.reedysetversions: + Python 3.9, Python 3.10
nosy: + terry.reedy, serhiy.storchaka, pitrou

messages: + msg389129

stage: needs patch
2021-03-16 16:24:25annesylviesetmessages: + msg388858
2021-03-16 16:20:43annesylviecreate