Issue 43518: textwrap.shorten does not always respect word boundaries

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87684

classification

Title:	textwrap.shorten does not always respect word boundaries
Type:	behavior	Stage:	needs patch
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	serhiy.storchaka	Nosy List:	andrei.avk, annesylvie, pitrou, serhiy.storchaka, terry.reedy
Priority:	normal	Keywords:

Created on 2021-03-16 16:20 by annesylvie, last changed 2022-04-11 14:59 by admin.

Messages (4)
msg388858 - (view)	Author: (annesylvie)	Date: 2021-03-16 16:24
The `shorten` function from the `textwrap` module does not always break strings at the correct location. `shorten("hello world!", width=7, placeholder="")` returns `'hello'` as expected, but `shorten("hello world!!!!!!", width=7, placeholder="")` returns `'hello w'` which is incorrect. The error seems to appear when two or more exclamation marks (in this specific example) are added.
msg389129 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2021-03-20 00:16
Verified in 3.10.0a6 that change is at 3 !s. I agree that is is a bug relative to the doc. The issue is that 'world!!!' is 8 chars, and by default, wrap splits that into 'w' and 'orld!!!' and add ' w' to 'hello'. >>> sh('hello world!!!', width=7) ['hello w', 'orld!!!'] A solution is to not break long words. >>> sh('hello world!!!', width=7, placeholder='', break_long_words=False) 'hello' Then >>> sh('hello!!!! world!!!', width=7, placeholder='', break_long_words=False) '' versus >>> sh('hello!!!! world!!!', width=7, placeholder='') 'hello!!' The docstring and doc say "enough words are dropped from the end so that the remaining words plus the placeholder fit within width:". Taking this literally, '' is correct. So a fix would be to add "break_long_words=False" to options if break_long_words not in options. Antoine, you last touched the shorten docstring. Serhiy, you last touched its code. What do you two think?
msg396821 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-07-01 18:22
Also see https://bugs.python.org/issue44544 which I think will help users avoid this issue.
msg396835 - (view)	Author: Andrei Kulakov (andrei.avk) *	Date: 2021-07-02 00:43
Some observations: - Just to be clear (because annesylvie implied this is caused by exclamation marks), punctuation at the end of the word is not required to hit this bug: In [44]: shorten("hello universe", width=7, placeholder="") Out[44]: 'hello u' (so for example adding an option to break at the boundary of word/punctuation would not fix this issue) - It would be good to fix this because my guess would be most code using `shorten` does it with default value of break_long_words, and this issue is easy to miss in testing. - My guess is that the goal of shorten is to return a shortened (okay, this much is obvious :) ) but representative snapshot of the text. - A user might also expect that it's consistent with TextWrapper, since it's essentially a wrapper around TextWrapper :) Therefore if we make a backwards incompatible change, the following would be also nice to have, perhaps requiring a new arg: width=5 1. universe => unive 2. hi universe => hi 3. hi universe => hi un 4. universe => universe # allow longer if can't get width without breaking words #4 would be consistent with TextWrapper handling of `break_long_words=False` Some option (perhaps new arg?) should produce both #1 and #2, the idea being that we remove the Nth word if it doesn't fit, but break the 1st word so that there's still representation of text rather than a blank. #3 would be the existing `break_long_words=True`, respecting width but providing max possible representation. - Generally speaking, shortening into one line is somewhat different than splitting into multiple lines, so it results in awkwardness when shortening is done by splitting into lines and keeping the first line.

History
Date	User	Action	Args
2022-04-11 14:59:42	admin	set	github: 87684
2021-07-02 08:35:05	serhiy.storchaka	set	assignee: serhiy.storchaka versions: + Python 3.11, - Python 3.8
2021-07-02 00:43:19	andrei.avk	set	messages: + msg396835
2021-07-01 18:22:31	andrei.avk	set	nosy: + andrei.avk messages: + msg396821
2021-03-20 00:16:07	terry.reedy	set	versions: + Python 3.9, Python 3.10 nosy: + terry.reedy, serhiy.storchaka, pitrou messages: + msg389129 stage: needs patch
2021-03-16 16:24:25	annesylvie	set	messages: + msg388858
2021-03-16 16:20:43	annesylvie	create