Issue 28660: TextWrapper break_long_words=True, break_on_hyphens=True on long words

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/72846

classification

Title:	TextWrapper break_long_words=True, break_on_hyphens=True on long words
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.7, Python 3.6, Python 3.5, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	georg.brandl, iritkatriel, maubp, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2016-11-10 16:35 by maubp, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 22721	merged	iritkatriel, 2020-10-16 16:01

Messages (5)
msg280522 - (view)	Author: Peter (maubp)	Date: 2016-11-10 16:35
Quoting https://docs.python.org/2/library/textwrap.html width (default: 70) The maximum length of wrapped lines. As long as there are no individual words in the input text longer than width, TextWrapper guarantees that no output line will be longer than width characters. It appears that with break_long_words=True and break_on_hyphens=True, any hyphenated term longer than the specified width does not get preferentially broken at a hyphen. Example input: We used the enyzme 2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate synthase. Using break_long_words=True, break_on_hyphens=True ================================================== We used the enyzme 2-succinyl-6-hydroxy-2,4-cycloh exadiene-1-carboxylate synthase. ================================================== Expected result using break_long_words=True, break_on_hyphens=True ================================================== We used the enyzme 2-succinyl-6-hydroxy-2,4- cyclohexadiene-1-carboxylate synthase. ================================================== Given a width=50, then the 53 character long "word" of "2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate" must be split somewhere, and since break_on_hyphens=True it should break at a hyphen as shown above as the desired output. Sample code: import textwrap w = 50 text = "We used the enyzme 2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate synthase." print("Input:") print("=" * w) print(text) print("=" * w) print("Using break_long_words=True, break_on_hyphens=True") print("=" * w) print(textwrap.fill(text, width=w, break_long_words=True, break_on_hyphens=True)) print("=" * w)
msg304806 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-10-23 13:49
This is because the current algorithm of breaking on hyphens allows to break only between letters. This prevents breaking dates and times. Perhaps it should be made more lenient in the case of too long word.
msg378718 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2020-10-16 16:05
textwrap does not actually apply the break-on-hyphen algorithm at all to long words. It just chops them up into depth-sized pieces. The PR I just submitted looks for hyphens and uses them as cut points if they exist, without any attempt to understand their context.
msg378724 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2020-10-16 17:29
Actually I see what Serhiy meant about the hyphen algorithm - the regex breaking up words. Yes, this is applied to long words and the reason he stated for this issue is correct. It is probably possible to make that regex understand width and long-words, but it would be more complicated and will need to be recalculated for each width. I think long words are not the typical input, so it's better to handle them separately and keep the rest simple.
msg378878 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-10-18 17:01
New changeset b81c833ab51fb7d7f0f8eaace37f60ef7455aa85 by Irit Katriel in branch 'master': bpo-28660: Make TextWrapper break long words on hyphens (GH-22721) https://github.com/python/cpython/commit/b81c833ab51fb7d7f0f8eaace37f60ef7455aa85

History
Date	User	Action	Args
2022-04-11 14:58:39	admin	set	github: 72846
2020-10-18 17:14:00	serhiy.storchaka	set	status: open -> closed resolution: fixed stage: patch review -> resolved
2020-10-18 17:01:26	serhiy.storchaka	set	messages: + msg378878
2020-10-16 17:29:34	iritkatriel	set	messages: + msg378724
2020-10-16 16:05:14	iritkatriel	set	messages: + msg378718
2020-10-16 16:01:24	iritkatriel	set	keywords: + patch nosy: + iritkatriel pull_requests: + pull_request21688 stage: patch review
2017-10-23 13:49:01	serhiy.storchaka	set	messages: + msg304806
2016-11-10 17:07:13	serhiy.storchaka	set	nosy: + georg.brandl, serhiy.storchaka type: behavior components: + Library (Lib) versions: + Python 3.5, Python 3.6, Python 3.7
2016-11-10 16:35:01	maubp	create