classification
Title: TextWrapper break_long_words=True, break_on_hyphens=True on long words
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, iritkatriel, maubp, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2016-11-10 16:35 by maubp, last changed 2020-10-18 17:14 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 22721 merged iritkatriel, 2020-10-16 16:01
Messages (5)
msg280522 - (view) Author: Peter (maubp) Date: 2016-11-10 16:35
Quoting https://docs.python.org/2/library/textwrap.html

width (default: 70) The maximum length of wrapped lines. As long as there are no individual words in the input text longer than width, TextWrapper guarantees that no output line will be longer than width characters.

It appears that with break_long_words=True and break_on_hyphens=True, any hyphenated term longer than the specified width does not get preferentially broken at a hyphen.

Example input:

We used the enyzme 2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate synthase.


Using break_long_words=True, break_on_hyphens=True
==================================================
We used the enyzme 2-succinyl-6-hydroxy-2,4-cycloh
exadiene-1-carboxylate synthase.
==================================================


Expected result using break_long_words=True, break_on_hyphens=True
==================================================
We used the enyzme 2-succinyl-6-hydroxy-2,4-
cyclohexadiene-1-carboxylate synthase.
==================================================


Given a width=50, then the 53 character long "word" of "2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate" must be split somewhere, and since break_on_hyphens=True it should break at a hyphen as shown above as the desired output.


Sample code:


import textwrap
w = 50
text = "We used the enyzme 2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate synthase."
print("Input:")
print("=" * w)
print(text)
print("=" * w)
print("Using break_long_words=True, break_on_hyphens=True")
print("=" * w)
print(textwrap.fill(text, width=w, break_long_words=True, break_on_hyphens=True))
print("=" * w)
msg304806 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-10-23 13:49
This is because the current algorithm of breaking on hyphens allows to break only between letters. This prevents breaking dates and times. Perhaps it should be made more lenient in the case of too long word.
msg378718 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-10-16 16:05
textwrap does not actually apply the break-on-hyphen algorithm at all to long words. It just chops them up into depth-sized pieces. 

The PR I just submitted looks for hyphens and uses them as cut points if they exist, without any attempt to understand their context.
msg378724 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2020-10-16 17:29
Actually I see what Serhiy meant about the hyphen algorithm - the regex breaking up words. Yes, this is applied to long words and the reason he stated for this issue is correct.

It is probably possible to make that regex understand width and long-words, but it would be more complicated and will need to be recalculated for each width. I think long words are not the typical input, so it's better to handle them separately and keep the rest simple.
msg378878 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-10-18 17:01
New changeset b81c833ab51fb7d7f0f8eaace37f60ef7455aa85 by Irit Katriel in branch 'master':
bpo-28660: Make TextWrapper break long words on hyphens (GH-22721)
https://github.com/python/cpython/commit/b81c833ab51fb7d7f0f8eaace37f60ef7455aa85
History
Date User Action Args
2020-10-18 17:14:00serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2020-10-18 17:01:26serhiy.storchakasetmessages: + msg378878
2020-10-16 17:29:34iritkatrielsetmessages: + msg378724
2020-10-16 16:05:14iritkatrielsetmessages: + msg378718
2020-10-16 16:01:24iritkatrielsetkeywords: + patch
nosy: + iritkatriel

pull_requests: + pull_request21688
stage: patch review
2017-10-23 13:49:01serhiy.storchakasetmessages: + msg304806
2016-11-10 17:07:13serhiy.storchakasetnosy: + georg.brandl, serhiy.storchaka

type: behavior
components: + Library (Lib)
versions: + Python 3.5, Python 3.6, Python 3.7
2016-11-10 16:35:01maubpcreate