classification
Title: Add a text truncation function
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: barry, dholth, ezio.melotti, georg.brandl, pitrou, python-dev, r.david.murray, serhiy.storchaka, steven.daprano, vajrasky
Priority: normal Keywords: patch

Created on 2013-07-29 12:49 by pitrou, last changed 2014-01-03 04:22 by dholth. This issue is now closed.

Files
File name Uploaded Description Edit
summarize.patch pitrou, 2013-07-29 12:49 review
summarize2.patch pitrou, 2013-07-30 10:05 review
shorten.patch pitrou, 2013-08-12 19:15 review
shorten2.patch pitrou, 2013-08-12 20:05 review
Messages (27)
msg193862 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-07-29 12:49
Following patch proposed to add a function named textwrap.summarize():

   >>> textwrap.summarize("Hello  world!", width=12)
   'Hello world!'
   >>> textwrap.summarize("Hello  world!", width=11)
   'Hello (...)'
msg193867 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-07-29 13:55
Perhaps the "placeholder" argument should actually include the last whitespace, to allow people to omit the whitespace, or use a non-breaking space instead?

   >>> textwrap.summarize("Hello  world!", width=11, placeholder='...')
   'Hello...'
msg193873 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2013-07-29 15:20
On Jul 29, 2013, at 01:55 PM, Antoine Pitrou wrote:

>Perhaps the "placeholder" argument should actually include the last whitespace, to allow people to omit the whitespace, or use a non-breaking space instead?
>
>   >>> textwrap.summarize("Hello  world!", width=11, placeholder='...')
>   'Hello...'

I guess the placeholder default is ' (...)' then?
msg193874 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-07-29 15:34
> >Perhaps the "placeholder" argument should actually include the last
> >whitespace, to allow people to omit the whitespace, or use a
> >non-breaking space instead?
> >
> >   >>> textwrap.summarize("Hello  world!", width=11,
> >   >>> placeholder='...')
> >   'Hello...'
> 
> I guess the placeholder default is ' (...)' then?

Yeah.
(of course, another default could be used)
msg193879 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-07-29 16:09
Something is not right if we use more than one space.

>>> textwrap.summarize('hello      world!', width=12)
'hello world!'
>>> textwrap.summarize('hello      world!', width=11)
'hello (...)'
>>> textwrap.summarize('hello      world!', width=10)
'(...)'

I expect the last statement would give result: 'hello (...)' because 'hello' is just 5 characters, less than 10.
msg193880 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-07-29 16:33
Beside of that, I notice the new lines are deleted silently.

>>> textwrap.summarize('republicans are red,\ndemocrats are blue,\nneither one of them,\ncares about you.', width=46)
'republicans are red, democrats are blue, (...)'
msg193898 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-07-29 20:30
Vajrasky, thanks. The former is a bug, but the latter is a feature. summarize() re-uses the textwrap machinery to normalize spaces.
msg193920 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-07-30 09:34
Oops, sorry, I was mistaken. There is no bug actually here:

>>> textwrap.summarize('hello      world!', width=10)
'(...)'

'hello (...)' cannot be the right answer since its len() is 11, greater than 10.
msg193921 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-07-30 10:05
Updated patch to not add any space before the placeholder, with " (...)" as default placeholder value.
msg193975 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-07-31 06:11
Monsieur Pitrou, thanks for the explanation. Actually, IMHO I prefer, 'hello (...)' should be the minimum words we can use not '(...)' because '(...)' does not make any sense. But, anyway, it's your call. :)

Anyway, using your summarize2.patch:

>>> textwrap.summarize('hello      world!', width=6)
'(...)'

>>> textwrap.summarize('hello      world!', width=5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ethan/Documents/code/python/cpython/Lib/textwrap.py", line 378, in summarize
    return w.summarize(text, placeholder=placeholder)
  File "/home/ethan/Documents/code/python/cpython/Lib/textwrap.py", line 314, in summarize
    raise ValueError("placeholder too large for max width")
ValueError: placeholder too large for max width

Why? '(...)' is 5 characters only. I checked the patch and found out that the placeholder is ' (...)' (with space) and you compare the width with the placeholder.
msg194187 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2013-08-02 17:12
A function like this often gets called to truncate lots of lines. Unfortunately for many use-cases, the part truncated is the most significant part of the line. E.g.:

Scanning file:
/home/fred/documents/datafil...
/home/fred/documents/datafil...
/home/fred/documents/datafil...
/home/fred/documents/datafil...


It's often better in cases like this to truncate in the middle:

Scanning file:
/home/fred/doc...datafile01.txt
/home/fred/doc...datafile02.txt
/home/fred/doc...datafile03.txt
/home/fred/doc...datafile04.txt

(I believe Mac OS-X routinely truncates long lines in the middle in this way.)

Perhaps there could be an argument controlling where to truncate (left, right or centre). A good use-case for the new Enums, perhaps? :-)
msg194188 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2013-08-02 17:19
Bike-shedding here... why "(...)"? Is it common to use round brackets for this purpose? In English-speaking countries, it is usual to use square brackets for editorial comments, including ellipsis "[...]".

Either way, if you wanted to be more Unicode aware, you could save two characters by using \N{HORIZONTAL ELLIPSIS} "(…)" as the default.
msg194189 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-02 17:40
> Bike-shedding here... why "(...)"? Is it common to use round brackets
> for this purpose? In English-speaking countries, it is usual to use
> square brackets for editorial comments, including ellipsis "[...]".

Ah, really? French uses "[...]" but I thought English-speaking people,
being different, used "(...)". So I'll change it.

> Either way, if you wanted to be more Unicode aware, you could save two
> characters by using \N{HORIZONTAL ELLIPSIS} "(…)" as the default.

I'd rather stay on the ASCII side of things here.
(also, the ellipsis character doesn't often look nice)
msg194483 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-08-05 14:41
[...] and ASCII are fine with me.

> Perhaps there could be an argument controlling where to truncate
> (left, right or centre). A good use-case for the new Enums, perhaps? :-)

I wrote a similar function once and in addition to the width it had this feature too (defaulting on "center"), so it sounds like a reasonable addition to me.  Back then I was simply passing a "left"/"right"/"center" string -- not sure it's worth adding an enum for this (FWIW for text alignment there are 3 separate methods: ljust, center, and rjust).
msg194484 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-08-05 14:48
Perhaps "shorten" would be a better name -- "summarize" sounds smarter than it actually is :)
msg194486 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-05 15:14
Looking just at the proposed functionality (taking a prefix) and ignoring the requested complexification :), the usual name for the text produced by this process is a "lead" (http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section), although formally a lead is actually written to be used as such, as opposed to just taking a prefix, so that word really has the same problem as 'summarize'.

I think 'truncate' would be a better name.  Or, if you don't mind being wordier, extract_prefix.  The fact that it is part of the textwrap module should be enough clue that the truncation happens at whitespace.  Truncate could also apply to the expanded version if you squint a little, if Antoine is interested in that.  On the other hand, the use case presented for that is not going to be served by this function anyway, since this function (being part of textwrap) breaks on whitespace...it shouldn't (IMO) elide text other than at whitespace.  If you want that functionality it belongs in some other module, I think.

The placeholder argument could alternatively be named 'ellipsis', but placeholder is certainly fine.

shorten would probably be better if you are going with the expanded version, but I like truncate.  It is probably significant that that is what the title of the issue calls it :)
msg194501 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-05 18:36
> Looking just at the proposed functionality (taking a prefix) and
> ignoring the requested complexification :), the usual name for the
> text produced by this process is a
> "lead" (http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section), although formally a lead is actually written to be used as such, as opposed to just taking a prefix, so that word really has the same problem as 'summarize'.

Good point.

> The placeholder argument could alternatively be named 'ellipsis', but
> placeholder is certainly fine.

I would certainly like ellipsis if it didn't already mean something else
in Python.

> shorten would probably be better if you are going with the expanded
> version, but I like truncate.  It is probably significant that that is
> what the title of the issue calls it :)

I'm a bit negative towards truncate(), mostly because I've worked on the
I/O stack and truncate means something much less careful there (e.g.
StringIO.truncate()).
But really, I'm fine with either shorten() or truncate(). I agree
summarize() may try to look a bit too smart.
msg194987 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-12 19:15
Updated patch renaming summarize() to shorten(), and adding docs and a fix for a nit reported by Vajrasky.
msg195005 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-12 20:05
Updated patch addressing Ezio's comments.
msg195012 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-12 20:39
New changeset c27ec198d3d1 by Antoine Pitrou in branch 'default':
Issue #18585: Add :func:`textwrap.shorten` to collapse and truncate a piece of text to a given length.
http://hg.python.org/cpython/rev/c27ec198d3d1
msg195013 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-12 20:39
Ok, I've committed after having addressed (most of) RDM's comments.
msg195020 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-12 21:22
What about a multiline summarize? The textwrap module is designed to work with multiline text.

Let we want wrap 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.' in 40 column and shorten it to three lines:

Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore (...)

For this we need to add two arguments for TextWrapper: max_lines and placeholder. Then shorten() will be just fill() with max_lines=1.
msg195412 - (view) Author: Roundup Robot (python-dev) Date: 2013-08-16 20:31
New changeset be5481bf4c57 by Antoine Pitrou in branch 'default':
Fix the default placeholder in textwrap.shorten() to be " [...]".
http://hg.python.org/cpython/rev/be5481bf4c57
msg195413 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-16 20:31
(Ezio noticed that I had left the placeholder as " (...)". This is now fixed.)
msg200050 - (view) Author: Roundup Robot (python-dev) Date: 2013-10-16 10:08
New changeset 0bd257cd3e88 by Serhiy Storchaka in branch 'default':
Add shorten to __all_ (issues #18585 and #18725).
http://hg.python.org/cpython/rev/0bd257cd3e88
msg207203 - (view) Author: Roundup Robot (python-dev) Date: 2014-01-03 04:20
New changeset 536a2cf5f1d2 by Daniel Holth in branch 'default':
Issue #18585: speed zipfile import by only generating zipfile._ZipDecryptor on demand
http://hg.python.org/cpython/rev/536a2cf5f1d2
msg207204 - (view) Author: Daniel Holth (dholth) (Python committer) Date: 2014-01-03 04:22
Previous changeset was meant for #18515
History
Date User Action Args
2014-01-03 04:22:26dholthsetnosy: + dholth
messages: + msg207204
2014-01-03 04:20:48python-devsetmessages: + msg207203
2013-10-16 10:08:46python-devsetmessages: + msg200050
2013-08-16 20:31:54pitrousetmessages: + msg195413
2013-08-16 20:31:21python-devsetmessages: + msg195412
2013-08-12 21:22:10serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg195020
2013-08-12 20:40:22pitrousetstatus: open -> closed
2013-08-12 20:39:48pitrousetresolution: fixed
messages: + msg195013
stage: patch review -> resolved
2013-08-12 20:39:19python-devsetnosy: + python-dev
messages: + msg195012
2013-08-12 20:05:22pitrousetfiles: + shorten2.patch

messages: + msg195005
2013-08-12 19:15:30pitrousetfiles: + shorten.patch

messages: + msg194987
2013-08-05 18:36:11pitrousetmessages: + msg194501
2013-08-05 15:14:26r.david.murraysetnosy: + r.david.murray
messages: + msg194486
2013-08-05 14:48:10ezio.melottisetmessages: + msg194484
2013-08-05 14:41:28ezio.melottisetmessages: + msg194483
2013-08-02 17:40:43pitrousetmessages: + msg194189
2013-08-02 17:19:04steven.dapranosetmessages: + msg194188
2013-08-02 17:12:04steven.dapranosetnosy: + steven.daprano
messages: + msg194187
2013-07-31 06:11:41vajraskysetmessages: + msg193975
2013-07-30 10:05:44pitrousetfiles: + summarize2.patch

messages: + msg193921
2013-07-30 09:34:29pitrousetmessages: + msg193920
2013-07-29 20:30:21pitrousetmessages: + msg193898
2013-07-29 16:33:24vajraskysetmessages: + msg193880
2013-07-29 16:09:02vajraskysetnosy: + vajrasky
messages: + msg193879
2013-07-29 15:36:40ezio.melottisetnosy: + ezio.melotti
2013-07-29 15:34:37pitrousetmessages: + msg193874
2013-07-29 15:20:38barrysetmessages: + msg193873
2013-07-29 13:55:30pitrousetmessages: + msg193867
2013-07-29 12:49:10pitroucreate