New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a text truncation function #62785
Comments
Following patch proposed to add a function named textwrap.summarize(): >>> textwrap.summarize("Hello world!", width=12)
'Hello world!'
>>> textwrap.summarize("Hello world!", width=11)
'Hello (...)' |
Perhaps the "placeholder" argument should actually include the last whitespace, to allow people to omit the whitespace, or use a non-breaking space instead? >>> textwrap.summarize("Hello world!", width=11, placeholder='...')
'Hello...' |
On Jul 29, 2013, at 01:55 PM, Antoine Pitrou wrote:
I guess the placeholder default is ' (...)' then? |
Yeah. |
Something is not right if we use more than one space. >>> textwrap.summarize('hello world!', width=12)
'hello world!'
>>> textwrap.summarize('hello world!', width=11)
'hello (...)'
>>> textwrap.summarize('hello world!', width=10)
'(...)' I expect the last statement would give result: 'hello (...)' because 'hello' is just 5 characters, less than 10. |
Beside of that, I notice the new lines are deleted silently. >>> textwrap.summarize('republicans are red,\ndemocrats are blue,\nneither one of them,\ncares about you.', width=46)
'republicans are red, democrats are blue, (...)' |
Vajrasky, thanks. The former is a bug, but the latter is a feature. summarize() re-uses the textwrap machinery to normalize spaces. |
Oops, sorry, I was mistaken. There is no bug actually here: >>> textwrap.summarize('hello world!', width=10)
'(...)' 'hello (...)' cannot be the right answer since its len() is 11, greater than 10. |
Updated patch to not add any space before the placeholder, with " (...)" as default placeholder value. |
Monsieur Pitrou, thanks for the explanation. Actually, IMHO I prefer, 'hello (...)' should be the minimum words we can use not '(...)' because '(...)' does not make any sense. But, anyway, it's your call. :) Anyway, using your summarize2.patch: >>> textwrap.summarize('hello world!', width=6)
'(...)'
>>> textwrap.summarize('hello world!', width=5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ethan/Documents/code/python/cpython/Lib/textwrap.py", line 378, in summarize
return w.summarize(text, placeholder=placeholder)
File "/home/ethan/Documents/code/python/cpython/Lib/textwrap.py", line 314, in summarize
raise ValueError("placeholder too large for max width")
ValueError: placeholder too large for max width Why? '(...)' is 5 characters only. I checked the patch and found out that the placeholder is ' (...)' (with space) and you compare the width with the placeholder. |
A function like this often gets called to truncate lots of lines. Unfortunately for many use-cases, the part truncated is the most significant part of the line. E.g.: Scanning file: It's often better in cases like this to truncate in the middle: Scanning file: (I believe Mac OS-X routinely truncates long lines in the middle in this way.) Perhaps there could be an argument controlling where to truncate (left, right or centre). A good use-case for the new Enums, perhaps? :-) |
Bike-shedding here... why "(...)"? Is it common to use round brackets for this purpose? In English-speaking countries, it is usual to use square brackets for editorial comments, including ellipsis "[...]". Either way, if you wanted to be more Unicode aware, you could save two characters by using \N{HORIZONTAL ELLIPSIS} "(…)" as the default. |
Ah, really? French uses "[...]" but I thought English-speaking people,
I'd rather stay on the ASCII side of things here. |
[...] and ASCII are fine with me.
I wrote a similar function once and in addition to the width it had this feature too (defaulting on "center"), so it sounds like a reasonable addition to me. Back then I was simply passing a "left"/"right"/"center" string -- not sure it's worth adding an enum for this (FWIW for text alignment there are 3 separate methods: ljust, center, and rjust). |
Perhaps "shorten" would be a better name -- "summarize" sounds smarter than it actually is :) |
Looking just at the proposed functionality (taking a prefix) and ignoring the requested complexification :), the usual name for the text produced by this process is a "lead" (http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section), although formally a lead is actually written to be used as such, as opposed to just taking a prefix, so that word really has the same problem as 'summarize'. I think 'truncate' would be a better name. Or, if you don't mind being wordier, extract_prefix. The fact that it is part of the textwrap module should be enough clue that the truncation happens at whitespace. Truncate could also apply to the expanded version if you squint a little, if Antoine is interested in that. On the other hand, the use case presented for that is not going to be served by this function anyway, since this function (being part of textwrap) breaks on whitespace...it shouldn't (IMO) elide text other than at whitespace. If you want that functionality it belongs in some other module, I think. The placeholder argument could alternatively be named 'ellipsis', but placeholder is certainly fine. shorten would probably be better if you are going with the expanded version, but I like truncate. It is probably significant that that is what the title of the issue calls it :) |
Good point.
I would certainly like ellipsis if it didn't already mean something else
I'm a bit negative towards truncate(), mostly because I've worked on the |
Updated patch renaming summarize() to shorten(), and adding docs and a fix for a nit reported by Vajrasky. |
Updated patch addressing Ezio's comments. |
New changeset c27ec198d3d1 by Antoine Pitrou in branch 'default': |
Ok, I've committed after having addressed (most of) RDM's comments. |
What about a multiline summarize? The textwrap module is designed to work with multiline text. Let we want wrap 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.' in 40 column and shorten it to three lines: Lorem ipsum dolor sit amet, consectetur For this we need to add two arguments for TextWrapper: max_lines and placeholder. Then shorten() will be just fill() with max_lines=1. |
New changeset be5481bf4c57 by Antoine Pitrou in branch 'default': |
(Ezio noticed that I had left the placeholder as " (...)". This is now fixed.) |
New changeset 0bd257cd3e88 by Serhiy Storchaka in branch 'default': |
New changeset 536a2cf5f1d2 by Daniel Holth in branch 'default': |
Previous changeset was meant for bpo-18515 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: