Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a text truncation function #62785

Closed
pitrou opened this issue Jul 29, 2013 · 27 comments
Closed

Add a text truncation function #62785

pitrou opened this issue Jul 29, 2013 · 27 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@pitrou
Copy link
Member

pitrou commented Jul 29, 2013

BPO 18585
Nosy @warsaw, @birkenfeld, @pitrou, @ezio-melotti, @stevendaprano, @bitdancer, @dholth, @serhiy-storchaka, @vajrasky
Files
  • summarize.patch
  • summarize2.patch
  • shorten.patch
  • shorten2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-08-12.20:40:22.114>
    created_at = <Date 2013-07-29.12:49:10.600>
    labels = ['type-feature', 'library']
    title = 'Add a text truncation function'
    updated_at = <Date 2014-01-03.04:22:26.775>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2014-01-03.04:22:26.775>
    actor = 'dholth'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-08-12.20:40:22.114>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2013-07-29.12:49:10.600>
    creator = 'pitrou'
    dependencies = []
    files = ['31072', '31084', '31250', '31257']
    hgrepos = []
    issue_num = 18585
    keywords = ['patch']
    message_count = 27.0
    messages = ['193862', '193867', '193873', '193874', '193879', '193880', '193898', '193920', '193921', '193975', '194187', '194188', '194189', '194483', '194484', '194486', '194501', '194987', '195005', '195012', '195013', '195020', '195412', '195413', '200050', '207203', '207204']
    nosy_count = 10.0
    nosy_names = ['barry', 'georg.brandl', 'pitrou', 'ezio.melotti', 'steven.daprano', 'r.david.murray', 'dholth', 'python-dev', 'serhiy.storchaka', 'vajrasky']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue18585'
    versions = ['Python 3.4']

    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 29, 2013

    Following patch proposed to add a function named textwrap.summarize():

       >>> textwrap.summarize("Hello  world!", width=12)
       'Hello world!'
       >>> textwrap.summarize("Hello  world!", width=11)
       'Hello (...)'

    @pitrou pitrou added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jul 29, 2013
    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 29, 2013

    Perhaps the "placeholder" argument should actually include the last whitespace, to allow people to omit the whitespace, or use a non-breaking space instead?

       >>> textwrap.summarize("Hello  world!", width=11, placeholder='...')
       'Hello...'

    @warsaw
    Copy link
    Member

    warsaw commented Jul 29, 2013

    On Jul 29, 2013, at 01:55 PM, Antoine Pitrou wrote:

    Perhaps the "placeholder" argument should actually include the last whitespace, to allow people to omit the whitespace, or use a non-breaking space instead?

    >>> textwrap.summarize("Hello world!", width=11, placeholder='...')
    'Hello...'

    I guess the placeholder default is ' (...)' then?

    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 29, 2013

    >Perhaps the "placeholder" argument should actually include the last
    >whitespace, to allow people to omit the whitespace, or use a
    >non-breaking space instead?
    >
    > >>> textwrap.summarize("Hello world!", width=11,
    > >>> placeholder='...')
    > 'Hello...'

    I guess the placeholder default is ' (...)' then?

    Yeah.
    (of course, another default could be used)

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jul 29, 2013

    Something is not right if we use more than one space.

    >>> textwrap.summarize('hello      world!', width=12)
    'hello world!'
    >>> textwrap.summarize('hello      world!', width=11)
    'hello (...)'
    >>> textwrap.summarize('hello      world!', width=10)
    '(...)'

    I expect the last statement would give result: 'hello (...)' because 'hello' is just 5 characters, less than 10.

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jul 29, 2013

    Beside of that, I notice the new lines are deleted silently.

    >>> textwrap.summarize('republicans are red,\ndemocrats are blue,\nneither one of them,\ncares about you.', width=46)
    'republicans are red, democrats are blue, (...)'

    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 29, 2013

    Vajrasky, thanks. The former is a bug, but the latter is a feature. summarize() re-uses the textwrap machinery to normalize spaces.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 30, 2013

    Oops, sorry, I was mistaken. There is no bug actually here:

    >>> textwrap.summarize('hello      world!', width=10)
    '(...)'

    'hello (...)' cannot be the right answer since its len() is 11, greater than 10.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jul 30, 2013

    Updated patch to not add any space before the placeholder, with " (...)" as default placeholder value.

    @vajrasky
    Copy link
    Mannequin

    vajrasky mannequin commented Jul 31, 2013

    Monsieur Pitrou, thanks for the explanation. Actually, IMHO I prefer, 'hello (...)' should be the minimum words we can use not '(...)' because '(...)' does not make any sense. But, anyway, it's your call. :)

    Anyway, using your summarize2.patch:

    >>> textwrap.summarize('hello      world!', width=6)
    '(...)'
    
    >>> textwrap.summarize('hello      world!', width=5)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/ethan/Documents/code/python/cpython/Lib/textwrap.py", line 378, in summarize
        return w.summarize(text, placeholder=placeholder)
      File "/home/ethan/Documents/code/python/cpython/Lib/textwrap.py", line 314, in summarize
        raise ValueError("placeholder too large for max width")
    ValueError: placeholder too large for max width

    Why? '(...)' is 5 characters only. I checked the patch and found out that the placeholder is ' (...)' (with space) and you compare the width with the placeholder.

    @stevendaprano
    Copy link
    Member

    A function like this often gets called to truncate lots of lines. Unfortunately for many use-cases, the part truncated is the most significant part of the line. E.g.:

    Scanning file:
    /home/fred/documents/datafil...
    /home/fred/documents/datafil...
    /home/fred/documents/datafil...
    /home/fred/documents/datafil...

    It's often better in cases like this to truncate in the middle:

    Scanning file:
    /home/fred/doc...datafile01.txt
    /home/fred/doc...datafile02.txt
    /home/fred/doc...datafile03.txt
    /home/fred/doc...datafile04.txt

    (I believe Mac OS-X routinely truncates long lines in the middle in this way.)

    Perhaps there could be an argument controlling where to truncate (left, right or centre). A good use-case for the new Enums, perhaps? :-)

    @stevendaprano
    Copy link
    Member

    Bike-shedding here... why "(...)"? Is it common to use round brackets for this purpose? In English-speaking countries, it is usual to use square brackets for editorial comments, including ellipsis "[...]".

    Either way, if you wanted to be more Unicode aware, you could save two characters by using \N{HORIZONTAL ELLIPSIS} "(…)" as the default.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 2, 2013

    Bike-shedding here... why "(...)"? Is it common to use round brackets
    for this purpose? In English-speaking countries, it is usual to use
    square brackets for editorial comments, including ellipsis "[...]".

    Ah, really? French uses "[...]" but I thought English-speaking people,
    being different, used "(...)". So I'll change it.

    Either way, if you wanted to be more Unicode aware, you could save two
    characters by using \N{HORIZONTAL ELLIPSIS} "(…)" as the default.

    I'd rather stay on the ASCII side of things here.
    (also, the ellipsis character doesn't often look nice)

    @ezio-melotti
    Copy link
    Member

    [...] and ASCII are fine with me.

    Perhaps there could be an argument controlling where to truncate
    (left, right or centre). A good use-case for the new Enums, perhaps? :-)

    I wrote a similar function once and in addition to the width it had this feature too (defaulting on "center"), so it sounds like a reasonable addition to me. Back then I was simply passing a "left"/"right"/"center" string -- not sure it's worth adding an enum for this (FWIW for text alignment there are 3 separate methods: ljust, center, and rjust).

    @ezio-melotti
    Copy link
    Member

    Perhaps "shorten" would be a better name -- "summarize" sounds smarter than it actually is :)

    @bitdancer
    Copy link
    Member

    Looking just at the proposed functionality (taking a prefix) and ignoring the requested complexification :), the usual name for the text produced by this process is a "lead" (http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section), although formally a lead is actually written to be used as such, as opposed to just taking a prefix, so that word really has the same problem as 'summarize'.

    I think 'truncate' would be a better name. Or, if you don't mind being wordier, extract_prefix. The fact that it is part of the textwrap module should be enough clue that the truncation happens at whitespace. Truncate could also apply to the expanded version if you squint a little, if Antoine is interested in that. On the other hand, the use case presented for that is not going to be served by this function anyway, since this function (being part of textwrap) breaks on whitespace...it shouldn't (IMO) elide text other than at whitespace. If you want that functionality it belongs in some other module, I think.

    The placeholder argument could alternatively be named 'ellipsis', but placeholder is certainly fine.

    shorten would probably be better if you are going with the expanded version, but I like truncate. It is probably significant that that is what the title of the issue calls it :)

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 5, 2013

    Looking just at the proposed functionality (taking a prefix) and
    ignoring the requested complexification :), the usual name for the
    text produced by this process is a
    "lead" (http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section), although formally a lead is actually written to be used as such, as opposed to just taking a prefix, so that word really has the same problem as 'summarize'.

    Good point.

    The placeholder argument could alternatively be named 'ellipsis', but
    placeholder is certainly fine.

    I would certainly like ellipsis if it didn't already mean something else
    in Python.

    shorten would probably be better if you are going with the expanded
    version, but I like truncate. It is probably significant that that is
    what the title of the issue calls it :)

    I'm a bit negative towards truncate(), mostly because I've worked on the
    I/O stack and truncate means something much less careful there (e.g.
    StringIO.truncate()).
    But really, I'm fine with either shorten() or truncate(). I agree
    summarize() may try to look a bit too smart.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 12, 2013

    Updated patch renaming summarize() to shorten(), and adding docs and a fix for a nit reported by Vajrasky.

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 12, 2013

    Updated patch addressing Ezio's comments.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 12, 2013

    New changeset c27ec198d3d1 by Antoine Pitrou in branch 'default':
    Issue bpo-18585: Add :func:`textwrap.shorten` to collapse and truncate a piece of text to a given length.
    http://hg.python.org/cpython/rev/c27ec198d3d1

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 12, 2013

    Ok, I've committed after having addressed (most of) RDM's comments.

    @pitrou pitrou closed this as completed Aug 12, 2013
    @serhiy-storchaka
    Copy link
    Member

    What about a multiline summarize? The textwrap module is designed to work with multiline text.

    Let we want wrap 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.' in 40 column and shorten it to three lines:

    Lorem ipsum dolor sit amet, consectetur
    adipisicing elit, sed do eiusmod tempor
    incididunt ut labore et dolore (...)

    For this we need to add two arguments for TextWrapper: max_lines and placeholder. Then shorten() will be just fill() with max_lines=1.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Aug 16, 2013

    New changeset be5481bf4c57 by Antoine Pitrou in branch 'default':
    Fix the default placeholder in textwrap.shorten() to be " [...]".
    http://hg.python.org/cpython/rev/be5481bf4c57

    @pitrou
    Copy link
    Member Author

    pitrou commented Aug 16, 2013

    (Ezio noticed that I had left the placeholder as " (...)". This is now fixed.)

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 16, 2013

    New changeset 0bd257cd3e88 by Serhiy Storchaka in branch 'default':
    Add shorten to __all_ (issues bpo-18585 and bpo-18725).
    http://hg.python.org/cpython/rev/0bd257cd3e88

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 3, 2014

    New changeset 536a2cf5f1d2 by Daniel Holth in branch 'default':
    Issue bpo-18585: speed zipfile import by only generating zipfile._ZipDecryptor on demand
    http://hg.python.org/cpython/rev/536a2cf5f1d2

    @dholth
    Copy link
    Mannequin

    dholth mannequin commented Jan 3, 2014

    Previous changeset was meant for bpo-18515

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants