Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pydoc crashing on unicode strings #41169

Closed
cben mannequin opened this issue Nov 14, 2004 · 31 comments
Closed

Fix pydoc crashing on unicode strings #41169

cben mannequin opened this issue Nov 14, 2004 · 31 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@cben
Copy link
Mannequin

cben mannequin commented Nov 14, 2004

BPO 1065986
Nosy @loewis, @rhettinger, @cben, @vstinner, @benjaminp, @merwok, @akitada, @bitdancer, @taschini
Files
  • PYDOC-UNICODE.diff: Path to Lib/pydoc.py to use unicode when needed
  • issue1065986.patch
  • issue1065986-2.patch
  • issue1065986-3.patch
  • issue1065986-4.patch
  • issue1065986-5.patch
  • issue1065986-6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-01-05.20:42:28.609>
    created_at = <Date 2004-11-14.01:36:55.000>
    labels = ['type-bug', 'library']
    title = 'Fix pydoc crashing on unicode strings'
    updated_at = <Date 2014-01-05.22:14:27.082>
    user = 'https://github.com/cben'

    bugs.python.org fields:

    activity = <Date 2014-01-05.22:14:27.082>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-01-05.20:42:28.609>
    closer = 'r.david.murray'
    components = ['Library (Lib)']
    creation = <Date 2004-11-14.01:36:55.000>
    creator = 'cben'
    dependencies = []
    files = ['6363', '25380', '31793', '31832', '32721', '32738', '33316']
    hgrepos = []
    issue_num = 1065986
    keywords = ['patch']
    message_count = 31.0
    messages = ['47289', '47290', '47291', '47292', '158138', '158141', '158142', '159140', '159141', '159160', '159167', '159452', '169209', '175304', '175305', '175313', '175329', '197886', '197992', '198186', '198201', '203482', '203486', '203487', '203530', '203548', '207342', '207364', '207399', '207400', '207403']
    nosy_count = 13.0
    nosy_names = ['loewis', 'ping', 'rhettinger', 'cben', 'vstinner', 'benjamin.peterson', 'eric.araujo', 'akitada', 'r.david.murray', 'mu_mind', 'python-dev', 'ness', 'taschini']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1065986'
    versions = ['Python 2.7']

    @cben
    Copy link
    Mannequin Author

    cben mannequin commented Nov 14, 2004

    The pydoc module currently only outputs ASCII and
    crashes with UnicodeEncodeError when printing a unicode
    string (in contexts where it prints the str rather than
    a repr, e.g. docstrings or variables like
    __credits__). The most ironic example of it is that
    since patch 1009389 was committed, pydoc.py pydoc
    crashes on its own __credits__!

    This patch changes pydoc help functions to return
    unicode strings only when needed; it returns ASCII
    strings if all characters are from ASCII. Therefore
    there should be no compatibility problems.

    For output, all pager functions were changed to encode
    to the locale's preferred encoding and HTML output was
    changed to always use UTF-8.

    cgitb.py, DocXMLRPCServer.py and/or
    SimpleXMLRPCServer.py seems to rely on pydoc to some
    degree. I didn't touch them, so they might still be
    broken in this respect.

    @cben cben mannequin closed this as completed Nov 14, 2004
    @cben cben mannequin added the stdlib Python modules in the Lib dir label Nov 14, 2004
    @cben cben mannequin closed this as completed Nov 14, 2004
    @cben cben mannequin added the stdlib Python modules in the Lib dir label Nov 14, 2004
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Nov 14, 2004

    Logged In: YES
    user_id=21627

    This is a too major change so short before the 2.4 release,
    so postponing it 2.5.

    @ping
    Copy link
    Mannequin

    ping mannequin commented Nov 17, 2004

    Logged In: YES
    user_id=45338

    I'm so sorry this has caused so much trouble.
    The silly moose comment is my fault; it can be removed.

    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    I believe this was fixed. Feel free to re-open is something
    is unresolved.

    @ness
    Copy link
    Mannequin

    ness mannequin commented Apr 12, 2012

    Hello,

    [this is my first bug report, so I'm sorry if I'm not adhering to some conventions]

    in what versions of python is this supposed to be fixed? Consider:

    % python
    Python 2.7.2+ (default, Nov 30 2011, 19:22:03) 
    [GCC 4.6.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from pydoc import pager
    >>> from locale import getpreferredencoding
    >>> expr = u'\u211a'
    >>> pager(expr) # error
    >>> pager(expr.encode(getdefaultencoding())) # works

    The error is:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/pydoc.py", line 1318, in pager
        pager(text)
      File "/usr/lib/python2.7/pydoc.py", line 1332, in <lambda>
        return lambda text: pipepager(text, os.environ['PAGER'])
      File "/usr/lib/python2.7/pydoc.py", line 1359, in pipepager
        pipe.write(text)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u211a' in position 0: ordinal not in range(128)

    Best,
    Tom

    @bitdancer
    Copy link
    Member

    It is fixed in Python3. Apparently Raymond was wrong about it having been fixed earlier (or perhaps he was referring to the unicode being removed from the pydoc __credits__ string).

    @ness
    Copy link
    Mannequin

    ness mannequin commented Apr 12, 2012

    I see. Thank you.

    On 12.04.2012 16:08, R. David Murray wrote:

    R. David Murray<rdmurray@bitdance.com> added the comment:

    It is fixed in Python3. Apparently Raymond was wrong about it having been fixed earlier (or perhaps he was referring to the unicode being removed from the pydoc __credits__ string).

    ----------
    nosy: +r.david.murray


    Python tracker<report@bugs.python.org>
    <http://bugs.python.org/issue1065986\>


    @taschini
    Copy link
    Mannequin

    taschini mannequin commented Apr 24, 2012

    Shouldn't this be reopened for Python 2.7 ?

    @taschini taschini mannequin added type-bug An unexpected behavior, bug, or error labels Apr 24, 2012
    @bitdancer
    Copy link
    Member

    I don't think so. We aren't promising unicode support in pydoc in 2.x, and it is too late to add it.

    @taschini
    Copy link
    Mannequin

    taschini mannequin commented Apr 24, 2012

    Oh well, in that case I guess we'll have to work around it.

    Here's the monkey patch I use to overcome this limitation in pydoc, in case others wish to add it to their PYTHONSTARTUP or sitecustomize:

    def pipepager(text, cmd):
        """Page through text by feeding it to another program."""
        try:
            import locale
        except ImportError:
            encoding = "ascii"
        else:
            encoding = locale.getpreferredencoding()
        pipe = os.popen(cmd, 'w')
        try:
            pipe.write(text.encode(encoding, 'xmlcharrefreplace') if isinstance(text, unicode) else text)
            pipe.close()
        except IOError:
            pass # Ignore broken pipes caused by quitting the pager program.
    import pydoc
    pydoc.pipepager = pipepager
    del pydoc, pipepager

    @bitdancer
    Copy link
    Member

    Hmm. Making it not raise an error while still producing useful output would be acceptable as a bug fix if that's all it takes, I think.

    @bitdancer bitdancer reopened this Apr 24, 2012
    @bitdancer bitdancer reopened this Apr 24, 2012
    @taschini
    Copy link
    Mannequin

    taschini mannequin commented Apr 27, 2012

    Here's my patch, along the lines of the work-around I posted earlier. A few remarks:

    1. The modifications in pydoc only touch the four console pagers and the html pager (html.page).

    2. A module-wide default encoding is initialized from locale.getpreferredencoding. Pagers that write to a file use the encoding of that file if defined, else they use the module-wide default. The html pager uses ascii. All of them use xml character entity replacement as fall-back.

    3. An additional set of tests has been added to test.test_pydoc to verify the behaviour of the modifications.

    4. No functionality is broken if Python is built without unicode support.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Aug 27, 2012

    I fail to see how this patch solves this issue. Taking the example from bpo-15791, I still get the traceback of that issue, namely in the line

        result = result + self.section('AUTHOR', str(object.__author__))

    If __author__ is a unicode object, it's the str call that fails. This is long before any attempt is made to render the resulting string to an output device.

    @mumind
    Copy link
    Mannequin

    mumind mannequin commented Nov 11, 2012

    I just ran into this, and I'd like to communicate how unfortunate it is that it's not a priority to fix this fairly trivial (?) bug. It means there's no way to define a unicode string literal with non-ascii characters that won't crash the builtin help() command.

    I ran into this with the desktop package (http://pypi.python.org/pypi/desktop) where the only useful documentation right now is the source code and the docstrings. Apparently the author, who has non-ascii characters in his name, did me a favor by using broken encoding on the doc string so that at least I could read everything except for his name in the help. I tried to correct the encoding and found I get a nice traceback instead of help. And to top it all off, googling for things like "help unicode docstring" and "python help ascii codec" turns up nothing. I only found this issue once I thought to include "pipepager" in the search...

    @mumind
    Copy link
    Mannequin

    mumind mannequin commented Nov 11, 2012

    Also, the resolution is still marked as "fixed", which is not correct...

    @bitdancer
    Copy link
    Member

    It is not so much that it isn't a priority, as that no one has suggested a working fix that is suitable for 2.7. Do you have a suggestion?

    @mumind
    Copy link
    Mannequin

    mumind mannequin commented Nov 11, 2012

    I guess it must be more complicated than it looks, because I thought checking for unicode strings and doing .encode('utf-8') would help at least some cases without making anything worse.

    Anyways, if it's too hard or not worth fixing "correctly", couldn't we at least do something to prevent a crash? Maybe strip out / replace special characters and try again?

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Sep 16, 2013

    Attaching a modified version of bpo-1065986.patch.
    The differences are:

    • Added _binstr(), which is str() that works with unicode objects.
    • Changed getdoc() to return encoded docstrings/comments
    • Used _binstr() to convert __version__, __date__, __author__ and
      __credits__ to str

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Sep 17, 2013

    With this patch applied, the example from bpo-15791 works fine.

    $ echo "__author__ = u'Michele Orr\xf9'" > foo.py && ./python -c "import foo; print foo.__author__; help(foo)"
    Michele Orrù
    Help on module foo:

    NAME
    foo

    FILE
    /tmp/cpython/foo.py

    DATA
    __author__ = u'Michele Orr\xf9'

    AUTHOR
    Michele Orrù

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Sep 21, 2013

    Updated the previous patch to test unicode strings in __{version,date,author,credits}__ don't crash.

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Sep 21, 2013

    Now we have a working fix for 2.7.
    Could someone please review the attached patch?

    @bitdancer
    Copy link
    Member

    Benjamin: the patch looks pretty good to me, for fixing the problem of docstrings that are explicitly unicode. But before I go to the trouble of a full review and test, is this a level of change you think is acceptable in 2.7 at this point it its lifecycle?

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Nov 20, 2013

    Added <meta charset="utf-8"> to html pydoc generates.

    @benjaminp
    Copy link
    Contributor

    Okay with me.

    @merwok
    Copy link
    Member

    merwok commented Nov 20, 2013

    LGTM.

    One thing: did you mean assertEqual in Lib/test/test_pydoc.py:466: self.assertTrue(open('pipe').read(), pydoc._encode(doc))
    ?

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Nov 20, 2013

    Good catch. Fixed.

    @bitdancer
    Copy link
    Member

    Made some review comments.

    Looks good in general and it seems like the tests are fairly comprehensive. I haven't tried to run any additional experiments, but I don't see how it could make things worse, since the new code paths will only do something different if unicode objects are actually involved.

    @akitada
    Copy link
    Mannequin

    akitada mannequin commented Jan 5, 2014

    Made a few more adjustments to fix things r.david.murray pointed out.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 5, 2014

    New changeset bf077fc97fdd by R David Murray in branch '2.7':
    bpo-1065986: Make pydoc handle unicode strings.
    http://hg.python.org/cpython/rev/bf077fc97fdd

    @bitdancer
    Copy link
    Member

    Committed, thanks Akira.

    The support for --disable-unicode is not fully tested. I tried running the tests but the _io module wasn't built, so regrtest doesn't work. A command line invocation of pydoc worked fine, though.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jan 5, 2014

    New changeset e57660acc6d4 by R David Murray in branch '2.7':
    bpo-1065986: add missing error handler in pydoc unicode fix.
    http://hg.python.org/cpython/rev/e57660acc6d4

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants