Issue 1065986: Fix pydoc crashing on unicode strings

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/41169

classification

Title:	Fix pydoc crashing on unicode strings
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	akitada, benjamin.peterson, cben, eric.araujo, loewis, mu_mind, ness, ping, python-dev, r.david.murray, rhettinger, taschini, vstinner
Priority:	normal	Keywords:	patch

Created on 2004-11-14 01:36 by cben, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
PYDOC-UNICODE.diff	cben, 2004-11-14 01:36	Path to Lib/pydoc.py to use unicode when needed
issue1065986.patch	taschini, 2012-04-27 12:21		review
issue1065986-2.patch	akitada, 2013-09-16 10:17
issue1065986-3.patch	akitada, 2013-09-21 07:05		review
issue1065986-4.patch	akitada, 2013-11-20 14:59		review
issue1065986-5.patch	akitada, 2013-11-20 23:43		review
issue1065986-6.patch	akitada, 2014-01-05 07:58		review

Messages (31)
msg47289 - (view)	Author: Cherniavsky Beni (cben) *	Date: 2004-11-14 01:36
The pydoc module currently only outputs ASCII and crashes with UnicodeEncodeError when printing a unicode string (in contexts where it prints the str rather than a repr, e.g. docstrings or variables like `__credits__`). The most ironic example of it is that since patch 1009389 was committed, ``pydoc.py pydoc`` crashes on its own `__credits__`! This patch changes pydoc help functions to return unicode strings only when needed; it returns ASCII strings if all characters are from ASCII. Therefore there should be no compatibility problems. For output, all pager functions were changed to encode to the locale's preferred encoding and HTML output was changed to always use UTF-8. cgitb.py, DocXMLRPCServer.py and/or SimpleXMLRPCServer.py seems to rely on pydoc to some degree. I didn't touch them, so they might still be broken in this respect.
msg47290 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2004-11-14 10:22
Logged In: YES user_id=21627 This is a too major change so short before the 2.4 release, so postponing it 2.5.
msg47291 - (view)	Author: Ka-Ping Yee (ping) *	Date: 2004-11-17 11:45
Logged In: YES user_id=45338 I'm so sorry this has caused so much trouble. The silly moose comment is my fault; it can be removed.
msg47292 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2005-08-24 07:10
Logged In: YES user_id=80475 I believe this was fixed. Feel free to re-open is something is unresolved.
msg158138 - (view)	Author: Tom Bachmann (ness)	Date: 2012-04-12 14:49
Hello, [this is my first bug report, so I'm sorry if I'm not adhering to some conventions] in what versions of python is this supposed to be fixed? Consider: % python Python 2.7.2+ (default, Nov 30 2011, 19:22:03) [GCC 4.6.2] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from pydoc import pager >>> from locale import getpreferredencoding >>> expr = u'\u211a' >>> pager(expr) # error >>> pager(expr.encode(getdefaultencoding())) # works The error is: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/pydoc.py", line 1318, in pager pager(text) File "/usr/lib/python2.7/pydoc.py", line 1332, in <lambda> return lambda text: pipepager(text, os.environ['PAGER']) File "/usr/lib/python2.7/pydoc.py", line 1359, in pipepager pipe.write(text) UnicodeEncodeError: 'ascii' codec can't encode character u'\u211a' in position 0: ordinal not in range(128) Best, Tom
msg158141 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-12 15:08
It is fixed in Python3. Apparently Raymond was wrong about it having been fixed earlier (or perhaps he was referring to the unicode being removed from the pydoc __credits__ string).
msg158142 - (view)	Author: Tom Bachmann (ness)	Date: 2012-04-12 15:08
I see. Thank you. On 12.04.2012 16:08, R. David Murray wrote: > > R. David Murray<rdmurray@bitdance.com> added the comment: > > It is fixed in Python3. Apparently Raymond was wrong about it having been fixed earlier (or perhaps he was referring to the unicode being removed from the pydoc __credits__ string). > > ---------- > nosy: +r.david.murray > > _______________________________________ > Python tracker<report@bugs.python.org> > <http://bugs.python.org/issue1065986> > _______________________________________
msg159140 - (view)	Author: Stefano Taschini (taschini) *	Date: 2012-04-24 14:01
Shouldn't this be reopened for Python 2.7 ?
msg159141 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-24 14:13
I don't think so. We aren't promising unicode support in pydoc in 2.x, and it is too late to add it.
msg159160 - (view)	Author: Stefano Taschini (taschini) *	Date: 2012-04-24 15:26
Oh well, in that case I guess we'll have to work around it. Here's the monkey patch I use to overcome this limitation in pydoc, in case others wish to add it to their PYTHONSTARTUP or sitecustomize: def pipepager(text, cmd): """Page through text by feeding it to another program.""" try: import locale except ImportError: encoding = "ascii" else: encoding = locale.getpreferredencoding() pipe = os.popen(cmd, 'w') try: pipe.write(text.encode(encoding, 'xmlcharrefreplace') if isinstance(text, unicode) else text) pipe.close() except IOError: pass # Ignore broken pipes caused by quitting the pager program. import pydoc pydoc.pipepager = pipepager del pydoc, pipepager
msg159167 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-24 16:27
Hmm. Making it not raise an error while still producing useful output would be acceptable as a bug fix if that's all it takes, I think.
msg159452 - (view)	Author: Stefano Taschini (taschini) *	Date: 2012-04-27 12:21
Here's my patch, along the lines of the work-around I posted earlier. A few remarks: 1. The modifications in pydoc only touch the four console pagers and the html pager (html.page). 2. A module-wide default encoding is initialized from locale.getpreferredencoding. Pagers that write to a file use the encoding of that file if defined, else they use the module-wide default. The html pager uses ascii. All of them use xml character entity replacement as fall-back. 3. An additional set of tests has been added to test.test_pydoc to verify the behaviour of the modifications. 4. No functionality is broken if Python is built without unicode support.
msg169209 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-08-27 16:48
I fail to see how this patch solves this issue. Taking the example from issue15791, I still get the traceback of that issue, namely in the line result = result + self.section('AUTHOR', str(object.__author__)) If __author__ is a unicode object, it's the str call that fails. This is long before any attempt is made to render the resulting string to an output device.
msg175304 - (view)	Author: David Barnett (mu_mind)	Date: 2012-11-11 00:18
I just ran into this, and I'd like to communicate how unfortunate it is that it's not a priority to fix this fairly trivial (?) bug. It means there's no way to define a unicode string literal with non-ascii characters that won't crash the builtin help() command. I ran into this with the desktop package (http://pypi.python.org/pypi/desktop) where the only useful documentation right now is the source code and the docstrings. Apparently the author, who has non-ascii characters in his name, did me a favor by using broken encoding on the doc string so that at least I could read everything except for his name in the help. I tried to correct the encoding and found I get a nice traceback instead of help. And to top it all off, googling for things like "help unicode docstring" and "python help ascii codec" turns up nothing. I only found this issue once I thought to include "pipepager" in the search...
msg175305 - (view)	Author: David Barnett (mu_mind)	Date: 2012-11-11 00:18
Also, the resolution is still marked as "fixed", which is not correct...
msg175313 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-11-11 03:43
It is not so much that it isn't a priority, as that no one has suggested a working fix that is suitable for 2.7. Do you have a suggestion?
msg175329 - (view)	Author: David Barnett (mu_mind)	Date: 2012-11-11 08:22
I guess it must be more complicated than it looks, because I thought checking for unicode strings and doing .encode('utf-8') would help at least some cases without making anything worse. Anyways, if it's too hard or not worth fixing "correctly", couldn't we at least do something to prevent a crash? Maybe strip out / replace special characters and try again?
msg197886 - (view)	Author: Akira Kitada (akitada) *	Date: 2013-09-16 10:17
Attaching a modified version of issue1065986.patch. The differences are: - Added _binstr(), which is str() that works with unicode objects. - Changed getdoc() to return encoded docstrings/comments - Used _binstr() to convert __version__, __date__, __author__ and __credits__ to str
msg197992 - (view)	Author: Akira Kitada (akitada) *	Date: 2013-09-17 14:30
With this patch applied, the example from issue15791 works fine. $ echo "__author__ = u'Michele Orr\xf9'" > foo.py && ./python -c "import foo; print foo.__author__; help(foo)" Michele Orrù Help on module foo: NAME foo FILE /tmp/cpython/foo.py DATA __author__ = u'Michele Orr\xf9' AUTHOR Michele Orrù
msg198186 - (view)	Author: Akira Kitada (akitada) *	Date: 2013-09-21 07:05
Updated the previous patch to test unicode strings in __{version,date,author,credits}__ don't crash.
msg198201 - (view)	Author: Akira Kitada (akitada) *	Date: 2013-09-21 13:44
Now we have a working fix for 2.7. Could someone please review the attached patch?
msg203482 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2013-11-20 14:23
Benjamin: the patch looks pretty good to me, for fixing the problem of docstrings that are explicitly unicode. But before I go to the trouble of a full review and test, is this a level of change you think is acceptable in 2.7 at this point it its lifecycle?
msg203486 - (view)	Author: Akira Kitada (akitada) *	Date: 2013-11-20 14:59
Added <meta charset="utf-8"> to html pydoc generates.
msg203487 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2013-11-20 15:00
Okay with me.
msg203530 - (view)	Author: Éric Araujo (eric.araujo) *	Date: 2013-11-20 20:34
LGTM. One thing: did you mean assertEqual in Lib/test/test_pydoc.py:466: self.assertTrue(open('pipe').read(), pydoc._encode(doc)) ?
msg203548 - (view)	Author: Akira Kitada (akitada) *	Date: 2013-11-20 23:43
Good catch. Fixed.
msg207342 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-01-05 03:31
Made some review comments. Looks good in general and it seems like the tests are fairly comprehensive. I haven't tried to run any additional experiments, but I don't see how it could make things worse, since the new code paths will only do something different if unicode objects are actually involved.
msg207364 - (view)	Author: Akira Kitada (akitada) *	Date: 2014-01-05 07:58
Made a few more adjustments to fix things r.david.murray pointed out.
msg207399 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-01-05 20:39
New changeset bf077fc97fdd by R David Murray in branch '2.7': #1065986: Make pydoc handle unicode strings. http://hg.python.org/cpython/rev/bf077fc97fdd
msg207400 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2014-01-05 20:42
Committed, thanks Akira. The support for --disable-unicode is not fully tested. I tried running the tests but the _io module wasn't built, so regrtest doesn't work. A command line invocation of pydoc worked fine, though.
msg207403 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-01-05 22:14
New changeset e57660acc6d4 by R David Murray in branch '2.7': #1065986: add missing error handler in pydoc unicode fix. http://hg.python.org/cpython/rev/e57660acc6d4

History
Date	User	Action	Args
2022-04-11 14:56:08	admin	set	github: 41169
2014-01-05 22:14:27	python-dev	set	messages: + msg207403
2014-01-05 20:42:28	r.david.murray	set	status: open -> closed resolution: fixed messages: + msg207400 stage: patch review -> resolved
2014-01-05 20:39:40	python-dev	set	nosy: + python-dev messages: + msg207399
2014-01-05 07:58:54	akitada	set	files: + issue1065986-6.patch messages: + msg207364
2014-01-05 03:31:09	r.david.murray	set	messages: + msg207342
2013-11-20 23:43:40	akitada	set	files: + issue1065986-5.patch messages: + msg203548
2013-11-20 20:34:08	eric.araujo	set	messages: + msg203530
2013-11-20 15:00:20	benjamin.peterson	set	messages: + msg203487
2013-11-20 14:59:10	akitada	set	files: + issue1065986-4.patch messages: + msg203486
2013-11-20 14:23:50	r.david.murray	set	nosy: + benjamin.peterson messages: + msg203482
2013-11-20 13:51:08	pitrou	set	nosy: + vstinner
2013-09-21 21:36:33	pitrou	set	stage: patch review
2013-09-21 13:44:35	akitada	set	messages: + msg198201
2013-09-21 07:05:12	akitada	set	files: + issue1065986-3.patch messages: + msg198186
2013-09-17 14:30:18	akitada	set	messages: + msg197992
2013-09-16 10:17:04	akitada	set	files: + issue1065986-2.patch messages: + msg197886
2013-09-07 20:23:57	akitada	set	nosy: + akitada
2012-11-11 08:22:29	mu_mind	set	messages: + msg175329
2012-11-11 03:43:28	r.david.murray	set	resolution: fixed -> (no value) messages: + msg175313
2012-11-11 00:18:54	mu_mind	set	messages: + msg175305
2012-11-11 00:18:28	mu_mind	set	messages: + msg175304
2012-11-10 23:57:21	mu_mind	set	nosy: + mu_mind
2012-08-27 16:48:52	loewis	set	messages: + msg169209
2012-08-27 16:38:22	r.david.murray	link	issue15791 superseder
2012-04-27 12:21:59	taschini	set	files: + issue1065986.patch messages: + msg159452
2012-04-26 16:54:23	eric.araujo	set	nosy: + eric.araujo
2012-04-24 16:27:02	r.david.murray	set	status: closed -> open messages: + msg159167
2012-04-24 15:26:42	taschini	set	messages: + msg159160
2012-04-24 14:13:40	r.david.murray	set	messages: + msg159141
2012-04-24 14:01:36	taschini	set	type: behavior messages: + msg159140 versions: + Python 2.7, - Python 2.5
2012-04-23 07:37:29	taschini	set	nosy: + taschini
2012-04-12 15:08:59	ness	set	messages: + msg158142
2012-04-12 15:08:31	r.david.murray	set	nosy: + r.david.murray messages: + msg158141
2012-04-12 14:49:09	ness	set	nosy: + ness messages: + msg158138
2004-11-14 01:36:55	cben	create