classification
Title: pydoc 3.x raises UnicodeEncodeError on sqlite3 package
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.5, Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: martin.panter, python-dev, r.david.murray, serhiy.storchaka, skip.montanaro
Priority: normal Keywords: patch

Created on 2015-02-01 19:24 by skip.montanaro, last changed 2015-02-20 21:49 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
pydoc_encoding.patch serhiy.storchaka, 2015-02-02 15:21 review
pydoc_encoding_2.patch serhiy.storchaka, 2015-02-15 12:59 review
Messages (12)
msg235200 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2015-02-01 19:24
I'm probably doing something wrong, but I've tried everything I can think of
without any success.

In Python 2.7, the pydoc command successfully displays help for the sqlite3
package, though it muffs the output of Gerhard Häring's name, spitting out
the original Latin-1 spelling. In Python 3.x, I get a UnicodeEncodeError for
my trouble, and it hoses my tty settings to boot, requiring a LF reset LF
sequence to put right unless I set PAGER to "cat".

Here's a sample run:

% PAGER=cat pydoc3.5 sqlite3
Traceback (most recent call last):
  File "/Users/skip/local/bin/pydoc3.5", line 5, in <module>
    pydoc.cli()
  File "/Users/skip/local/lib/python3.5/pydoc.py", line 2591, in cli
    help.help(arg)
  File "/Users/skip/local/lib/python3.5/pydoc.py", line 1874, in help
    elif request: doc(request, 'Help on %s:', output=self._output)
  File "/Users/skip/local/lib/python3.5/pydoc.py", line 1612, in doc
    pager(render_doc(thing, title, forceload))
  File "/Users/skip/local/lib/python3.5/pydoc.py", line 1412, in pager
    pager(text)
  File "/Users/skip/local/lib/python3.5/pydoc.py", line 1428, in <lambda>
    return lambda text: pipepager(text, os.environ['PAGER'])
  File "/Users/skip/local/lib/python3.5/pydoc.py", line 1455, in pipepager
    pipe.write(text)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 600: ordinal not in range(128)

I understand the error, but I see no way to convince it to use any codec
other than "ascii".  Stuff I tried:

* setting PYTHONIOENCODING to "UTF-8" (suggested by Peter Otten on c.l.py)
* setting LANG to "en_US.utf8"

This is on a Mac running Yosemite with pydoc invoked in Apple's Terminal
app. Display is fine in my browser when I run pydoc as a web server.

The source it is attempting to display has a coding cookie, so it should
know that the code is encoded using Latin-1. The problem seems to all be
about generating output.
msg235201 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-01 19:51
What are sys.getfilesystemencoding(), locale.getpreferredencoding(False), os.popen('cat', 'w').encoding?
msg235202 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2015-02-01 20:15
Without setting any environment variables:

>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'
>>> import locale
>>> locale.getpreferredencoding(False)
'US-ASCII'
>>> import os
>>> os.popen('cat', 'w').encoding
'US-ASCII'

If I set PYTHONIOENCODING=UTF-8:

>>> import sys, locale, os
>>> sys.getfilesystemencoding()
'utf-8'
>>> locale.getpreferredencoding(False)
'US-ASCII'
>>> os.popen('cat', 'w').encoding
'US-ASCII'

If I set LANG=en_US.utf8:

>>> import sys, locale, os
>>> sys.getfilesystemencoding()
'utf-8'
>>> locale.getpreferredencoding(False)
'US-ASCII'
>>> os.popen('cat', 'w').encoding
'US-ASCII'

It appears neither of these environment variables does much in my environment.

I should point out that I just updated to Mac OS X 10.10.2 a couple
days ago. I have no idea if this problem existed before that upgrade.
Realizing that perhaps something had changed in the underlying
operating system support, I rebuilt Python 2.6 through 3.5 from
scratch. Same result.
msg235203 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2015-02-01 20:19
Peter Otten posted a solution on c.l.py. The issue is that I didn't
mix my case properly when setting LANG:

hgpython% LANG=en_US.UTF-8 python3.5 -c 'import locale;
print(locale.getpreferredencoding(False))'
UTF-8
hgpython% LANG=en_US.utf8 python3.5 -c 'import locale;
print(locale.getpreferredencoding(False))'
US-ASCII
msg235204 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2015-02-01 20:26
On Sun, Feb 1, 2015 at 2:19 PM, Skip Montanaro <report@bugs.python.org> wrote:
> The issue is that I didn't
> mix my case properly when setting LANG:

Actually, it's that the hyphen is required in "utf-8" or "UTF-8".
msg235206 - (view) Author: Skip Montanaro (skip.montanaro) * Date: 2015-02-01 20:59
Final note here. Peter also did a bit of digging. Here's his note about
what he found on c.l.py:

The pager is invoked by os.popen(), and after some digging I find that it
uses a io.TestIOWrapper() to write the help text. This in turn uses
locale.getpreferredencoding(False), i. e. you were right to set LANG and
PYTHONIOENCODING is not relevant.

I was also able to provoke this problem on an openSuSE 12.2 system with
3.2.3 installed. In that environment (confirmed by Chris Angelico on his
Linux system), the case of "utf" didn't matter, nor did it matter if
"utf-8" was hyphenated or not. Obviously the Mac continues to be a rather
touchy system w.r.t. locale.

I don't know if Python should try to be accommodating here, but my
inclination is "no". OTOH, maybe io.TestIOWrapper should look at
PYTHONIOENCODING, or the pager should be invoked through something other
than os.popen (assuming there is a suitable replacement which does pay
attention to PYTHONIOENCODING).
msg235208 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-02-01 21:32
Maybe because a pager sends its bytes more-or-less straight throught from input to output, the PYTHONIOENCODING (sys.stdout.encoding?) should be used for the TextIOWrapper to the pager’s input in this case. I’m not so sure this should be assumed in general though.
msg235263 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-02 15:21
There are few levels of this issue:

1) pydoc doesn't escape characters according to output encoding. It escapes characters uneencodable with sys.getfilesystemencoding(), but this encoding can differ from the encoding of sys.stdout or default encoding.

2) Default encoding for io.TestIOWrapper() and open() can be different from sys.getfilesystemencoding(). And it unexpectedly can be ASCII.

3) Mac OS doesn't support locales with the utf8 encoding (without hyphen).

Here is a patch which solves first level -- makes pydoc using appropriate encoding with the backslashreplace error handler.
msg236036 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-15 12:59
Added a test.
msg236076 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-02-15 22:49
Patch looks sensible to me. This is another example of where Issue 15216 would be useful (a standard way to modify the encoding settings of a stream).
msg236078 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-02-15 23:11
In the case of this issue pydoc needs change not the encoding of stdout, but errors handler of stdout. There is similar issue with pprint (issue19100).
msg236334 - (view) Author: Roundup Robot (python-dev) Date: 2015-02-20 21:48
New changeset e7b6b1f57268 by Serhiy Storchaka in branch '3.4':
Issue #23374: Fixed pydoc failure with non-ASCII files when stdout encoding
https://hg.python.org/cpython/rev/e7b6b1f57268

New changeset affe167a45f3 by Serhiy Storchaka in branch 'default':
Issue #23374: Fixed pydoc failure with non-ASCII files when stdout encoding
https://hg.python.org/cpython/rev/affe167a45f3
History
Date User Action Args
2015-02-20 21:49:08serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2015-02-20 21:48:31python-devsetnosy: + python-dev
messages: + msg236334
2015-02-15 23:11:53serhiy.storchakasetmessages: + msg236078
2015-02-15 22:49:05martin.pantersetmessages: + msg236076
2015-02-15 12:59:01serhiy.storchakasetfiles: + pydoc_encoding_2.patch

messages: + msg236036
2015-02-15 12:08:41serhiy.storchakasetassignee: serhiy.storchaka
2015-02-02 15:21:25serhiy.storchakasetfiles: + pydoc_encoding.patch
versions: - Python 3.2, Python 3.3
messages: + msg235263

keywords: + patch
type: crash -> behavior
stage: patch review
2015-02-01 23:00:19r.david.murraysetnosy: + r.david.murray
2015-02-01 21:32:48martin.pantersetnosy: + martin.panter
messages: + msg235208
2015-02-01 20:59:39skip.montanarosetmessages: + msg235206
2015-02-01 20:26:39skip.montanarosetmessages: + msg235204
2015-02-01 20:19:07skip.montanarosetmessages: + msg235203
2015-02-01 20:15:41skip.montanarosetmessages: + msg235202
2015-02-01 19:51:30serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg235201
2015-02-01 19:24:18skip.montanarocreate