Issue 19100: Use backslashreplace in pprint

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63299

classification

Title:	Use backslashreplace in pprint
Type:	behavior	Stage:	patch review
Components:	Library (Lib), Unicode	Versions:	Python 3.3, Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	fdrake	Nosy List:	doerwalter, ezio.melotti, fdrake, martin.panter, pitrou, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2013-09-27 11:05 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
pprint_unencodable.patch	serhiy.storchaka, 2013-09-27 11:05		review
pprint_unencodable_2.patch	serhiy.storchaka, 2013-12-10 19:18		review

Messages (13)
msg198465 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-09-27 11:05
Currently pprint.pprint() fails on unencodable characters. $ LANG=en_US.utf8 ./python -c "import pprint; pprint.pprint('\u20ac')" '€' $ LANG= ./python -c "import pprint; pprint.pprint('\u20ac')" Traceback (most recent call last): File "<string>", line 1, in <module> File "/home/serhiy/py/cpython/Lib/pprint.py", line 56, in pprint printer.pprint(object) File "/home/serhiy/py/cpython/Lib/pprint.py", line 137, in pprint self._format(object, self._stream, 0, 0, {}, 0) File "/home/serhiy/py/cpython/Lib/pprint.py", line 274, in _format write(rep) UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 1: ordinal not in range(128) This is a regression from Python 2 in which repr() always returns ascii string. $ LANG= python2.7 -c "import pprint; pprint.pprint(u'\u20ac')" u'\u20ac' Perhaps pprint() should use the backslashreplace error handler (as sys.displayhook()). With the proposed patch: $ LANG= ./python -c "import pprint; pprint.pprint('\u20ac')" '\u20ac'
msg204952 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-01 20:01
Any review?
msg205846 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-10 19:18
In new patch wrapping stream is moved to PrettyPrinter constructor.
msg205902 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2013-12-11 11:42
This is not the fault of pprint. IMHO it doesn't make sense to fix anything here, at least not for pprint specifically. print() has the same "problem": $ LANG= ./python -c "print('\u20ac')" Traceback (most recent call last): File "<string>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character '\u20ac' in position 0: ordinal not in range(128)
msg205907 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-11 15:07
pprint is not print. >>> print('\u20ac') € >>> import pprint; pprint.pprint('\u20ac') '€' Default sys.displayhook doesn't fail on unencodable output. $ LANG=C ./python Python 3.4.0b1 (default:e961a166dc70+, Dec 11 2013, 13:57:17) [GCC 4.6.3] on linux Type "help", "copyright", "credits" or "license" for more information. >>> '\u20ac' '\u20ac'
msg206178 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2013-12-14 11:49
sys.displayhook doesn't fail, because it uses the backslashreplace error handler, and for sys.displayhook that's OK, because it's only used for screen output and there some output is better than no output. However print and pprint.pprint might be used for output that is consumed by other programs (via pipes etc.) and IMHO in this case "Errors should never pass silently."
msg206200 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2013-12-14 19:57
The purpose of pprint.pprint() is to produce human-readable output. In this case some output is better than nothing. It isn't designed to be parseable by other programs, because sometimes it is even less accurate than the result of repr() (pprint() truncates long reprs and losses information for dict subclasses). Also result of pprint() can be changed from version to version (e.g. issue17150). The main source of non-ASCII characters is string reprs and for them the backslashreplace error handler doesn't lose information. And pprint.pprint() is mainly used for screen output too.
msg239650 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-31 00:35
I agree with Serhiy that using a permissive error handler with pprint() is appropriate. What is the reasoning behind the DecodeWriter case, where the original stream has an interesting encoding, but “buffer” is None? Are there any real-world cases like that? Your mock test case sets encoding="latin1" with no buffer, but that class will also write non-latin1 strings, so there is no problem. Also I wonder if flushing the stream once or twice for each pprint() call is a wise move. Another way to tackle this might be a function that translates the non-Latin-1 or whatever characters, allowing the original write() or whatever method to still be used. Here is a Python 2 and 3 compatible attempt: <https://bitbucket.org/Gfy/pyrescene/src/560cafe/rescene/utility.py#cl-426>. Python 3 only version: <https://github.com/vadmium/python-iview/commit/68b0559>. This function is originally used for printing descriptive comments to stdout (alongside other text where the “strict” error handler is appropriate). But I think it could be generally usable for pprint(), sys.displayhook(), etc as well.
msg239692 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2015-03-31 12:19
The linked code at https://github.com/vadmium/python-iview/commit/68b0559 seems strange to me: try: text.encode(encoding, textio.errors or "strict") except UnicodeEncodeError: text = text.encode(encoding, errors).decode(encoding) return text is the same as: return text.encode(encoding, errors).decode(encoding) because when there are no unencodable characters in text, the error handler will never be invoked.
msg239742 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-31 17:40
> What is the reasoning behind the DecodeWriter case, where the original stream has an interesting encoding, but “buffer” is None? Are there any real-world cases like that? sys.stdout and sys.stderr in IDLE.
msg239756 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-31 22:26
Walter: the first line encoding with textio.errors is meant to handle the case where the output stream already has its own permissive error handler set. But anyway I was just trying to point out that it might be better to do the backslash escaping at the text level, and write the escaped text string to the original stream. Serhiy: thanks for pointing out IDLE’s stdout. It seems the encoding can be set to say ASCII by the locale, yet it still accepts non-ASCII text. But I guess that’s a separate issue. I haven’t tested the patch, but reading it, I think the there may be a couple of problems: * Newline handling will be wrong e.g. on windows, where CRLF would be expected. I am not aware of a proper way to determine the newline translation mode of a text stream in arbitrary cases. * The order of text written directly to stdout and via pprint would get messed up, because pprint would bypass the buffering in the original text stream. * For encodings that store state, such as “utf-8-sig”, I think you may see an extra signature output, due to creating a new TextIOWrapper. With encoders whose state depends on the actual text, like the "hz" codec, multiplexing ASCII and GB2312 could be a more serious problem. Issue 15216 is slightly related, and has a patch apparently allowing the encoding and error handler to be changed on a text stream. But I guess it is no good here because you need backwards compatibility with other non-TextIOWrapper streams.
msg308600 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-12-19 01:00
$ LANG= ./python -c "import pprint; pprint.pprint('\u20ac')" Thanks to the PEP 538 and PEP 540, this command now works as expected in Python 3.7: vstinner@apu$ LANG= python3.7 -c "import pprint; pprint.pprint('\u20ac')" '€' Do we still need pprint_unencodable_2.patch workaround?
msg308618 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-12-19 07:02
Try with LANG=en_US. And even UTF-8 can fail.

History
Date	User	Action	Args
2022-04-11 14:57:51	admin	set	github: 63299
2017-12-19 07:02:52	serhiy.storchaka	set	messages: + msg308618
2017-12-19 01:00:15	vstinner	set	nosy: + vstinner messages: + msg308600
2015-03-31 22:26:42	martin.panter	set	messages: + msg239756
2015-03-31 17:40:43	serhiy.storchaka	set	messages: + msg239742
2015-03-31 12:19:59	doerwalter	set	messages: + msg239692
2015-03-31 00:35:31	martin.panter	set	nosy: + martin.panter messages: + msg239650
2013-12-14 19:57:59	serhiy.storchaka	set	messages: + msg206200
2013-12-14 11:49:29	doerwalter	set	messages: + msg206178
2013-12-11 16:14:48	fdrake	set	assignee: fdrake
2013-12-11 15:07:01	serhiy.storchaka	set	messages: + msg205907
2013-12-11 11:42:35	doerwalter	set	nosy: + doerwalter messages: + msg205902
2013-12-10 19:18:07	serhiy.storchaka	set	files: + pprint_unencodable_2.patch messages: + msg205846
2013-12-01 20:01:33	serhiy.storchaka	set	messages: + msg204952
2013-09-27 15:12:04	serhiy.storchaka	link	issue19103 dependencies
2013-09-27 11:05:10	serhiy.storchaka	create