Issue 5093: 2to3 with a pipe on non-ASCII script

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/49343

classification

Title:	2to3 with a pipe on non-ASCII script
Type:		Stage:
Components:	2to3 (2.x to 3.x conversion tool)	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	abbeyj, benjamin.peterson, collinwinter, mrabarnett, vstinner
Priority:	normal	Keywords:	patch

Created on 2009-01-29 00:29 by vstinner, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
2to3_write.patch	vstinner, 2009-01-29 00:37
output_encoding.patch	abbeyj, 2009-07-30 01:29

Messages (7)
msg80733 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-01-29 00:29
If Python output is redirected to a pipe, sys.stdout encoding is ASCII. So "2to3 script.py\|cat" will write the patch in ASCII. If the script contains a non-ASCII character, 2to3 fails with: ... File ".../lib2to3/refactor.py", line 238, in refactor_file self.processed_file(str(tree)[:-1], filename, write=write) File ".../lib2to3/refactor.py", line 342, in processed_file self.print_output(diff_texts(old_text, new_text, filename)) File ".../main.py", line 48, in print_output print(line) UnicodeEncodeError: 'ascii' codec can't encode character '\xfb' in position 11: ordinal not in range(128) Should we consider the input file and stdout as binary files? Workaround: modify the files in place (-w option) but don't write the patch to stdout (no such option yet). A project may contain scripts in ASCII, Latin-1 and UTF-8 (eg. Python source code ;-)).
msg80734 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-01-29 00:37
Example of workaround: don't write the patch if the option -w is used. I don't need the patch if I choosed to modify the files in place.
msg91077 - (view)	Author: James Abbatiello (abbeyj)	Date: 2009-07-30 01:29
The --no-diffs option was recently added which looks like a good workaround. Here's an attempt at a solution. If sys.stdout has an encoding set then use that, just as is being done now. If there is no encoding (implying "ascii") then use the encoding of the input file.
msg91130 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2009-07-31 12:33
I'd like to suggest that it the output could/should be encoded in UTF-8.
msg91136 - (view)	Author: James Abbatiello (abbeyj)	Date: 2009-07-31 18:02
In what case(s) do you propose the output to be encoded in UTF-8? If output is to a terminal and that terminal is set to Latin-1 or cp437 or whatever then outputting UTF-8 in that case will only show garbage characters to the user. If output is to a file then using the encoding of the input file makes the most sense to me. Assume you have a simple program encoded in Latin-1 that prints out a string with some non-ASCII characters. The patch is printed in UTF-8 encoding and redirected to a file. The patch program has no idea what encodings are used and it will just compare the bytes in the original to the bytes in the patch file. These won't match since the encodings are different and he patch will fail. If the output is to a pipe then I'm not sure what the right thing is. It may be intended for display on the screen with something like `less` or it may not. I don't think there's a good solution for this. So following the above logic the patch attached here does the following: 1) If output is to a terminal (sys.stdout.encoding is set) then use that encoding for output 2) Otherwise if an encoding was determined for the input file, use that encoding for output 3) If all else fails, use 'ascii' encoding. If the input contained non-ASCII characters and no encoding has been determined for the input then this will cause an exception to be raised. I think this can only happen when reading the input file from stdin. Perhaps that case needs to be looked at for how to detect the encoding of stdin.
msg91140 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2009-07-31 18:37
I was thinking that if you're converting a Python 2.x script to Python 3.x using 2to3 then also encoding the new script in UTF-8 might be a good idea.
msg96546 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2009-12-18 02:49
Fixed in r76871.

History
Date	User	Action	Args
2022-04-11 14:56:44	admin	set	github: 49343
2009-12-18 02:49:52	benjamin.peterson	set	status: open -> closed nosy: + benjamin.peterson, collinwinter messages: + msg96546 resolution: fixed
2009-07-31 18:37:14	mrabarnett	set	messages: + msg91140
2009-07-31 18:02:48	abbeyj	set	messages: + msg91136
2009-07-31 12:33:29	mrabarnett	set	nosy: + mrabarnett messages: + msg91130
2009-07-30 01:29:29	abbeyj	set	files: + output_encoding.patch nosy: + abbeyj messages: + msg91077
2009-01-29 00:37:46	vstinner	set	files: + 2to3_write.patch keywords: + patch messages: + msg80734
2009-01-29 00:29:57	vstinner	create