classification
Title: 2to3 with a pipe on non-ASCII script
Type: Stage:
Components: 2to3 (2.x to 3.x conversion tool) Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: abbeyj, benjamin.peterson, collinwinter, mrabarnett, vstinner
Priority: normal Keywords: patch

Created on 2009-01-29 00:29 by vstinner, last changed 2009-12-18 02:49 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
2to3_write.patch vstinner, 2009-01-29 00:37
output_encoding.patch abbeyj, 2009-07-30 01:29
Messages (7)
msg80733 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-29 00:29
If Python output is redirected to a pipe, sys.stdout encoding is 
ASCII. So "2to3 script.py|cat" will write the patch in ASCII. If the 
script contains a non-ASCII character, 2to3 fails with:
  ...
  File ".../lib2to3/refactor.py", line 238, in refactor_file
    self.processed_file(str(tree)[:-1], filename, write=write)
  File ".../lib2to3/refactor.py", line 342, in processed_file
    self.print_output(diff_texts(old_text, new_text, filename))
  File ".../main.py", line 48, in print_output
    print(line)
UnicodeEncodeError: 'ascii' codec can't encode character '\xfb' in 
position 11: ordinal not in range(128)

Should we consider the input file and stdout as binary files? 
Workaround: modify the files in place (-w option) but don't write the 
patch to stdout (no such option yet).

A project may contain scripts in ASCII, Latin-1 and UTF-8 (eg. Python 
source code ;-)).
msg80734 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-29 00:37
Example of workaround: don't write the patch if the option -w is used. 
I don't need the patch if I choosed to modify the files in place.
msg91077 - (view) Author: James Abbatiello (abbeyj) Date: 2009-07-30 01:29
The --no-diffs option was recently added which looks like a good workaround.

Here's an attempt at a solution.  If sys.stdout has an encoding set then
use that, just as is being done now.  If there is no encoding (implying
"ascii") then use the encoding of the input file.
msg91130 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-07-31 12:33
I'd like to suggest that it the output could/should be encoded in UTF-8.
msg91136 - (view) Author: James Abbatiello (abbeyj) Date: 2009-07-31 18:02
In what case(s) do you propose the output to be encoded in UTF-8?  If
output is to a terminal and that terminal is set to Latin-1 or cp437 or
whatever then outputting UTF-8 in that case will only show garbage
characters to the user.

If output is to a file then using the encoding of the input file makes
the most sense to me.  Assume you have a simple program encoded in
Latin-1 that prints out a string with some non-ASCII characters.  The
patch is printed in UTF-8 encoding and redirected to a file.  The patch
program has no idea what encodings are used and it will just compare the
bytes in the original to the bytes in the patch file.  These won't match
since the encodings are different and he patch will fail.

If the output is to a pipe then I'm not sure what the right thing is. 
It may be intended for display on the screen with something like `less`
or it may not.  I don't think there's a good solution for this.

So following the above logic the patch attached here does the following:
1) If output is to a terminal (sys.stdout.encoding is set) then use that
encoding for output
2) Otherwise if an encoding was determined for the input file, use that
encoding for output
3) If all else fails, use 'ascii' encoding.  If the input contained
non-ASCII characters and no encoding has been determined for the input
then this will cause an exception to be raised.  I think this can only
happen when reading the input file from stdin.  Perhaps that case needs
to be looked at for how to detect the encoding of stdin.
msg91140 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2009-07-31 18:37
I was thinking that if you're converting a Python 2.x script to Python
3.x using 2to3 then also encoding the new script in UTF-8 might be a
good idea.
msg96546 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2009-12-18 02:49
Fixed in r76871.
History
Date User Action Args
2009-12-18 02:49:52benjamin.petersonsetstatus: open -> closed

nosy: + benjamin.peterson, collinwinter
messages: + msg96546

resolution: fixed
2009-07-31 18:37:14mrabarnettsetmessages: + msg91140
2009-07-31 18:02:48abbeyjsetmessages: + msg91136
2009-07-31 12:33:29mrabarnettsetnosy: + mrabarnett
messages: + msg91130
2009-07-30 01:29:29abbeyjsetfiles: + output_encoding.patch
nosy: + abbeyj
messages: + msg91077

2009-01-29 00:37:46vstinnersetfiles: + 2to3_write.patch
keywords: + patch
messages: + msg80734
2009-01-29 00:29:57vstinnercreate