classification
Title: compile() should not encode 'filename' (at least on Windows)
Type: behavior Stage: test needed
Components: Interpreter Core Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Drekin, haypo, terry.reedy
Priority: normal Keywords:

Created on 2012-01-11 03:46 by terry.reedy, last changed 2013-09-14 13:47 by haypo. This issue is now closed.

Messages (8)
msg151034 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-11 03:46
The 3.2.2 doc for compile() says "The filename argument should give the file from which the code was read; pass some recognizable value if it wasn’t read from a file ('<string>' is commonly used)."

I am not sure what 'recognizable' is supposed to mean, but as I understand it, it would be user-specific and any string containing a fake 'filename' should be accepted and attached to the output code object as the .co_filename attribute. (At least on Windows.)

In fact, compile() has a hidden restriction: it encodes 'filename' with the local filesystem encoding. It tosses the bytes result (at least on Windows) but lets a UnicodeEncodeError terminate compilation. The effect is to add an undocumented and spurious dependency to code that has nothing to do with real files or the local machine.

In #10114, msg118845, Victor Stinner justified this with 
"co_filename attribute is used to display the traceback: Python opens the related file, read the source code line and display it."
If the filename is fake, it cannot do that. (Perhaps the doc should warn users to make sure that fake filenames do not match any possibly real filenames ;-). The traceback mechanism could ignore UnicodeEncodeErrors just as well as it now ignores IO(?)Errors when open('fakename') does not not work.

Victor continues "On Windows, co_filename is directly used because Windows accepts unicode for filenames." This is not true in that on at least some Windows, compile tries to encode with the mbcs codec, which in turn uses the hidden local codepage. I believe that for most or all codepages, this will even raise errors for some valid Unicode filenames.

I do not know whether the stored .co_filename attribute type for *nix is str, as on Windows, or bytes. If the latter, the doc should say so.
If compile() continues to filter fake filenames, which I oppose, the doc should also say so and say what it does.

This issue came up on python-list when someone used a Chinese filename and mbcs rejected it.
msg151076 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2012-01-11 18:41
My supposition that compile() rejects some real file names appears correct: from python-list
ME: Is this a filename that could be an actual, valid filename on your system?
OP: Yes it is. open works on that file.
msg195954 - (view) Author: Adam Bartoš (Drekin) * Date: 2013-08-23 09:07
Hello. Will this be fixed? It's really annoying that you cannot pass valid unicode filename to compile(). I'm using a workaround: I just pass "<placeholder>" and then “update” the resulting code object recursively to set the correct co_filename. Afterwards the code object can be executed and produces correct tracebacks. (I'm using Windows.)

Fixing this will probably fix also http://bugs.python.org/issue17588 . It doesn't bother just me. See e.g. http://stackoverflow.com/questions/8798591/unicodeencodeerror-when-using-the-compile-function .

Thank you. Drekin
msg195983 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2013-08-23 15:56
Victor, do you have any opinion on this unicode filename issue?
msg196005 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-23 19:02
> Victor, do you have any opinion on this unicode filename issue?

I closed the issue #11619 in january 2013 before there was no user requesting the feature. I just reopened the issue because users now ask for it.
msg196247 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-26 20:39
This issue has been fixed in issue #11619 by:

New changeset df2fdd42b375 by Victor Stinner in branch 'default':
Close #11619: The parser and the import machinery do not encode Unicode
http://hg.python.org/cpython/rev/df2fdd42b375

Thanks for the report!

(I don't plan to backport the fix to Python 3.3, it's a huge patch for a rare use case.)
msg197706 - (view) Author: Adam Bartoš (Drekin) * Date: 2013-09-14 12:18
Since this issue was fixed, shouldn't it be marked fixed here?
msg197709 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-09-14 13:47
Closed.
History
Date User Action Args
2013-09-14 13:47:09hayposetstatus: open -> closed
resolution: fixed
messages: + msg197709
2013-09-14 12:18:17Drekinsetmessages: + msg197706
2013-08-26 20:39:04hayposetmessages: + msg196247
versions: + Python 3.4, - Python 3.2, Python 3.3
2013-08-23 19:02:03hayposetmessages: + msg196005
2013-08-23 15:56:08terry.reedysetnosy: + haypo
messages: + msg195983
2013-08-23 09:07:05Drekinsetnosy: + Drekin
messages: + msg195954
2012-01-11 18:41:30terry.reedysetmessages: + msg151076
2012-01-11 03:46:43terry.reedycreate