classification
Title: os.walk fails on undecodable filenames
Type: behavior Stage: resolved
Components: Library (Lib), Unicode Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, fhoech, haypo
Priority: normal Keywords:

Created on 2014-11-13 13:16 by fhoech, last changed 2014-11-13 15:16 by fhoech. This issue is now closed.

Messages (7)
msg231110 - (view) Author: Florian Höch (fhoech) * Date: 2014-11-13 13:16
If 'top' is an unicode directory name, os.listdir can still return non-unicode filenames if they can't be decoded. This case is not handled in the Python 2.x standard library version of os.walk and will cause join(top, name) to fail on such filenames with an UnicodeDecodeError.
msg231111 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-11-13 13:23
What is your OS?
msg231112 - (view) Author: Florian Höch (fhoech) * Date: 2014-11-13 13:30
This problem only affects Linux as far as I know (in my case I'm using Fedora 21 Beta).
msg231115 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-11-13 14:40
Your problem has two solutions.

1) Upgrade to Python 3 which handles correctly your use case (thanks to the PEP 383, surrogateescape error handler)

2) Only process filenames as bytes, and encode/decode manually (so you can decide how to handle undecodable filenames)
msg231117 - (view) Author: Florian Höch (fhoech) * Date: 2014-11-13 14:50
1) Is not yet possible for me unfortunately, some libraries I require are not yet available for Python 3 (but in the long run, this would be my preferred solution)

2) Would necessitate too many changes in a carefully crafted, unicode-only application. I think I'll just override os.listdir and filter out filenames that are not decodable, or override os.walk and do something equivalent.
msg231118 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-11-13 14:57
> 1) Is not yet possible for me unfortunately, some libraries I require are not yet available for Python 3 (but in the long run, this would be my preferred solution)

I'm curious, which libraries?

Oh, I forgot to say that it's not possible to fix this issue in Python 2. Backporting the PEP 383 in Python 2 requires deep changes in the Unicode machinery, starting by the UTF-8 codec. Currently, the UTF-8 encoder encodes surrogates which violates Unicode standard and makes impossible to use this codec with the surrogateescape error handler.
msg231120 - (view) Author: Florian Höch (fhoech) * Date: 2014-11-13 15:16
> I'm curious, which libraries?

wxPython and wexpect (wexpect I could probably port myself, so the problem is mainly with wx)

> Oh, I forgot to say that it's not possible to fix this issue in Python 2. Backporting the PEP 383 in Python 2 requires deep changes in the Unicode machinery, starting by the UTF-8 codec.

Ok, that's understandable of course.
History
Date User Action Args
2014-11-13 15:16:27fhoechsetmessages: + msg231120
2014-11-13 15:15:54r.david.murraysetstatus: open -> closed
resolution: wont fix
stage: resolved
2014-11-13 14:57:11hayposetmessages: + msg231118
2014-11-13 14:50:07fhoechsetmessages: + msg231117
2014-11-13 14:40:44hayposetmessages: + msg231115
2014-11-13 13:30:54fhoechsetmessages: + msg231112
2014-11-13 13:23:11hayposetnosy: + ezio.melotti, haypo
messages: + msg231111
components: + Unicode
2014-11-13 13:16:20fhoechcreate