Title: Ugly behavior of binary and unicode handling on reading unknown encoded files
Type: enhancement Stage: resolved
Components: IO Versions: Python 3.4
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, Sworddragon, r.david.murray
Priority: normal Keywords:

Created on 2013-06-22 14:54 by Sworddragon, last changed 2013-06-22 19:48 by Arfrever. This issue is now closed.

Messages (3)
msg191643 - (view) Author: (Sworddragon) Date: 2013-06-22 14:54
Currently Python 3 has some problems of handling files with an unknown encoding. In this example we have a file encoded as ISO-8859-1 with the content "รค" which should be tried to be read. Lets see what Python 3 can currently do here:

1. We can simply open the file and try to read the content. The encoding will be set in my case automatically to UTF-8. But the read() operation will throw an exception: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

2. Now lets look a little more into the arguments of open(): We will find an errors argument which could maybe be useful:
2.1. "strict" is the default behavior which was already tested.
2.2. "ignore" will not throw any exception but delete any character which can't be read. This would be problematic in many cases.
2.3. "replace" will replace any character which can't be read which will be problematic in many cases too.
2.4. "surrogateescape" can throw exceptions too: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 0: surrogates not allowed
2.5. "xmlcharrefreplace" and "backslashreplace" are not used for reading.

3. Since trying to decode the file will make many problems we can try to read the file as binary content. This will work in all cases but causing another problem: Any unicode string that must be concatenated with the content of the file must be converted to a binary string too (like b'some_unicode_content' or some_unicode_variable.encode()). The same happens for unicode strings that must be concatenated somewhere else with the newly converted unicode_to_binary variable even if they doesn't touch the file content. This behavior can affect the maintainability in a bad way.

As you can see all current solutions of Python 3 have big disadvantages. If I'm overlooking something feel free to correct me. Currently I have developed my own solution in Python which solved the problem: A function that autodetects the encoding of the file. Maybe there could also be a native way to do this on open() or maybe there could be another way found to solve this problem.
msg191653 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-06-22 18:43
In python we have a saying that we follow most of the time: if you don't know, refuse the temptation to guess.  So currently this is all working as designed: you have to know the encoding of the file you are trying to read as unicode.

Adding a 'guess' function that could be called explicitly is a possibility, but if we were to go that route we'd probably really want something general to guess the encoding of strings, such as (I think) ICU has.  This larger topic is a topic more suited to python-ideas, probably followed, if response is positive, by a PEP.

So I'm closing this issue as rejected, but feel free to bring it up on python-ideas.  (Search for existing threads about it first, please.)
msg191668 - (view) Author: Arfrever Frehtes Taifersar Arahesis (Arfrever) * (Python triager) Date: 2013-06-22 19:48
You can try to use charade ( or potential another encoding detector.
Date User Action Args
2013-06-22 19:48:50Arfreversetnosy: + Arfrever
messages: + msg191668
2013-06-22 18:43:25r.david.murraysetstatus: open -> closed

versions: + Python 3.4, - Python 3.3
nosy: + r.david.murray

messages: + msg191653
resolution: rejected
stage: resolved
2013-06-22 14:54:24Sworddragoncreate