Message 191643 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Sworddragon
Recipients	Sworddragon
Date	2013-06-22.14:54:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1371912864.83.0.123376262998.issue18282@psf.upfronthosting.co.za>
In-reply-to

Content
Currently Python 3 has some problems of handling files with an unknown encoding. In this example we have a file encoded as ISO-8859-1 with the content "ä" which should be tried to be read. Lets see what Python 3 can currently do here: 1. We can simply open the file and try to read the content. The encoding will be set in my case automatically to UTF-8. But the read() operation will throw an exception: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data 2. Now lets look a little more into the arguments of open(): We will find an errors argument which could maybe be useful: 2.1. "strict" is the default behavior which was already tested. 2.2. "ignore" will not throw any exception but delete any character which can't be read. This would be problematic in many cases. 2.3. "replace" will replace any character which can't be read which will be problematic in many cases too. 2.4. "surrogateescape" can throw exceptions too: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 0: surrogates not allowed 2.5. "xmlcharrefreplace" and "backslashreplace" are not used for reading. 3. Since trying to decode the file will make many problems we can try to read the file as binary content. This will work in all cases but causing another problem: Any unicode string that must be concatenated with the content of the file must be converted to a binary string too (like b'some_unicode_content' or some_unicode_variable.encode()). The same happens for unicode strings that must be concatenated somewhere else with the newly converted unicode_to_binary variable even if they doesn't touch the file content. This behavior can affect the maintainability in a bad way. As you can see all current solutions of Python 3 have big disadvantages. If I'm overlooking something feel free to correct me. Currently I have developed my own solution in Python which solved the problem: A function that autodetects the encoding of the file. Maybe there could also be a native way to do this on open() or maybe there could be another way found to solve this problem.

Currently Python 3 has some problems of handling files with an unknown encoding. In this example we have a file encoded as ISO-8859-1 with the content "ä" which should be tried to be read. Lets see what Python 3 can currently do here:

1. We can simply open the file and try to read the content. The encoding will be set in my case automatically to UTF-8. But the read() operation will throw an exception: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0: unexpected end of data

2. Now lets look a little more into the arguments of open(): We will find an errors argument which could maybe be useful:
2.1. "strict" is the default behavior which was already tested.
2.2. "ignore" will not throw any exception but delete any character which can't be read. This would be problematic in many cases.
2.3. "replace" will replace any character which can't be read which will be problematic in many cases too.
2.4. "surrogateescape" can throw exceptions too: UnicodeEncodeError: 'utf-8' codec can't encode character '\udce4' in position 0: surrogates not allowed
2.5. "xmlcharrefreplace" and "backslashreplace" are not used for reading.

3. Since trying to decode the file will make many problems we can try to read the file as binary content. This will work in all cases but causing another problem: Any unicode string that must be concatenated with the content of the file must be converted to a binary string too (like b'some_unicode_content' or some_unicode_variable.encode()). The same happens for unicode strings that must be concatenated somewhere else with the newly converted unicode_to_binary variable even if they doesn't touch the file content. This behavior can affect the maintainability in a bad way.

As you can see all current solutions of Python 3 have big disadvantages. If I'm overlooking something feel free to correct me. Currently I have developed my own solution in Python which solved the problem: A function that autodetects the encoding of the file. Maybe there could also be a native way to do this on open() or maybe there could be another way found to solve this problem.

History
Date	User	Action	Args
2013-06-22 14:54:24	Sworddragon	set	recipients: + Sworddragon
2013-06-22 14:54:24	Sworddragon	set	messageid: <1371912864.83.0.123376262998.issue18282@psf.upfronthosting.co.za>
2013-06-22 14:54:24	Sworddragon	link	issue18282 messages
2013-06-22 14:54:24	Sworddragon	create