classification
Title: utf8 codec readlines error after "\x85 "
Type: behavior Stage:
Components: Interpreter Core, IO, Unicode Versions: Python 2.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, jcope, pitrou
Priority: normal Keywords:

Created on 2010-08-13 19:27 by jcope, last changed 2010-08-13 20:54 by jcope. This issue is now closed.

Files
File name Uploaded Description Edit
ErrorProof-utf8-x85.py jcope, 2010-08-13 19:27
Messages (5)
msg113818 - (view) Author: Joseph Copenhaver (jcope) Date: 2010-08-13 19:27
The IO readlines() facility incorrectly processes utf8 files for some unknown reason. Specifically, the call generates too many entries in the lines array result after a character sequence "\x85 blah" which gets cut as ("\x85 ","blah") according the the resultant array. My workaround for this issue is not elegant, especially since I need the newline characters:

#BEGIN: WTF
a_str_whole = fs_in.read()
fs_in.close()
a_str_lines = a_str_whole.split("\n")
for idx in range(0,len(a_str_lines)-1):
   a_str_lines[idx]+="\n"
#END: WTF

Attached is an example script that defines the problem clearly.
msg113821 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-13 19:37
U+0085 corresponds to a line terminator (*). and codecs.open() observes this convention.
Do note that the new io.open() (or the built-in open() in 3.x) only recognizes '\r' and '\n' as line separators.

In any case, changing this behaviour would break compatibility, therefore I'm rejecting the issue.

(*) http://en.wikipedia.org/wiki/Newline#Unicode
msg113823 - (view) Author: Joseph Copenhaver (jcope) Date: 2010-08-13 20:13
I now recognize the issue was in regard to format problems and not python, but the area where this code will be used requires the use of the codecs module.
Is there any way to get the efficiency of codecs I/O readlines() chunking behavior and specify a list of characters to use? Can the file delimiter be changed in python as in perl?

Thanks for the quick response.
msg113827 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-08-13 20:29
> Is there any way to get the efficiency of codecs I/O readlines()
> chunking behavior and specify a list of characters to use? Can the
> file delimiter be changed in python as in perl?

No, but you can use readlines() from the standard open() function (which
will give you 8-bit strings), and then decode individual lines yourself.
msg113829 - (view) Author: Joseph Copenhaver (jcope) Date: 2010-08-13 20:54
It is better, thanks.
History
Date User Action Args
2010-08-13 20:54:53jcopesetmessages: + msg113829
2010-08-13 20:29:21pitrousetmessages: + msg113827
2010-08-13 20:13:07jcopesetmessages: + msg113823
2010-08-13 19:37:49pitrousetstatus: open -> closed

nosy: + pitrou
messages: + msg113821

resolution: rejected
2010-08-13 19:29:33ezio.melottisetnosy: + ezio.melotti
components: - Regular Expressions
2010-08-13 19:27:02jcopecreate