Message191758
I hope I am writing in the right place.
When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.
Example code:
>>> with open('unicodetest.txt', 'w') as f:
>>> f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>> for i,l in enumerate(f):
>>> print i, l
0 a\x12b\x13c\x14d\x15e
The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:
>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>> for i,l in enumerate(f):
>>> print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e
The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.
As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):
>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>> for i,l in enumerate(f):
>>> print i, l
0 a\x12b\x13c\x14d\x15e |
|
Date |
User |
Action |
Args |
2013-06-24 13:11:12 | wpk | set | recipients:
+ wpk, ezio.melotti |
2013-06-24 13:11:12 | wpk | set | messageid: <1372079472.71.0.230881178227.issue18291@psf.upfronthosting.co.za> |
2013-06-24 13:11:12 | wpk | link | issue18291 messages |
2013-06-24 13:11:12 | wpk | create | |
|