This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author wpk
Recipients ezio.melotti, wpk
Date 2013-06-24.13:11:12
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1372079472.71.0.230881178227.issue18291@psf.upfronthosting.co.za>
In-reply-to
Content
I hope I am writing in the right place.

When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.

Example code:

>>> with open('unicodetest.txt', 'w') as f:
>>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:

>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e

The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.

As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):

>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e
History
Date User Action Args
2013-06-24 13:11:12wpksetrecipients: + wpk, ezio.melotti
2013-06-24 13:11:12wpksetmessageid: <1372079472.71.0.230881178227.issue18291@psf.upfronthosting.co.za>
2013-06-24 13:11:12wpklinkissue18291 messages
2013-06-24 13:11:12wpkcreate