Message 191758 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	wpk
Recipients	ezio.melotti, wpk
Date	2013-06-24.13:11:12
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1372079472.71.0.230881178227.issue18291@psf.upfronthosting.co.za>
In-reply-to

Content
I hope I am writing in the right place. When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line. Example code: >>> with open('unicodetest.txt', 'w') as f: >>> f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e') >>> with open('unicodetest.txt', 'r') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12b\x13c\x14d\x15e The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines: >>> import codecs >>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12 1 b\x13 2 c\x14 3 d\x15e The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such. As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs): >>> import io >>> with io.open('unicodetest.txt', encoding='UTF-8') as f: >>> for i,l in enumerate(f): >>> print i, l 0 a\x12b\x13c\x14d\x15e

I hope I am writing in the right place.

When using codecs.open with UTF-8 encoding, it seems characters \x12, \x13, and \x14 are interpreted as end-of-line.

Example code:

>>> with open('unicodetest.txt', 'w') as f:
>>>   f.write('a'+chr(28)+'b'+chr(29)+'c'+chr(30)+'d'+chr(31)+'e')
>>> with open('unicodetest.txt', 'r') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

The point here is that it reads it as one line, as I would expect. But using codecs.open with UTF-8 encoding it reads it as many lines:

>>> import codecs
>>> with codecs.open('unicodetest.txt', 'r', 'UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12
1 b\x13
2 c\x14
3 d\x15e

The characters \x12 through \x15 are described as "Information Separator Four" through "One" (in that order). As far as I can see they never mark line ends. Also interestingly, \x15 isn't interpreted as such.

As a sidenote, I tested and verified that io.open is correct (but when reading loads of data it appears to be 5 times slower than codecs):

>>> import io
>>> with io.open('unicodetest.txt', encoding='UTF-8') as f:
>>>   for i,l in enumerate(f):
>>>     print i, l
0 a\x12b\x13c\x14d\x15e

History
Date	User	Action	Args
2013-06-24 13:11:12	wpk	set	recipients: + wpk, ezio.melotti
2013-06-24 13:11:12	wpk	set	messageid: <1372079472.71.0.230881178227.issue18291@psf.upfronthosting.co.za>
2013-06-24 13:11:12	wpk	link	issue18291 messages
2013-06-24 13:11:12	wpk	create