Message 97300 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	flox, lemburg
Date	2010-01-06.09:14:08
SpamBayes Score	6.16307e-12
Marked as misclassified	No
Message-id	<4B44545E.6090805@egenix.com>
In-reply-to	<1262767606.8.0.225636949209.issue7643@psf.upfronthosting.co.za>

Content
Florent Xicluna wrote: > > New submission from Florent Xicluna <laxyf@yahoo.fr>: > > Bytes objects and Unicode objects do not agree on ASCII linebreaks. > > ## Python 2 > > for s in '\x0a\x0d\x1c\x1d\x1e': > print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1) > > # [u'a\n', u'b'] ['a\n', 'b'] > # [u'a\r', u'b'] ['a\r', 'b'] > # [u'a\x1c', u'b'] ['a\x1cb'] > # [u'a\x1d', u'b'] ['a\x1db'] > # [u'a\x1e', u'b'] ['a\x1eb'] > > > ## Python 3 > > for s in '\x0a\x0d\x1c\x1d\x1e': > print('a{}b'.format(s).splitlines(1), > bytes('a{}b'.format(s), 'utf-8').splitlines(1)) > > ['a\n', 'b'] [b'a\n', b'b'] > ['a\r', 'b'] [b'a\r', b'b'] > ['a\x1c', 'b'] [b'a\x1cb'] > ['a\x1d', 'b'] [b'a\x1db'] > ['a\x1e', 'b'] [b'a\x1eb'] Unicode has more line break characters defined than ASCII, which only has a single line break character \n, but also uses the conventions \r and \r\n for meaning "start a new line, go to position 1". See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters The three extra code points Unicode defines for line breaks are group separators that are not in common use.

Florent Xicluna wrote:
> 
> New submission from Florent Xicluna <laxyf@yahoo.fr>:
> 
> Bytes objects and Unicode objects do not agree on ASCII linebreaks.
> 
> ## Python 2
> 
> for s in '\x0a\x0d\x1c\x1d\x1e':
>   print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)
> 
> # [u'a\n', u'b'] ['a\n', 'b']
> # [u'a\r', u'b'] ['a\r', 'b']
> # [u'a\x1c', u'b'] ['a\x1cb']
> # [u'a\x1d', u'b'] ['a\x1db']
> # [u'a\x1e', u'b'] ['a\x1eb']
> 
> 
> ## Python 3
> 
> for s in '\x0a\x0d\x1c\x1d\x1e':
>   print('a{}b'.format(s).splitlines(1),
>         bytes('a{}b'.format(s), 'utf-8').splitlines(1))
> 
> ['a\n', 'b'] [b'a\n', b'b']
> ['a\r', 'b'] [b'a\r', b'b']
> ['a\x1c', 'b'] [b'a\x1cb']
> ['a\x1d', 'b'] [b'a\x1db']
> ['a\x1e', 'b'] [b'a\x1eb']

Unicode has more line break characters defined than ASCII, which
only has a single line break character \n, but also uses the
conventions \r and \r\n for meaning "start a new line,
go to position 1".

See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters

The three extra code points Unicode defines for line breaks are
group separators that are not in common use.

History
Date	User	Action	Args
2010-01-06 09:14:10	lemburg	set	recipients: + lemburg, flox
2010-01-06 09:14:08	lemburg	link	issue7643 messages
2010-01-06 09:14:08	lemburg	create