What is a Unicode line break character? #51892

florentx · 2010-01-06T08:46:45Z

BPO	7643
Nosy	@malemburg, @amauryfa, @florentx
Files	issue7643_use_LineBreak_v2.diff: Patch, apply to 2.x

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/florentx'
closed_at = <Date 2010-03-30.20:21:44.868>
created_at = <Date 2010-01-06.08:46:45.401>
labels = ['interpreter-core', 'type-bug', 'expert-unicode']
title = 'What is a Unicode line break character?'
updated_at = <Date 2010-03-30.20:21:44.866>
user = 'https://github.com/florentx'

bugs.python.org fields:

activity = <Date 2010-03-30.20:21:44.866>
actor = 'flox'
assignee = 'flox'
closed = True
closed_date = <Date 2010-03-30.20:21:44.868>
closer = 'flox'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-01-06.08:46:45.401>
creator = 'flox'
dependencies = []
files = ['16577']
hgrepos = []
issue_num = 7643
keywords = ['patch']
message_count = 19.0
messages = ['97299', '97300', '97333', '97407', '97408', '97410', '97438', '97440', '97483', '97502', '97531', '98485', '98486', '101294', '101306', '101494', '101945', '101948', '101955']
nosy_count = 3.0
nosy_names = ['lemburg', 'amaury.forgeotdarc', 'flox']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue7643'
versions = ['Python 2.7', 'Python 3.2']

florentx · 2010-01-06T08:46:45Z

Bytes objects and Unicode objects do not agree on ASCII linebreaks.

## Python 2

for s in '\x0a\x0d\x1c\x1d\x1e':
  print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

# [u'a\n', u'b'] ['a\n', 'b']
# [u'a\r', u'b'] ['a\r', 'b']
# [u'a\x1c', u'b'] ['a\x1cb']
# [u'a\x1d', u'b'] ['a\x1db']
# [u'a\x1e', u'b'] ['a\x1eb']

## Python 3

for s in '\x0a\x0d\x1c\x1d\x1e':
  print('a{}b'.format(s).splitlines(1),
        bytes('a{}b'.format(s), 'utf-8').splitlines(1))

['a\n', 'b'] [b'a\n', b'b']
['a\r', 'b'] [b'a\r', b'b']
['a\x1c', 'b'] [b'a\x1cb']
['a\x1d', 'b'] [b'a\x1db']
['a\x1e', 'b'] [b'a\x1eb']

malemburg · 2010-01-06T09:14:08Z

Florent Xicluna wrote:

New submission from Florent Xicluna <laxyf@yahoo.fr>:

Bytes objects and Unicode objects do not agree on ASCII linebreaks.

Python 2

for s in '\x0a\x0d\x1c\x1d\x1e':
print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

[u'a\n', u'b'] ['a\n', 'b']

[u'a\r', u'b'] ['a\r', 'b']

[u'a\x1c', u'b'] ['a\x1cb']

[u'a\x1d', u'b'] ['a\x1db']

[u'a\x1e', u'b'] ['a\x1eb']

Python 3

for s in '\x0a\x0d\x1c\x1d\x1e':
print('a{}b'.format(s).splitlines(1),
bytes('a{}b'.format(s), 'utf-8').splitlines(1))

['a\n', 'b'] [b'a\n', b'b']
['a\r', 'b'] [b'a\r', b'b']
['a\x1c', 'b'] [b'a\x1cb']
['a\x1d', 'b'] [b'a\x1db']
['a\x1e', 'b'] [b'a\x1eb']

Unicode has more line break characters defined than ASCII, which
only has a single line break character \n, but also uses the
conventions \r and \r\n for meaning "start a new line,
go to position 1".

See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters

The three extra code points Unicode defines for line breaks are
group separators that are not in common use.

voidspace · 2010-01-07T00:03:17Z

'\x85' when decoded using latin-1 is just transcoded to u'\x85' which is treated as the NEL (a C1 control code equivalent to end of line). This changes iteration over the file when you decode and actually broke our csv parsing code when we got some latin-1 encoded data with \x85 in it from our customer.

florentx · 2010-01-08T10:32:05Z

Some technical background.

== Unicode ==

According to the Unicode Standard Annex #9, a character with
bidirectional class B is a "Paragraph Separator". And “Because a
Paragraph Separator breaks lines, there will be at most one per line,
at the end of that line.”

As a consequence, there's 3 reasons to identify a character as a
linebreak:

General Category Zl "Line Separator"
General Category Zp "Paragraph Separator"
Bidirectional Class B "Paragraph Separator"

There's 8 linebreaks in the current Unicode Database (5.2):
------------------------------------------------------------------------
000A LF LINE FEED Cc B
000D CR CARRIAGE RETURN Cc B
001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
0085 NEL NEXT LINE Cc B (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS (Unicode)
2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
------------------------------------------------------------------------

== ASCII ==

The Standard ASCII control codes (C0) are in the range 00-1F.
It limits the list to LF, CR, FS, GS, RS.
Regarding the last three, they are not considered as linebreaks:
“The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
structure data, usually on a tape, in order to simulate punched cards. End of
medium (EM) warns that the tape (or whatever) is ending. While many systems use
CR/LF and TAB for structuring data, it is possible to encounter the separator
control characters in data that needs to be structured. The separator control
characters are not overloaded; there is no general use of them except to
separate data into structured groupings. Their numeric values are contiguous
with the space character, which can be considered a member of the group, as a
word separator.”
(Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)

In conclusion, it may be better to keep things unchanged.
We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

References:

The Unicode Character Database (UCD): http://www.unicode.org/ucd/
UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
C0 and C1 Control Codes:
http://en.wikipedia.org/wiki/C0_and_C1_control_codes

voidspace · 2010-01-08T10:33:52Z

Documenting the characters that splitlines treats as newlines for Unicode should definitely be done.

florentx · 2010-01-08T11:42:41Z

It's confusing.

There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
BK, CR, LF, NL

And the resulting list is different:
CAT BIDI BRK
------------------------------------------------------------------------000A LF LINE FEED Cc B LF
000B VT LINE TABULATION Cc S BK (since Unicode 5.0)
000C FF FORM FEED Cc WS BK
000D CR CARRIAGE RETURN Cc B CR
0085 NEL NEXT LINE Cc B NL (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS BK
2029 PS PARAGRAPH SEPARATOR Zp B BK
------------------------------------------------------------------------

Differences:

VT and FF are mandatory breaks (even if “implementations are not
required to support the VT character”)
FS, GS, US are combined marks (CM): “Prohibit a line break between
the character and the preceding character”

According to this Annex, the current splitlines() implementation violates the Unicode standard.

References:

Unicode Standard Annex Disable Travis docs job until a fix is found #14 - Line Breaking Algorithm
http://www.unicode.org/reports/tr14/
UCD LineBreak.txt
http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

malemburg · 2010-01-08T20:18:21Z

Florent Xicluna wrote:

Florent Xicluna <laxyf@yahoo.fr> added the comment:

Some technical background.

== Unicode ==

According to the Unicode Standard Annex #9, a character with
bidirectional class B is a "Paragraph Separator". And “Because a
Paragraph Separator breaks lines, there will be at most one per line,
at the end of that line.”

As a consequence, there's 3 reasons to identify a character as a
linebreak:

General Category Zl "Line Separator"

General Category Zp "Paragraph Separator"

Bidirectional Class B "Paragraph Separator"

This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).

There's 8 linebreaks in the current Unicode Database (5.2):
------------------------------------------------------------------------
000A LF LINE FEED Cc B
000D CR CARRIAGE RETURN Cc B
001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
0085 NEL NEXT LINE Cc B (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS (Unicode)
2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
------------------------------------------------------------------------

And that's the list we're currently using.

== ASCII ==

The Standard ASCII control codes (C0) are in the range 00-1F.
It limits the list to LF, CR, FS, GS, RS.
Regarding the last three, they are not considered as linebreaks:
“The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
structure data, usually on a tape, in order to simulate punched cards. End of
medium (EM) warns that the tape (or whatever) is ending. While many systems use
CR/LF and TAB for structuring data, it is possible to encounter the separator
control characters in data that needs to be structured. The separator control
characters are not overloaded; there is no general use of them except to
separate data into structured groupings. Their numeric values are contiguous
with the space character, which can be considered a member of the group, as a
word separator.”
(Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)

In conclusion, it may be better to keep things unchanged.

Agreed.

We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

For ASCII we should make the list of characters explicit.
For Unicode, we should mention the above definition and give
the table as example list (the Unicode database may add more
such characters in the future).

References:

The Unicode Character Database (UCD): http://www.unicode.org/ucd/

UCD Property Values: http://unicode.org/reports/tr44/#Property_Values

The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/

C0 and C1 Control Codes:
http://en.wikipedia.org/wiki/C0_and_C1_control_codes

malemburg · 2010-01-08T21:08:19Z

Florent Xicluna wrote:

Florent Xicluna <laxyf@yahoo.fr> added the comment:

It's confusing.

There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
BK, CR, LF, NL

Note that a line breaking algorithm is something different than
a line split algorithm. The latter is used to separate lines at
pre-defined positions in the text, the former is used to format
a piece of text to fit e.g. into a certain width of available
character positions.

.splitlines() implements a line splitting algorithm, not a line
breaking one.

And the resulting list is different:
CAT BIDI BRK
------------------------------------------------------------------------
000A LF LINE FEED Cc B LF
000B VT LINE TABULATION Cc S BK (since Unicode 5.0)
000C FF FORM FEED Cc WS BK
000D CR CARRIAGE RETURN Cc B CR
0085 NEL NEXT LINE Cc B NL (C1 Control Code)
2028 LS LINE SEPARATOR Zl WS BK
2029 PS PARAGRAPH SEPARATOR Zp B BK
------------------------------------------------------------------------

Differences:

VT and FF are mandatory breaks (even if “implementations are not
required to support the VT character”)

FS, GS, US are combined marks (CM): “Prohibit a line break between
the character and the preceding character”

According to this Annex, the current splitlines() implementation violates the Unicode standard.

It appears so and I guess that's an oversight on my part when
writing the code: in Unicode 2.1 (the version I started with),
FF was marked as "B", later on Unicode 3.0 was published and
the new LineBreak.txt file was added to the standard. FF was
changed to "WS" and instead marked as "BK" in that new LineBreak.txt
file.

Since we only used the main UnicodeData.txt file as basis for
the type database, the "FF" code point dropped out of the
line break code point set.

I guess we'll have to add FF and VT to the generator makeunicodedata.py
to remedy this.

References:

Unicode Standard Annex Disable Travis docs job until a fix is found #14 - Line Breaking Algorithm
http://www.unicode.org/reports/tr14/

UCD LineBreak.txt
http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

Thanks,

Marc-Andre Lemburg
eGenix.com

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

florentx · 2010-01-10T00:45:26Z

Here is draft of the patch to do what is proposed by Marc André on msg97440 (add VT and FF).
Additionnally I upgraded the UCD 5.1 -> 5.2.

The implementation uses field 16 as defined in "py3k" implementation of "makeunicodedata.py". It should minimize differences between Py2 and Py3 implementations.

Documentation and tests are missing.
I can provide a "diff.gz" containing "Modules/unicodedata_db.h", "Modules/unicodename_db.h" and "Objects/unicodetype_db.h", if needed.

- /* Returns 1 for Unicode characters having the category 'Zl',
-  * 'Zp' or type 'B', 0 otherwise.
+ /* Returns 1 for Unicode characters having the line break
+  * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
+  * type 'B', 0 otherwise.
   */

Note: the "remove_deprecation" should be applied before to remove "-3" warnings.

florentx · 2010-01-10T10:28:23Z

I don't know what to do about this:

FS, GS, RS are combined marks (CM): “Prohibit a line break between
the character and the preceding character”

I know they are not commonly used. So we can keep them as line breaks.
But if we comply strictly with UAX 14 we do not consider them as line breaks.

malemburg · 2010-01-10T18:05:00Z

Florent Xicluna wrote:

Florent Xicluna <laxyf@yahoo.fr> added the comment:

I don't know what to do about this:

> - FS, GS, RS are combined marks (CM): “Prohibit a line break between
> the character and the preceding character”

I know they are not commonly used. So we can keep them as line breaks.
But if we comply strictly with UAX 14 we do not consider them as line breaks.

Right. The only update we'd have to do is add FF and VT.

I am a little worried about the possible breakage this may cause,
though. E.g. if you look at a file with FFs in Emacs, the FFs don't
show up as line breaks. FFs in CSV files are currently also not regarded
as line breaks and thus don't need to be placed in quotes.

VTs are probably a non-issue, since they are not in common use.

ChrisCarter · 2010-01-29T00:15:43Z

Then I must ask, why did the string attribute behave differently? I added it to allow for that, and the behavior seems inconsistent.

ChrisCarter · 2010-01-29T00:16:20Z

My bad, wrong bug.

florentx · 2010-03-19T00:30:58Z

Cleanup committed as r78982

Patch for LineBreak.txt updated after UCD upgrade to 5.2.
See details: http://bugs.python.org/issue7643#msg97483

Tests added to test_unicodedata.

Backward compatibility concern:

it adds VT u'\x0b' and FF u'\x0c' as line breaks.

The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).

ChrisCarter · 2010-03-19T05:01:52Z

unwatched

malemburg · 2010-03-22T11:56:00Z

Florent Xicluna wrote:

Backward compatibility concern:

it adds VT u'\x0b' and FF u'\x0c' as line breaks.

The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).

I think we should correct this bug together with a clear warning in
the Misc/NEWS file.

amauryfa · 2010-03-30T16:45:26Z

Which functions are affected by this change?
Py_UNICODE_ISLINEBREAK()? unicode.splitlines()?

florentx · 2010-03-30T17:05:56Z

Committed to trunk: r79494 and r79496.

Afaict, it changes Py_UNICODE_ISLINEBREAK, _PyUnicode_IsLinebreak and the Unicode functions which depend on it (splitlines(), _sre module).

florentx · 2010-03-30T20:21:45Z

Ported to 3.x with r79506

florentx mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jan 6, 2010

florentx mannequin added the topic-unicode label Jan 10, 2010

florentx mannequin changed the title ~~What is an ASCII linebreak?~~ What is a Unicode line break character? Jan 10, 2010

amauryfa assigned florentx Mar 30, 2010

florentx mannequin closed this as completed Mar 30, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is a Unicode line break character? #51892

What is a Unicode line break character? #51892

florentx mannequin commented Jan 6, 2010

florentx mannequin commented Jan 6, 2010

malemburg commented Jan 6, 2010

Python 2

[u'a\n', u'b'] ['a\n', 'b']

[u'a\r', u'b'] ['a\r', 'b']

[u'a\x1c', u'b'] ['a\x1cb']

[u'a\x1d', u'b'] ['a\x1db']

[u'a\x1e', u'b'] ['a\x1eb']

Python 3

voidspace commented Jan 7, 2010

florentx mannequin commented Jan 8, 2010

voidspace commented Jan 8, 2010

florentx mannequin commented Jan 8, 2010

malemburg commented Jan 8, 2010

malemburg commented Jan 8, 2010

florentx mannequin commented Jan 10, 2010

florentx mannequin commented Jan 10, 2010

malemburg commented Jan 10, 2010

ChrisCarter mannequin commented Jan 29, 2010

ChrisCarter mannequin commented Jan 29, 2010

florentx mannequin commented Mar 19, 2010

ChrisCarter mannequin commented Mar 19, 2010

malemburg commented Mar 22, 2010

amauryfa commented Mar 30, 2010

florentx mannequin commented Mar 30, 2010

florentx mannequin commented Mar 30, 2010

What is a Unicode line break character? #51892

What is a Unicode line break character? #51892

Comments

florentx mannequin commented Jan 6, 2010

florentx mannequin commented Jan 6, 2010

malemburg commented Jan 6, 2010

Python 2

[u'a\n', u'b'] ['a\n', 'b']

[u'a\r', u'b'] ['a\r', 'b']

[u'a\x1c', u'b'] ['a\x1cb']

[u'a\x1d', u'b'] ['a\x1db']

[u'a\x1e', u'b'] ['a\x1eb']

Python 3

voidspace commented Jan 7, 2010

florentx mannequin commented Jan 8, 2010

voidspace commented Jan 8, 2010

florentx mannequin commented Jan 8, 2010

malemburg commented Jan 8, 2010

malemburg commented Jan 8, 2010

Thanks,

florentx mannequin commented Jan 10, 2010

florentx mannequin commented Jan 10, 2010

malemburg commented Jan 10, 2010

ChrisCarter mannequin commented Jan 29, 2010

ChrisCarter mannequin commented Jan 29, 2010

florentx mannequin commented Mar 19, 2010

ChrisCarter mannequin commented Mar 19, 2010

malemburg commented Mar 22, 2010

amauryfa commented Mar 30, 2010

florentx mannequin commented Mar 30, 2010

florentx mannequin commented Mar 30, 2010