classification
Title: What is a Unicode line break character?
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: flox Nosy List: amaury.forgeotdarc, flox, lemburg
Priority: normal Keywords: patch

Created on 2010-01-06 08:46 by flox, last changed 2010-03-30 20:21 by flox. This issue is now closed.

Files
File name Uploaded Description Edit
issue7643_use_LineBreak_v2.diff flox, 2010-03-19 00:30 Patch, apply to 2.x
Messages (19)
msg97299 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-06 08:46
Bytes objects and Unicode objects do not agree on ASCII linebreaks.

## Python 2

for s in '\x0a\x0d\x1c\x1d\x1e':
  print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

# [u'a\n', u'b'] ['a\n', 'b']
# [u'a\r', u'b'] ['a\r', 'b']
# [u'a\x1c', u'b'] ['a\x1cb']
# [u'a\x1d', u'b'] ['a\x1db']
# [u'a\x1e', u'b'] ['a\x1eb']


## Python 3

for s in '\x0a\x0d\x1c\x1d\x1e':
  print('a{}b'.format(s).splitlines(1),
        bytes('a{}b'.format(s), 'utf-8').splitlines(1))

['a\n', 'b'] [b'a\n', b'b']
['a\r', 'b'] [b'a\r', b'b']
['a\x1c', 'b'] [b'a\x1cb']
['a\x1d', 'b'] [b'a\x1db']
['a\x1e', 'b'] [b'a\x1eb']
msg97300 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-06 09:14
Florent Xicluna wrote:
> 
> New submission from Florent Xicluna <laxyf@yahoo.fr>:
> 
> Bytes objects and Unicode objects do not agree on ASCII linebreaks.
> 
> ## Python 2
> 
> for s in '\x0a\x0d\x1c\x1d\x1e':
>   print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)
> 
> # [u'a\n', u'b'] ['a\n', 'b']
> # [u'a\r', u'b'] ['a\r', 'b']
> # [u'a\x1c', u'b'] ['a\x1cb']
> # [u'a\x1d', u'b'] ['a\x1db']
> # [u'a\x1e', u'b'] ['a\x1eb']
> 
> 
> ## Python 3
> 
> for s in '\x0a\x0d\x1c\x1d\x1e':
>   print('a{}b'.format(s).splitlines(1),
>         bytes('a{}b'.format(s), 'utf-8').splitlines(1))
> 
> ['a\n', 'b'] [b'a\n', b'b']
> ['a\r', 'b'] [b'a\r', b'b']
> ['a\x1c', 'b'] [b'a\x1cb']
> ['a\x1d', 'b'] [b'a\x1db']
> ['a\x1e', 'b'] [b'a\x1eb']

Unicode has more line break characters defined than ASCII, which
only has a single line break character \n, but also uses the
conventions \r and \r\n for meaning "start a new line,
go to position 1".

See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters

The three extra code points Unicode defines for line breaks are
group separators that are not in common use.
msg97333 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010-01-07 00:03
'\x85' when decoded using latin-1 is just transcoded to u'\x85' which is treated as the NEL (a C1 control code equivalent to end of line). This changes iteration over the file when you decode and actually broke our csv parsing code when we got some latin-1 encoded data with \x85 in it from our customer.
msg97407 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-08 10:32
Some technical background.

== Unicode ==

According to the Unicode Standard Annex #9, a character with
bidirectional class B is a "Paragraph Separator". And “Because a
Paragraph Separator breaks lines, there will be at most one per line,
at the end of that line.”

As a consequence, there's 3 reasons to identify a character as a
linebreak:
 - General Category Zl "Line Separator"
 - General Category Zp "Paragraph Separator"
 - Bidirectional Class B "Paragraph Separator"

There's 8 linebreaks in the current Unicode Database (5.2):
------------------------------------------------------------------------
000A    LF  LINE FEED                   Cc  B
000D    CR  CARRIAGE RETURN             Cc  B
001C    FS  INFORMATION SEPARATOR FOUR  Cc  B (UCD 3.1 FILE SEPARATOR)
001D    GS  INFORMATION SEPARATOR THREE Cc  B (UCD 3.1 GROUP SEPARATOR)
001E    RS  INFORMATION SEPARATOR TWO   Cc  B (UCD 3.1 RECORD SEPARATOR)
0085    NEL NEXT LINE                   Cc  B (C1 Control Code)
2028    LS  LINE SEPARATOR              Zl  WS  (Unicode)
2029    PS  PARAGRAPH SEPARATOR         Zp  B   (Unicode)
------------------------------------------------------------------------


== ASCII ==

The Standard ASCII control codes (C0) are in the range 00-1F.
It limits the list to LF, CR, FS, GS, RS.
Regarding the last three, they are not considered as linebreaks:
“The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
structure data, usually on a tape, in order to simulate punched cards. End of
medium (EM) warns that the tape (or whatever) is ending. While many systems use
CR/LF and TAB for structuring data, it is possible to encounter the separator
control characters in data that needs to be structured. The separator control
characters are not overloaded; there is no general use of them except to
separate data into structured groupings. Their numeric values are contiguous
with the space character, which can be considered a member of the group, as a
word separator.”
(Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)

In conclusion, it may be better to keep things unchanged.
We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

References:
 - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
 - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
 - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
 - C0 and C1 Control Codes:
     http://en.wikipedia.org/wiki/C0_and_C1_control_codes
msg97408 - (view) Author: Michael Foord (michael.foord) * (Python committer) Date: 2010-01-08 10:33
Documenting the characters that splitlines treats as newlines for Unicode should definitely be done.
msg97410 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-08 11:42
It's confusing.

There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
  BK, CR, LF, NL

And the resulting list is different:
                                       CAT BIDI BRK
------------------------------------------------------------------------000A    LF  LINE FEED                   Cc  B   LF
000B    VT  LINE TABULATION             Cc  S   BK (since Unicode 5.0) 
000C    FF  FORM FEED                   Cc  WS  BK
000D    CR  CARRIAGE RETURN             Cc  B   CR
0085    NEL NEXT LINE                   Cc  B   NL (C1 Control Code)
2028    LS  LINE SEPARATOR              Zl  WS  BK
2029    PS  PARAGRAPH SEPARATOR         Zp  B   BK
------------------------------------------------------------------------

Differences:
 - VT and FF are mandatory breaks (even if “implementations are not
   required to support the VT character”)
 - FS, GS, US are combined marks (CM): “Prohibit a line break between
   the character and the preceding character”

According to this Annex, the current splitlines() implementation violates the Unicode standard.

References:
 - Unicode Standard Annex #14 - Line Breaking Algorithm
   http://www.unicode.org/reports/tr14/
 - UCD LineBreak.txt
   http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt
msg97438 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-08 20:18
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> Some technical background.
> 
> == Unicode ==
> 
> According to the Unicode Standard Annex #9, a character with
> bidirectional class B is a "Paragraph Separator". And “Because a
> Paragraph Separator breaks lines, there will be at most one per line,
> at the end of that line.”
> 
> As a consequence, there's 3 reasons to identify a character as a
> linebreak:
>  - General Category Zl "Line Separator"
>  - General Category Zp "Paragraph Separator"
>  - Bidirectional Class B "Paragraph Separator"

This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).

> There's 8 linebreaks in the current Unicode Database (5.2):
> ------------------------------------------------------------------------
> 000A    LF  LINE FEED                   Cc  B
> 000D    CR  CARRIAGE RETURN             Cc  B
> 001C    FS  INFORMATION SEPARATOR FOUR  Cc  B (UCD 3.1 FILE SEPARATOR)
> 001D    GS  INFORMATION SEPARATOR THREE Cc  B (UCD 3.1 GROUP SEPARATOR)
> 001E    RS  INFORMATION SEPARATOR TWO   Cc  B (UCD 3.1 RECORD SEPARATOR)
> 0085    NEL NEXT LINE                   Cc  B (C1 Control Code)
> 2028    LS  LINE SEPARATOR              Zl  WS  (Unicode)
> 2029    PS  PARAGRAPH SEPARATOR         Zp  B   (Unicode)
> ------------------------------------------------------------------------

And that's the list we're currently using.

> == ASCII ==
> 
> The Standard ASCII control codes (C0) are in the range 00-1F.
> It limits the list to LF, CR, FS, GS, RS.
> Regarding the last three, they are not considered as linebreaks:
> “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
> structure data, usually on a tape, in order to simulate punched cards. End of
> medium (EM) warns that the tape (or whatever) is ending. While many systems use
> CR/LF and TAB for structuring data, it is possible to encounter the separator
> control characters in data that needs to be structured. The separator control
> characters are not overloaded; there is no general use of them except to
> separate data into structured groupings. Their numeric values are contiguous
> with the space character, which can be considered a member of the group, as a
> word separator.”
> (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)
> 
> In conclusion, it may be better to keep things unchanged.

Agreed.

> We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

For ASCII we should make the list of characters explicit.
For Unicode, we should mention the above definition and give
the table as example list (the Unicode database may add more
such characters in the future).

> References:
>  - The Unicode Character Database (UCD): http://www.unicode.org/ucd/
>  - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values
>  - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/
>  - C0 and C1 Control Codes:
>      http://en.wikipedia.org/wiki/C0_and_C1_control_codes
msg97440 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-08 21:08
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> It's confusing.
> 
> There's a specific annex UAX #14 which defines "Line Breaking Properties".
> Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
>   BK, CR, LF, NL

Note that a line breaking algorithm is something different than
a line split algorithm. The latter is used to separate lines at
pre-defined positions in the text, the former is used to format
a piece of text to fit e.g. into a certain width of available
character positions.

.splitlines() implements a line splitting algorithm, not a line
breaking one.

> And the resulting list is different:
>                                        CAT BIDI BRK
> ------------------------------------------------------------------------
> 000A    LF  LINE FEED                   Cc  B   LF
> 000B    VT  LINE TABULATION             Cc  S   BK (since Unicode 5.0) 
> 000C    FF  FORM FEED                   Cc  WS  BK
> 000D    CR  CARRIAGE RETURN             Cc  B   CR
> 0085    NEL NEXT LINE                   Cc  B   NL (C1 Control Code)
> 2028    LS  LINE SEPARATOR              Zl  WS  BK
> 2029    PS  PARAGRAPH SEPARATOR         Zp  B   BK
> ------------------------------------------------------------------------
>
> Differences:
>  - VT and FF are mandatory breaks (even if “implementations are not
>    required to support the VT character”)
>  - FS, GS, US are combined marks (CM): “Prohibit a line break between
>    the character and the preceding character”
> 
> According to this Annex, the current splitlines() implementation violates the Unicode standard.

It appears so and I guess that's an oversight on my part when
writing the code: in Unicode 2.1 (the version I started with),
FF was marked as "B", later on Unicode 3.0 was published and
the new LineBreak.txt file was added to the standard. FF was
changed to "WS" and instead marked as "BK" in that new LineBreak.txt
file.

Since we only used the main UnicodeData.txt file as basis for
the type database, the "FF" code point dropped out of the
line break code point set.

I guess we'll have to add FF and VT to the generator makeunicodedata.py
to remedy this.

> References:
>  - Unicode Standard Annex #14 - Line Breaking Algorithm
>    http://www.unicode.org/reports/tr14/
>  - UCD LineBreak.txt
>    http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg97483 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-10 00:45
Here is draft of the patch to do what is proposed by Marc André on msg97440 (add VT and FF).
Additionnally I upgraded the UCD 5.1 -> 5.2.

The implementation uses field 16 as defined in "py3k" implementation of "makeunicodedata.py". It should minimize differences between Py2 and Py3 implementations.

Documentation and tests are missing.
I can provide a "diff.gz" containing "Modules/unicodedata_db.h", "Modules/unicodename_db.h" and "Objects/unicodetype_db.h", if needed.


- /* Returns 1 for Unicode characters having the category 'Zl',
-  * 'Zp' or type 'B', 0 otherwise.
+ /* Returns 1 for Unicode characters having the line break
+  * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
+  * type 'B', 0 otherwise.
   */

Note: the "remove_deprecation" should be applied before to remove "-3" warnings.
msg97502 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-01-10 10:28
I don't know what to do about this:

>  - FS, GS, RS are combined marks (CM): “Prohibit a line break between
>    the character and the preceding character”

I know they are not commonly used. So we can keep them as line breaks.
But if we comply strictly with UAX 14 we do not consider them as line breaks.
msg97531 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-10 18:04
Florent Xicluna wrote:
> 
> Florent Xicluna <laxyf@yahoo.fr> added the comment:
> 
> I don't know what to do about this:
> 
>>  - FS, GS, RS are combined marks (CM): “Prohibit a line break between
>>    the character and the preceding character”
> 
> I know they are not commonly used. So we can keep them as line breaks.
> But if we comply strictly with UAX 14 we do not consider them as line breaks.

Right. The only update we'd have to do is add FF and VT.

I am a little worried about the possible breakage this may cause,
though. E.g. if you look at a file with FFs in Emacs, the FFs don't
show up as line breaks. FFs in CSV files are currently also not regarded
as line breaks and thus don't need to be placed in quotes.

VTs are probably a non-issue, since they are not in common use.
msg98485 - (view) Author: Chris Carter (Chris.Carter) Date: 2010-01-29 00:15
Then I must ask, why did the string attribute behave differently?  I added it to allow for that, and the behavior seems inconsistent.
msg98486 - (view) Author: Chris Carter (Chris.Carter) Date: 2010-01-29 00:16
My bad, wrong bug.
msg101294 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-19 00:30
Cleanup committed as r78982

Patch for LineBreak.txt updated after UCD upgrade to 5.2.
See details: http://bugs.python.org/issue7643#msg97483

Tests added to test_unicodedata.

Backward compatibility concern:
 * it adds VT u'\x0b' and FF u'\x0c' as line breaks.

The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).
msg101306 - (view) Author: Chris Carter (Chris.Carter) Date: 2010-03-19 05:01
unwatched
msg101494 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-22 11:56
Florent Xicluna wrote:
> Backward compatibility concern:
>  * it adds VT u'\x0b' and FF u'\x0c' as line breaks.
> 
> The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).

I think we should correct this bug together with a clear warning in
the Misc/NEWS file.
msg101945 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-03-30 16:45
Which functions are affected by this change?
Py_UNICODE_ISLINEBREAK()? unicode.splitlines()?
msg101948 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-30 17:05
Committed to trunk: r79494 and r79496.

Afaict, it changes Py_UNICODE_ISLINEBREAK, _PyUnicode_IsLinebreak and the Unicode functions which depend on it (splitlines(), _sre module).
msg101955 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-30 20:21
Ported to 3.x with r79506
History
Date User Action Args
2010-03-30 20:21:44floxsetstatus: open -> closed
resolution: fixed
messages: + msg101955

stage: resolved
2010-03-30 17:05:56floxsetmessages: + msg101948
2010-03-30 16:45:25amaury.forgeotdarcsetassignee: flox

messages: + msg101945
nosy: + amaury.forgeotdarc
2010-03-22 11:56:00lemburgsetmessages: + msg101494
2010-03-19 06:57:07floxsetnosy: - Chris.Carter
2010-03-19 05:01:51Chris.Cartersetnosy: lemburg, flox, Chris.Carter
messages: + msg101306
2010-03-19 00:31:00floxsetpriority: normal
files: + issue7643_use_LineBreak_v2.diff
messages: + msg101294
2010-03-18 23:48:19floxsetfiles: - issue7643_use_LineBreak.diff
2010-03-18 22:58:00michael.foordsetnosy: - michael.foord
2010-03-18 22:57:18floxsetfiles: - issue7643_remove_deprecation.diff
2010-01-29 00:16:20Chris.Cartersetmessages: + msg98486
2010-01-29 00:15:43Chris.Cartersetnosy: + Chris.Carter
messages: + msg98485
2010-01-10 18:05:00lemburgsetmessages: + msg97531
2010-01-10 10:28:24floxsetnosy: lemburg, michael.foord, flox
messages: + msg97502
components: + Unicode
title: What is an ASCII linebreak? -> What is a Unicode line break character?
2010-01-10 00:45:28floxsetfiles: + issue7643_use_LineBreak.diff

messages: + msg97483
2010-01-10 00:36:01floxsetfiles: + issue7643_remove_deprecation.diff
keywords: + patch
2010-01-08 21:08:20lemburgsetmessages: + msg97440
2010-01-08 20:18:22lemburgsetmessages: + msg97438
2010-01-08 11:42:41floxsetmessages: + msg97410
2010-01-08 10:33:51michael.foordsetmessages: + msg97408
2010-01-08 10:32:06floxsetmessages: + msg97407
2010-01-07 00:03:17michael.foordsetnosy: + michael.foord
messages: + msg97333
2010-01-06 09:14:08lemburgsetnosy: + lemburg
messages: + msg97300
2010-01-06 08:46:45floxcreate