Issue 7643: What is a Unicode line break character?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51892

classification

Title:	What is a Unicode line break character?
Type:	behavior	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	flox	Nosy List:	amaury.forgeotdarc, flox, lemburg
Priority:	normal	Keywords:	patch

Created on 2010-01-06 08:46 by flox, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue7643_use_LineBreak_v2.diff	flox, 2010-03-19 00:30	Patch, apply to 2.x

Messages (19)
msg97299 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-06 08:46
Bytes objects and Unicode objects do not agree on ASCII linebreaks. ## Python 2 for s in '\x0a\x0d\x1c\x1d\x1e': print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1) # [u'a\n', u'b'] ['a\n', 'b'] # [u'a\r', u'b'] ['a\r', 'b'] # [u'a\x1c', u'b'] ['a\x1cb'] # [u'a\x1d', u'b'] ['a\x1db'] # [u'a\x1e', u'b'] ['a\x1eb'] ## Python 3 for s in '\x0a\x0d\x1c\x1d\x1e': print('a{}b'.format(s).splitlines(1), bytes('a{}b'.format(s), 'utf-8').splitlines(1)) ['a\n', 'b'] [b'a\n', b'b'] ['a\r', 'b'] [b'a\r', b'b'] ['a\x1c', 'b'] [b'a\x1cb'] ['a\x1d', 'b'] [b'a\x1db'] ['a\x1e', 'b'] [b'a\x1eb']
msg97300 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-06 09:14
Florent Xicluna wrote: > > New submission from Florent Xicluna <laxyf@yahoo.fr>: > > Bytes objects and Unicode objects do not agree on ASCII linebreaks. > > ## Python 2 > > for s in '\x0a\x0d\x1c\x1d\x1e': > print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1) > > # [u'a\n', u'b'] ['a\n', 'b'] > # [u'a\r', u'b'] ['a\r', 'b'] > # [u'a\x1c', u'b'] ['a\x1cb'] > # [u'a\x1d', u'b'] ['a\x1db'] > # [u'a\x1e', u'b'] ['a\x1eb'] > > > ## Python 3 > > for s in '\x0a\x0d\x1c\x1d\x1e': > print('a{}b'.format(s).splitlines(1), > bytes('a{}b'.format(s), 'utf-8').splitlines(1)) > > ['a\n', 'b'] [b'a\n', b'b'] > ['a\r', 'b'] [b'a\r', b'b'] > ['a\x1c', 'b'] [b'a\x1cb'] > ['a\x1d', 'b'] [b'a\x1db'] > ['a\x1e', 'b'] [b'a\x1eb'] Unicode has more line break characters defined than ASCII, which only has a single line break character \n, but also uses the conventions \r and \r\n for meaning "start a new line, go to position 1". See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters The three extra code points Unicode defines for line breaks are group separators that are not in common use.
msg97333 - (view)	Author: Michael Foord (michael.foord) *	Date: 2010-01-07 00:03
'\x85' when decoded using latin-1 is just transcoded to u'\x85' which is treated as the NEL (a C1 control code equivalent to end of line). This changes iteration over the file when you decode and actually broke our csv parsing code when we got some latin-1 encoded data with \x85 in it from our customer.
msg97407 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-08 10:32
Some technical background. == Unicode == According to the Unicode Standard Annex #9, a character with bidirectional class B is a "Paragraph Separator". And “Because a Paragraph Separator breaks lines, there will be at most one per line, at the end of that line.” As a consequence, there's 3 reasons to identify a character as a linebreak: - General Category Zl "Line Separator" - General Category Zp "Paragraph Separator" - Bidirectional Class B "Paragraph Separator" There's 8 linebreaks in the current Unicode Database (5.2): ------------------------------------------------------------------------ 000A LF LINE FEED Cc B 000D CR CARRIAGE RETURN Cc B 001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR) 001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR) 001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR) 0085 NEL NEXT LINE Cc B (C1 Control Code) 2028 LS LINE SEPARATOR Zl WS (Unicode) 2029 PS PARAGRAPH SEPARATOR Zp B (Unicode) ------------------------------------------------------------------------ == ASCII == The Standard ASCII control codes (C0) are in the range 00-1F. It limits the list to LF, CR, FS, GS, RS. Regarding the last three, they are not considered as linebreaks: “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to structure data, usually on a tape, in order to simulate punched cards. End of medium (EM) warns that the tape (or whatever) is ending. While many systems use CR/LF and TAB for structuring data, it is possible to encounter the separator control characters in data that needs to be structured. The separator control characters are not overloaded; there is no general use of them except to separate data into structured groupings. Their numeric values are contiguous with the space character, which can be considered a member of the group, as a word separator.” (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring) In conclusion, it may be better to keep things unchanged. We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character. References: - The Unicode Character Database (UCD): http://www.unicode.org/ucd/ - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/ - C0 and C1 Control Codes: http://en.wikipedia.org/wiki/C0_and_C1_control_codes
msg97408 - (view)	Author: Michael Foord (michael.foord) *	Date: 2010-01-08 10:33
Documenting the characters that splitlines treats as newlines for Unicode should definitely be done.
msg97410 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-08 11:42
It's confusing. There's a specific annex UAX #14 which defines "Line Breaking Properties". Some properties are defines as "Mandatory Line Breaks (non-tailorable)": BK, CR, LF, NL And the resulting list is different: CAT BIDI BRK ------------------------------------------------------------------------000A LF LINE FEED Cc B LF 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 000C FF FORM FEED Cc WS BK 000D CR CARRIAGE RETURN Cc B CR 0085 NEL NEXT LINE Cc B NL (C1 Control Code) 2028 LS LINE SEPARATOR Zl WS BK 2029 PS PARAGRAPH SEPARATOR Zp B BK ------------------------------------------------------------------------ Differences: - VT and FF are mandatory breaks (even if “implementations are not required to support the VT character”) - FS, GS, US are combined marks (CM): “Prohibit a line break between the character and the preceding character” According to this Annex, the current splitlines() implementation violates the Unicode standard. References: - Unicode Standard Annex #14 - Line Breaking Algorithm http://www.unicode.org/reports/tr14/ - UCD LineBreak.txt http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt
msg97438 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-08 20:18
Florent Xicluna wrote: > > Florent Xicluna <laxyf@yahoo.fr> added the comment: > > Some technical background. > > == Unicode == > > According to the Unicode Standard Annex #9, a character with > bidirectional class B is a "Paragraph Separator". And “Because a > Paragraph Separator breaks lines, there will be at most one per line, > at the end of that line.” > > As a consequence, there's 3 reasons to identify a character as a > linebreak: > - General Category Zl "Line Separator" > - General Category Zp "Paragraph Separator" > - Bidirectional Class B "Paragraph Separator" This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch). > There's 8 linebreaks in the current Unicode Database (5.2): > ------------------------------------------------------------------------ > 000A LF LINE FEED Cc B > 000D CR CARRIAGE RETURN Cc B > 001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR) > 001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR) > 001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR) > 0085 NEL NEXT LINE Cc B (C1 Control Code) > 2028 LS LINE SEPARATOR Zl WS (Unicode) > 2029 PS PARAGRAPH SEPARATOR Zp B (Unicode) > ------------------------------------------------------------------------ And that's the list we're currently using. > == ASCII == > > The Standard ASCII control codes (C0) are in the range 00-1F. > It limits the list to LF, CR, FS, GS, RS. > Regarding the last three, they are not considered as linebreaks: > “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to > structure data, usually on a tape, in order to simulate punched cards. End of > medium (EM) warns that the tape (or whatever) is ending. While many systems use > CR/LF and TAB for structuring data, it is possible to encounter the separator > control characters in data that needs to be structured. The separator control > characters are not overloaded; there is no general use of them except to > separate data into structured groupings. Their numeric values are contiguous > with the space character, which can be considered a member of the group, as a > word separator.” > (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring) > > In conclusion, it may be better to keep things unchanged. Agreed. > We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character. For ASCII we should make the list of characters explicit. For Unicode, we should mention the above definition and give the table as example list (the Unicode database may add more such characters in the future). > References: > - The Unicode Character Database (UCD): http://www.unicode.org/ucd/ > - UCD Property Values: http://unicode.org/reports/tr44/#Property_Values > - The Bidirectional Algorithm: http://www.unicode.org/reports/tr9/ > - C0 and C1 Control Codes: > http://en.wikipedia.org/wiki/C0_and_C1_control_codes
msg97440 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-08 21:08
Florent Xicluna wrote: > > Florent Xicluna <laxyf@yahoo.fr> added the comment: > > It's confusing. > > There's a specific annex UAX #14 which defines "Line Breaking Properties". > Some properties are defines as "Mandatory Line Breaks (non-tailorable)": > BK, CR, LF, NL Note that a line breaking algorithm is something different than a line split algorithm. The latter is used to separate lines at pre-defined positions in the text, the former is used to format a piece of text to fit e.g. into a certain width of available character positions. .splitlines() implements a line splitting algorithm, not a line breaking one. > And the resulting list is different: > CAT BIDI BRK > ------------------------------------------------------------------------ > 000A LF LINE FEED Cc B LF > 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) > 000C FF FORM FEED Cc WS BK > 000D CR CARRIAGE RETURN Cc B CR > 0085 NEL NEXT LINE Cc B NL (C1 Control Code) > 2028 LS LINE SEPARATOR Zl WS BK > 2029 PS PARAGRAPH SEPARATOR Zp B BK > ------------------------------------------------------------------------ > > Differences: > - VT and FF are mandatory breaks (even if “implementations are not > required to support the VT character”) > - FS, GS, US are combined marks (CM): “Prohibit a line break between > the character and the preceding character” > > According to this Annex, the current splitlines() implementation violates the Unicode standard. It appears so and I guess that's an oversight on my part when writing the code: in Unicode 2.1 (the version I started with), FF was marked as "B", later on Unicode 3.0 was published and the new LineBreak.txt file was added to the standard. FF was changed to "WS" and instead marked as "BK" in that new LineBreak.txt file. Since we only used the main UnicodeData.txt file as basis for the type database, the "FF" code point dropped out of the line break code point set. I guess we'll have to add FF and VT to the generator makeunicodedata.py to remedy this. > References: > - Unicode Standard Annex #14 - Line Breaking Algorithm > http://www.unicode.org/reports/tr14/ > - UCD LineBreak.txt > http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg97483 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-10 00:45
Here is draft of the patch to do what is proposed by Marc André on msg97440 (add VT and FF). Additionnally I upgraded the UCD 5.1 -> 5.2. The implementation uses field 16 as defined in "py3k" implementation of "makeunicodedata.py". It should minimize differences between Py2 and Py3 implementations. Documentation and tests are missing. I can provide a "diff.gz" containing "Modules/unicodedata_db.h", "Modules/unicodename_db.h" and "Objects/unicodetype_db.h", if needed. - /* Returns 1 for Unicode characters having the category 'Zl', - * 'Zp' or type 'B', 0 otherwise. + /* Returns 1 for Unicode characters having the line break + * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional + * type 'B', 0 otherwise. */ Note: the "remove_deprecation" should be applied before to remove "-3" warnings.
msg97502 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-01-10 10:28
I don't know what to do about this: > - FS, GS, RS are combined marks (CM): “Prohibit a line break between > the character and the preceding character” I know they are not commonly used. So we can keep them as line breaks. But if we comply strictly with UAX 14 we do not consider them as line breaks.
msg97531 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-01-10 18:04
Florent Xicluna wrote: > > Florent Xicluna <laxyf@yahoo.fr> added the comment: > > I don't know what to do about this: > >> - FS, GS, RS are combined marks (CM): “Prohibit a line break between >> the character and the preceding character” > > I know they are not commonly used. So we can keep them as line breaks. > But if we comply strictly with UAX 14 we do not consider them as line breaks. Right. The only update we'd have to do is add FF and VT. I am a little worried about the possible breakage this may cause, though. E.g. if you look at a file with FFs in Emacs, the FFs don't show up as line breaks. FFs in CSV files are currently also not regarded as line breaks and thus don't need to be placed in quotes. VTs are probably a non-issue, since they are not in common use.
msg98485 - (view)	Author: Chris Carter (Chris.Carter)	Date: 2010-01-29 00:15
Then I must ask, why did the string attribute behave differently? I added it to allow for that, and the behavior seems inconsistent.
msg98486 - (view)	Author: Chris Carter (Chris.Carter)	Date: 2010-01-29 00:16
My bad, wrong bug.
msg101294 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-19 00:30
Cleanup committed as r78982 Patch for LineBreak.txt updated after UCD upgrade to 5.2. See details: http://bugs.python.org/issue7643#msg97483 Tests added to test_unicodedata. Backward compatibility concern: * it adds VT u'\x0b' and FF u'\x0c' as line breaks. The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).
msg101306 - (view)	Author: Chris Carter (Chris.Carter)	Date: 2010-03-19 05:01
unwatched
msg101494 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-22 11:56
Florent Xicluna wrote: > Backward compatibility concern: > * it adds VT u'\x0b' and FF u'\x0c' as line breaks. > > The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14). I think we should correct this bug together with a clear warning in the Misc/NEWS file.
msg101945 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-03-30 16:45
Which functions are affected by this change? Py_UNICODE_ISLINEBREAK()? unicode.splitlines()?
msg101948 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-30 17:05
Committed to trunk: r79494 and r79496. Afaict, it changes Py_UNICODE_ISLINEBREAK, _PyUnicode_IsLinebreak and the Unicode functions which depend on it (splitlines(), _sre module).
msg101955 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-30 20:21
Ported to 3.x with r79506

History
Date	User	Action	Args
2022-04-11 14:56:56	admin	set	github: 51892
2010-03-30 20:21:44	flox	set	status: open -> closed resolution: fixed messages: + msg101955 stage: resolved
2010-03-30 17:05:56	flox	set	messages: + msg101948
2010-03-30 16:45:25	amaury.forgeotdarc	set	assignee: flox messages: + msg101945 nosy: + amaury.forgeotdarc
2010-03-22 11:56:00	lemburg	set	messages: + msg101494
2010-03-19 06:57:07	flox	set	nosy: - Chris.Carter
2010-03-19 05:01:51	Chris.Carter	set	nosy: lemburg, flox, Chris.Carter messages: + msg101306
2010-03-19 00:31:00	flox	set	priority: normal files: + issue7643_use_LineBreak_v2.diff messages: + msg101294
2010-03-18 23:48:19	flox	set	files: - issue7643_use_LineBreak.diff
2010-03-18 22:58:00	michael.foord	set	nosy: - michael.foord
2010-03-18 22:57:18	flox	set	files: - issue7643_remove_deprecation.diff
2010-01-29 00:16:20	Chris.Carter	set	messages: + msg98486
2010-01-29 00:15:43	Chris.Carter	set	nosy: + Chris.Carter messages: + msg98485
2010-01-10 18:05:00	lemburg	set	messages: + msg97531
2010-01-10 10:28:24	flox	set	nosy: lemburg, michael.foord, flox messages: + msg97502 components: + Unicode title: What is an ASCII linebreak? -> What is a Unicode line break character?
2010-01-10 00:45:28	flox	set	files: + issue7643_use_LineBreak.diff messages: + msg97483
2010-01-10 00:36:01	flox	set	files: + issue7643_remove_deprecation.diff keywords: + patch
2010-01-08 21:08:20	lemburg	set	messages: + msg97440
2010-01-08 20:18:22	lemburg	set	messages: + msg97438
2010-01-08 11:42:41	flox	set	messages: + msg97410
2010-01-08 10:33:51	michael.foord	set	messages: + msg97408
2010-01-08 10:32:06	flox	set	messages: + msg97407
2010-01-07 00:03:17	michael.foord	set	nosy: + michael.foord messages: + msg97333
2010-01-06 09:14:08	lemburg	set	nosy: + lemburg messages: + msg97300
2010-01-06 08:46:45	flox	create