Message 97410 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	flox
Recipients	flox, lemburg, michael.foord
Date	2010-01-08.11:42:40
SpamBayes Score	7.804027e-05
Marked as misclassified	No
Message-id	<1262950962.97.0.235825438798.issue7643@psf.upfronthosting.co.za>
In-reply-to

Content
It's confusing. There's a specific annex UAX #14 which defines "Line Breaking Properties". Some properties are defines as "Mandatory Line Breaks (non-tailorable)": BK, CR, LF, NL And the resulting list is different: CAT BIDI BRK ------------------------------------------------------------------------000A LF LINE FEED Cc B LF 000B VT LINE TABULATION Cc S BK (since Unicode 5.0) 000C FF FORM FEED Cc WS BK 000D CR CARRIAGE RETURN Cc B CR 0085 NEL NEXT LINE Cc B NL (C1 Control Code) 2028 LS LINE SEPARATOR Zl WS BK 2029 PS PARAGRAPH SEPARATOR Zp B BK ------------------------------------------------------------------------ Differences: - VT and FF are mandatory breaks (even if “implementations are not required to support the VT character”) - FS, GS, US are combined marks (CM): “Prohibit a line break between the character and the preceding character” According to this Annex, the current splitlines() implementation violates the Unicode standard. References: - Unicode Standard Annex #14 - Line Breaking Algorithm http://www.unicode.org/reports/tr14/ - UCD LineBreak.txt http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

It's confusing.

There's a specific annex UAX #14 which defines "Line Breaking Properties".
Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
  BK, CR, LF, NL

And the resulting list is different:
                                       CAT BIDI BRK
------------------------------------------------------------------------000A    LF  LINE FEED                   Cc  B   LF
000B    VT  LINE TABULATION             Cc  S   BK (since Unicode 5.0) 
000C    FF  FORM FEED                   Cc  WS  BK
000D    CR  CARRIAGE RETURN             Cc  B   CR
0085    NEL NEXT LINE                   Cc  B   NL (C1 Control Code)
2028    LS  LINE SEPARATOR              Zl  WS  BK
2029    PS  PARAGRAPH SEPARATOR         Zp  B   BK
------------------------------------------------------------------------

Differences:
 - VT and FF are mandatory breaks (even if “implementations are not
   required to support the VT character”)
 - FS, GS, US are combined marks (CM): “Prohibit a line break between
   the character and the preceding character”

According to this Annex, the current splitlines() implementation violates the Unicode standard.

References:
 - Unicode Standard Annex #14 - Line Breaking Algorithm
   http://www.unicode.org/reports/tr14/
 - UCD LineBreak.txt
   http://www.unicode.org/Public/5.2.0/ucd/LineBreak.txt

History
Date	User	Action	Args
2010-01-08 11:42:43	flox	set	recipients: + flox, lemburg, michael.foord
2010-01-08 11:42:42	flox	set	messageid: <1262950962.97.0.235825438798.issue7643@psf.upfronthosting.co.za>
2010-01-08 11:42:41	flox	link	issue7643 messages
2010-01-08 11:42:40	flox	create