Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is a Unicode line break character? #51892

Closed
florentx mannequin opened this issue Jan 6, 2010 · 19 comments
Closed

What is a Unicode line break character? #51892

florentx mannequin opened this issue Jan 6, 2010 · 19 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@florentx
Copy link
Mannequin

florentx mannequin commented Jan 6, 2010

BPO 7643
Nosy @malemburg, @amauryfa, @florentx
Files
  • issue7643_use_LineBreak_v2.diff: Patch, apply to 2.x
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/florentx'
    closed_at = <Date 2010-03-30.20:21:44.868>
    created_at = <Date 2010-01-06.08:46:45.401>
    labels = ['interpreter-core', 'type-bug', 'expert-unicode']
    title = 'What is a Unicode line break character?'
    updated_at = <Date 2010-03-30.20:21:44.866>
    user = 'https://github.com/florentx'

    bugs.python.org fields:

    activity = <Date 2010-03-30.20:21:44.866>
    actor = 'flox'
    assignee = 'flox'
    closed = True
    closed_date = <Date 2010-03-30.20:21:44.868>
    closer = 'flox'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2010-01-06.08:46:45.401>
    creator = 'flox'
    dependencies = []
    files = ['16577']
    hgrepos = []
    issue_num = 7643
    keywords = ['patch']
    message_count = 19.0
    messages = ['97299', '97300', '97333', '97407', '97408', '97410', '97438', '97440', '97483', '97502', '97531', '98485', '98486', '101294', '101306', '101494', '101945', '101948', '101955']
    nosy_count = 3.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue7643'
    versions = ['Python 2.7', 'Python 3.2']

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 6, 2010

    Bytes objects and Unicode objects do not agree on ASCII linebreaks.

    ## Python 2

    for s in '\x0a\x0d\x1c\x1d\x1e':
      print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

    # [u'a\n', u'b'] ['a\n', 'b']
    # [u'a\r', u'b'] ['a\r', 'b']
    # [u'a\x1c', u'b'] ['a\x1cb']
    # [u'a\x1d', u'b'] ['a\x1db']
    # [u'a\x1e', u'b'] ['a\x1eb']

    ## Python 3

    for s in '\x0a\x0d\x1c\x1d\x1e':
      print('a{}b'.format(s).splitlines(1),
            bytes('a{}b'.format(s), 'utf-8').splitlines(1))

    ['a\n', 'b'] [b'a\n', b'b']
    ['a\r', 'b'] [b'a\r', b'b']
    ['a\x1c', 'b'] [b'a\x1cb']
    ['a\x1d', 'b'] [b'a\x1db']
    ['a\x1e', 'b'] [b'a\x1eb']

    @florentx florentx mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jan 6, 2010
    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    New submission from Florent Xicluna <laxyf@yahoo.fr>:

    Bytes objects and Unicode objects do not agree on ASCII linebreaks.

    Python 2

    for s in '\x0a\x0d\x1c\x1d\x1e':
    print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1)

    [u'a\n', u'b'] ['a\n', 'b']

    [u'a\r', u'b'] ['a\r', 'b']

    [u'a\x1c', u'b'] ['a\x1cb']

    [u'a\x1d', u'b'] ['a\x1db']

    [u'a\x1e', u'b'] ['a\x1eb']

    Python 3

    for s in '\x0a\x0d\x1c\x1d\x1e':
    print('a{}b'.format(s).splitlines(1),
    bytes('a{}b'.format(s), 'utf-8').splitlines(1))

    ['a\n', 'b'] [b'a\n', b'b']
    ['a\r', 'b'] [b'a\r', b'b']
    ['a\x1c', 'b'] [b'a\x1cb']
    ['a\x1d', 'b'] [b'a\x1db']
    ['a\x1e', 'b'] [b'a\x1eb']

    Unicode has more line break characters defined than ASCII, which
    only has a single line break character \n, but also uses the
    conventions \r and \r\n for meaning "start a new line,
    go to position 1".

    See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters

    The three extra code points Unicode defines for line breaks are
    group separators that are not in common use.

    @voidspace
    Copy link
    Contributor

    '\x85' when decoded using latin-1 is just transcoded to u'\x85' which is treated as the NEL (a C1 control code equivalent to end of line). This changes iteration over the file when you decode and actually broke our csv parsing code when we got some latin-1 encoded data with \x85 in it from our customer.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 8, 2010

    Some technical background.

    == Unicode ==

    According to the Unicode Standard Annex #9, a character with
    bidirectional class B is a "Paragraph Separator". And “Because a
    Paragraph Separator breaks lines, there will be at most one per line,
    at the end of that line.”

    As a consequence, there's 3 reasons to identify a character as a
    linebreak:

    • General Category Zl "Line Separator"
    • General Category Zp "Paragraph Separator"
    • Bidirectional Class B "Paragraph Separator"

    There's 8 linebreaks in the current Unicode Database (5.2):
    ------------------------------------------------------------------------
    000A LF LINE FEED Cc B
    000D CR CARRIAGE RETURN Cc B
    001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
    001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
    001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
    0085 NEL NEXT LINE Cc B (C1 Control Code)
    2028 LS LINE SEPARATOR Zl WS (Unicode)
    2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
    ------------------------------------------------------------------------

    == ASCII ==

    The Standard ASCII control codes (C0) are in the range 00-1F.
    It limits the list to LF, CR, FS, GS, RS.
    Regarding the last three, they are not considered as linebreaks:
    “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
    structure data, usually on a tape, in order to simulate punched cards. End of
    medium (EM) warns that the tape (or whatever) is ending. While many systems use
    CR/LF and TAB for structuring data, it is possible to encounter the separator
    control characters in data that needs to be structured. The separator control
    characters are not overloaded; there is no general use of them except to
    separate data into structured groupings. Their numeric values are contiguous
    with the space character, which can be considered a member of the group, as a
    word separator.”
    (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)

    In conclusion, it may be better to keep things unchanged.
    We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

    References:

    @voidspace
    Copy link
    Contributor

    Documenting the characters that splitlines treats as newlines for Unicode should definitely be done.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 8, 2010

    It's confusing.

    There's a specific annex UAX #14 which defines "Line Breaking Properties".
    Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
    BK, CR, LF, NL

    And the resulting list is different:
    CAT BIDI BRK
    ------------------------------------------------------------------------000A LF LINE FEED Cc B LF
    000B VT LINE TABULATION Cc S BK (since Unicode 5.0)
    000C FF FORM FEED Cc WS BK
    000D CR CARRIAGE RETURN Cc B CR
    0085 NEL NEXT LINE Cc B NL (C1 Control Code)
    2028 LS LINE SEPARATOR Zl WS BK
    2029 PS PARAGRAPH SEPARATOR Zp B BK
    ------------------------------------------------------------------------

    Differences:

    • VT and FF are mandatory breaks (even if “implementations are not
      required to support the VT character”)
    • FS, GS, US are combined marks (CM): “Prohibit a line break between
      the character and the preceding character”

    According to this Annex, the current splitlines() implementation violates the Unicode standard.

    References:

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <laxyf@yahoo.fr> added the comment:

    Some technical background.

    == Unicode ==

    According to the Unicode Standard Annex #9, a character with
    bidirectional class B is a "Paragraph Separator". And “Because a
    Paragraph Separator breaks lines, there will be at most one per line,
    at the end of that line.”

    As a consequence, there's 3 reasons to identify a character as a
    linebreak:

    • General Category Zl "Line Separator"
    • General Category Zp "Paragraph Separator"
    • Bidirectional Class B "Paragraph Separator"

    This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).

    There's 8 linebreaks in the current Unicode Database (5.2):
    ------------------------------------------------------------------------
    000A LF LINE FEED Cc B
    000D CR CARRIAGE RETURN Cc B
    001C FS INFORMATION SEPARATOR FOUR Cc B (UCD 3.1 FILE SEPARATOR)
    001D GS INFORMATION SEPARATOR THREE Cc B (UCD 3.1 GROUP SEPARATOR)
    001E RS INFORMATION SEPARATOR TWO Cc B (UCD 3.1 RECORD SEPARATOR)
    0085 NEL NEXT LINE Cc B (C1 Control Code)
    2028 LS LINE SEPARATOR Zl WS (Unicode)
    2029 PS PARAGRAPH SEPARATOR Zp B (Unicode)
    ------------------------------------------------------------------------

    And that's the list we're currently using.

    == ASCII ==

    The Standard ASCII control codes (C0) are in the range 00-1F.
    It limits the list to LF, CR, FS, GS, RS.
    Regarding the last three, they are not considered as linebreaks:
    “The separators (File, Group, Record, and Unit: FS, GS, RS and US) were made to
    structure data, usually on a tape, in order to simulate punched cards. End of
    medium (EM) warns that the tape (or whatever) is ending. While many systems use
    CR/LF and TAB for structuring data, it is possible to encounter the separator
    control characters in data that needs to be structured. The separator control
    characters are not overloaded; there is no general use of them except to
    separate data into structured groupings. Their numeric values are contiguous
    with the space character, which can be considered a member of the group, as a
    word separator.”
    (Ref: http://en.wikipedia.org/wiki/Control_character#Data_structuring)

    In conclusion, it may be better to keep things unchanged.

    Agreed.

    We may add some words to the documentation for str.splitlines() and bytes.splitlines() to explain what is considered a line break character.

    For ASCII we should make the list of characters explicit.
    For Unicode, we should mention the above definition and give
    the table as example list (the Unicode database may add more
    such characters in the future).

    References:

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <laxyf@yahoo.fr> added the comment:

    It's confusing.

    There's a specific annex UAX #14 which defines "Line Breaking Properties".
    Some properties are defines as "Mandatory Line Breaks (non-tailorable)":
    BK, CR, LF, NL

    Note that a line breaking algorithm is something different than
    a line split algorithm. The latter is used to separate lines at
    pre-defined positions in the text, the former is used to format
    a piece of text to fit e.g. into a certain width of available
    character positions.

    .splitlines() implements a line splitting algorithm, not a line
    breaking one.

    And the resulting list is different:
    CAT BIDI BRK
    ------------------------------------------------------------------------
    000A LF LINE FEED Cc B LF
    000B VT LINE TABULATION Cc S BK (since Unicode 5.0)
    000C FF FORM FEED Cc WS BK
    000D CR CARRIAGE RETURN Cc B CR
    0085 NEL NEXT LINE Cc B NL (C1 Control Code)
    2028 LS LINE SEPARATOR Zl WS BK
    2029 PS PARAGRAPH SEPARATOR Zp B BK
    ------------------------------------------------------------------------

    Differences:

    • VT and FF are mandatory breaks (even if “implementations are not
      required to support the VT character”)
    • FS, GS, US are combined marks (CM): “Prohibit a line break between
      the character and the preceding character”

    According to this Annex, the current splitlines() implementation violates the Unicode standard.

    It appears so and I guess that's an oversight on my part when
    writing the code: in Unicode 2.1 (the version I started with),
    FF was marked as "B", later on Unicode 3.0 was published and
    the new LineBreak.txt file was added to the standard. FF was
    changed to "WS" and instead marked as "BK" in that new LineBreak.txt
    file.

    Since we only used the main UnicodeData.txt file as basis for
    the type database, the "FF" code point dropped out of the
    line break code point set.

    I guess we'll have to add FF and VT to the generator makeunicodedata.py
    to remedy this.

    References:

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 10, 2010

    Here is draft of the patch to do what is proposed by Marc André on msg97440 (add VT and FF).
    Additionnally I upgraded the UCD 5.1 -> 5.2.

    The implementation uses field 16 as defined in "py3k" implementation of "makeunicodedata.py". It should minimize differences between Py2 and Py3 implementations.

    Documentation and tests are missing.
    I can provide a "diff.gz" containing "Modules/unicodedata_db.h", "Modules/unicodename_db.h" and "Objects/unicodetype_db.h", if needed.

    - /* Returns 1 for Unicode characters having the category 'Zl',
    -  * 'Zp' or type 'B', 0 otherwise.
    + /* Returns 1 for Unicode characters having the line break
    +  * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
    +  * type 'B', 0 otherwise.
       */

    Note: the "remove_deprecation" should be applied before to remove "-3" warnings.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Jan 10, 2010

    I don't know what to do about this:

    • FS, GS, RS are combined marks (CM): “Prohibit a line break between
      the character and the preceding character”

    I know they are not commonly used. So we can keep them as line breaks.
    But if we comply strictly with UAX 14 we do not consider them as line breaks.

    @florentx florentx mannequin added the topic-unicode label Jan 10, 2010
    @florentx florentx mannequin changed the title What is an ASCII linebreak? What is a Unicode line break character? Jan 10, 2010
    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <laxyf@yahoo.fr> added the comment:

    I don't know what to do about this:

    > - FS, GS, RS are combined marks (CM): “Prohibit a line break between
    > the character and the preceding character”

    I know they are not commonly used. So we can keep them as line breaks.
    But if we comply strictly with UAX 14 we do not consider them as line breaks.

    Right. The only update we'd have to do is add FF and VT.

    I am a little worried about the possible breakage this may cause,
    though. E.g. if you look at a file with FFs in Emacs, the FFs don't
    show up as line breaks. FFs in CSV files are currently also not regarded
    as line breaks and thus don't need to be placed in quotes.

    VTs are probably a non-issue, since they are not in common use.

    @ChrisCarter
    Copy link
    Mannequin

    ChrisCarter mannequin commented Jan 29, 2010

    Then I must ask, why did the string attribute behave differently? I added it to allow for that, and the behavior seems inconsistent.

    @ChrisCarter
    Copy link
    Mannequin

    ChrisCarter mannequin commented Jan 29, 2010

    My bad, wrong bug.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 19, 2010

    Cleanup committed as r78982

    Patch for LineBreak.txt updated after UCD upgrade to 5.2.
    See details: http://bugs.python.org/issue7643#msg97483

    Tests added to test_unicodedata.

    Backward compatibility concern:

    • it adds VT u'\x0b' and FF u'\x0c' as line breaks.

    The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).

    @ChrisCarter
    Copy link
    Mannequin

    ChrisCarter mannequin commented Mar 19, 2010

    unwatched

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Backward compatibility concern:

    • it adds VT u'\x0b' and FF u'\x0c' as line breaks.

    The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14).

    I think we should correct this bug together with a clear warning in
    the Misc/NEWS file.

    @amauryfa
    Copy link
    Member

    Which functions are affected by this change?
    Py_UNICODE_ISLINEBREAK()? unicode.splitlines()?

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 30, 2010

    Committed to trunk: r79494 and r79496.

    Afaict, it changes Py_UNICODE_ISLINEBREAK, _PyUnicode_IsLinebreak and the Unicode functions which depend on it (splitlines(), _sre module).

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 30, 2010

    Ported to 3.x with r79506

    @florentx florentx mannequin closed this as completed Mar 30, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants