Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linebreak sequences should be better documented #57064

Closed
MatthewBoehm mannequin opened this issue Aug 29, 2011 · 24 comments
Closed

linebreak sequences should be better documented #57064

MatthewBoehm mannequin opened this issue Aug 29, 2011 · 24 comments
Labels
docs Documentation in the Doc dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@MatthewBoehm
Copy link
Mannequin

MatthewBoehm mannequin commented Aug 29, 2011

BPO 12855
Nosy @vstinner, @bitdancer, @vadmium
Files
  • linebreakdoc.py27.patch
  • linebreakdoc.v2.py27.patch
  • linebreakdoc.v2.py32.patch
  • linebreakdoc.v3.py3.5.patch
  • python.JPG: Bug resolved
  • linebreakdoc.v4.py3.5.patch
  • linebreakdoc.v5.py2.7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2016-06-15.02:01:19.539>
    created_at = <Date 2011-08-29.21:42:30.093>
    labels = ['type-bug', 'expert-unicode', 'docs']
    title = 'linebreak sequences should be better documented'
    updated_at = <Date 2016-06-15.02:01:19.538>
    user = 'https://bugs.python.org/MatthewBoehm'

    bugs.python.org fields:

    activity = <Date 2016-06-15.02:01:19.538>
    actor = 'martin.panter'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2016-06-15.02:01:19.539>
    closer = 'martin.panter'
    components = ['Documentation', 'Unicode']
    creation = <Date 2011-08-29.21:42:30.093>
    creator = 'Matthew.Boehm'
    dependencies = []
    files = ['23069', '23076', '23077', '38179', '38523', '38748', '43080']
    hgrepos = []
    issue_num = 12855
    keywords = ['patch']
    message_count = 24.0
    messages = ['143182', '143185', '143187', '143188', '143189', '143194', '143199', '143204', '143217', '143220', '143245', '223411', '225938', '236247', '238262', '238298', '238450', '239653', '239767', '266806', '266812', '268491', '268582', '268599']
    nosy_count = 9.0
    nosy_names = ['vstinner', 'r.david.murray', 'docs@python', 'python-dev', 'Matthew.Boehm', 'martin.panter', 'davidhalter', 'SMRUTI RANJAN SAHOO', 'Alexander Schrijver']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12855'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 29, 2011

    A file opened with codecs.open() splits on a form feed character (\x0c) while a file opened with open() does not.

    >>> with open("formfeed.txt", "w") as f:
    ...   f.write("line \fone\nline two\n")
    ...
    >>> with open("formfeed.txt", "r") as f:
    ...   s = f.read()
    ...
    >>> s
    'line \x0cone\nline two\n'
    >>> print s
    line
        one
    line two
    
    >>> import codecs
    >>> with open("formfeed.txt", "rb") as f:
    ...   lines = f.readlines()
    ...
    >>> lines
    ['line \x0cone\n', 'line two\n']
    >>> with codecs.open("formfeed.txt", "r", encoding="ascii") as f:
    ...   lines2 = f.readlines()
    ...
    >>> lines2
    [u'line \x0c', u'one\n', u'line two\n']
    >>>

    Note that lines contains two items while lines2 has 3.

    bpo-7643 has a good discussion on newlines in python, but I did not see this discrepancy mentioned.

    @MatthewBoehm MatthewBoehm mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Aug 29, 2011
    @vstinner
    Copy link
    Member

    U+000C (Form feed) is considered as a line boundary in Unicode (unicode type), but no for a byte string (str type).

    Example:

    >>> u'line \x0cone\nline two\n'.splitlines(True)
    [u'line \x0c', u'one\n', u'line two\n']
    >>> 'line \x0cone\nline two\n'.splitlines(True)
    ['line \x0cone\n', 'line two\n']

    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 29, 2011

    Thanks for explaining the reasoning.

    Perhaps I should add this to the python wiki (http://wiki.python.org/moin/Unicode) ?

    It would be nice if it fit in the docs somewhere, but I'm not sure where.

    I'm curious how (or if) 2to3 would handle this as well, but I'm closing this issue as it's now clear to me why these two are expected to act differently.

    @MatthewBoehm MatthewBoehm mannequin closed this as completed Aug 29, 2011
    @vstinner
    Copy link
    Member

    It would be nice if it fit in the docs somewhere,
    but I'm not sure where.

    See:
    http://docs.python.org/library/codecs.html#codecs.StreamReader.readline

    Can you suggest a patch for the documentation? Source code of this document:
    http://hg.python.org/cpython/file/bb7b14dd5ded/Doc/library/codecs.rst

    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 29, 2011

    I'll suggest a patch for the documentation when I get to my home computer in an hour or two.

    @MatthewBoehm MatthewBoehm mannequin added docs Documentation in the Doc dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Aug 29, 2011
    @MatthewBoehm MatthewBoehm mannequin reopened this Aug 29, 2011
    @MatthewBoehm MatthewBoehm mannequin assigned docspython Aug 29, 2011
    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 30, 2011

    I'm taking a look at the docs now.

    I'm considering adding a table/list of characters python treats as newlines, but it seems like this might fit better as a note in http://docs.python.org/library/stdtypes.html#str.splitlines or somewhere else in stdtypes. I'll start working on it now, but please let me know what you think about this.

    This is my first attempt at a patch, so I greatly appreciate your help so far.

    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 30, 2011

    I've attached a patch for python2.7 that adds a small not to library/stdtypes.html#str.splitlines explaining which sequences are treated as line breaks:

    """
    Note: Python recognizes "\r", "\n", and "\r\n" as line boundaries for strings.

    In addition to these, Unicode strings can have line boundaries of u"\x0b", u"\x0c", u"\x85", u"\u2028", and u"\u2029"
    """

    Additional thoughts:

    • Would it be better to put this note in a different place?

    • It looks like \x0b and \x0c (vertical tab and form feed) were first considered line breaks in Python 2.7, probably related to this note from "What's New in 2.7": "The Unicode database provided by the unicodedata module is now used internally to determine which characters are numeric, whitespace, or represent line breaks." It might be worth putting a "changed in 2.7" note somewhere in the docs.

    Please let me know of any thoughts you have and I'll be glad to make any desired changes and submit a new patch.

    @MatthewBoehm MatthewBoehm mannequin changed the title open() and codecs.open() treat form-feed differently linebreak sequences should be better documented Aug 30, 2011
    @vstinner
    Copy link
    Member

    Would it be better to put this note in a different place?

    You may just say that StreamReader.readline() uses unicode.splitlines(), and so point to unicode.splitlines() doc (use :meth:`unicode.splitlines` syntax). unicode.splitlines() is now well documented: line boundaries are not listed, even in Python 3 documentation.

    Unicode line boundaries used by Python 2.7 and 3.3:

    U+000A: Line feed
    U+000B: Line tabulation
    U+000C: Form feed
    U+000D: Carriage return
    U+001C: File separator
    U+001D: Group separator
    U+001E: Record separator
    U+0085: "control"
    U+2028: Line separator
    U+2029: Paragraph separator

    It looks like \x0b and \x0c (vertical tab and form feed) were first
    considered line breaks in Python 2.7

    Correct: U+000B and U+000C were added to Python 2.7 and 3.2.

    It might be worth putting a "changed in 2.7" note somewhere in the docs

    We add the following syntax exactly for this:

    .. versionchanged:: 2.6
    Also unset environment variables when calling :meth:`os.environ.clear`
    and :meth:`os.environ.pop`.

    If you downloaded Python source code, go into Doc/ directory and run "make html" to compile the doc to HTML.

    http://docs.python.org/devguide/setup.html
    http://docs.python.org/devguide/docquality.html

    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 30, 2011

    I can fix the patch to list all the unicode line boundaries. The three places I've considered putting it are:

    1. On the howto/unicode.html

    2. Somewhere in the stdtypes.html#typesseq description (maybe with other notes at the bottom)

    3. As a note to the stdtypes.html#str.splitlines method description (where it is in the previous patch.)

    I can move it to any of these places if you think it's a better fit. I'll fix the list so that it's complete, add a note about \x0b and \x0c being added in 2.7/3.2, and possibly reference it from StreamReader.readline.

    After confirming that my documentation matches the style guide, I'll make the docs, test the output, and upload a patch. I can do this for 2.7, 3.2 and 3.3 separately.

    Let me know if that sounds good and if you have any further thoughts. I should be able to upload new patches in 10 hours (after work today).

    @vstinner
    Copy link
    Member

    1. On the howto/unicode.html
    2. Somewhere in the stdtypes.html#typesseq description (maybe with other notes at the bottom)
    3. As a note to the stdtypes.html#str.splitlines method description (where it is in the previous patch.)

    (3) is the best place. For Python 2, you should add a new unicode.splitlines entry, whereas the str.splitlines should be updated in Python 3.

    I can do this for 2.7, 3.2 and 3.3 separately.

    You don't have to do it for 3.3: 2.7 and 3.2 are enough (I will do the change in 3.3 using Mercurial).

    @MatthewBoehm
    Copy link
    Mannequin Author

    MatthewBoehm mannequin commented Aug 31, 2011

    I've attached a patch for 2.7 and will attach one for 3.2 in a minute.

    I built the docs for both 2.7 and 3.2 and verified that there were no warnings and that the resulting web pages looked okay.

    Things to consider:

    • Placement of unicode.splitlines() method: I placed it next to str.splitlines. I didn't want to place it with the unicode methods further down because docs say "The following methods are present only on unicode objects"

    • The docs for codecs.readlines() already mentions "Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true."

    • Feel free to make any wording/style suggestions.

    @davidhalter
    Copy link
    Mannequin

    davidhalter mannequin commented Jul 18, 2014

    I would vote for the inclusion of that patch. I just stumbled over this.

    @vadmium
    Copy link
    Member

    vadmium commented Aug 26, 2014

    Any reason why characters 1C–1E are excluded?

    @vadmium
    Copy link
    Member

    vadmium commented Feb 20, 2015

    Posting linebreakdoc.v3.py3.5.patch:

    • Rebased onto recent “default” (3.5) branch
    • Add missing 1C–1E codes
    • Dropped reference to “universal newlines”, since that only handles CRs and LFs as I understand it

    The newlines are already tested by test_unicodedata.UnicodeMiscTest.test_linebreak_7643() when the VT and FF codes were added in bpo-7643.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 17, 2015

    Note to self, or anyone else handling this patch: See <https://bugs.python.org/issue22232#msg225769\> for further improvement ideas:

    • Might be good to bring back the reference to universal newlines, but say it accepts additional line boundaries
    • Terry also suggested a doc string improvement

    @SMRUTIRANJANSAHOO
    Copy link
    Mannequin

    SMRUTIRANJANSAHOO mannequin commented Mar 17, 2015

    i think in this, "line \fone\nline two\n" ,the space after line taking some garbage value or you can say hex value of "\". so that's why that is showing some hex value. if you write "\n " instead of"\" then you can't find that hex value. i attached my idle image here.

    @bitdancer
    Copy link
    Member

    SMRUTI: \f is the python escape code for the ASCII formfeed character. It is the handling of that ASCII character (among others) that this issue is discussing.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 31, 2015

    Patch v4 adds back the reference to “universal newlines”. I did not alter the doc string, because I don’t think doc strings need to be as detailed as the main documentation.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Apr 1, 2015

    New changeset 6244a5dbaf84 by Benjamin Peterson in branch '3.4':
    document what exactly str.splitlines() splits on (closes bpo-12855)
    https://hg.python.org/cpython/rev/6244a5dbaf84

    New changeset 87af6deb5d26 by Benjamin Peterson in branch 'default':
    merge 3.4 (bpo-12855)
    https://hg.python.org/cpython/rev/87af6deb5d26

    @python-dev python-dev mannequin closed this as completed Apr 1, 2015
    @vadmium
    Copy link
    Member

    vadmium commented Jun 1, 2016

    Reopening to change the Python 2 documentation. A starting point may be Matthew’s patch and/or Alexander’s patch in bpo-22232.

    @vadmium vadmium reopened this Jun 1, 2016
    @vadmium
    Copy link
    Member

    vadmium commented Jun 1, 2016

    Here is an updated patch for Python 2, based on Benjamin’s commit, Matthew’s earlier py27 patch, and Alexander’s backport of related changes from Python 3. Let me know what you think.

    @vadmium
    Copy link
    Member

    vadmium commented Jun 14, 2016

    Alexander: does my latest patch linebreakdoc.v5.py2.7.patch address your concerns about the 2.7 documentation? If so, I can push it to the repository.

    @AlexanderSchrijver
    Copy link
    Mannequin

    AlexanderSchrijver mannequin commented Jun 14, 2016

    Martin: Yes, it does, thank you. Sorry, I didn't know you where waiting for my approval.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jun 15, 2016

    New changeset 2e6fda267a20 by Martin Panter in branch '2.7':
    Issue bpo-12855: Document what exactly unicode.splitlines() splits on
    https://hg.python.org/cpython/rev/2e6fda267a20

    @vadmium vadmium closed this as completed Jun 15, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants