New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
linebreak sequences should be better documented #57064
Comments
A file opened with codecs.open() splits on a form feed character (\x0c) while a file opened with open() does not. >>> with open("formfeed.txt", "w") as f:
... f.write("line \fone\nline two\n")
...
>>> with open("formfeed.txt", "r") as f:
... s = f.read()
...
>>> s
'line \x0cone\nline two\n'
>>> print s
line
one
line two
>>> import codecs
>>> with open("formfeed.txt", "rb") as f:
... lines = f.readlines()
...
>>> lines
['line \x0cone\n', 'line two\n']
>>> with codecs.open("formfeed.txt", "r", encoding="ascii") as f:
... lines2 = f.readlines()
...
>>> lines2
[u'line \x0c', u'one\n', u'line two\n']
>>> Note that lines contains two items while lines2 has 3. bpo-7643 has a good discussion on newlines in python, but I did not see this discrepancy mentioned. |
U+000C (Form feed) is considered as a line boundary in Unicode (unicode type), but no for a byte string (str type). Example: >>> u'line \x0cone\nline two\n'.splitlines(True)
[u'line \x0c', u'one\n', u'line two\n']
>>> 'line \x0cone\nline two\n'.splitlines(True)
['line \x0cone\n', 'line two\n'] |
Thanks for explaining the reasoning. Perhaps I should add this to the python wiki (http://wiki.python.org/moin/Unicode) ? It would be nice if it fit in the docs somewhere, but I'm not sure where. I'm curious how (or if) 2to3 would handle this as well, but I'm closing this issue as it's now clear to me why these two are expected to act differently. |
See: Can you suggest a patch for the documentation? Source code of this document: |
I'll suggest a patch for the documentation when I get to my home computer in an hour or two. |
I'm taking a look at the docs now. I'm considering adding a table/list of characters python treats as newlines, but it seems like this might fit better as a note in http://docs.python.org/library/stdtypes.html#str.splitlines or somewhere else in stdtypes. I'll start working on it now, but please let me know what you think about this. This is my first attempt at a patch, so I greatly appreciate your help so far. |
I've attached a patch for python2.7 that adds a small not to library/stdtypes.html#str.splitlines explaining which sequences are treated as line breaks: """ In addition to these, Unicode strings can have line boundaries of u"\x0b", u"\x0c", u"\x85", u"\u2028", and u"\u2029" Additional thoughts:
Please let me know of any thoughts you have and I'll be glad to make any desired changes and submit a new patch. |
You may just say that StreamReader.readline() uses unicode.splitlines(), and so point to unicode.splitlines() doc (use :meth:`unicode.splitlines` syntax). unicode.splitlines() is now well documented: line boundaries are not listed, even in Python 3 documentation. Unicode line boundaries used by Python 2.7 and 3.3: U+000A: Line feed
Correct: U+000B and U+000C were added to Python 2.7 and 3.2.
We add the following syntax exactly for this: .. versionchanged:: 2.6 If you downloaded Python source code, go into Doc/ directory and run "make html" to compile the doc to HTML. http://docs.python.org/devguide/setup.html |
I can fix the patch to list all the unicode line boundaries. The three places I've considered putting it are:
I can move it to any of these places if you think it's a better fit. I'll fix the list so that it's complete, add a note about \x0b and \x0c being added in 2.7/3.2, and possibly reference it from StreamReader.readline. After confirming that my documentation matches the style guide, I'll make the docs, test the output, and upload a patch. I can do this for 2.7, 3.2 and 3.3 separately. Let me know if that sounds good and if you have any further thoughts. I should be able to upload new patches in 10 hours (after work today). |
(3) is the best place. For Python 2, you should add a new unicode.splitlines entry, whereas the str.splitlines should be updated in Python 3.
You don't have to do it for 3.3: 2.7 and 3.2 are enough (I will do the change in 3.3 using Mercurial). |
I've attached a patch for 2.7 and will attach one for 3.2 in a minute. I built the docs for both 2.7 and 3.2 and verified that there were no warnings and that the resulting web pages looked okay. Things to consider:
|
I would vote for the inclusion of that patch. I just stumbled over this. |
Any reason why characters 1C–1E are excluded? |
Posting linebreakdoc.v3.py3.5.patch:
The newlines are already tested by test_unicodedata.UnicodeMiscTest.test_linebreak_7643() when the VT and FF codes were added in bpo-7643. |
Note to self, or anyone else handling this patch: See <https://bugs.python.org/issue22232#msg225769\> for further improvement ideas:
|
i think in this, "line \fone\nline two\n" ,the space after line taking some garbage value or you can say hex value of "\". so that's why that is showing some hex value. if you write "\n " instead of"\" then you can't find that hex value. i attached my idle image here. |
SMRUTI: \f is the python escape code for the ASCII formfeed character. It is the handling of that ASCII character (among others) that this issue is discussing. |
Patch v4 adds back the reference to “universal newlines”. I did not alter the doc string, because I don’t think doc strings need to be as detailed as the main documentation. |
New changeset 6244a5dbaf84 by Benjamin Peterson in branch '3.4': New changeset 87af6deb5d26 by Benjamin Peterson in branch 'default': |
Reopening to change the Python 2 documentation. A starting point may be Matthew’s patch and/or Alexander’s patch in bpo-22232. |
Here is an updated patch for Python 2, based on Benjamin’s commit, Matthew’s earlier py27 patch, and Alexander’s backport of related changes from Python 3. Let me know what you think. |
Alexander: does my latest patch linebreakdoc.v5.py2.7.patch address your concerns about the 2.7 documentation? If so, I can push it to the repository. |
Martin: Yes, it does, thank you. Sorry, I didn't know you where waiting for my approval. |
New changeset 2e6fda267a20 by Martin Panter in branch '2.7': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: