Message 225561 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	scharron
Recipients	ezio.melotti, scharron, vstinner
Date	2014-08-20.10:01:50
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1408528911.37.0.452679392827.issue22232@psf.upfronthosting.co.za>
In-reply-to

Content
According to the documentation, str.splitlines uses the universal newlines to split lines. The documentation says it's all about \r, \n, and \r\n (https://docs.python.org/3.5/glossary.html#term-universal-newlines) However, it's also splitting on other characters. Reading the code, it seems the list of characters is from Objects/unicodeobject.c , in _PyUnicode_Init, the linebreak array. When testing any of these characters, it splits the string. Other libraries are using str.splitlines assuming it only breaks on these \r and \n characters. This is the case of email.feedparser for instance, used by http.client to parse headers. These HTTP headers should be separated by CLRF as specified by http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4. Either the documentation should state that splitlines splits on other characters or it should stick to the documentation and split only on \r and \n characters. If it splits on other characters, the list could be improved, as the unicode reference lists the mandatory characters for line breaking : http://www.unicode.org/reports/tr14/tr14-32.html#BK

According to the documentation, str.splitlines uses the universal newlines to split lines.
The documentation says it's all about \r, \n, and \r\n (https://docs.python.org/3.5/glossary.html#term-universal-newlines)

However, it's also splitting on other characters. Reading the code, it seems the list of characters is from Objects/unicodeobject.c , in _PyUnicode_Init, the linebreak array.
When testing any of these characters, it splits the string.

Other libraries are using str.splitlines assuming it only breaks on these \r and \n characters. This is the case of email.feedparser for instance, used by http.client to parse headers. These HTTP headers should be separated by CLRF as specified by http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4. 

Either the documentation should state that splitlines splits on other characters or it should stick to the documentation and split only on \r and \n characters.

If it splits on other characters, the list could be improved, as the unicode reference lists the mandatory characters for line breaking : http://www.unicode.org/reports/tr14/tr14-32.html#BK

History
Date	User	Action	Args
2014-08-20 10:01:51	scharron	set	recipients: + scharron, vstinner, ezio.melotti
2014-08-20 10:01:51	scharron	set	messageid: <1408528911.37.0.452679392827.issue22232@psf.upfronthosting.co.za>
2014-08-20 10:01:51	scharron	link	issue22232 messages
2014-08-20 10:01:50	scharron	create