New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is a Unicode line break character? #51892
Comments
Bytes objects and Unicode objects do not agree on ASCII linebreaks. ## Python 2 for s in '\x0a\x0d\x1c\x1d\x1e':
print u'a{}b'.format(s).splitlines(1), 'a{}b'.format(s).splitlines(1) # [u'a\n', u'b'] ['a\n', 'b'] ## Python 3 for s in '\x0a\x0d\x1c\x1d\x1e':
print('a{}b'.format(s).splitlines(1),
bytes('a{}b'.format(s), 'utf-8').splitlines(1)) ['a\n', 'b'] [b'a\n', b'b'] |
Florent Xicluna wrote:
Unicode has more line break characters defined than ASCII, which See e.g. http://en.wikipedia.org/wiki/Ascii#ASCII_control_characters The three extra code points Unicode defines for line breaks are |
'\x85' when decoded using latin-1 is just transcoded to u'\x85' which is treated as the NEL (a C1 control code equivalent to end of line). This changes iteration over the file when you decode and actually broke our csv parsing code when we got some latin-1 encoded data with \x85 in it from our customer. |
Some technical background. == Unicode == According to the Unicode Standard Annex #9, a character with As a consequence, there's 3 reasons to identify a character as a
There's 8 linebreaks in the current Unicode Database (5.2): == ASCII == The Standard ASCII control codes (C0) are in the range 00-1F. In conclusion, it may be better to keep things unchanged. References:
|
Documenting the characters that splitlines treats as newlines for Unicode should definitely be done. |
It's confusing. There's a specific annex UAX #14 which defines "Line Breaking Properties". And the resulting list is different: Differences:
According to this Annex, the current splitlines() implementation violates the Unicode standard. References:
|
Florent Xicluna wrote:
This definition is what we use in Python for Py_UNICODE_ISLINEBREAK(ch).
And that's the list we're currently using.
Agreed.
For ASCII we should make the list of characters explicit.
|
Florent Xicluna wrote:
Note that a line breaking algorithm is something different than .splitlines() implements a line splitting algorithm, not a line
It appears so and I guess that's an oversight on my part when Since we only used the main UnicodeData.txt file as basis for I guess we'll have to add FF and VT to the generator makeunicodedata.py
Thanks,Marc-Andre Lemburg ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
Here is draft of the patch to do what is proposed by Marc André on msg97440 (add VT and FF). The implementation uses field 16 as defined in "py3k" implementation of "makeunicodedata.py". It should minimize differences between Py2 and Py3 implementations. Documentation and tests are missing. - /* Returns 1 for Unicode characters having the category 'Zl',
- * 'Zp' or type 'B', 0 otherwise.
+ /* Returns 1 for Unicode characters having the line break
+ * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
+ * type 'B', 0 otherwise.
*/ Note: the "remove_deprecation" should be applied before to remove "-3" warnings. |
I don't know what to do about this:
I know they are not commonly used. So we can keep them as line breaks. |
Florent Xicluna wrote:
Right. The only update we'd have to do is add FF and VT. I am a little worried about the possible breakage this may cause, VTs are probably a non-issue, since they are not in common use. |
Then I must ask, why did the string attribute behave differently? I added it to allow for that, and the behavior seems inconsistent. |
My bad, wrong bug. |
Cleanup committed as r78982 Patch for LineBreak.txt updated after UCD upgrade to 5.2. Tests added to test_unicodedata. Backward compatibility concern:
The choice is either to preserve backward compatibility, or to comply with the specification (UAX #14). |
unwatched |
Which functions are affected by this change? |
Committed to trunk: r79494 and r79496. Afaict, it changes Py_UNICODE_ISLINEBREAK, _PyUnicode_IsLinebreak and the Unicode functions which depend on it (splitlines(), _sre module). |
Ported to 3.x with r79506 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: