Issue 8859: split() splits on non whitespace char when ther is no separator given.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53105

classification

Title:	split() splits on non whitespace char when ther is no separator given.
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 2.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	PeterL, ezio.melotti, pitrou, rhettinger
Priority:	normal	Keywords:

Created on 2010-05-30 18:54 by PeterL, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (8)
msg106773 - (view)	Author: Peter Landgren (PeterL)	Date: 2010-05-30 18:54
When the variable label is equal to '\xc5\xa0 Z\nX W' this line sequence label = " ".join(label.split()) label = unicode(label) results in: 7347: ERROR: gramps.py: line 138: Unhandled exception Traceback (most recent call last): File "C:\Program Files (x86)\gramps\gui\views\listview.py", line 660, in row_changed self.uistate.modify_statusbar(self.dbstate) File "C:\Program Files (x86)\gramps\DisplayState.py", line 521, in modify_statusbar name, obj = navigation_label(dbstate.db, nav_type, active_handle) File "C:\Program Files (x86)\gramps\Utils.py", line 1358, in navigation_label label = unicode(label) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data While this line sequence: label = unicode(label) label = " ".join(label.split()) gives correct result and no error. With the error the variable label changes from '\xc5\xa0 Z\nX W' to '\xc5 Z X W' by the line: label = " ".join(label.split()) Note '\xa0' has been dropped, interpreted as "whitespace"? This happens on Windows. It works perfectly well on Linux.
msg106774 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-05-30 19:12
Both on Linux and Windows I get: >>> '\xa0'.isspace() False >>> u'\xa0'.isspace() True The Unicode char u'\xa0' is U+00A0 NO-BREAK SPACE, so unicode.split correctly considers it a whitespace. However '\xa0' is not a whitespace, so str.split ignores it. The correct solution is to convert your string to Unicode and then split. I'd close this as invalid but I'd like you to confirm that the example I posted and that 'split' return the same result on both Linux and Windows before doing so (the fact that on Linux works it's probably caused by something else -- e.g. the label is already Unicode).
msg106775 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-30 19:13
What do you mean, "works perfectly well under Linux"? The error also happens under Linux here, and is expected: you can't call unicode() without an encoding and expect it to decode properly non-ASCII chars (and \xa0 is a non-ASCII char).
msg106776 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-30 19:16
Oh, and I agree with Ezio, this is most likely not a bug at all and should probably be closed.
msg106778 - (view)	Author: Peter Landgren (PeterL)	Date: 2010-05-30 20:03
I am not sure I can follow you. I will try to be more specific. The test string consists originally of one character; the Czech Š. 1. On Linux with Python 2.6.4 1.1 If I keep the original code line order: label = obj.get() print type(label), repr(label) label = " ".join(label.split()) print type(label), repr(label) label = unicode(label) if len(label) > 40: label = label[:40] + "..." Both lines print type(label), repr(label) gives: <type 'str'> '\xc5\xa0' 1.2 If I change order and take the unicode conversion first: label = obj.get() label = unicode(label) print type(label), repr(label) label = " ".join(label.split()) print type(label), repr(label) if len(label) > 40: label = label[:40] + "..." Both lines print type(label), repr(label) gives: <type 'unicode'> u'\u0160' 2. On Windows with Python 2.6.5 2.1 The original code line order: The lines print type(label), repr(label) gives <type 'str'> '\xc5\xa0' <type 'str'> '\xc5' 8217: ERROR: gramps.py: line 138: Unhandled exception .... 2.2 If I change order and take the unicode conversion first: Both lines print type(label), repr(label) gives: <type 'unicode'> u'\u0160' 3. If I use this little code: # -- coding: utf-8 -- label = 'Š' print type(label), repr(label) label = " ".join(label.split()) print type(label), repr(label) I get <type 'str'> '\xc5\xa0' <type 'str'> '\xc5\xa0' on both Linux and Windows. The examples above under 1. and 2. comes from an application, Gramps. There is still something I don't understand.
msg106779 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-05-30 20:20
I think the problem is in the default encoding used when you call unicode() without specifying any encoding. >>> '\xc5\xa0'.decode('iso-8859-1').split() [u'\xc5'] >>> '\xc5\xa0'.decode('utf-8').split() [u'\u0160']
msg106785 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2010-05-31 04:13
I also agree this should be closed.
msg106787 - (view)	Author: Peter Landgren (PeterL)	Date: 2010-05-31 07:13
So as a summary to what Ezio Melotti said: I should always specify encoding when calling split() to be sure nothing nasty happens? (Belive Ezio Melotti meant "calling split()" not "calling unicode()" in his last answer?) Thanks for pointing this out.

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53105
2010-05-31 07:13:58	PeterL	set	messages: + msg106787
2010-05-31 04:13:06	rhettinger	set	status: open -> closed nosy: + rhettinger messages: + msg106785
2010-05-30 20:20:52	ezio.melotti	set	messages: + msg106779
2010-05-30 20:03:47	PeterL	set	messages: + msg106778
2010-05-30 19:16:22	pitrou	set	messages: + msg106776
2010-05-30 19:13:22	pitrou	set	status: pending -> open nosy: + pitrou messages: + msg106775
2010-05-30 19:12:34	ezio.melotti	set	status: open -> pending nosy: + ezio.melotti messages: + msg106774 resolution: not a bug
2010-05-30 18:54:12	PeterL	create