This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: split() splits on non whitespace char when ther is no separator given.
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.6
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: PeterL, ezio.melotti, pitrou, rhettinger
Priority: normal Keywords:

Created on 2010-05-30 18:54 by PeterL, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (8)
msg106773 - (view) Author: Peter Landgren (PeterL) Date: 2010-05-30 18:54
When the variable label is equal to '\xc5\xa0 Z\nX W'
this line sequence
label = " ".join(label.split())
label = unicode(label)
results in:
7347: ERROR: gramps.py: line 138: Unhandled exception
Traceback (most recent call last):
  File "C:\Program Files (x86)\gramps\gui\views\listview.py", line 660, in row_changed
    self.uistate.modify_statusbar(self.dbstate)
  File "C:\Program Files (x86)\gramps\DisplayState.py", line 521, in modify_statusbar
    name, obj = navigation_label(dbstate.db, nav_type, active_handle)
  File "C:\Program Files (x86)\gramps\Utils.py", line 1358, in navigation_label
    label = unicode(label)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data

While this line sequence:
label = unicode(label)
label = " ".join(label.split())
gives correct result and no error.

With the error the variable label changes from
'\xc5\xa0 Z\nX W'
to
'\xc5 Z X W'
by the line:
label = " ".join(label.split())
Note '\xa0' has been dropped, interpreted as "whitespace"?
This happens on Windows. It works perfectly well on Linux.
msg106774 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-05-30 19:12
Both on Linux and Windows I get:
>>> '\xa0'.isspace()
False
>>> u'\xa0'.isspace()
True

The Unicode char u'\xa0' is U+00A0 NO-BREAK SPACE, so unicode.split correctly considers it a whitespace.
However '\xa0' is not a whitespace, so str.split ignores it.
The correct solution is to convert your string to Unicode and then split.
I'd close this as invalid but I'd like you to confirm that the example I posted and that 'split' return the same result on both Linux and Windows before doing so (the fact that on Linux works it's probably caused by something else -- e.g. the label is already Unicode).
msg106775 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-30 19:13
What do you mean, "works perfectly well under Linux"?
The error also happens under Linux here, and is expected: you can't call unicode() without an encoding and expect it to decode properly non-ASCII chars (and \xa0 is a non-ASCII char).
msg106776 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-30 19:16
Oh, and I agree with Ezio, this is most likely not a bug at all and should probably be closed.
msg106778 - (view) Author: Peter Landgren (PeterL) Date: 2010-05-30 20:03
I am not sure I can follow you. I will try to be more specific.

The test string consists originally of one character; the Czech Š.

1. On Linux with Python 2.6.4
1.1 If I keep the original code line order:
label = obj.get()
print type(label), repr(label)
label = " ".join(label.split())
print type(label), repr(label)
label = unicode(label)
if len(label) > 40:
    label = label[:40] + "..."

Both lines print type(label), repr(label) gives:
<type 'str'> '\xc5\xa0'

1.2 If I change order and take the unicode conversion first:
label = obj.get()
label = unicode(label)
print type(label), repr(label)
label = " ".join(label.split())
print type(label), repr(label)
if len(label) > 40:
    label = label[:40] + "..."

Both lines print type(label), repr(label) gives:
<type 'unicode'> u'\u0160'

2. On Windows with Python 2.6.5
2.1 The original code line order:
The lines print type(label), repr(label) gives
<type 'str'> '\xc5\xa0'
<type 'str'> '\xc5'
 8217: ERROR: gramps.py: line 138: Unhandled exception
 ....

2.2 If I change order and take the unicode conversion first:
Both lines print type(label), repr(label) gives:
<type 'unicode'> u'\u0160'

3.
If I use this little code:
# -*- coding: utf-8 -*-
label = 'Š'
print type(label), repr(label)
label = " ".join(label.split())
print type(label), repr(label)
I get 
<type 'str'> '\xc5\xa0'
<type 'str'> '\xc5\xa0'
on both Linux and Windows.

The examples above under 1. and 2. comes from an application, Gramps.

There is still something I don't understand.
msg106779 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-05-30 20:20
I think the problem is in the default encoding used when you call unicode() without specifying any encoding.
>>> '\xc5\xa0'.decode('iso-8859-1').split()
[u'\xc5']
>>> '\xc5\xa0'.decode('utf-8').split()
[u'\u0160']
msg106785 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2010-05-31 04:13
I also agree this should be closed.
msg106787 - (view) Author: Peter Landgren (PeterL) Date: 2010-05-31 07:13
So as a summary to what Ezio Melotti said:
I should always specify encoding when calling split() to be sure nothing nasty happens? (Belive Ezio Melotti meant "calling split()" not "calling unicode()" in his last answer?)

Thanks for pointing this out.
History
Date User Action Args
2022-04-11 14:57:01adminsetgithub: 53105
2010-05-31 07:13:58PeterLsetmessages: + msg106787
2010-05-31 04:13:06rhettingersetstatus: open -> closed
nosy: + rhettinger
messages: + msg106785

2010-05-30 20:20:52ezio.melottisetmessages: + msg106779
2010-05-30 20:03:47PeterLsetmessages: + msg106778
2010-05-30 19:16:22pitrousetmessages: + msg106776
2010-05-30 19:13:22pitrousetstatus: pending -> open
nosy: + pitrou
messages: + msg106775

2010-05-30 19:12:34ezio.melottisetstatus: open -> pending

nosy: + ezio.melotti
messages: + msg106774

resolution: not a bug
2010-05-30 18:54:12PeterLcreate