classification
Title: behavior for finding an empty string is inconsistent with documentation
Type: behavior Stage:
Components: Documentation, Interpreter Core, Unicode Versions: Python 3.5, Python 3.4, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: benjamin.peterson, docs@python, ezio.melotti, haypo, lemburg, martin.panter, pitrou, r.david.murray, rhettinger, serhiy.storchaka, swanson, veky
Priority: normal Keywords:

Created on 2015-05-20 04:02 by swanson, last changed 2017-09-18 21:10 by r.david.murray.

Messages (12)
msg243640 - (view) Author: (swanson) Date: 2015-05-20 04:02
Background:
-----------
Perhaps this has been addressed elsewhere, but I couldn't find it.  To me, semantically, the whole idea of finding nothing, whether in something or in nothing is complete and utter nonsense.  I'm a fail-quickly, fail-loudly kind of guy, and I'd prefer that any attempt to find nothing would result in an exception being raised.

But for whatever reason, the following behavior has long existed:
>>> "".index("") == "A".index("") == 0
True
>>> "" in "" and b"" in b"" and b"" in bytearray(b"")
True
>>> "" in "A" and b"" in b"A" and b"" in bytearray(b"A")
True


The Problem:
------------
The various string types (str, bytes, bytearray) all have the following functions: count, find, rfind, index, rindex
Each of these functions accepts optional parameters "start" and "end".  The documentation for each says something like "Optional arguments start and end are interpreted as in slice notation."  This is not the case.

On the one hand:
>>> "".find("") == ""[0:0].find("") == "".find("", 0, 0) == 0
True

Consistent so far, however:
>>> "".find("") == ""[1:0].find("") == 0 and "".find("", 1, 0) == -1
True

So, you see that 'start' and 'end' are NOT in all cases interpreted the same way as slicing.  This problem has been around forever, affecting both Python 3 and 2, so I don't know how many people's code you'd break if you changed the behavior to make it consistent with the docs.  But if it's not going to be changed, it should at least be a well-documented "feature" of the functions with a line or two of explanation in the relevant docs:
https://docs.python.org/3/library/stdtypes.html#bytes-and-bytearray-operations
https://docs.python.org/3/library/stdtypes.html#string-methods

The built-in types bytes, bytearray, and str also have the functions "startswith" and "endswith" that also take optional 'start' and 'end' arguments.  The documentation does not specifically say (as for count, (r)find, and (r)index) that these are "interpreted as in slice notation".  Instead, it says: "With optional start, test string beginning at that position. With optional end, stop comparing string at that position."  That wording is equally confusing and erroneous.  The natural interpretation of that would lead you to believe that, unlike slicing:
"A".startswith("A",0,0) == True
however it's really == False because the 'end' index is like slicing.

Now, as to the interpretation of finding nothing, it's a mixed bag:

For str:
>>> "".startswith("",0,0)
True
>>> "".startswith("",1,0)
True
>>> "".endswith("",0,0)
True
>>> "".endswith("",1,0)
True

For bytes: (and the same for bytearray)
>>> b"".startswith(b"",0,0)
True
>>> b"".startswith(b"",1,0)
False
>>> b"".endswith(b"",0,0)
True
>>> b"".endswith(b"",1,0)
False
msg243656 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-05-20 12:27
There are a few related issues here I think:

1. Empty string search: I think it is completely valid to be able to find an empty string inside another string, at least as long as the slice indexes are within range. Although I remember it was a bit of a revelation the first time I considered this. Anyway, the documentation should explicitly mention this case.

2. "".find("", 1, 0) -> -1: This is inconsistent with startswith() and slicing, but the only reasonable alternative would be to return 1, which would be a strange index for an empty string. Certainly needs documenting though.

3. starts/endswith() slice arguments: I think they should be described equivalently to find() etc. So “start” points to the first character, “end” points after the last character, negative indexes from the end, etc.

4. "".startswith("", 1, 0): This is True for str, and False for bytes, which is terribly inconsistent. My gut says both should be False, to match how find() works.

Some other related undocumented quirks:

>>> "abcd".count("")  # Infinity? End of the universe? No!
5
>>> "abcd".replace("", "_", 10)  # Might expect 10 underscores
'_a_b_c_d_'
>>> "abcd".split("", 10)  # Might expect a list of 11 strings
ValueError: empty separator
>>> "abcd".partition("")  # Why not ("", "", "abcd")?
ValueError: empty separator
msg243658 - (view) Author: (swanson) Date: 2015-05-20 13:18
Thanks for pointing out how count and replace operate.  I don't mind the "ValueError: empty separator" on split and partition - makes sense to me.

re: "at least as long as the slice indexes are within range"
If the slices indexes had to be in range, that would be inconsistent with the behavior of slicing, which is inconsistent with the docs, which was my main point.
msg243666 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-05-20 14:00
"If the slices indexes had to be in range, that would be inconsistent with the behavior of slicing"

No, it wouldn't.  Your slice example is two operations: the slice returns an empty string (because that's how the *substring* operation is defined to behave for an out-of-range slice), and *then* the search operation is called on it; but in the call with index arguments, the indicies are specifying the slice to search in using the slice semantics of the indicies, but that substring is invalid for the *search* operation.

I agree that the startswith/endswith difference between string a bytes looks like a bug, and that the bytes case looks to be the correct one, in terms of consistency with the other search operations.  Those operations are a bit different from the other search operations, though, so I could see it argued the other way.

The point is that the slice notation specifies how to compute the substring, but what happens if the substring is out of range depends on the *operation*.
msg243668 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-20 14:03
Changing behavior of such base methods is dangerous and can break existing code. Even if the behavior is wrong, some code can depends on it. We should be very careful with this.

There are several simple invariants for these methods.

s1.index(s2, start, end) (for non-negative indices) returns minimal index i such that start <= i and i + len(s2) <= end and s1[i: i + len(s2)] = s2. Or raise an exception if such index doesn't exist. find() returns -1 instead of an exception. Therefore "".find("", 1, 0) should be -1 and it is. All right.

The only bug is in inconsistency in startswith/endswith between str and bytes (unicode and str in Python 2). The worse, the behavior of str in Python 2 and Python 3 is different.

s1.startswith(s2, start, end) (for non-negative indices and non-tuple s2) is equivalent to start + len(s2) <= end and s2[start: start + len(s2)] == s2. Or to s1.find(s2, start, end) == start. Therefore "".startswith("", 1, 0) should be False, and it is for bytes and str in Python 2. The behavior of "".startswith("", 1, 0) in Python 3 and u"".startswith(u"", 1, 0) in Python 3 is wrong.

The question is can we fix this behavior in default branch (I think rather yes) and should we fix it in maintained releases (doubt)?
msg243670 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-05-20 14:13
Serhiy: I agree.  I think the consistency with python2's string is the deciding factor, and we ought to fix it in default.  But not 3.4, because it is a behavior change.  I'd like other people's opinions, though.
msg243710 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2015-05-21 01:12
> Changing behavior of such base methods is dangerous 
> and can break existing code.

I agree.  The time to argue an API design is before it released not years later after people have written code that relies on the behavior.  

> To me, semantically, the whole idea of finding nothing, 
> whether in something or in nothing is complete and utter nonsense. 

It might not be something you don't agree with, but it isn't nonsense.  There are a number of ways to view it sensibly (empty sets always a subset of other sets).  There a reason that any() and all() are well-defined for empty iterables.  There a reason that math.factorial(0) returns 1.  Those are useful corner cases that save us from writing special case code.

> Empty string search: I think it is completely valid to be able to find an empty string inside another string,

I concur.

I am closing this one as "not a bug".

Primum non nocere.  First Do No Harm".  Don't break existing code because the OP would "prefer that any attempt to find nothing would result in an exception being raised".  That is a terrible reason to break code.  For better or worse, this state of affairs has existed for a long time and no harm seems to have arisen.

Also keep in mind that the current behaviors are tested, meaning that it is not accidental and is guaranteed across implementations.  See Lib/test/string_tests.py.

Though this is closed as not a bug, feel free to add an example or a mention in the documentation.  Keep it short though.  The purpose of the docs for str.count and str.index are to teach how to use the common case rather than getting lost in a discussion of corner case behavior.  Also, the docs should be affirmatively worded (here is how something works and how to use it) rather than negatively worded (omg, finding nothing is something is nonsense).
msg243722 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-05-21 03:17
Raymond: this is currently marked as a documentation bug, and documentation is mostly what the OP was asking for.  So I'm reopening it.

That said, I think it is very appropriate to discuss fixing the behavior of string's endswith and startswith, because people are still (and will be for a while yet) porting from python2 and supporting single-code-base code, where string behaves the way bytes does in python3.  That should be addressed in a separate issue.  So in fact, history and "code written for the current behavior" comes down, IMO, on the side of fixing this edge case.  I'm willing to be argued out of that, though :)
msg244028 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-05-25 11:42
Opened separate issue24284 for inconsistency in startswith/endswith.
msg302488 - (view) Author: Vedran Čačić (veky) * Date: 2017-09-18 20:50
Raymond, with respect, I think you're either wrong here, or misleading with a purpose.

There is a big difference between any(()) returning False, all(()) returning True, '' in '' returning True, math.factorial(0) returning 1,  and set() <= set() returning True, on one hand, and ''.rindex('') returning 5, on the other. You might argue the latter is convenient (though I haven't seen any argument in favor of that), but it's simply not in the same category as other phenomena you're equating it with.

Those in the first group are mathematical definitions. They cannot be different without breaking various mathematical properties. 0! is 1 in the same way, and with the same reason, that 3! is 6. If 0! were 5, then 3! would _have_ to be 30. Both the specification (as the order of the symmetry group), and various algorithms for calculating factorial, simply give 1 when given 0 as an input. You can't really get 5 unless you explicitly treat 0 as a special case.

In the second group, there is an answer that might be convenient, and it probably has some obvious algorithm that produces it (I haven't read the code, but I doubt that someone treated '' as a special case there), but it doesn't fit the specification (maximal index doesn't exist), and you can easily write another algorithm that doesn't obviously treat '' as a special case, gives the same answers for all the other "sensible" cases, and gives something completely different for this case. The proof is that in many very similar cases, those different algorithms _have_ (inadvertently) been written.

One question is _whether_ '' is in some other string. Quite another is _where_ it is. First one is (greatly simplified, since it doesn't require contiguousness) all(c in other_string for c in ''), and that's obviously True for the same reason that set() <= any_set is True. But all(1/0 for c in '') is _also_ True, which shows that it really doesn't matter _what_ we test, as long as we test it on an empty collection. It shouldn't give us (by design) _any_ information that we can extract about "that particular occurence of ''" because there is in fact _no_ particular occurence to talk about.

In the face of ambiguity, refuse the temptation to guess. Yes, I see the convenience of not rasing ValueErrors for various string operations so some algorithms can say "ok, give me any value in case of '', I don't really care what it is", but we already do raise it for some operations - e.g. when splitting on an empty separator. We should do the same here. Either change the specification, or if the specification tells you to calculate something that doesn't exist, raise an exception.
msg302489 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-09-18 21:03
I'm not sure why we're arguing over this.  Is this actually something that matters in the real world?
msg302491 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-09-18 21:10
I'm not Raymond, but he is correct.  This is an example of "taking advantage of the corner cases", and is something Python does a lot of, especially around strings and slices.  The current behavior was carefully considered and has useful properties.
History
Date User Action Args
2017-09-18 21:10:09r.david.murraysetmessages: + msg302491
2017-09-18 21:03:20pitrousetmessages: + msg302489
2017-09-18 20:50:03vekysetnosy: + veky
messages: + msg302488
2015-05-25 11:42:16serhiy.storchakasetmessages: + msg244028
2015-05-21 03:17:41r.david.murraysetstatus: closed -> open
resolution: not a bug ->
messages: + msg243722
2015-05-21 01:12:25rhettingersetstatus: open -> closed

nosy: + rhettinger
messages: + msg243710

resolution: not a bug
2015-05-20 14:13:01r.david.murraysetmessages: + msg243670
2015-05-20 14:03:18serhiy.storchakasetversions: + Python 2.7, Python 3.4, Python 3.5
nosy: + haypo, pitrou, serhiy.storchaka, ezio.melotti, lemburg, benjamin.peterson

messages: + msg243668

components: + Interpreter Core, Unicode
2015-05-20 14:00:55r.david.murraysetnosy: + r.david.murray
messages: + msg243666
2015-05-20 13:18:42swansonsetmessages: + msg243658
2015-05-20 12:27:05martin.pantersetnosy: + martin.panter
messages: + msg243656
2015-05-20 04:02:42swansoncreate