Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

behavior for finding an empty string is inconsistent with documentation #68431

Open
swanson mannequin opened this issue May 20, 2015 · 12 comments
Open

behavior for finding an empty string is inconsistent with documentation #68431

swanson mannequin opened this issue May 20, 2015 · 12 comments
Labels
docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@swanson
Copy link
Mannequin

swanson mannequin commented May 20, 2015

BPO 24243
Nosy @malemburg, @rhettinger, @pitrou, @vstinner, @benjaminp, @ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka, @vedgar

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2015-05-20.04:02:42.226>
labels = ['interpreter-core', 'type-bug', 'expert-unicode', 'docs']
title = 'behavior for finding an empty string is inconsistent with documentation'
updated_at = <Date 2017-09-18.21:10:09.592>
user = 'https://bugs.python.org/swanson'

bugs.python.org fields:

activity = <Date 2017-09-18.21:10:09.592>
actor = 'r.david.murray'
assignee = 'docs@python'
closed = False
closed_date = None
closer = None
components = ['Documentation', 'Interpreter Core', 'Unicode']
creation = <Date 2015-05-20.04:02:42.226>
creator = 'swanson'
dependencies = []
files = []
hgrepos = []
issue_num = 24243
keywords = []
message_count = 12.0
messages = ['243640', '243656', '243658', '243666', '243668', '243670', '243710', '243722', '244028', '302488', '302489', '302491']
nosy_count = 12.0
nosy_names = ['lemburg', 'rhettinger', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'r.david.murray', 'docs@python', 'martin.panter', 'serhiy.storchaka', 'veky', 'swanson']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue24243'
versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

@swanson
Copy link
Mannequin Author

swanson mannequin commented May 20, 2015

Background:
-----------
Perhaps this has been addressed elsewhere, but I couldn't find it. To me, semantically, the whole idea of finding nothing, whether in something or in nothing is complete and utter nonsense. I'm a fail-quickly, fail-loudly kind of guy, and I'd prefer that any attempt to find nothing would result in an exception being raised.

But for whatever reason, the following behavior has long existed:
>>> "".index("") == "A".index("") == 0
True
>>> "" in "" and b"" in b"" and b"" in bytearray(b"")
True
>>> "" in "A" and b"" in b"A" and b"" in bytearray(b"A")
True

The Problem:
------------
The various string types (str, bytes, bytearray) all have the following functions: count, find, rfind, index, rindex
Each of these functions accepts optional parameters "start" and "end". The documentation for each says something like "Optional arguments start and end are interpreted as in slice notation." This is not the case.

On the one hand:
>>> "".find("") == ""[0:0].find("") == "".find("", 0, 0) == 0
True

Consistent so far, however:
>>> "".find("") == ""[1:0].find("") == 0 and "".find("", 1, 0) == -1
True

So, you see that 'start' and 'end' are NOT in all cases interpreted the same way as slicing. This problem has been around forever, affecting both Python 3 and 2, so I don't know how many people's code you'd break if you changed the behavior to make it consistent with the docs. But if it's not going to be changed, it should at least be a well-documented "feature" of the functions with a line or two of explanation in the relevant docs:
https://docs.python.org/3/library/stdtypes.html#bytes-and-bytearray-operations
https://docs.python.org/3/library/stdtypes.html#string-methods

The built-in types bytes, bytearray, and str also have the functions "startswith" and "endswith" that also take optional 'start' and 'end' arguments. The documentation does not specifically say (as for count, (r)find, and (r)index) that these are "interpreted as in slice notation". Instead, it says: "With optional start, test string beginning at that position. With optional end, stop comparing string at that position." That wording is equally confusing and erroneous. The natural interpretation of that would lead you to believe that, unlike slicing:
"A".startswith("A",0,0) == True
however it's really == False because the 'end' index is like slicing.

Now, as to the interpretation of finding nothing, it's a mixed bag:

For str:
>>> "".startswith("",0,0)
True
>>> "".startswith("",1,0)
True
>>> "".endswith("",0,0)
True
>>> "".endswith("",1,0)
True

For bytes: (and the same for bytearray)
>>> b"".startswith(b"",0,0)
True
>>> b"".startswith(b"",1,0)
False
>>> b"".endswith(b"",0,0)
True
>>> b"".endswith(b"",1,0)
False

@swanson swanson mannequin assigned docspython May 20, 2015
@swanson swanson mannequin added docs Documentation in the Doc dir type-bug An unexpected behavior, bug, or error labels May 20, 2015
@vadmium
Copy link
Member

vadmium commented May 20, 2015

There are a few related issues here I think:

  1. Empty string search: I think it is completely valid to be able to find an empty string inside another string, at least as long as the slice indexes are within range. Although I remember it was a bit of a revelation the first time I considered this. Anyway, the documentation should explicitly mention this case.

  2. "".find("", 1, 0) -> -1: This is inconsistent with startswith() and slicing, but the only reasonable alternative would be to return 1, which would be a strange index for an empty string. Certainly needs documenting though.

  3. starts/endswith() slice arguments: I think they should be described equivalently to find() etc. So “start” points to the first character, “end” points after the last character, negative indexes from the end, etc.

  4. "".startswith("", 1, 0): This is True for str, and False for bytes, which is terribly inconsistent. My gut says both should be False, to match how find() works.

Some other related undocumented quirks:

>>> "abcd".count("")  # Infinity? End of the universe? No!
5
>>> "abcd".replace("", "_", 10)  # Might expect 10 underscores
'_a_b_c_d_'
>>> "abcd".split("", 10)  # Might expect a list of 11 strings
ValueError: empty separator
>>> "abcd".partition("")  # Why not ("", "", "abcd")?
ValueError: empty separator

@swanson
Copy link
Mannequin Author

swanson mannequin commented May 20, 2015

Thanks for pointing out how count and replace operate. I don't mind the "ValueError: empty separator" on split and partition - makes sense to me.

re: "at least as long as the slice indexes are within range"
If the slices indexes had to be in range, that would be inconsistent with the behavior of slicing, which is inconsistent with the docs, which was my main point.

@bitdancer
Copy link
Member

"If the slices indexes had to be in range, that would be inconsistent with the behavior of slicing"

No, it wouldn't. Your slice example is two operations: the slice returns an empty string (because that's how the *substring* operation is defined to behave for an out-of-range slice), and *then* the search operation is called on it; but in the call with index arguments, the indicies are specifying the slice to search in using the slice semantics of the indicies, but that substring is invalid for the *search* operation.

I agree that the startswith/endswith difference between string a bytes looks like a bug, and that the bytes case looks to be the correct one, in terms of consistency with the other search operations. Those operations are a bit different from the other search operations, though, so I could see it argued the other way.

The point is that the slice notation specifies how to compute the substring, but what happens if the substring is out of range depends on the *operation*.

@serhiy-storchaka
Copy link
Member

Changing behavior of such base methods is dangerous and can break existing code. Even if the behavior is wrong, some code can depends on it. We should be very careful with this.

There are several simple invariants for these methods.

s1.index(s2, start, end) (for non-negative indices) returns minimal index i such that start <= i and i + len(s2) <= end and s1[i: i + len(s2)] = s2. Or raise an exception if such index doesn't exist. find() returns -1 instead of an exception. Therefore "".find("", 1, 0) should be -1 and it is. All right.

The only bug is in inconsistency in startswith/endswith between str and bytes (unicode and str in Python 2). The worse, the behavior of str in Python 2 and Python 3 is different.

s1.startswith(s2, start, end) (for non-negative indices and non-tuple s2) is equivalent to start + len(s2) <= end and s2[start: start + len(s2)] == s2. Or to s1.find(s2, start, end) == start. Therefore "".startswith("", 1, 0) should be False, and it is for bytes and str in Python 2. The behavior of "".startswith("", 1, 0) in Python 3 and u"".startswith(u"", 1, 0) in Python 3 is wrong.

The question is can we fix this behavior in default branch (I think rather yes) and should we fix it in maintained releases (doubt)?

@serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels May 20, 2015
@bitdancer
Copy link
Member

Serhiy: I agree. I think the consistency with python2's string is the deciding factor, and we ought to fix it in default. But not 3.4, because it is a behavior change. I'd like other people's opinions, though.

@rhettinger
Copy link
Contributor

Changing behavior of such base methods is dangerous
and can break existing code.

I agree. The time to argue an API design is before it released not years later after people have written code that relies on the behavior.

To me, semantically, the whole idea of finding nothing,
whether in something or in nothing is complete and utter nonsense.

It might not be something you don't agree with, but it isn't nonsense. There are a number of ways to view it sensibly (empty sets always a subset of other sets). There a reason that any() and all() are well-defined for empty iterables. There a reason that math.factorial(0) returns 1. Those are useful corner cases that save us from writing special case code.

Empty string search: I think it is completely valid to be able to find an empty string inside another string,

I concur.

I am closing this one as "not a bug".

Primum non nocere. First Do No Harm". Don't break existing code because the OP would "prefer that any attempt to find nothing would result in an exception being raised". That is a terrible reason to break code. For better or worse, this state of affairs has existed for a long time and no harm seems to have arisen.

Also keep in mind that the current behaviors are tested, meaning that it is not accidental and is guaranteed across implementations. See Lib/test/string_tests.py.

Though this is closed as not a bug, feel free to add an example or a mention in the documentation. Keep it short though. The purpose of the docs for str.count and str.index are to teach how to use the common case rather than getting lost in a discussion of corner case behavior. Also, the docs should be affirmatively worded (here is how something works and how to use it) rather than negatively worded (omg, finding nothing is something is nonsense).

@bitdancer
Copy link
Member

Raymond: this is currently marked as a documentation bug, and documentation is mostly what the OP was asking for. So I'm reopening it.

That said, I think it is very appropriate to discuss fixing the behavior of string's endswith and startswith, because people are still (and will be for a while yet) porting from python2 and supporting single-code-base code, where string behaves the way bytes does in python3. That should be addressed in a separate issue. So in fact, history and "code written for the current behavior" comes down, IMO, on the side of fixing this edge case. I'm willing to be argued out of that, though :)

@bitdancer bitdancer reopened this May 21, 2015
@bitdancer bitdancer removed the invalid label May 21, 2015
@serhiy-storchaka
Copy link
Member

Opened separate bpo-24284 for inconsistency in startswith/endswith.

@vedgar
Copy link
Mannequin

vedgar mannequin commented Sep 18, 2017

Raymond, with respect, I think you're either wrong here, or misleading with a purpose.

There is a big difference between any(()) returning False, all(()) returning True, '' in '' returning True, math.factorial(0) returning 1, and set() <= set() returning True, on one hand, and ''.rindex('') returning 5, on the other. You might argue the latter is convenient (though I haven't seen any argument in favor of that), but it's simply not in the same category as other phenomena you're equating it with.

Those in the first group are mathematical definitions. They cannot be different without breaking various mathematical properties. 0! is 1 in the same way, and with the same reason, that 3! is 6. If 0! were 5, then 3! would _have_ to be 30. Both the specification (as the order of the symmetry group), and various algorithms for calculating factorial, simply give 1 when given 0 as an input. You can't really get 5 unless you explicitly treat 0 as a special case.

In the second group, there is an answer that might be convenient, and it probably has some obvious algorithm that produces it (I haven't read the code, but I doubt that someone treated '' as a special case there), but it doesn't fit the specification (maximal index doesn't exist), and you can easily write another algorithm that doesn't obviously treat '' as a special case, gives the same answers for all the other "sensible" cases, and gives something completely different for this case. The proof is that in many very similar cases, those different algorithms _have_ (inadvertently) been written.

One question is _whether_ '' is in some other string. Quite another is _where_ it is. First one is (greatly simplified, since it doesn't require contiguousness) all(c in other_string for c in ''), and that's obviously True for the same reason that set() <= any_set is True. But all(1/0 for c in '') is _also_ True, which shows that it really doesn't matter _what_ we test, as long as we test it on an empty collection. It shouldn't give us (by design) _any_ information that we can extract about "that particular occurence of ''" because there is in fact _no_ particular occurence to talk about.

In the face of ambiguity, refuse the temptation to guess. Yes, I see the convenience of not rasing ValueErrors for various string operations so some algorithms can say "ok, give me any value in case of '', I don't really care what it is", but we already do raise it for some operations - e.g. when splitting on an empty separator. We should do the same here. Either change the specification, or if the specification tells you to calculate something that doesn't exist, raise an exception.

@pitrou
Copy link
Member

pitrou commented Sep 18, 2017

I'm not sure why we're arguing over this. Is this actually something that matters in the real world?

@bitdancer
Copy link
Member

I'm not Raymond, but he is correct. This is an example of "taking advantage of the corner cases", and is something Python does a lot of, especially around strings and slices. The current behavior was carefully considered and has useful properties.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants