classification
Title: startswith and endswith leak implementation details
Type: behavior Stage:
Components: Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ronan.Lamy, barry, r.david.murray, serhiy.storchaka, steven.daprano
Priority: normal Keywords:

Created on 2017-11-08 16:59 by Ronan.Lamy, last changed 2017-11-09 00:03 by steven.daprano.

Messages (8)
msg305881 - (view) Author: Ronan Lamy (Ronan.Lamy) * Date: 2017-11-08 16:59
One would think that u.startswith(v, start, end) would be equivalent to u[start: end].startswith(v), but one would be wrong. And the same goes for endswith(). Here is the actual spec (for bytes, but str and bytearray are the same), in the form of passing pytest+hypothesis tests:


from hypothesis import strategies as st, given

def adjust_indices(u, start, end):
    if end < 0:
        end = max(end + len(u), 0)
    else:
        end = min(end, len(u))
    if start < 0:
        start = max(start + len(u), 0)
    return start, end

@given(st.binary(), st.binary(), st.integers(), st.integers())
def test_startswith_3(u, v, start, end):
    if v:
        expected = u[start:end].startswith(v)
    else:
        start0, end0 = adjust_indices(u, start, end)
        expected = start0 <= len(u) and start0 <= end0
    assert u.startswith(v, start, end) is expected

@given(st.binary(), st.binary(), st.integers(), st.integers())
def test_endswith_3(u, v, start, end):
    if v:
        expected = u[start:end].endswith(v)
    else:
        start0, end0 = adjust_indices(u, start, end)
        expected = start0 <= len(u) and start0 <= end0
    assert u.endswith(v, start, end) is expected

Fixing this behaviour to work in the "obvious" way would be simple: just add a check for len(v) == 0 and always return True in that case.
msg305882 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-11-08 17:28
Can you please give examples of what you think the problem is?
msg305886 - (view) Author: Ronan Lamy (Ronan.Lamy) * Date: 2017-11-08 17:57
The problem is the complexity of the actual behaviour of these methods. 

It is impossible to get it right without looking at the source (at least, it was for me), and I doubt any ordinary user can correctly make use of the v='' behaviour, or predict what the return value will be in all cases.
msg305887 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-08 18:05
See issue24284. `s1.startswith(s2, start, end)` for non-negative indices and non-tuple s2 is equivalent to expressions

    start + len(s2) <= end and s2[start: start + len(s2)] == s2

or
    s1.find(s2, start, end) == start
msg305891 - (view) Author: Ronan Lamy (Ronan.Lamy) * Date: 2017-11-08 19:17
Ah, thanks, I noticed the discrepancy between unicode and str in 2.7, but wondered when it was fixed. I guess I'm arguing that it was resolved in the wrong direction, then.

Now, your first expression is wrong, even after fixing the obvious typo. The correct version is:
    start + len(s2) <= min(len(s1), end) and s1[start: start + len(s2)] == s2

If the person who implemented the behaviour can get it right, who will? ;-)

The second expression is correct, but I'll argue that it shows that find() also suffers from a discrepancy between its basic one-argument form and the extended ones.
msg305901 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-08 20:37
For the justification of the find() behavior see msg243668.

But the largest argument for this behavior is that find() have it for a long time. Changing it will break existing code that depends on it.

This argument is weaker in the case of startwith() and endwith() because their behavior for bytes and Unicode was inconsistent. But the consistency with find() plays a role.
msg305922 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-11-08 23:55
Thank you for the bug report Ronan, but I'm afraid that I have no idea what you think the problematic behaviour is. I'm not going to spend the time installing the third-party hypothesis module, and learning how to use it, just to decipher your "actual spec". Where did this spec come from? The documentation is fairly sparse:

https://docs.python.org/3/library/stdtypes.html#str.startswith

so I'm not sure where your spec comes from. The title of this ticket is uninformative: what implementation details are being leaked? 

Saying "The problem is the complexity of the actual behaviour of these methods." explains nothing. Which actual behaviour? Please provide simple examples that contrast expected behaviour from actual behaviour, and justification for the expected behaviour.
msg305923 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-11-09 00:03
I don't have Python 3.7 available to me, but in 3.5 the behaviour of u.startswith(v) with an empty v seems consistent to me:


py> "alpha".startswith("", 20, 30)
True
py> "alpha"[20:30].startswith("")
True

py> "".startswith("", 20, 30)
True
py> ""[20:30].startswith("")
True

So I can't see any inconsistency that might be fixed by always returning True in the case v="", as that appears to already be the case.
History
Date User Action Args
2017-11-09 00:03:12steven.dapranosetmessages: + msg305923
2017-11-08 23:55:48steven.dapranosetnosy: + steven.daprano
messages: + msg305922
2017-11-08 20:37:48serhiy.storchakasetmessages: + msg305901
2017-11-08 19:17:49Ronan.Lamysetmessages: + msg305891
2017-11-08 18:45:36barrysetnosy: + barry
2017-11-08 18:05:54serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg305887
2017-11-08 17:57:39Ronan.Lamysetmessages: + msg305886
2017-11-08 17:28:14r.david.murraysetnosy: + r.david.murray
messages: + msg305882
2017-11-08 16:59:43Ronan.Lamycreate