classification
Title: remove redundant check in unicode_hash(PyObject *self)
Type: performance Stage: resolved
Components: Interpreter Core Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Ma Lin, scoder, serhiy.storchaka
Priority: normal Keywords: patch, patch, patch

Created on 2019-01-02 07:20 by Ma Lin, last changed 2019-01-05 19:14 by serhiy.storchaka. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11402 merged Ma Lin, 2019-01-02 07:25
Messages (11)
msg332857 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-02 07:20
Please see the PR
msg332858 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-02 07:30
Every non-empty str will be checked twice at present.
msg332861 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-02 11:08
This redundant exists since Python 3.4 or earlier.
msg332864 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-01-02 11:25
Unlikely to get changed in Py3.4/5 anymore, since this is not even a bug fix. I wouldn't even fight for backporting, although 3.7 seems ok for it.

I agree that this code duplication is worth removing. I don't consider hashing the empty string important enough for leaving it in, especially because the net performance effect can at most be zero, if not negative, for the normal case of non-empty strings.
msg332866 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-01-02 12:16
New changeset a1d14253066f7dd60cfb465c6511fa565f312b42 by Serhiy Storchaka (animalize) in branch 'master':
bpo-35636: Remove redundant check in unicode_hash(). (GH-11402)
https://github.com/python/cpython/commit/a1d14253066f7dd60cfb465c6511fa565f312b42
msg332908 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-03 05:10
Thanks for review.

Don't know why bytes and str generates the same hash value for ASCII sequence.
>>> hash('abc') == hash(b'abc')
True

This may brings some hash collisions, does it affect performance slightly?
msg332909 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-01-03 07:02
> why bytes and str generates the same hash value for ASCII sequence

Probably mostly for historical Py2 reasons. These days, both are somewhat unlikely to appear in the same dict. But still, I'd advise against changing the hash function without a very good reason. You never know how much code relies on it in one way or another.
msg332910 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-03 07:13
> I'd advise against changing the hash function without a very good reason. You never know how much code relies on it in one way or another.

ok, maybe this can be changed in Python 4.0
msg332911 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-01-03 07:20
> maybe this can be changed in Python 4.0

Well, if you find a *very* good reason for changing it, as I said. Py4 won't be special in that regard, I suppose.
msg332916 - (view) Author: Ma Lin (Ma Lin) * Date: 2019-01-03 07:45
One scene is caching regular expresses, b'[a-z]', '[a-z]' may exist in the same dict. Any way, it's trivial on the whole.
msg333068 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-01-05 19:14
For historical reasons. In Python 2, str and unicode consisting of ASCII characters can be equal. Equal values should have the same hash. In Python 3, bytes and str are always different. This can cause subtle bugs in the code ported from Python 2. Options -b and -bb were added to help to catch such bugs. For increasing a chance of catching such bugs, hashes of bytes and str consisting of ASCII characters with same codes, should be equal.
History
Date User Action Args
2019-01-05 19:14:26serhiy.storchakasetstatus: open -> closed
messages: + msg333068

keywords: patch, patch, patch
resolution: fixed
stage: patch review -> resolved
2019-01-03 07:45:46Ma Linsetmessages: + msg332916
2019-01-03 07:20:48scodersetmessages: + msg332911
2019-01-03 07:13:32Ma Linsetmessages: + msg332910
2019-01-03 07:02:11scodersetmessages: + msg332909
2019-01-03 05:10:25Ma Linsetmessages: + msg332908
2019-01-02 12:16:10serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg332866
2019-01-02 11:33:15serhiy.storchakasetpull_requests: - pull_request10785
2019-01-02 11:33:03serhiy.storchakasetpull_requests: - pull_request10784
2019-01-02 11:32:47serhiy.storchakasetkeywords: patch, patch, patch
versions: - Python 3.6, Python 3.7
2019-01-02 11:25:50scodersetnosy: + scoder

messages: + msg332864
versions: - Python 3.4, Python 3.5
2019-01-02 11:08:15Ma Linsettype: enhancement -> performance
versions: + Python 3.4, Python 3.5
messages: + msg332861
title: remove redundant code in unicode_hash(PyObject *self) -> remove redundant check in unicode_hash(PyObject *self)
2019-01-02 07:49:36Ma Linsetversions: + Python 3.6, Python 3.7
2019-01-02 07:30:20Ma Linsettype: enhancement
messages: + msg332858
components: + Interpreter Core
versions: + Python 3.8
2019-01-02 07:25:14Ma Linsetkeywords: + patch
stage: patch review
pull_requests: + pull_request10785
2019-01-02 07:25:10Ma Linsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10784
2019-01-02 07:25:06Ma Linsetkeywords: + patch
stage: (no value)
pull_requests: + pull_request10783
2019-01-02 07:20:57Ma Lincreate