This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, josh.r, python-dev, serhiy.storchaka, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2014-05-06 21:35 by josh.r, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
comparewithidequals.patch josh.r, 2014-05-06 21:35 Replaces _PyUnicode_CompareWithId with _PyUnicode_CompareWithIdEqual and changes all call sites review
_PyUnicode_EqualToId.patch xiang.zhang, 2016-11-14 05:20 review
_PyUnicode_EqualToId_v2.patch xiang.zhang, 2016-11-14 06:53 More short paths by Serhiy. review
_PyUnicode_EqualToASCIIId.patch xiang.zhang, 2016-11-16 11:28 review
Messages (21)
msg218022 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-05-06 21:35
_PyUnicode_CompareWithId is used exclusively for equality comparisons (after all, identifiers aren't really sortable in a meaningful way; they're isolated values, not a continuum). But because _PyUnicode_CompareWithId maintains the general comparison behavior, not just ==/!=, it serves little purpose; while it checks the return of _PyUnicode_FromId, none of its callers check for failure anyway, so every use could just as well have been:

PyUnicode_Compare(left, _PyUnicode_FromId(right));

I've attached a patch that replaces _PyUnicode_CompareWithId with _PyUnicode_CompareWithIdEqual, that:

1. Only check equality vs. inequality
2. Can optimize for the case where left is an interned string by performing direct pointer comparison
3. Even when left is not interned, it can use the optimized unicode_compare_eq worker function instead of the slower generalized unicode_compare function

I've replaced all the uses of the old function I could find, and all unit tests pass. I don't expect to see any meaningful speed ups as a result of the change (the most commonly traversed code that would benefit appears to be the code that creates new classes, and the code that creates reprs for objects), but the goal here is not immediate speed ups, but enabling future speed ups.

I am looking into writing a PyDict_GetItem fastpath for looking up identifiers (that would remove the need to perform memory comparisons when the dictionary, as in keyword argument passing, is usually composed of interned keys), possibly in combination with making an identifier based version of PyArg_ParseTupleAndKeywords; with ArgumentClinic, it might become practical to swap in a new argument parser without having to manually change thousands of lines of code, and one of the simplest ways to improve speed would be to remove the overhead of constantly constructing, hashing, and comparing the same keyword strings every time a C function is called.

Adding haypo as nosy since he created the original function in #19512.
msg229840 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2014-10-22 22:22
Is there someone else who should be looking at this? Having a fast path for identifier comparisons makes sense (and the concept of ordering between essentially unique identifiers makes no sense). It's not part of the public API (limited or not) so I don't think compatibility concerns apply, so it seems like this should be a simple change...
msg229842 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-22 22:41
Note that PyUnicode_CompareWithASCII should be quite fast in most cases (it uses memcmp() on UCS1 strings).
msg229843 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-22 22:43
That said, I think it's quite a good idea.
msg280701 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-13 16:21
This issue is not just about readability or performance. It is about correctness, since none of callers check for failure of _PyUnicode_CompareWithId.

I just came to the same problem from other side (replacing PyUnicode_CompareWithASCII with PyUnicode_EqualToASCII or _PyUnicode_CompareWithId). Josh's idea in general LGTM, but there are few details:

1. None of callers check for failure of new functions as well. In boolean context the failure is interpreted as true. What is even worse, an exception is set, and this can cause a crash in debug build.

If we don't want to rewrite all the uses of _PyUnicode_CompareWithId by adding the code for checking return value for error, we should make new function never failing. _PyUnicode_CompareWithId can fail if the first argument is not valid string or if there is no memory for allocating a Unicode object for identifier. I think it is better to return false value in these circumstances.

If the first argument is not string, it isn't equal to an identifier. If it is an invalid string, it can't be equal too. If there is no memory for allocating a Unicode object for identifier, we should fallback to comparing the raw content (like in PyUnicode_CompareWithASCII).

2. It may be worth to replace _PyUnicode_FromId with the macro:

#define _PyUnicode_FROM_ID(id) ((id)->object ? (id)->object : _PyUnicode_FromId(id))

Or rename _PyUnicode_FromId to _PyUnicode_FromIdInternal and _PyUnicode_FROM_ID to _PyUnicode_FromId? This would save us from rewriting a lot of code that uses _PyUnicode_FromId.

3. The name _PyUnicode_CompareWithIdEqual looks too long to me. What about _PyUnicode_EqualToId?

4. Pointers could be checked for equality before checking PyUnicode_CHECK_INTERNED(left). The former check is much cheaper than the latter.
msg280730 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-11-14 05:20
> The name _PyUnicode_CompareWithIdEqual looks too long to me. What about _PyUnicode_EqualToId?

+1. I think this name is more clear.

Serhiy's idea on the implementation sounds good. As for _PyUnicode_FROM_ID, I think it's better for another issue.
msg280742 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-14 09:20
There are yet few subtle details.

1. _PyUnicode_FromId() uses UTF-8 for decoding from C string, but PyUnicode_CompareWithASCIIString() uses Latin1. Two ways of comparison can return different results. Currently all identifiers are ASCII, thus perhaps we can ignore this issue for a time. Perhaps the simplest solution is to make PyUnicode_FromId() using ASCII or Latin1.

2. PyUnicode_READY() can fail either because Unicode object is misformed or due to MemoryError. The former case is unavoidable error and returning false is good. But the latter can be temporary error and we should add a fallback, compare wchar_t * representation of non-ready Unicode object with char * representation of identifier.
msg280746 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-11-14 10:38
_PyUnicode_FromId could fail due to memoryerror or bad encoded data. They should be treated differently like PyUnicode_READY.

Could we treat it like PyDict_GetItem? If memoryerror happens it's highly possible other parts will fail too.
msg280834 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-15 12:56
_PyUnicode_FromId would not fail due to bad encoded data if use the Latin1 encoding. Seems encoded data always is ASCII. We could use PyErr_WriteUnraisable() for output a warning if it is not.
msg280843 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 14:09
Please don't modify _PyUnicode_FromId(), I prefer to keep it as it is, decode from UTF-8.
msg280855 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-15 14:47
> Please don't modify _PyUnicode_FromId(), I prefer to keep it as it is, decode from UTF-8.

Then we should add handling of three special cases: PyUnicode_READY() fails, _PyUnicode_FromId() fails and both fail due to memory error. This means that should be implemented character-by-character encoding of UCS1, UCS2, UCS4 or wchar_t (with possible surrogate pairs) to UTF-8 and comparing with UTF-8 encoded data.
msg280857 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 15:04
> This issue is not just about readability or performance. It is about correctness, since none of callers check for failure of _PyUnicode_CompareWithId.

Callers should be fixed to handle errors.
msg280858 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-11-15 15:31
Since currently _PyUnicode_CompareWithId is used to compare a unicode with ascii identifiers for all cases, how about introduce a more specific function like _PyUnicode_EqualToASCIIId for this case? We can preserve _PyUnicode_CompareWithId for more general purpose usage. It's much easier to write a error free _PyUnicode_EqualToASCIIId.
msg280871 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-15 18:08
> Callers should be fixed to handle errors.

This would make the code of callers more complex for small benefit. And perhaps we will add more callers (if replace PyUnicode_CompareWithASCII with new function).

> Since currently _PyUnicode_CompareWithId is used to compare a unicode with ascii identifiers for all cases, how about introduce a more specific function like _PyUnicode_EqualToASCIIId for this case?

Great idea Xiang!
msg280883 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-15 19:53
_PyUnicode_EqualToASCIIString() added in issue28701 would help in the patch for this issue.
msg280923 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-16 07:16
_PyUnicode_CompareWithId is a private function. We can remove it if it has
issues.
msg280936 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2016-11-16 11:28
> _PyUnicode_CompareWithId is a private function. We can remove it if it has issues.

It doesn't. But once there is _PyUnicode_EqualToASCIIId, it's can be rarely used.

The new patch implements a version of _PyUnicode_EqualToASCIIId.
msg280942 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 12:59
LGTM. Except that _PyUnicode_EqualToASCIIString() could be used for simplicity.

> _PyUnicode_CompareWithId is a private function. We can remove it if it has issues.

I would left it in maintained releases and removed it in 3.7 (or 3.6?).
msg280948 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 13:43
New changeset faf04a995031 by Serhiy Storchaka in branch '3.5':
Issue #28701: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId.
https://hg.python.org/cpython/rev/faf04a995031

New changeset ff3dacc98b3a by Serhiy Storchaka in branch '3.6':
Issue #28701: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId.
https://hg.python.org/cpython/rev/ff3dacc98b3a

New changeset 765013f71bc4 by Serhiy Storchaka in branch 'default':
Issue #28701: Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId.
https://hg.python.org/cpython/rev/765013f71bc4
msg280950 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-11-16 13:57
New changeset b995a6950139 by Serhiy Storchaka in branch '3.6':
Issue #21449: Removed private function _PyUnicode_CompareWithId.
https://hg.python.org/cpython/rev/b995a6950139

New changeset 9b053d3c74dc by Serhiy Storchaka in branch 'default':
Issue #21449: Removed private function _PyUnicode_CompareWithId.
https://hg.python.org/cpython/rev/9b053d3c74dc
msg280953 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-11-16 13:59
Thank you Josh and Xiang for your contribution.
History
Date User Action Args
2022-04-11 14:58:03adminsetgithub: 65648
2016-11-16 18:15:47serhiy.storchakasettitle: Replace _PyUnicode_CompareWithId with _PyUnicode_CompareWithIdEqual -> Replace _PyUnicode_CompareWithId with _PyUnicode_EqualToASCIIId
2016-11-16 13:59:58serhiy.storchakasetstatus: open -> closed
stage: patch review -> resolved
2016-11-16 13:59:39serhiy.storchakasetresolution: fixed
dependencies: - Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString
messages: + msg280953
2016-11-16 13:57:13python-devsetnosy: + python-dev
messages: + msg280950
2016-11-16 13:43:38serhiy.storchakasetmessages: + msg280948
2016-11-16 12:59:24serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg280942
2016-11-16 11:28:04xiang.zhangsetfiles: + _PyUnicode_EqualToASCIIId.patch

messages: + msg280936
2016-11-16 07:16:21vstinnersetmessages: + msg280923
2016-11-15 19:53:34serhiy.storchakasetdependencies: + Replace PyUnicode_CompareWithASCIIString with _PyUnicode_EqualToASCIIString
messages: + msg280883
2016-11-15 18:08:30serhiy.storchakasetmessages: + msg280871
2016-11-15 15:31:35xiang.zhangsetmessages: + msg280858
2016-11-15 15:04:28vstinnersetmessages: + msg280857
2016-11-15 14:47:35serhiy.storchakasetmessages: + msg280855
2016-11-15 14:09:08vstinnersetmessages: + msg280843
2016-11-15 13:02:12pitrousetnosy: - pitrou
2016-11-15 12:56:59serhiy.storchakasetmessages: + msg280834
2016-11-14 10:38:20xiang.zhangsetmessages: + msg280746
2016-11-14 09:20:05serhiy.storchakasetmessages: + msg280742
2016-11-14 06:53:23xiang.zhangsetfiles: + _PyUnicode_EqualToId_v2.patch
2016-11-14 05:20:38xiang.zhangsetfiles: + _PyUnicode_EqualToId.patch
nosy: + xiang.zhang
messages: + msg280730

2016-11-13 16:21:13serhiy.storchakasetversions: + Python 3.6, Python 3.7
nosy: + ezio.melotti, serhiy.storchaka

messages: + msg280701

components: + Interpreter Core, Unicode
type: behavior
2014-10-22 22:43:04pitrousetmessages: + msg229843
2014-10-22 22:41:51pitrousetversions: - Python 3.6
nosy: + pitrou

messages: + msg229842

stage: patch review
2014-10-22 22:22:28josh.rsetmessages: + msg229840
versions: + Python 3.5, Python 3.6
2014-05-06 21:35:24josh.rcreate