Issue 23550: Add to unicodedata a function to query the "Quick_Check" property for a character

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/67738

classification

Title:	Add to unicodedata a function to query the "Quick_Check" property for a character
Type:	enhancement	Stage:	patch review
Components:	Library (Lib), Unicode	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Hammerite, benjamin.peterson, berker.peksag, ezio.melotti, lemburg, loewis, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2015-02-28 18:21 by Hammerite, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
quick_check_x.patch	Hammerite, 2015-03-03 00:09	Initial attempt at implementation	review
quick_check_2.patch	Hammerite, 2015-03-03 22:08	Better patch that includes changes to unicodedata_db.h	review
quick_check_3.patch	Hammerite, 2015-03-26 22:21	Patch addressing some issues raised by reviewers	review
quick_check_4.diff	Hammerite, 2015-03-28 19:25	Patch that makes all requested stylistic changes	review

Messages (12)
msg236901 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-02-28 18:21
Unicode Standard Annex #15 (http://unicode.org/reports/tr15/#Stable_Code_Points) describes how each character in Unicode, for each of the four normalisation forms, has a "Quick_Check" value that aids in determining whether a given string is in that normalisation form. It goes on to describe, in section 9.1, how this "Quick_Check" value may be used to optimise the concatenation of a string onto a normalised string to produce another normalised string: normalisation need only be performed from the last "stable" character in the left-hand string onwards, where a character is "stable" if it has the "Quick_Check" property and has a canonical combining class of 0. This will generally be more efficient than re-running the normalisation algorithm on the entire concatenated string, if the strings involved are long. The unicodedata standard-library module does not, in my understanding, expose this information. I would like to see a new function added that allows us to determine whether a given character has the "Quick_Check" property for a given normalisation form. This function might accept two parameters, the first of which is a string indicating the normalisation form and the second of which is the character being tested (similar to unicodedata.normalize()). Suppose we have a need to accept text data, receiving chunks of it at a time, and every time we receive a new chunk we need to append it to the string so far and also make sure that the resulting string is normalised to a particular normalisation form (NFD say). This implies that we would like to be able to concatenate the new chunk (which may not be normalised) onto the string "so far" (which is) and have the result be normalised - but without re-doing normalisation of the whole string over again, as this might be inefficient. From the linked UAX, this might be achieved like this, where unicodedata.quick_check() is the requested function: def concat (s1, s2): LSCP = len(s1) # Last stable character position while LSCP > 0: LSCP -= 1 if unicodedata.combining(s1[LSCP]) == 0 and unicodedata.quick_check('NFD', s1[LSCP]): break return s1[:LSCP] + unicodedata.normalize('NFD', s1[LSCP:] + s2)
msg236910 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2015-02-28 20:54
Can you provide a patch for this ?
msg236913 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-02-28 21:37
No, I haven't done any work on it. Is that the "done" thing when suggesting something? I'm sorry, I wasn't aware. I could look into it. I am unfamiliar with the CPython codebase, but I can have a go.
msg237094 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-03 00:09
Here is an initial attempt at a patch that implements the new function. Notes on this patch: - The function as implemented here returns a string: "Yes", "No", or "Maybe". In light of the fact that Python now has enums, it is probably more appropriate that unicodedata contain an enum (QuickCheckValue, say) that has Yes, No and Maybe as values, and that the function return values from the enum. However, I do not know how to implement this in C and I do not know where I might look for an example of where something like this is done (in C) already in the CPython codebase, so I have not done that. - My example code in the initial message on this issue was in error, as it assumed that the new function would return a Boolean value. In fact, there are three possible return values (Yes, No and Maybe), and corrected code would test the return value for equality with a string (as I have implemented the function) or for equality or identity with an enum value. - When I first generated a patch, it was very long and consisted mostly of changes to three generated header files; these changes accounted for a total of about 60,000 lines' worth of patch file and appeared to be almost entirely spurious. Although one or more of these header files will genuinely be generated differently on account of my changes (a new string array _PyUnicode_QuickCheckNames is now present, along the same lines as the pre-existing 'Names' arrays), I have omitted them from the diff for being extremely bulky and for being generated code in any case.
msg237165 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-03 22:08
Here is a better patch that includes the changes to unicodedata.h The problem before was that the diff tool can't cope with line ending differences. I just fixed the line endings manually.
msg238484 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-19 00:32
I'm a registered contributor now, in case that is a roadblock.
msg238487 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2015-03-19 02:05
Thanks for the patch, Hammerite. Did you see the review comments on Rietveld?
msg238556 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-19 20:51
Berker, I had not seen your or Ezio's review comments. The user interface here is new to me and I did not know to seek them out separately from the issue page. Thank you for pointing them out to me. I shall go through and look at the suggestions that were made.
msg239185 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-24 21:42
I tried to add these responses within the code review section of the site, but I am unable to do so; when I click "Send Message", I am taken to a page that consists of the text "500 Server Error" and no other content. Therefore I am responding to the code review comments here. Ezio: The form comes before the chr because that is the order unicodedata.normalize() takes its arguments (though it takes a string rather than a single character). To write normalize(form, str) but quick_check(chr, form) would be quite inconsistent, I think, and would invite confusion. Agreed that a brief description of what the property is useful for would be good; I will make another patch that includes this. I will delete the space after the function name. I will also eliminate the column arrangement of the test code, although I think that this will make the code much less readable. Berker: Regarding whatsnew: Since I have your invitation to do so, I will. Regarding putting the normalisation form comparison into helper functions, I will follow your suggestion, and incorporate the improvement for normalize() as well. For the error message, this is the same error message that is given by unicodedata.normalize(). I could change it, but if I do, I think I should change it for normalize() at the same time. For two of your other points I would like to observe that the way I have done things is in mimicry of existing code that relates to the unicodedata module. Elsewhere in unicodedata.c multiline strings are continued using \, so I have done the same; in makeunicodedata.py the other arrays are not static so the new one is not static either. I could make these changes regardless, but the result would be inconsistency in the code. Or I could change these things everywhere in the relevant files, but this would mean I would be changing parts of the files that are not properly related to this issue. For the remaining stylistic points, I will make the changes you asked for.
msg239356 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-26 22:21
Here is a new patch. This version addresses several issues raised in review by Ezio Melotti and Berker Peksag. Where I have not (yet) addressed an issue, I have explained why not in my previous post. I attempted to preview my changes to the documentation that are new in this patch, but processing the documentation seems to be quite an involved process, so I have not done this in the end. Therefore there might be syntax errors or other issues with the documentation changes in unicodedata.rst.
msg239441 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-27 21:48
My remark about the arrays not being static is not entirely accurate. _PyUnicode_CategoryNames, _PyUnicode_BidirectionalNames and _PyUnicode_EastAsianWidthNames are not static, however decomp_prefix is static. It would not add any significant bulk to the patch to make these three static in addition to the new array, but I will await another review before submitting a further patch, since I anticipate there will be a need for it.
msg239463 - (view)	Author: Hammerite (Hammerite) *	Date: 2015-03-28 19:25
For good measure, here is a patch that makes all of Berker's suggested stylistic changes, but applies them to the pre-existing code in makeunicodedata.py and unicodedata.c. So all of the existing docstrings are converted from '"...\n\' to '"...\n"' form, and the other property name arrays that were non-static are made static.

History
Date	User	Action	Args
2022-04-11 14:58:13	admin	set	github: 67738
2015-03-28 19:25:31	Hammerite	set	files: + quick_check_4.diff messages: + msg239463
2015-03-27 21:48:40	Hammerite	set	messages: + msg239441
2015-03-26 22:21:11	Hammerite	set	files: + quick_check_3.patch messages: + msg239356
2015-03-24 21:42:22	Hammerite	set	messages: + msg239185
2015-03-19 20:51:49	Hammerite	set	messages: + msg238556
2015-03-19 02:05:06	berker.peksag	set	nosy: + berker.peksag messages: + msg238487
2015-03-19 01:46:58	berker.peksag	set	stage: patch review
2015-03-19 00:32:51	Hammerite	set	messages: + msg238484
2015-03-03 22:08:15	Hammerite	set	files: + quick_check_2.patch messages: + msg237165
2015-03-03 00:09:29	Hammerite	set	files: + quick_check_x.patch keywords: + patch messages: + msg237094
2015-02-28 21:37:24	Hammerite	set	messages: + msg236913
2015-02-28 20:54:24	lemburg	set	messages: + msg236910
2015-02-28 18:28:48	SilentGhost	set	nosy: + lemburg, loewis, pitrou, benjamin.peterson versions: + Python 3.5
2015-02-28 18:21:10	Hammerite	create