Message 236901 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Hammerite
Recipients	Hammerite, ezio.melotti, vstinner
Date	2015-02-28.18:21:09
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1425147670.05.0.762030398058.issue23550@psf.upfronthosting.co.za>
In-reply-to

Content
Unicode Standard Annex #15 (http://unicode.org/reports/tr15/#Stable_Code_Points) describes how each character in Unicode, for each of the four normalisation forms, has a "Quick_Check" value that aids in determining whether a given string is in that normalisation form. It goes on to describe, in section 9.1, how this "Quick_Check" value may be used to optimise the concatenation of a string onto a normalised string to produce another normalised string: normalisation need only be performed from the last "stable" character in the left-hand string onwards, where a character is "stable" if it has the "Quick_Check" property and has a canonical combining class of 0. This will generally be more efficient than re-running the normalisation algorithm on the entire concatenated string, if the strings involved are long. The unicodedata standard-library module does not, in my understanding, expose this information. I would like to see a new function added that allows us to determine whether a given character has the "Quick_Check" property for a given normalisation form. This function might accept two parameters, the first of which is a string indicating the normalisation form and the second of which is the character being tested (similar to unicodedata.normalize()). Suppose we have a need to accept text data, receiving chunks of it at a time, and every time we receive a new chunk we need to append it to the string so far and also make sure that the resulting string is normalised to a particular normalisation form (NFD say). This implies that we would like to be able to concatenate the new chunk (which may not be normalised) onto the string "so far" (which is) and have the result be normalised - but without re-doing normalisation of the whole string over again, as this might be inefficient. From the linked UAX, this might be achieved like this, where unicodedata.quick_check() is the requested function: def concat (s1, s2): LSCP = len(s1) # Last stable character position while LSCP > 0: LSCP -= 1 if unicodedata.combining(s1[LSCP]) == 0 and unicodedata.quick_check('NFD', s1[LSCP]): break return s1[:LSCP] + unicodedata.normalize('NFD', s1[LSCP:] + s2)

Unicode Standard Annex #15 (http://unicode.org/reports/tr15/#Stable_Code_Points) describes how each character in Unicode, for each of the four normalisation forms, has a "Quick_Check" value that aids in determining whether a given string is in that normalisation form. It goes on to describe, in section 9.1, how this "Quick_Check" value may be used to optimise the concatenation of a string onto a normalised string to produce another normalised string: normalisation need only be performed from the last "stable" character in the left-hand string onwards, where a character is "stable" if it has the "Quick_Check" property and has a canonical combining class of 0. This will generally be more efficient than re-running the normalisation algorithm on the entire concatenated string, if the strings involved are long.

The unicodedata standard-library module does not, in my understanding, expose this information. I would like to see a new function added that allows us to determine whether a given character has the "Quick_Check" property for a given normalisation form. This function might accept two parameters, the first of which is a string indicating the normalisation form and the second of which is the character being tested (similar to unicodedata.normalize()).

Suppose we have a need to accept text data, receiving chunks of it at a time, and every time we receive a new chunk we need to append it to the string so far and also make sure that the resulting string is normalised to a particular normalisation form (NFD say). This implies that we would like to be able to concatenate the new chunk (which may not be normalised) onto the string "so far" (which is) and have the result be normalised - but without re-doing normalisation of the whole string over again, as this might be inefficient. From the linked UAX, this might be achieved like this, where unicodedata.quick_check() is the requested function:

    def concat (s1, s2):
        LSCP = len(s1) # Last stable character position
        while LSCP > 0:
            LSCP -= 1
            if unicodedata.combining(s1[LSCP]) == 0 and unicodedata.quick_check('NFD', s1[LSCP]):
                break
        return s1[:LSCP] + unicodedata.normalize('NFD', s1[LSCP:] + s2)

History
Date	User	Action	Args
2015-02-28 18:21:10	Hammerite	set	recipients: + Hammerite, vstinner, ezio.melotti
2015-02-28 18:21:10	Hammerite	set	messageid: <1425147670.05.0.762030398058.issue23550@psf.upfronthosting.co.za>
2015-02-28 18:21:10	Hammerite	link	issue23550 messages
2015-02-28 18:21:09	Hammerite	create