This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Hammerite
Recipients Hammerite, ezio.melotti, vstinner
Date 2015-02-28.18:21:09
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1425147670.05.0.762030398058.issue23550@psf.upfronthosting.co.za>
In-reply-to
Content
Unicode Standard Annex #15 (http://unicode.org/reports/tr15/#Stable_Code_Points) describes how each character in Unicode, for each of the four normalisation forms, has a "Quick_Check" value that aids in determining whether a given string is in that normalisation form. It goes on to describe, in section 9.1, how this "Quick_Check" value may be used to optimise the concatenation of a string onto a normalised string to produce another normalised string: normalisation need only be performed from the last "stable" character in the left-hand string onwards, where a character is "stable" if it has the "Quick_Check" property and has a canonical combining class of 0. This will generally be more efficient than re-running the normalisation algorithm on the entire concatenated string, if the strings involved are long.

The unicodedata standard-library module does not, in my understanding, expose this information. I would like to see a new function added that allows us to determine whether a given character has the "Quick_Check" property for a given normalisation form. This function might accept two parameters, the first of which is a string indicating the normalisation form and the second of which is the character being tested (similar to unicodedata.normalize()).

Suppose we have a need to accept text data, receiving chunks of it at a time, and every time we receive a new chunk we need to append it to the string so far and also make sure that the resulting string is normalised to a particular normalisation form (NFD say). This implies that we would like to be able to concatenate the new chunk (which may not be normalised) onto the string "so far" (which is) and have the result be normalised - but without re-doing normalisation of the whole string over again, as this might be inefficient. From the linked UAX, this might be achieved like this, where unicodedata.quick_check() is the requested function:

    def concat (s1, s2):
        LSCP = len(s1) # Last stable character position
        while LSCP > 0:
            LSCP -= 1
            if unicodedata.combining(s1[LSCP]) == 0 and unicodedata.quick_check('NFD', s1[LSCP]):
                break
        return s1[:LSCP] + unicodedata.normalize('NFD', s1[LSCP:] + s2)
History
Date User Action Args
2015-02-28 18:21:10Hammeritesetrecipients: + Hammerite, vstinner, ezio.melotti
2015-02-28 18:21:10Hammeritesetmessageid: <1425147670.05.0.762030398058.issue23550@psf.upfronthosting.co.za>
2015-02-28 18:21:10Hammeritelinkissue23550 messages
2015-02-28 18:21:09Hammeritecreate