Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to Unicode 5.2 #52272

Closed
florentx mannequin opened this issue Feb 26, 2010 · 15 comments
Closed

upgrade to Unicode 5.2 #52272

florentx mannequin opened this issue Feb 26, 2010 · 15 comments
Labels
topic-unicode type-feature A feature request or enhancement

Comments

@florentx
Copy link
Mannequin

florentx mannequin commented Feb 26, 2010

BPO 8024
Nosy @malemburg, @amauryfa, @ezio-melotti, @florentx
Dependencies
  • bpo-7783: test_normalization fails when NormalizationTest.txt is outdated
  • Files
  • issue8024_UCD_py3k.diff: Patch, apply to 3.x
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-03-19.14:45:56.423>
    created_at = <Date 2010-02-26.14:28:47.873>
    labels = ['type-feature', 'expert-unicode']
    title = 'upgrade to Unicode 5.2'
    updated_at = <Date 2010-03-19.14:45:56.421>
    user = 'https://github.com/florentx'

    bugs.python.org fields:

    activity = <Date 2010-03-19.14:45:56.421>
    actor = 'flox'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-03-19.14:45:56.423>
    closer = 'flox'
    components = ['Unicode']
    creation = <Date 2010-02-26.14:28:47.873>
    creator = 'flox'
    dependencies = ['7783']
    files = ['16580']
    hgrepos = []
    issue_num = 8024
    keywords = ['patch']
    message_count = 15.0
    messages = ['100151', '100153', '100155', '101114', '101121', '101124', '101126', '101287', '101297', '101309', '101311', '101314', '101315', '101316', '101328']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'ezio.melotti', 'flox']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue8024'
    versions = ['Python 2.7', 'Python 3.2']

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Feb 26, 2010

    Is there any benefit to upgrade the UCD in trunk?

    @florentx florentx mannequin added topic-unicode type-feature A feature request or enhancement labels Feb 26, 2010
    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Feb 26, 2010

    Excerpt of the release note:
    http://www.unicode.org/versions/Unicode5.2.0/

    The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes.

    Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language.

    The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East.

    Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols.

    Current version is 5.1 in Python 2.6

    @malemburg
    Copy link
    Member

    Have you checked how big the structural changes are between 5.2 and 5.1.

    If we only have to rerun the makeunicodedata.py script, then I'd be +1 on going with 5.2.

    Otherwise, I think it's better to wait another release before upgrading to the then latest Unicode version.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 15, 2010

    It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2".

    It generates the 3 db files:

    Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py".

    I use UCD 5.2 since January, and everything works fine.

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <florent.xicluna@gmail.com> added the comment:

    It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2".

    It generates the 3 db files:

    Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py".

    I use UCD 5.2 since January, and everything works fine.

    So the Unicode database format itself has not changed ?

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 15, 2010

    So the Unicode database format itself has not changed ?

    No. The changes listed below have no impact afai-have-tested.

    --------- --------- --------- --------- --------- --------- ---------
    F. Unicode Character Database Changes

    The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include:

    * There are new case-related properties in DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped, Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.
    * Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative. The status of all character properties is listed in the property table in UAX #44, Unicode Character Database.
    * Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules.
    * There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the first field is not a code point number.
    * The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category. See UAX #38, Unicode Han Database (Unihan) for details.
    

    --------- --------- --------- --------- --------- --------- ---------

    See also:
    http://www.unicode.org/reports/tr44/tr44-4.html#Change_History

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <florent.xicluna@gmail.com> added the comment:

    > So the Unicode database format itself has not changed ?

    No. The changes listed below have no impact afai-have-tested.

    Ok, so +1 for updating to 5.2.

    The files that have changed are not used by Python (yet), so there's
    no impact of those changes for the unicodedata module.

    Thanks for checking.

    @florentx florentx mannequin changed the title upgrade to Unicode 5.2? upgrade to Unicode 5.2 Mar 15, 2010
    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 18, 2010

    Done with r79059 and r79062.

    @florentx florentx mannequin closed this as completed Mar 18, 2010
    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 19, 2010

    Reverted in 3.x: it triggers some failures.

    Symptoms:

    • repr('\uaaa') gives an empty string
    • test_bigmem fails

    @florentx florentx mannequin reopened this Mar 19, 2010
    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <florent.xicluna@gmail.com> added the comment:

    Reverted in 3.x: it triggers some failures.

    Symptoms:

    • repr('\uaaa') gives an empty string
    • test_bigmem fails

    repr() for Unicode doesn't use the Unicode database. Are you sure that
    those errors are related to the upgrade ?

    Looking closer at the patch, you also changed the unicodetype mappings
    and since this removes a lot of entries, it looks like the Unicode
    consortium either moved some mappings out of the UCD file into a
    separate file or made some massive changes to the code point properties
    (which is unlikely).

    If that's the case, please also revert the Python 2.7 checkin.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 19, 2010

    The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character:

    AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;;

    And test_bigmem relies on this code point being non-printable.
    I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch.

    The regression test suite passes flawlessly.

    I will do further tests before merging back in 3.x

    @amauryfa
    Copy link
    Member

    Looking closer at the patch, you also changed the unicodetype mappings
    and since this removes a lot of entries, it looks like the Unicode
    consortium either moved some mappings out of the UCD file into a
    separate file or made some massive changes to the code point
    properties (which is unlikely).

    Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.

    @malemburg
    Copy link
    Member

    Florent Xicluna wrote:

    Florent Xicluna <florent.xicluna@gmail.com> added the comment:

    The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character:

    AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;;

    And test_bigmem relies on this code point being non-printable.
    I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch.

    That's better.

    You wrote about '\üaaa' (3 'a's) in your previous post
    on the ticket and I didn't understand why that would change with the
    patch, since it's basically a SyntaxError which doesn't have anything
    to do with the Unicode types or database.

    The regression test suite passes flawlessly.

    I will do further tests before merging back in 3.x

    Please also check what happened to all those code points that were
    removed by the patch in unicodetype_db.h.

    Thanks.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    > Looking closer at the patch, you also changed the unicodetype mappings
    > and since this removes a lot of entries, it looks like the Unicode
    > consortium either moved some mappings out of the UCD file into a
    > separate file or made some massive changes to the code point
    > properties (which is unlikely).

    Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.

    Ooops :-) I now realized that I was looking at the patch reverting
    the change.

    Sorry about that.

    @florentx
    Copy link
    Mannequin Author

    florentx mannequin commented Mar 19, 2010

    Merged with r79093

    @florentx florentx mannequin closed this as completed Mar 19, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants