classification
Title: upgrade to Unicode 5.2
Type: enhancement Stage: resolved
Components: Unicode Versions: Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: 7783 Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, ezio.melotti, flox, lemburg
Priority: normal Keywords: patch

Created on 2010-02-26 14:28 by flox, last changed 2010-03-19 14:45 by flox. This issue is now closed.

Files
File name Uploaded Description Edit
issue8024_UCD_py3k.diff flox, 2010-03-19 08:48 Patch, apply to 3.x
Messages (15)
msg100151 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-26 14:28
Is there any benefit to upgrade the UCD in trunk?
msg100153 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-26 14:43
Excerpt of the release note:
http://www.unicode.org/versions/Unicode5.2.0/

The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes.

Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language.

The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East.

Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols. 


Current version is 5.1 in Python 2.6
msg100155 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-02-26 15:59
Have you checked how big the structural changes are between 5.2 and 5.1.

If we only have to rerun the makeunicodedata.py script, then I'd be +1 on going with 5.2.

Otherwise, I think it's better to wait another release before upgrading to the then latest Unicode version.
msg101114 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-15 14:15
It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2".

It generates the 3 db files:
 * Modules/unicodedata_db.h
 * Modules/unicodename_db.h
 * Objects/unicodetype_db.h

Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py".

I use UCD 5.2 since January, and everything works fine.
msg101121 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-15 16:22
Florent Xicluna wrote:
> 
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
> 
> It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2".
> 
> It generates the 3 db files:
>  * Modules/unicodedata_db.h
>  * Modules/unicodename_db.h
>  * Objects/unicodetype_db.h
> 
> Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py".
> 
> I use UCD 5.2 since January, and everything works fine.

So the Unicode database format itself has not changed ?
msg101124 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-15 16:34
> So the Unicode database format itself has not changed ?

No. The changes listed below have no impact afai-have-tested.

--------- --------- --------- --------- --------- --------- ---------
F. Unicode Character Database Changes

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include:

    * There are new case-related properties in DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped, Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.
    * Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative. The status of all character properties is listed in the property table in UAX #44, Unicode Character Database.
    * Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules.
    * There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the first field is not a code point number.
    * The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category. See UAX #38, Unicode Han Database (Unihan) for details.
--------- --------- --------- --------- --------- --------- ---------

See also:
http://www.unicode.org/reports/tr44/tr44-4.html#Change_History
msg101126 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-15 16:51
Florent Xicluna wrote:
> 
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
> 
>> So the Unicode database format itself has not changed ?
> 
> No. The changes listed below have no impact afai-have-tested.

Ok, so +1 for updating to 5.2.

The files that have changed are not used by Python (yet), so there's
no impact of those changes for the unicodedata module.

Thanks for checking.
msg101287 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-18 22:13
Done with r79059 and r79062.
msg101297 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-19 01:34
Reverted in 3.x: it triggers some failures.

Symptoms:
 * repr('\uaaa') gives an empty string
 * test_bigmem fails
msg101309 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-19 08:30
Florent Xicluna wrote:
> 
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
> 
> Reverted in 3.x: it triggers some failures.
> 
> Symptoms:
>  * repr('\uaaa') gives an empty string
>  * test_bigmem fails

repr() for Unicode doesn't use the Unicode database. Are you sure that
those errors are related to the upgrade ?

Looking closer at the patch, you also changed the unicodetype mappings
and since this removes a lot of entries, it looks like the Unicode
consortium either moved some mappings out of the UCD file into a
separate file or made some massive changes to the code point properties
(which is unlikely).

If that's the case, please also revert the Python 2.7 checkin.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/
msg101311 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-19 08:48
The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character:

  AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;;

And test_bigmem relies on this code point being non-printable.
I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch.

The regression test suite passes flawlessly.

I will do further tests before merging back in 3.x
msg101314 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-03-19 09:21
> Looking closer at the patch, you also changed the unicodetype mappings
> and since this removes a lot of entries, it looks like the Unicode
> consortium either moved some mappings out of the UCD file into a
> separate file or made some massive changes to the code point
> properties (which is unlikely).

Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.
msg101315 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-19 09:25
Florent Xicluna wrote:
> 
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
> 
> The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character:
> 
>   AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;;
> 
> And test_bigmem relies on this code point being non-printable.
> I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch.

That's better.

You wrote about '\├╝aaa' (3 'a's) in your previous post
on the ticket and I didn't understand why that would change with the
patch, since it's basically a SyntaxError which doesn't have anything
to do with the Unicode types or database.

> The regression test suite passes flawlessly.
> 
> I will do further tests before merging back in 3.x

Please also check what happened to all those code points that were
removed by the patch in unicodetype_db.h.

Thanks.
msg101316 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-03-19 09:27
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
>> Looking closer at the patch, you also changed the unicodetype mappings
>> and since this removes a lot of entries, it looks like the Unicode
>> consortium either moved some mappings out of the UCD file into a
>> separate file or made some massive changes to the code point
>> properties (which is unlikely).
> 
> Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.

Ooops :-) I now realized that I was looking at the patch reverting
the change.

Sorry about that.
msg101328 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-03-19 14:45
Merged with r79093
History
Date User Action Args
2010-03-19 14:45:56floxsetstatus: open -> closed
resolution: accepted -> fixed
messages: + msg101328

stage: commit review -> resolved
2010-03-19 09:27:09lemburgsetmessages: + msg101316
2010-03-19 09:25:35lemburgsetmessages: + msg101315
2010-03-19 09:21:44amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg101314
2010-03-19 08:48:41floxsetfiles: + issue8024_UCD_py3k.diff
keywords: + patch
messages: + msg101311
2010-03-19 08:31:00lemburgsetmessages: + msg101309
2010-03-19 01:34:03floxsetstatus: closed -> open
resolution: fixed -> accepted
messages: + msg101297

stage: resolved -> commit review
2010-03-18 22:13:01floxsetstatus: open -> closed
resolution: fixed
messages: + msg101287

stage: patch review -> resolved
2010-03-15 17:16:38floxsettitle: upgrade to Unicode 5.2? -> upgrade to Unicode 5.2
2010-03-15 16:51:51lemburgsetmessages: + msg101126
2010-03-15 16:34:02floxsetmessages: + msg101124
2010-03-15 16:22:20lemburgsetmessages: + msg101121
2010-03-15 14:15:30floxsetmessages: + msg101114
stage: needs patch -> patch review
2010-02-26 16:01:12floxsetdependencies: + test_normalization fails when NormalizationTest.txt is outdated, - Add an argument to test_support.open_urlresource to invalidate the cache
2010-02-26 15:59:30lemburgsetnosy: + lemburg
messages: + msg100155
2010-02-26 14:43:48floxsetdependencies: + Add an argument to test_support.open_urlresource to invalidate the cache
messages: + msg100153
2010-02-26 14:37:02ezio.melottisetnosy: + ezio.melotti

stage: needs patch
2010-02-26 14:28:47floxcreate