msg100151 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-02-26 14:28 |
Is there any benefit to upgrade the UCD in trunk?
|
msg100153 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-02-26 14:43 |
Excerpt of the release note:
http://www.unicode.org/versions/Unicode5.2.0/
The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes.
Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language.
The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East.
Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols.
Current version is 5.1 in Python 2.6
|
msg100155 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-02-26 15:59 |
Have you checked how big the structural changes are between 5.2 and 5.1.
If we only have to rerun the makeunicodedata.py script, then I'd be +1 on going with 5.2.
Otherwise, I think it's better to wait another release before upgrading to the then latest Unicode version.
|
msg101114 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-03-15 14:15 |
It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2".
It generates the 3 db files:
* Modules/unicodedata_db.h
* Modules/unicodename_db.h
* Objects/unicodetype_db.h
Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py".
I use UCD 5.2 since January, and everything works fine.
|
msg101121 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-03-15 16:22 |
Florent Xicluna wrote:
>
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
>
> It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2".
>
> It generates the 3 db files:
> * Modules/unicodedata_db.h
> * Modules/unicodename_db.h
> * Objects/unicodetype_db.h
>
> Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py".
>
> I use UCD 5.2 since January, and everything works fine.
So the Unicode database format itself has not changed ?
|
msg101124 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-03-15 16:34 |
> So the Unicode database format itself has not changed ?
No. The changes listed below have no impact afai-have-tested.
--------- --------- --------- --------- --------- --------- ---------
F. Unicode Character Database Changes
The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include:
* There are new case-related properties in DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped, Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.
* Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative. The status of all character properties is listed in the property table in UAX #44, Unicode Character Database.
* Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules.
* There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the first field is not a code point number.
* The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category. See UAX #38, Unicode Han Database (Unihan) for details.
--------- --------- --------- --------- --------- --------- ---------
See also:
http://www.unicode.org/reports/tr44/tr44-4.html#Change_History
|
msg101126 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-03-15 16:51 |
Florent Xicluna wrote:
>
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
>
>> So the Unicode database format itself has not changed ?
>
> No. The changes listed below have no impact afai-have-tested.
Ok, so +1 for updating to 5.2.
The files that have changed are not used by Python (yet), so there's
no impact of those changes for the unicodedata module.
Thanks for checking.
|
msg101287 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-03-18 22:13 |
Done with r79059 and r79062.
|
msg101297 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-03-19 01:34 |
Reverted in 3.x: it triggers some failures.
Symptoms:
* repr('\uaaa') gives an empty string
* test_bigmem fails
|
msg101309 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-03-19 08:30 |
Florent Xicluna wrote:
>
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
>
> Reverted in 3.x: it triggers some failures.
>
> Symptoms:
> * repr('\uaaa') gives an empty string
> * test_bigmem fails
repr() for Unicode doesn't use the Unicode database. Are you sure that
those errors are related to the upgrade ?
Looking closer at the patch, you also changed the unicodetype mappings
and since this removes a lot of entries, it looks like the Unicode
consortium either moved some mappings out of the UCD file into a
separate file or made some massive changes to the code point properties
(which is unlikely).
If that's the case, please also revert the Python 2.7 checkin.
Thanks,
--
Marc-Andre Lemburg
eGenix.com
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
|
msg101311 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-03-19 08:48 |
The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character:
AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;;
And test_bigmem relies on this code point being non-printable.
I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch.
The regression test suite passes flawlessly.
I will do further tests before merging back in 3.x
|
msg101314 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * |
Date: 2010-03-19 09:21 |
> Looking closer at the patch, you also changed the unicodetype mappings
> and since this removes a lot of entries, it looks like the Unicode
> consortium either moved some mappings out of the UCD file into a
> separate file or made some massive changes to the code point
> properties (which is unlikely).
Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.
|
msg101315 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-03-19 09:25 |
Florent Xicluna wrote:
>
> Florent Xicluna <florent.xicluna@gmail.com> added the comment:
>
> The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character:
>
> AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;;
>
> And test_bigmem relies on this code point being non-printable.
> I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch.
That's better.
You wrote about '\üaaa' (3 'a's) in your previous post
on the ticket and I didn't understand why that would change with the
patch, since it's basically a SyntaxError which doesn't have anything
to do with the Unicode types or database.
> The regression test suite passes flawlessly.
>
> I will do further tests before merging back in 3.x
Please also check what happened to all those code points that were
removed by the patch in unicodetype_db.h.
Thanks.
|
msg101316 - (view) |
Author: Marc-Andre Lemburg (lemburg) * |
Date: 2010-03-19 09:27 |
Amaury Forgeot d'Arc wrote:
>
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
>> Looking closer at the patch, you also changed the unicodetype mappings
>> and since this removes a lot of entries, it looks like the Unicode
>> consortium either moved some mappings out of the UCD file into a
>> separate file or made some massive changes to the code point
>> properties (which is unlikely).
>
> Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.
Ooops :-) I now realized that I was looking at the patch reverting
the change.
Sorry about that.
|
msg101328 - (view) |
Author: Florent Xicluna (flox) * |
Date: 2010-03-19 14:45 |
Merged with r79093
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:58 | admin | set | github: 52272 |
2010-03-19 14:45:56 | flox | set | status: open -> closed resolution: accepted -> fixed messages:
+ msg101328
stage: commit review -> resolved |
2010-03-19 09:27:09 | lemburg | set | messages:
+ msg101316 |
2010-03-19 09:25:35 | lemburg | set | messages:
+ msg101315 |
2010-03-19 09:21:44 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg101314
|
2010-03-19 08:48:41 | flox | set | files:
+ issue8024_UCD_py3k.diff keywords:
+ patch messages:
+ msg101311
|
2010-03-19 08:31:00 | lemburg | set | messages:
+ msg101309 |
2010-03-19 01:34:03 | flox | set | status: closed -> open resolution: fixed -> accepted messages:
+ msg101297
stage: resolved -> commit review |
2010-03-18 22:13:01 | flox | set | status: open -> closed resolution: fixed messages:
+ msg101287
stage: patch review -> resolved |
2010-03-15 17:16:38 | flox | set | title: upgrade to Unicode 5.2? -> upgrade to Unicode 5.2 |
2010-03-15 16:51:51 | lemburg | set | messages:
+ msg101126 |
2010-03-15 16:34:02 | flox | set | messages:
+ msg101124 |
2010-03-15 16:22:20 | lemburg | set | messages:
+ msg101121 |
2010-03-15 14:15:30 | flox | set | messages:
+ msg101114 stage: needs patch -> patch review |
2010-02-26 16:01:12 | flox | set | dependencies:
+ test_normalization fails when NormalizationTest.txt is outdated, - Add an argument to test_support.open_urlresource to invalidate the cache |
2010-02-26 15:59:30 | lemburg | set | nosy:
+ lemburg messages:
+ msg100155
|
2010-02-26 14:43:48 | flox | set | dependencies:
+ Add an argument to test_support.open_urlresource to invalidate the cache messages:
+ msg100153 |
2010-02-26 14:37:02 | ezio.melotti | set | nosy:
+ ezio.melotti
stage: needs patch |
2010-02-26 14:28:47 | flox | create | |