Issue 8024: upgrade to Unicode 5.2

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52272

classification

Title:	upgrade to Unicode 5.2
Type:	enhancement	Stage:	resolved
Components:	Unicode	Versions:	Python 3.2, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:	7783	Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, ezio.melotti, flox, lemburg
Priority:	normal	Keywords:	patch

Created on 2010-02-26 14:28 by flox, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue8024_UCD_py3k.diff	flox, 2010-03-19 08:48	Patch, apply to 3.x

Messages (15)
msg100151 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-02-26 14:28
Is there any benefit to upgrade the UCD in trunk?
msg100153 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-02-26 14:43
Excerpt of the release note: http://www.unicode.org/versions/Unicode5.2.0/ The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes. Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language. The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East. Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols. Current version is 5.1 in Python 2.6
msg100155 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-02-26 15:59
Have you checked how big the structural changes are between 5.2 and 5.1. If we only have to rerun the makeunicodedata.py script, then I'd be +1 on going with 5.2. Otherwise, I think it's better to wait another release before upgrading to the then latest Unicode version.
msg101114 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-15 14:15
It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2". It generates the 3 db files: * Modules/unicodedata_db.h * Modules/unicodename_db.h * Objects/unicodetype_db.h Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py". I use UCD 5.2 since January, and everything works fine.
msg101121 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-15 16:22
Florent Xicluna wrote: > > Florent Xicluna <florent.xicluna@gmail.com> added the comment: > > It is just a matter of running "makeunicodedata" affter changing "5.1" -> "5.2". > > It generates the 3 db files: > * Modules/unicodedata_db.h > * Modules/unicodename_db.h > * Objects/unicodetype_db.h > > Then you adjust the "expectedchecksum" in "Lib/test/test_unicodedata.py". > > I use UCD 5.2 since January, and everything works fine. So the Unicode database format itself has not changed ?
msg101124 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-15 16:34
> So the Unicode database format itself has not changed ? No. The changes listed below have no impact afai-have-tested. --------- --------- --------- --------- --------- --------- --------- F. Unicode Character Database Changes The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include: * There are new case-related properties in DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped, Changes_When_Casefolded, and Changes_When_NFKC_Casefolded. * Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative. The status of all character properties is listed in the property table in UAX #44, Unicode Character Database. * Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules. * There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the first field is not a code point number. * The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category. See UAX #38, Unicode Han Database (Unihan) for details. --------- --------- --------- --------- --------- --------- --------- See also: http://www.unicode.org/reports/tr44/tr44-4.html#Change_History
msg101126 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-15 16:51
Florent Xicluna wrote: > > Florent Xicluna <florent.xicluna@gmail.com> added the comment: > >> So the Unicode database format itself has not changed ? > > No. The changes listed below have no impact afai-have-tested. Ok, so +1 for updating to 5.2. The files that have changed are not used by Python (yet), so there's no impact of those changes for the unicodedata module. Thanks for checking.
msg101287 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-18 22:13
Done with r79059 and r79062.
msg101297 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-19 01:34
Reverted in 3.x: it triggers some failures. Symptoms: * repr('\uaaa') gives an empty string * test_bigmem fails
msg101309 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-19 08:30
Florent Xicluna wrote: > > Florent Xicluna <florent.xicluna@gmail.com> added the comment: > > Reverted in 3.x: it triggers some failures. > > Symptoms: > * repr('\uaaa') gives an empty string > * test_bigmem fails repr() for Unicode doesn't use the Unicode database. Are you sure that those errors are related to the upgrade ? Looking closer at the patch, you also changed the unicodetype mappings and since this removes a lot of entries, it looks like the Unicode consortium either moved some mappings out of the UCD file into a separate file or made some massive changes to the code point properties (which is unlikely). If that's the case, please also revert the Python 2.7 checkin. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg101311 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-19 08:48
The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character: AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;; And test_bigmem relies on this code point being non-printable. I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch. The regression test suite passes flawlessly. I will do further tests before merging back in 3.x
msg101314 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-03-19 09:21
> Looking closer at the patch, you also changed the unicodetype mappings > and since this removes a lot of entries, it looks like the Unicode > consortium either moved some mappings out of the UCD file into a > separate file or made some massive changes to the code point > properties (which is unlikely). Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines.
msg101315 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-19 09:25
Florent Xicluna wrote: > > Florent Xicluna <florent.xicluna@gmail.com> added the comment: > > The bug was a side-effect of the update. Code point "\uAAAA" is now assigned to a printable character: > > AAAA;TAI VIET LETTER LOW VO;Lo;0;L;;;;;N;;;;; > > And test_bigmem relies on this code point being non-printable. > I changed it for a char in the Low surrogates range, which is guaranteed not printable. See attached patch. That's better. You wrote about '\üaaa' (3 'a's) in your previous post on the ticket and I didn't understand why that would change with the patch, since it's basically a SyntaxError which doesn't have anything to do with the Unicode types or database. > The regression test suite passes flawlessly. > > I will do further tests before merging back in 3.x Please also check what happened to all those code points that were removed by the patch in unicodetype_db.h. Thanks.
msg101316 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-03-19 09:27
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > >> Looking closer at the patch, you also changed the unicodetype mappings >> and since this removes a lot of entries, it looks like the Unicode >> consortium either moved some mappings out of the UCD file into a >> separate file or made some massive changes to the code point >> properties (which is unlikely). > > Does it? On the contrary, it seems to me that with r79059, unicodetype_db.h grown by 200 lines. Ooops :-) I now realized that I was looking at the patch reverting the change. Sorry about that.
msg101328 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-03-19 14:45
Merged with r79093

History
Date	User	Action	Args
2022-04-11 14:56:58	admin	set	github: 52272
2010-03-19 14:45:56	flox	set	status: open -> closed resolution: accepted -> fixed messages: + msg101328 stage: commit review -> resolved
2010-03-19 09:27:09	lemburg	set	messages: + msg101316
2010-03-19 09:25:35	lemburg	set	messages: + msg101315
2010-03-19 09:21:44	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg101314
2010-03-19 08:48:41	flox	set	files: + issue8024_UCD_py3k.diff keywords: + patch messages: + msg101311
2010-03-19 08:31:00	lemburg	set	messages: + msg101309
2010-03-19 01:34:03	flox	set	status: closed -> open resolution: fixed -> accepted messages: + msg101297 stage: resolved -> commit review
2010-03-18 22:13:01	flox	set	status: open -> closed resolution: fixed messages: + msg101287 stage: patch review -> resolved
2010-03-15 17:16:38	flox	set	title: upgrade to Unicode 5.2? -> upgrade to Unicode 5.2
2010-03-15 16:51:51	lemburg	set	messages: + msg101126
2010-03-15 16:34:02	flox	set	messages: + msg101124
2010-03-15 16:22:20	lemburg	set	messages: + msg101121
2010-03-15 14:15:30	flox	set	messages: + msg101114 stage: needs patch -> patch review
2010-02-26 16:01:12	flox	set	dependencies: + test_normalization fails when NormalizationTest.txt is outdated, - Add an argument to test_support.open_urlresource to invalidate the cache
2010-02-26 15:59:30	lemburg	set	nosy: + lemburg messages: + msg100155
2010-02-26 14:43:48	flox	set	dependencies: + Add an argument to test_support.open_urlresource to invalidate the cache messages: + msg100153
2010-02-26 14:37:02	ezio.melotti	set	nosy: + ezio.melotti stage: needs patch
2010-02-26 14:28:47	flox	create