classification
Title: Update Unicode database to 5.1.0
Type: Stage:
Components: Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: gvanrossum Nosy List: ajaksu2, amaury.forgeotdarc, effbot, gvanrossum, lemburg, loewis
Priority: normal Keywords: needs review, patch

Created on 2008-09-09 05:37 by loewis, last changed 2008-09-11 06:05 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
ucd51.diff.bz2 loewis, 2008-09-09 05:37
Messages (11)
msg72821 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-09 05:37
This is a patch to update the Unicode database. It's mostly the imported
data, but there were two code changes:
- 5.1 changes the "mirrored" property for a character (U+0F3A), and the
delta-to-3.2 code did not support that. I added a field into
hange_record to support that kind of change.
- 5.1 also added a character (U+1d79) whose upper-case version is far
off (U+A77D), triggering a complaint that the delta can't be represented
in 16 bits. I fixed that adding a flag into the ctype record indicating
that deltas aren't used for that record.

Fredrik, can you please review these changes?
msg72941 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-10 04:51
Guido, would you like to review?
msg72946 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2008-09-10 07:06
The patch looks fine to me (assuming that I didn't miss something 
critical hidden among the large table diffs).

(I'd probably named the "NODELTA" flag after what it is rather than what 
it isn't, but I cannot think of a short replacement right now, so let's 
leave it as it is.)
msg72950 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-09-10 09:34
Reviewed the patch: looks fine to me. 

One nit: the unicodedata module doc-string must be updated to 5.1.0 as
well. Ditto for the documentation.
msg72962 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-10 14:11
I have now committed the change as r66362 (including the missing
documentation updates), and ported it to 3.0 as r66363 (where I had to
change the flag value and regenerate the data, as the flag 0x100 was
already taken).
msg72973 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-10 16:11
2008/9/10 Martin v. Löwis <report@bugs.python.org>:
> I have now committed the change as r66362 (including the missing
> documentation updates), and ported it to 3.0 as r66363 (where I had to
> change the flag value and regenerate the data, as the flag 0x100 was
> already taken).

That's unfortunate -- perhaps the 2.6 flag and data can be brought in line,
to make future merges easier?
msg72979 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-10 18:09
> That's unfortunate -- perhaps the 2.6 flag and data can be brought in
> line, to make future merges easier?

I thought of that, however, merging the databases themselves would still
not be possible: the 3.0 database has the flags set in many records,
which causes merge conflicts (as the 2.x database has different flag
values). So regenerating the database is necessary, anyway.

In future changes, it might be useful to have new flags with the same
values, so that such patches can be merged without conflicts in the
generator.
msg72987 - (view) Author: Daniel Diniz (ajaksu2) Date: 2008-09-10 21:31
#66363 breaks test_unicode and test_format on 3.0.
msg72997 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-10 23:54
Code point 0x0370 is now a printable character.
r66381 corrected the failures by simply changing it to 0x0378, until the 
next unicodedata upgrade...
I wonder if there is a value that is guaranteed to stay non-printable.
msg73000 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-11 01:08
2008/9/10 Amaury Forgeot d'Arc <report@bugs.python.org>:
> Code point 0x0370 is now a printable character.
> r66381 corrected the failures by simply changing it to 0x0378, until the
> next unicodedata upgrade...
> I wonder if there is a value that is guaranteed to stay non-printable.

The control characters?
msg73005 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-11 06:05
> The control characters?

Indeed, also the private-use characters. test_unicode explicitly
comments that the test is about unassigned characters, although
I don't understand the purpose of that test (it then also tests
a surrogate character, which is also guaranteed to remain
unprintable).

One of the characters that is guaranteed to remain unassigned is
U+FFFE (and its mirrors in other planes, e.g. U+1FFFE, ...).
This guarantee is made to support the BOM. Along with U+FFFF,
these are non-characters. #765036 once suggested that Python should
refuse to represent them at all, but that proposal was rejected.
History
Date User Action Args
2008-09-11 06:05:22loewissetmessages: + msg73005
2008-09-11 01:09:44gvanrossumsetfiles: - unnamed
2008-09-11 01:08:53gvanrossumsetfiles: + unnamed
messages: + msg73000
2008-09-10 23:54:54amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg72997
2008-09-10 21:31:01ajaksu2setnosy: + ajaksu2
messages: + msg72987
versions: + Python 3.0
2008-09-10 18:09:23loewissetmessages: + msg72979
2008-09-10 16:18:10gvanrossumsetfiles: - unnamed
2008-09-10 16:11:42gvanrossumsetfiles: + unnamed
messages: + msg72973
2008-09-10 14:11:27loewissetstatus: open -> closed
resolution: accepted
messages: + msg72962
2008-09-10 09:34:27lemburgsetnosy: + lemburg
messages: + msg72950
2008-09-10 07:06:13effbotsetmessages: + msg72946
2008-09-10 04:51:42loewissetassignee: effbot -> gvanrossum
messages: + msg72941
nosy: + gvanrossum
2008-09-09 05:39:59loewissetkeywords: + needs review
2008-09-09 05:39:54loewissetkeywords: + patch, - needs review
2008-09-09 05:37:53loewiscreate