classification
Title: Generate numeric/space/linebreak from Unicode database.
Type: enhancement Stage: test needed
Components: Interpreter Core Versions: Python 3.0, Python 3.1, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, amaury.forgeotdarc, andersch, ezio.melotti, lemburg, vernondcole
Priority: normal Keywords: patch

Created on 2006-10-05 07:57 by andersch, last changed 2009-10-06 21:35 by amaury.forgeotdarc. This issue is now closed.

Files
File name Uploaded Description Edit
Unicodedata_part1.patch andersch, 2006-10-05 07:57 Generate unicodedata part1
Unicodedata_part2.patch andersch, 2006-10-05 08:00 Generate unicodedata part2
Unicodedata.patch andersch, 2006-10-06 09:44
unicodedata-2.7.patch amaury.forgeotdarc, 2009-07-01 00:03
Messages (9)
msg51199 - (view) Author: Anders Chrigström (andersch) Date: 2006-10-05 07:57
This patch changes the functions _PyUnicode_ToNumeric,
_PyUnicode_IsLinebreak and _PyUnicode_IsWhitespace from
having to be manually updated into being generated from
data in the unicode database.

It will allso read numeric values for characters whos
numeric type is defined in the Unihan.txt file and not
in the UnicodeData.txt file.

The patch should work for both the release25-maint
branch as well as the trunk.

The patch is so big i had to split it into two files
for sourcefore to accept it.

msg51200 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-10-05 10:45
Logged In: YES 
user_id=38388

Instead of attaching the patch with the generated code,
could you please just attach the script that generates the
files and/or any patch needed to support the new generation
of the above three functions ?

That makes reviewing this a lot easier.

Thanks.
msg51201 - (view) Author: Anders Chrigström (andersch) Date: 2006-10-06 09:44
Logged In: YES 
user_id=621306

Here is a patch without the generated files.

msg84457 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-03-30 02:04
I believe this one is out of date, but without a sample test to check
verifying is harder...
msg89954 - (view) Author: Vernon Cole (vernondcole) Date: 2009-06-30 22:39
Adding Python 2.6 to the list of affected versions - as that is where I
found the bug reported in issue 6383 (now superseded by this one.)
msg89959 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-07-01 00:03
Here is a refreshed version of the patch, without the generated files.
The patch combines several changes which are fairly independent from 
each other:

- Using the unicode database to generate the functions adds 143 new 
codepoints to PyUnicode_ToNumeric, and one codepoint to 
PyUnicode_IsWhitespace.

- In addition, PyUnicode_ToNumeric now contains code for all numerics; 
previously those which are also digits fell in the 'default:' case and 
were converted with PyUnicode_ToDigit(). This adds 468 new codepoints, 
but removes the need to call PyUnicode_ToDigit()

- The Unihan.txt files (two files to download, 25Mb each) are now 
parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There 
are now 1009 entries in this function.)
The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20, 
I had to widen the type of 'change_record.numeric_changed' from 'int' to 
'double'.  It is possible that these were removed from the Unicode 
database between versions 4.1 and 5.1.

- the database has a new flag, NUMERIC_MASK, used by 
PyUnicode_IsNumeric.  This adds ~350 lines in the arrays of numbers in 
unicodetype_db.h

If this patch is accepted, the md5 checksum in test_unicodedata.py will 
need to change.
msg93597 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-05 12:33
Marc-Andre, could you comment on this patch?
The comments above were made by inspecting the generated code, comparing
with the previous version.
IMO the only drawback is the increased memory usage.
msg93600 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-10-05 12:55
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> Marc-Andre, could you comment on this patch?
> The comments above were made by inspecting the generated code, comparing
> with the previous version.
> IMO the only drawback is the increased memory usage.

I haven't tried applying the patch, but from reading it, it looks
good.
msg93663 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-06 21:35
Patch applied with r75272.
Merged to py3k, adapted and regenerated files with r75274.
History
Date User Action Args
2010-04-01 11:58:25floxlinkissue1498930 superseder
2009-10-06 21:35:55amaury.forgeotdarcsetstatus: open -> closed
resolution: fixed
messages: + msg93663
2009-10-05 12:55:02lemburgsetmessages: + msg93600
2009-10-05 12:38:07amaury.forgeotdarcsetfiles: - unicodectype_ucs4-2.patch
2009-10-05 12:37:39amaury.forgeotdarcsetfiles: + unicodectype_ucs4-2.patch
2009-10-05 12:33:21amaury.forgeotdarcsetmessages: + msg93597
2009-07-01 00:03:30amaury.forgeotdarcsetfiles: + unicodedata-2.7.patch
nosy: + amaury.forgeotdarc
messages: + msg89959

2009-06-30 23:01:22ezio.melottisetnosy: + ezio.melotti
2009-06-30 22:39:43vernondcolesetnosy: + vernondcole

messages: + msg89954
versions: + Python 2.6, Python 3.0
2009-06-30 21:29:30amaury.forgeotdarclinkissue6383 superseder
2009-06-30 21:29:30amaury.forgeotdarcunlinkissue6383 dependencies
2009-06-30 19:11:47loewislinkissue6383 dependencies
2009-03-30 02:04:52ajaksu2linkissue1571170 dependencies
2009-03-30 02:04:07ajaksu2setversions: + Python 3.1, Python 2.7
nosy: + ajaksu2

messages: + msg84457

type: enhancement
stage: test needed
2006-10-05 07:57:32anderschcreate