Message 89959 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	amaury.forgeotdarc
Recipients	ajaksu2, amaury.forgeotdarc, andersch, ezio.melotti, lemburg, vernondcole
Date	2009-07-01.00:03:25
SpamBayes Score	1.8590002e-10
Marked as misclassified	No
Message-id	<1246406612.38.0.0754513900135.issue1571184@psf.upfronthosting.co.za>
In-reply-to

Content
Here is a refreshed version of the patch, without the generated files. The patch combines several changes which are fairly independent from each other: - Using the unicode database to generate the functions adds 143 new codepoints to PyUnicode_ToNumeric, and one codepoint to PyUnicode_IsWhitespace. - In addition, PyUnicode_ToNumeric now contains code for all numerics; previously those which are also digits fell in the 'default:' case and were converted with PyUnicode_ToDigit(). This adds 468 new codepoints, but removes the need to call PyUnicode_ToDigit() - The Unihan.txt files (two files to download, 25Mb each) are now parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There are now 1009 entries in this function.) The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20, I had to widen the type of 'change_record.numeric_changed' from 'int' to 'double'. It is possible that these were removed from the Unicode database between versions 4.1 and 5.1. - the database has a new flag, NUMERIC_MASK, used by PyUnicode_IsNumeric. This adds ~350 lines in the arrays of numbers in unicodetype_db.h If this patch is accepted, the md5 checksum in test_unicodedata.py will need to change.

Here is a refreshed version of the patch, without the generated files.
The patch combines several changes which are fairly independent from 
each other:

- Using the unicode database to generate the functions adds 143 new 
codepoints to PyUnicode_ToNumeric, and one codepoint to 
PyUnicode_IsWhitespace.

- In addition, PyUnicode_ToNumeric now contains code for all numerics; 
previously those which are also digits fell in the 'default:' case and 
were converted with PyUnicode_ToDigit(). This adds 468 new codepoints, 
but removes the need to call PyUnicode_ToDigit()

- The Unihan.txt files (two files to download, 25Mb each) are now 
parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There 
are now 1009 entries in this function.)
The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20, 
I had to widen the type of 'change_record.numeric_changed' from 'int' to 
'double'.  It is possible that these were removed from the Unicode 
database between versions 4.1 and 5.1.

- the database has a new flag, NUMERIC_MASK, used by 
PyUnicode_IsNumeric.  This adds ~350 lines in the arrays of numbers in 
unicodetype_db.h

If this patch is accepted, the md5 checksum in test_unicodedata.py will 
need to change.

History
Date	User	Action	Args
2009-07-01 00:03:32	amaury.forgeotdarc	set	recipients: + amaury.forgeotdarc, lemburg, ajaksu2, andersch, ezio.melotti, vernondcole
2009-07-01 00:03:32	amaury.forgeotdarc	set	messageid: <1246406612.38.0.0754513900135.issue1571184@psf.upfronthosting.co.za>
2009-07-01 00:03:30	amaury.forgeotdarc	link	issue1571184 messages
2009-07-01 00:03:30	amaury.forgeotdarc	create