Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate numeric/space/linebreak from Unicode database. #44085

Closed
andersch mannequin opened this issue Oct 5, 2006 · 9 comments
Closed

Generate numeric/space/linebreak from Unicode database. #44085

andersch mannequin opened this issue Oct 5, 2006 · 9 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@andersch
Copy link
Mannequin

andersch mannequin commented Oct 5, 2006

BPO 1571184
Nosy @malemburg, @amauryfa, @devdanzin, @ezio-melotti
Files
  • Unicodedata_part1.patch: Generate unicodedata part1
  • Unicodedata_part2.patch: Generate unicodedata part2
  • Unicodedata.patch
  • unicodedata-2.7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-10-06.21:35:55.116>
    created_at = <Date 2006-10-05.07:57:32.000>
    labels = ['interpreter-core', 'type-feature']
    title = 'Generate numeric/space/linebreak from Unicode database.'
    updated_at = <Date 2009-10-06.21:35:55.114>
    user = 'https://bugs.python.org/andersch'

    bugs.python.org fields:

    activity = <Date 2009-10-06.21:35:55.114>
    actor = 'amaury.forgeotdarc'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-10-06.21:35:55.116>
    closer = 'amaury.forgeotdarc'
    components = ['Interpreter Core']
    creation = <Date 2006-10-05.07:57:32.000>
    creator = 'andersch'
    dependencies = []
    files = ['7564', '7565', '7566', '14413']
    hgrepos = []
    issue_num = 1571184
    keywords = ['patch']
    message_count = 9.0
    messages = ['51199', '51200', '51201', '84457', '89954', '89959', '93597', '93600', '93663']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'amaury.forgeotdarc', 'ajaksu2', 'andersch', 'ezio.melotti', 'vernondcole']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue1571184'
    versions = ['Python 2.6', 'Python 3.0', 'Python 3.1', 'Python 2.7']

    @andersch
    Copy link
    Mannequin Author

    andersch mannequin commented Oct 5, 2006

    This patch changes the functions _PyUnicode_ToNumeric,
    _PyUnicode_IsLinebreak and _PyUnicode_IsWhitespace from
    having to be manually updated into being generated from
    data in the unicode database.

    It will allso read numeric values for characters whos
    numeric type is defined in the Unihan.txt file and not
    in the UnicodeData.txt file.

    The patch should work for both the release25-maint
    branch as well as the trunk.

    The patch is so big i had to split it into two files
    for sourcefore to accept it.

    @andersch andersch mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 5, 2006
    @malemburg
    Copy link
    Member

    Logged In: YES
    user_id=38388

    Instead of attaching the patch with the generated code,
    could you please just attach the script that generates the
    files and/or any patch needed to support the new generation
    of the above three functions ?

    That makes reviewing this a lot easier.

    Thanks.

    @andersch
    Copy link
    Mannequin Author

    andersch mannequin commented Oct 6, 2006

    Logged In: YES
    user_id=621306

    Here is a patch without the generated files.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Mar 30, 2009

    I believe this one is out of date, but without a sample test to check
    verifying is harder...

    @devdanzin devdanzin mannequin added type-feature A feature request or enhancement labels Mar 30, 2009
    @vernondcole
    Copy link
    Mannequin

    vernondcole mannequin commented Jun 30, 2009

    Adding Python 2.6 to the list of affected versions - as that is where I
    found the bug reported in bpo-6383 (now superseded by this one.)

    @amauryfa
    Copy link
    Member

    amauryfa commented Jul 1, 2009

    Here is a refreshed version of the patch, without the generated files.
    The patch combines several changes which are fairly independent from
    each other:

    • Using the unicode database to generate the functions adds 143 new
      codepoints to PyUnicode_ToNumeric, and one codepoint to
      PyUnicode_IsWhitespace.

    • In addition, PyUnicode_ToNumeric now contains code for all numerics;
      previously those which are also digits fell in the 'default:' case and
      were converted with PyUnicode_ToDigit(). This adds 468 new codepoints,
      but removes the need to call PyUnicode_ToDigit()

    • The Unihan.txt files (two files to download, 25Mb each) are now
      parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There
      are now 1009 entries in this function.)
      The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20,
      I had to widen the type of 'change_record.numeric_changed' from 'int' to
      'double'. It is possible that these were removed from the Unicode
      database between versions 4.1 and 5.1.

    • the database has a new flag, NUMERIC_MASK, used by
      PyUnicode_IsNumeric. This adds ~350 lines in the arrays of numbers in
      unicodetype_db.h

    If this patch is accepted, the md5 checksum in test_unicodedata.py will
    need to change.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 5, 2009

    Marc-Andre, could you comment on this patch?
    The comments above were made by inspecting the generated code, comparing
    with the previous version.
    IMO the only drawback is the increased memory usage.

    @malemburg
    Copy link
    Member

    Amaury Forgeot d'Arc wrote:

    Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

    Marc-Andre, could you comment on this patch?
    The comments above were made by inspecting the generated code, comparing
    with the previous version.
    IMO the only drawback is the increased memory usage.

    I haven't tried applying the patch, but from reading it, it looks
    good.

    @amauryfa
    Copy link
    Member

    amauryfa commented Oct 6, 2009

    Patch applied with r75272.
    Merged to py3k, adapted and regenerated files with r75274.

    @amauryfa amauryfa closed this as completed Oct 6, 2009
    @amauryfa amauryfa closed this as completed Oct 6, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants