Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Unicode database to 5.1.0 #48061

Closed
loewis mannequin opened this issue Sep 9, 2008 · 11 comments
Closed

Update Unicode database to 5.1.0 #48061

loewis mannequin opened this issue Sep 9, 2008 · 11 comments
Assignees

Comments

@loewis
Copy link
Mannequin

loewis mannequin commented Sep 9, 2008

BPO 3811
Nosy @malemburg, @gvanrossum, @loewis, @amauryfa, @devdanzin
Files
  • ucd51.diff.bz2
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/gvanrossum'
    closed_at = <Date 2008-09-10.14:11:27.403>
    created_at = <Date 2008-09-09.05:37:53.238>
    labels = []
    title = 'Update Unicode database to 5.1.0'
    updated_at = <Date 2008-09-11.06:05:22.713>
    user = 'https://github.com/loewis'

    bugs.python.org fields:

    activity = <Date 2008-09-11.06:05:22.713>
    actor = 'loewis'
    assignee = 'gvanrossum'
    closed = True
    closed_date = <Date 2008-09-10.14:11:27.403>
    closer = 'loewis'
    components = []
    creation = <Date 2008-09-09.05:37:53.238>
    creator = 'loewis'
    dependencies = []
    files = ['11429']
    hgrepos = []
    issue_num = 3811
    keywords = ['patch', 'needs review']
    message_count = 11.0
    messages = ['72821', '72941', '72946', '72950', '72962', '72973', '72979', '72987', '72997', '73000', '73005']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'gvanrossum', 'loewis', 'effbot', 'amaury.forgeotdarc', 'ajaksu2']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue3811'
    versions = ['Python 2.6', 'Python 3.0']

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Sep 9, 2008

    This is a patch to update the Unicode database. It's mostly the imported
    data, but there were two code changes:

    • 5.1 changes the "mirrored" property for a character (U+0F3A), and the
      delta-to-3.2 code did not support that. I added a field into
      hange_record to support that kind of change.
    • 5.1 also added a character (U+1d79) whose upper-case version is far
      off (U+A77D), triggering a complaint that the delta can't be represented
      in 16 bits. I fixed that adding a flag into the ctype record indicating
      that deltas aren't used for that record.

    Fredrik, can you please review these changes?

    @loewis loewis mannequin assigned effbot Sep 9, 2008
    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Sep 10, 2008

    Guido, would you like to review?

    @loewis loewis mannequin assigned gvanrossum and unassigned effbot Sep 10, 2008
    @effbot
    Copy link
    Mannequin

    effbot mannequin commented Sep 10, 2008

    The patch looks fine to me (assuming that I didn't miss something
    critical hidden among the large table diffs).

    (I'd probably named the "NODELTA" flag after what it is rather than what
    it isn't, but I cannot think of a short replacement right now, so let's
    leave it as it is.)

    @malemburg
    Copy link
    Member

    Reviewed the patch: looks fine to me.

    One nit: the unicodedata module doc-string must be updated to 5.1.0 as
    well. Ditto for the documentation.

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Sep 10, 2008

    I have now committed the change as r66362 (including the missing
    documentation updates), and ported it to 3.0 as r66363 (where I had to
    change the flag value and regenerate the data, as the flag 0x100 was
    already taken).

    @loewis loewis mannequin closed this as completed Sep 10, 2008
    @gvanrossum
    Copy link
    Member

    2008/9/10 Martin v. Löwis <report@bugs.python.org>:

    I have now committed the change as r66362 (including the missing
    documentation updates), and ported it to 3.0 as r66363 (where I had to
    change the flag value and regenerate the data, as the flag 0x100 was
    already taken).

    That's unfortunate -- perhaps the 2.6 flag and data can be brought in line,
    to make future merges easier?

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Sep 10, 2008

    That's unfortunate -- perhaps the 2.6 flag and data can be brought in
    line, to make future merges easier?

    I thought of that, however, merging the databases themselves would still
    not be possible: the 3.0 database has the flags set in many records,
    which causes merge conflicts (as the 2.x database has different flag
    values). So regenerating the database is necessary, anyway.

    In future changes, it might be useful to have new flags with the same
    values, so that such patches can be merged without conflicts in the
    generator.

    @devdanzin
    Copy link
    Mannequin

    devdanzin mannequin commented Sep 10, 2008

    bpo-66363 breaks test_unicode and test_format on 3.0.

    @amauryfa
    Copy link
    Member

    Code point 0x0370 is now a printable character.
    r66381 corrected the failures by simply changing it to 0x0378, until the
    next unicodedata upgrade...
    I wonder if there is a value that is guaranteed to stay non-printable.

    @gvanrossum
    Copy link
    Member

    2008/9/10 Amaury Forgeot d'Arc <report@bugs.python.org>:

    Code point 0x0370 is now a printable character.
    r66381 corrected the failures by simply changing it to 0x0378, until the
    next unicodedata upgrade...
    I wonder if there is a value that is guaranteed to stay non-printable.

    The control characters?

    @loewis
    Copy link
    Mannequin Author

    loewis mannequin commented Sep 11, 2008

    The control characters?

    Indeed, also the private-use characters. test_unicode explicitly
    comments that the test is about unassigned characters, although
    I don't understand the purpose of that test (it then also tests
    a surrogate character, which is also guaranteed to remain
    unprintable).

    One of the characters that is guaranteed to remain unassigned is
    U+FFFE (and its mirrors in other planes, e.g. U+1FFFE, ...).
    This guarantee is made to support the BOM. Along with U+FFFF,
    these are non-characters. bpo-765036 once suggested that Python should
    refuse to represent them at all, but that proposal was rejected.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants