Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #73642

Closed
Pusnow mannequin opened this issue Feb 6, 2017 · 23 comments
Closed

bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #73642

Pusnow mannequin opened this issue Feb 6, 2017 · 23 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@Pusnow
Copy link
Mannequin

Pusnow mannequin commented Feb 6, 2017

BPO 29456
Nosy @malemburg, @loewis, @vstinner, @ezio-melotti, @animalize, @zhangyangyu, @Pusnow, @miss-islington
PRs
  • bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 #1958
  • [3.7] bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) #7702
  • [3.6] bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) #7703
  • [2.7] bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) #7704
  • Files
  • u1176.patch
  • u11a7u11c3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-06-15.13:28:49.407>
    created_at = <Date 2017-02-06.04:27:52.177>
    labels = ['3.8', 'type-bug', '3.7', 'expert-unicode']
    title = 'bugs in unicodedata.normalize: u1176, u11a7 and u11c3'
    updated_at = <Date 2018-06-18.14:21:55.246>
    user = 'https://github.com/Pusnow'

    bugs.python.org fields:

    activity = <Date 2018-06-18.14:21:55.246>
    actor = 'xiang.zhang'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-06-15.13:28:49.407>
    closer = 'xiang.zhang'
    components = ['Unicode']
    creation = <Date 2017-02-06.04:27:52.177>
    creator = 'pusnow'
    dependencies = []
    files = ['46535', '46536']
    hgrepos = []
    issue_num = 29456
    keywords = ['patch']
    message_count = 23.0
    messages = ['287077', '287078', '287079', '295123', '295171', '295172', '299214', '299657', '300039', '300046', '300576', '300933', '313056', '315214', '319591', '319608', '319609', '319610', '319615', '319701', '319719', '319802', '319886']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'loewis', 'vstinner', 'ezio.melotti', 'malin', 'xiang.zhang', 'pusnow', 'miss-islington']
    pr_nums = ['1958', '7702', '7703', '7704']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue29456'
    versions = ['Python 2.7', 'Python 3.6', 'Python 3.7', 'Python 3.8']

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Feb 6, 2017

    unicodedata can't normalize(NFC) hangul strings which contain \u1176(HANGUL JUNGSEONG A-O).

    >>> from unicodedata import normalize
    >>> normalize("NFC", "\u1100\u1176\u11a8")
    '깍'

    => should be "\u1100\u1176\u11a8" not '깍' (\uae4d)

    I attached a patch for this issue. (Fixing boundary of modern medial vowels)

    @Pusnow Pusnow mannequin added the topic-unicode label Feb 6, 2017
    @zhangyangyu
    Copy link
    Member

    How about the third character's range? The code seems assuming it's [11a7..11c3] while the spec is [11a8..11c2]?

    >>> unicodedata.normalize("NFC", "\u1100\u1175\u11a7")
    '기'

    while it should be '기ᆧ'?

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Feb 6, 2017

    I think you are right. The modern final consonants is [11a8..11c2].
    I attached another patch for this issue.

    @Pusnow Pusnow mannequin changed the title bug in unicodedata.normalize: u1176 bug in unicodedata.normalize: u1176, u11a7 and u11c3 Mar 11, 2017
    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life type-bug An unexpected behavior, bug, or error labels Mar 11, 2017
    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Jun 4, 2017

    Is there anything need more?

    @zhangyangyu
    Copy link
    Member

    We have moved our code hosting to GitHub, would you mind turn your patch into a GitHub PR first Wonsup?

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Jun 5, 2017

    Ok, I'll do it.

    @Pusnow Pusnow mannequin changed the title bug in unicodedata.normalize: u1176, u11a7 and u11c3 bugs in unicodedata.normalize: u1176, u11a7 and u11c3 Jun 5, 2017
    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Jul 26, 2017

    Any updates? I need this fix for my project.

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Aug 2, 2017

    I added some test cases for this issue. Please, someone check this.

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Aug 10, 2017

    I think it can be merged. Is there anything I need to do?

    @zhangyangyu
    Copy link
    Member

    Hi Wonsup, sorry for the delay. I get really busy with my work these days. If no one get involved I'd try to find time reviewing your patch this week.

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Aug 19, 2017

    This patch fixes changes in Unicode 4.1.0.
    I think it well reviewed and it is time to merge.
    Who can commit this patch?

    @animalize says:
    Let me give a supplement:

    Before Unicode 4.1.0 (draft), here is: TBase <= code <= TBase+TCount
    see: http://www.unicode.org/reports/tr15/tr15-24.html#hangul_composition

    After Unicode 4.1.0, here is TBase < code < TBase+TCount, which in line with the latest version (Unicode 10.0)
    see: http://www.unicode.org/reports/tr15/tr15-25.html#hangul_composition

    This change happened in 2005.

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Aug 28, 2017

    Hello?

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Feb 28, 2018

    ping, this was forgotten.

    @Pusnow
    Copy link
    Mannequin Author

    Pusnow mannequin commented Apr 12, 2018

    Hello!

    @zhangyangyu
    Copy link
    Member

    Sorry for the absence and late response. I just reviewed it and think it's ready. I think the change in the unicode standard is more like a bug in the implementation than an intentional change. It's mentioned in Unicode 3.0 the third character is out of bounds when TIndex <= 0 or TIndex >= TCount. We have a ucd_3_2_0 in unicodedata.

    I'll merge it after resolve the CI bot.

    @zhangyangyu zhangyangyu added the 3.8 only security fixes label Jun 15, 2018
    @zhangyangyu
    Copy link
    Member

    New changeset d134809 by Xiang Zhang (Wonsup Yoon) in branch 'master':
    bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
    d134809

    @miss-islington
    Copy link
    Contributor

    New changeset 0e2b76e by Miss Islington (bot) in branch '3.7':
    bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
    0e2b76e

    @miss-islington
    Copy link
    Contributor

    New changeset e2e7ff0 by Miss Islington (bot) in branch '3.6':
    bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958)
    e2e7ff0

    @zhangyangyu
    Copy link
    Member

    New changeset 1889c4c by Xiang Zhang in branch '2.7':
    bpo-29456: Fix bugs in unicodedata.normalize: u1176, u11a7 and u11c3 (GH-1958) (GH-7704)
    1889c4c

    @zhangyangyu zhangyangyu added stdlib Python modules in the Lib dir and removed topic-unicode labels Jun 15, 2018
    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Jun 16, 2018

    We have a ucd_3_2_0 in unicodedata.

    Probably this 3.2 unicodedata is used for IDNA2003.
    In IDNA2003 there is a step: normalize the domain_name string to Unicode Normalization Form C.

    Now we changed the Composition code of Hangul to Unicode Standard 4.1+, and fixed the bug even in Unicode Standard 4.1-.
    Should this (Unicode Standard 4.1+ behavior) cause a security vulnerability for someone who is using IDNA2003 via ucd_3_2_0?

    @zhangyangyu
    Copy link
    Member

    As I said, I checked Unicode 3.0 for the hangul composition algorithm. It looks consistent with Unicode 4.1+. 3.0 only gets description but no sample implementation. So I think the changed code also applies to Unicode 3.0+.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Jun 17, 2018

    You are right.

    I found a Normalization Test Suite for Unicode 3.2
    http://www.unicode.org/Public/3.2-Update/NormalizationTest-3.2.0.txt

    \u1176 is not in the range of the second character.
    \u11a7, \u11c3 are not in the range of the third character.

    @zhangyangyu
    Copy link
    Member

    Thanks for your confirmation, Ma Lin. Also thanks for Wonsup!

    @zhangyangyu zhangyangyu added topic-unicode and removed stdlib Python modules in the Lib dir labels Jun 18, 2018
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants