Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

idna encoding fails for Cherokee symbols #85022

Closed
RomanAkopov mannequin opened this issue Jun 2, 2020 · 5 comments
Closed

idna encoding fails for Cherokee symbols #85022

RomanAkopov mannequin opened this issue Jun 2, 2020 · 5 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@RomanAkopov
Copy link
Mannequin

RomanAkopov mannequin commented Jun 2, 2020

BPO 40845
Nosy @vstinner, @tiran, @ezio-melotti

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2020-06-02.21:49:50.369>
created_at = <Date 2020-06-02.17:53:00.347>
labels = ['3.8', 'type-bug', '3.7', 'expert-unicode']
title = 'idna encoding fails for Cherokee symbols'
updated_at = <Date 2020-06-02.21:49:50.368>
user = 'https://bugs.python.org/RomanAkopov'

bugs.python.org fields:

activity = <Date 2020-06-02.21:49:50.368>
actor = 'christian.heimes'
assignee = 'none'
closed = True
closed_date = <Date 2020-06-02.21:49:50.369>
closer = 'christian.heimes'
components = ['Unicode']
creation = <Date 2020-06-02.17:53:00.347>
creator = 'Roman Akopov'
dependencies = []
files = []
hgrepos = []
issue_num = 40845
keywords = []
message_count = 5.0
messages = ['370615', '370617', '370628', '370629', '370634']
nosy_count = 5.0
nosy_names = ['vstinner', 'christian.heimes', 'ezio.melotti', 'SilentGhost', 'Roman Akopov']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40845'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

@RomanAkopov
Copy link
Mannequin Author

RomanAkopov mannequin commented Jun 2, 2020

For a specific Cherokee string of three symbols b'\\u13e3\\u13b3\\u13a9' generating punycode representation fails.

What steps will reproduce the problem?

Execute 'ꮳꮃꭹ'.encode('idna')
of even more reliable
Execute '\u13e3\u13b3\u13a9'.encode('idna')

What is the expected result?

'xn--f9dt7l'

What happens instead?

'xn--tz9ata7l'

Version affected.

Tested on Python 3.8.3 Windows and Python 3.6.8 CentOS.

Other information.

I was testing if our product supports internationalized domain names. So I had written a Python script which generated DNS zone file with punycode encoded names and JavaScript file for a browser to send requests to URLs containing internationalized domain names. Strings were taken from Common Locale Data Repository. 193 various URL, one per language.

When executed in Google Chrome, Mozilla Firefox and Microsoft EDGE, domain name 'ꮳꮃꭹ.myhost.local' is converted to 'xn--f9dt7l.myhost.local', but we have 'xn--tz9ata7l.myhost.local' in DNS zone file and this is how I had found the bug. For 192 other languages I have tested everything works just fine. hese are Afrikaans, Aghem, Akan, Amharic, Arabic, Assamese, Asu, Asturian, Azerbaijani, Basaa, Belarusian, Bemba, Bena, Bulgarian, Bambara, Bangla, Tibetan, Breton, Bodo, Bosnian, Catalan, Chakma, Chechen, Cebuano, Chiga, Czech, Church Slavic, Welsh, Danish, Taita, German, Zarma, Lower Sorbian, Duala, Jola-Fonyi, Dzongkha, Embu, Ewe, Greek, English, Esperanto, Spanish, Estonian, Basque, Ewondo, Persian, Fulah, Finnish, Filipino, Faroese, French, Friulian, Western Frisian, Irish, Scottish Gaelic, Galician, Swiss German, Gujarati, Gusii, Manx, Hausa, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Interlingua, Indonesian, Sichuan Yi, Icelandic, Italian, Japanese, Ngomba, Machame, Javanese, Georgian, Kabyle, Kamba, Makonde, Kabuverdianu, Kikuyu, Kako, Kalaallisut, Kalenjin, Khmer, Kannada, Korean, Konkani, Kashmiri, Shambala, Bafia, Colognian, Kurdish, Cornish, Kyrgyz, Langi, Luxembourgish, Ganda, Lakota, Lingala, Lao, Lithuanian, Luba-Katanga, Luo, Luyia, Latvian, Maithili, Masai, Meru, Malagasy, Makhuwa-Meetto, Metaʼ, Maori, Macedonian, Malayalam, Mongolian, Manipuri, Marathi, Malay, Maltese, Mundang, Burmese, Mazanderani, Nama, North Ndebele, Low German, Nepali, Dutch, Kwasio, Norwegian Nynorsk, Nyankole, Oromo, Odia, Ossetic, Punjabi, Polish, Prussian, Pashto, Portuguese, Quechua, Romansh, Rundi, Romanian, Rombo, Russian, Kinyarwanda, Rwa, Samburu, Santali, Sangu, Sindhi, Northern Sami, Sena, Sango, Tachelhit, Sinhala, Slovak, Slovenian, Inari Sami, Shona, Somali, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Teso, Tajik, Thai, Tigrinya, Turkish, Tatar, Uyghur, Ukrainian, Urdu, Uzbek, Vai, Volapük, Vunjo, Walser, Wolof, Xhosa, Soga, Yangben, Yiddish, Cantonese, Standard Moroccan Tamazight, Chinese, Traditional Chinese, Zulu.

Somehow specifically Cherokee code points trigger the bug.

On top of that, https://www.punycoder.com/ converts 'ꮳꮃꭹ' into 'xn--f9dt7l' and back. However 'xn--tz9ata7l' is reported as an invalid punycode.

@RomanAkopov RomanAkopov mannequin added 3.7 (EOL) end of life 3.8 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error labels Jun 2, 2020
@SilentGhost
Copy link
Mannequin

SilentGhost mannequin commented Jun 2, 2020

For the record:

>>> 'ꮳꮃꭹ'.encode('punycode')
b'tz9ata7l'
>>> '\u13e3\u13b3\u13a9'.encode('punycode')
b'f9dt7l'

Also, your unicode-escaped string is an upper-cased version of the first string.

@RomanAkopov
Copy link
Mannequin Author

RomanAkopov mannequin commented Jun 2, 2020

This is how I extract data from Common Locale Data Repository v37
script assumes common\main working directory

from os import walk
from xml.etree import ElementTree

en_root = ElementTree.parse('en.xml')

for (dirpath, dirnames, filenames) in walk('.'):
    for filename in filenames:
        if filename.endswith('.xml'):
            code = filename[:-4]
            xx_root = ElementTree.parse(filename)
            xx_lang = xx_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')
            en_lang = en_root.find('localeDisplayNames/languages/language[@type=\'' + code + '\']')

            if en_lang.text == 'Cherokee':
                print(en_lang.text)
                print(xx_lang.text)
                print(xx_lang.text.encode("unicode_escape"))
                print(xx_lang.text.encode('idna'))
                print(ord(xx_lang.text[0]))
                print(ord(xx_lang.text[1]))
                print(ord(xx_lang.text[2]))

script outputs

Cherokee
ᏣᎳᎩ
b'\\u13e3\\u13b3\\u13a9'
b'xn--tz9ata7l'
5091
5043
5033

If I change text to lower case

                print(en_lang.text.lower())
                print(xx_lang.text.lower())
                print(xx_lang.text.lower().encode("unicode_escape"))
                print(xx_lang.text.lower().encode('idna'))
                print(ord(xx_lang.text.lower()[0]))
                print(ord(xx_lang.text.lower()[1]))
                print(ord(xx_lang.text.lower()[2]))

then script outputs

cherokee
ꮳꮃꭹ
b'\\uabb3\\uab83\\uab79'
b'xn--tz9ata7l'
43955
43907
43897

I am not sure where do you get '\u13e3\u13b3\u13a9' string. '\u13e3\u13b3\u13a9'.lower().encode('unicode_escape') gives b'\\uabb3\\uab83\\uab79'

@SilentGhost
Copy link
Mannequin

SilentGhost mannequin commented Jun 2, 2020

I took it from your msg370615:

of even more reliable
Execute '\u13e3\u13b3\u13a9'.encode('idna')

@tiran
Copy link
Member

tiran commented Jun 2, 2020

There are two IDNA standard. Python's standard library only provides IDNA 2003 and does not support IDNA 2008.

# IDNA 2003
>>> '\u13e3\u13b3\u13a9'.encode('idna')
b'xn--tz9ata7l'
# idna package with IDNA 2008
>>> idna.encode('\u13e3\u13b3\u13a9')
b'xn--f9dt7l'

The bug report is a duplicate of bpo-17305.

@tiran tiran closed this as completed Jun 2, 2020
@tiran tiran closed this as completed Jun 2, 2020
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life 3.8 only security fixes topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant