Message 370615 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Roman Akopov
Recipients	Roman Akopov, ezio.melotti, vstinner
Date	2020-06-02.17:52:59
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1591120380.37.0.839523789408.issue40845@roundup.psfhosted.org>
In-reply-to

Content
For a specific Cherokee string of three symbols b'\\u13e3\\u13b3\\u13a9' generating punycode representation fails. What steps will reproduce the problem? Execute 'ꮳꮃꭹ'.encode('idna') of even more reliable Execute '\u13e3\u13b3\u13a9'.encode('idna') What is the expected result? 'xn--f9dt7l' What happens instead? 'xn--tz9ata7l' Version affected. Tested on Python 3.8.3 Windows and Python 3.6.8 CentOS. Other information. I was testing if our product supports internationalized domain names. So I had written a Python script which generated DNS zone file with punycode encoded names and JavaScript file for a browser to send requests to URLs containing internationalized domain names. Strings were taken from Common Locale Data Repository. 193 various URL, one per language. When executed in Google Chrome, Mozilla Firefox and Microsoft EDGE, domain name 'ꮳꮃꭹ.myhost.local' is converted to 'xn--f9dt7l.myhost.local', but we have 'xn--tz9ata7l.myhost.local' in DNS zone file and this is how I had found the bug. For 192 other languages I have tested everything works just fine. hese are Afrikaans, Aghem, Akan, Amharic, Arabic, Assamese, Asu, Asturian, Azerbaijani, Basaa, Belarusian, Bemba, Bena, Bulgarian, Bambara, Bangla, Tibetan, Breton, Bodo, Bosnian, Catalan, Chakma, Chechen, Cebuano, Chiga, Czech, Church Slavic, Welsh, Danish, Taita, German, Zarma, Lower Sorbian, Duala, Jola-Fonyi, Dzongkha, Embu, Ewe, Greek, English, Esperanto, Spanish, Estonian, Basque, Ewondo, Persian, Fulah, Finnish, Filipino, Faroese, French, Friulian, Western Frisian, Irish, Scottish Gaelic, Galician, Swiss German, Gujarati, Gusii, Manx, Hausa, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Interlingua, Indonesian, Sichuan Yi, Icelandic, Italian, Japanese, Ngomba, Machame, Javanese, Georgian, Kabyle, Kamba, Makonde, Kabuverdianu, Kikuyu, Kako, Kalaallisut, Kalenjin, Khmer, Kannada, Korean, Konkani, Kashmiri, Shambala, Bafia, Colognian, Kurdish, Cornish, Kyrgyz, Langi, Luxembourgish, Ganda, Lakota, Lingala, Lao, Lithuanian, Luba-Katanga, Luo, Luyia, Latvian, Maithili, Masai, Meru, Malagasy, Makhuwa-Meetto, Metaʼ, Maori, Macedonian, Malayalam, Mongolian, Manipuri, Marathi, Malay, Maltese, Mundang, Burmese, Mazanderani, Nama, North Ndebele, Low German, Nepali, Dutch, Kwasio, Norwegian Nynorsk, Nyankole, Oromo, Odia, Ossetic, Punjabi, Polish, Prussian, Pashto, Portuguese, Quechua, Romansh, Rundi, Romanian, Rombo, Russian, Kinyarwanda, Rwa, Samburu, Santali, Sangu, Sindhi, Northern Sami, Sena, Sango, Tachelhit, Sinhala, Slovak, Slovenian, Inari Sami, Shona, Somali, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Teso, Tajik, Thai, Tigrinya, Turkish, Tatar, Uyghur, Ukrainian, Urdu, Uzbek, Vai, Volapük, Vunjo, Walser, Wolof, Xhosa, Soga, Yangben, Yiddish, Cantonese, Standard Moroccan Tamazight, Chinese, Traditional Chinese, Zulu. Somehow specifically Cherokee code points trigger the bug. On top of that, https://www.punycoder.com/ converts 'ꮳꮃꭹ' into 'xn--f9dt7l' and back. However 'xn--tz9ata7l' is reported as an invalid punycode.

For a specific Cherokee string of three symbols b'\\u13e3\\u13b3\\u13a9' generating punycode representation fails.

What steps will reproduce the problem?

Execute 'ꮳꮃꭹ'.encode('idna')
of even more reliable
Execute '\u13e3\u13b3\u13a9'.encode('idna')

What is the expected result?

'xn--f9dt7l'

What happens instead?

'xn--tz9ata7l'

Version affected.

Tested on Python 3.8.3 Windows and Python 3.6.8 CentOS.

Other information.

I was testing if our product supports internationalized domain names. So I had written a Python script which generated DNS zone file with punycode encoded names and JavaScript file for a browser to send requests to URLs containing internationalized domain names. Strings were taken from Common Locale Data Repository. 193 various URL, one per language.
 
When executed in Google Chrome, Mozilla Firefox and Microsoft EDGE, domain name 'ꮳꮃꭹ.myhost.local' is converted to 'xn--f9dt7l.myhost.local', but we have 'xn--tz9ata7l.myhost.local' in DNS zone file and this is how I had found the bug. For 192 other languages I have tested everything works just fine. hese are Afrikaans, Aghem, Akan, Amharic, Arabic, Assamese, Asu, Asturian, Azerbaijani, Basaa, Belarusian, Bemba, Bena, Bulgarian, Bambara, Bangla, Tibetan, Breton, Bodo, Bosnian, Catalan, Chakma, Chechen, Cebuano, Chiga, Czech, Church Slavic, Welsh, Danish, Taita, German, Zarma, Lower Sorbian, Duala, Jola-Fonyi, Dzongkha, Embu, Ewe, Greek, English, Esperanto, Spanish, Estonian, Basque, Ewondo, Persian, Fulah, Finnish, Filipino, Faroese, French, Friulian, Western Frisian, Irish, Scottish Gaelic, Galician, Swiss German, Gujarati, Gusii, Manx, Hausa, Hebrew, Hindi, Croatian, Upper Sorbian, Hungarian, Armenian, Interlingua, Indonesian, Sichuan Yi, Icelandic, Italian, Japanese, Ngomba, Machame, Javanese, Georgian, Kabyle, Kamba, Makonde, Kabuverdianu, Kikuyu, Kako, Kalaallisut, Kalenjin, Khmer, Kannada, Korean, Konkani, Kashmiri, Shambala, Bafia, Colognian, Kurdish, Cornish, Kyrgyz, Langi, Luxembourgish, Ganda, Lakota, Lingala, Lao, Lithuanian, Luba-Katanga, Luo, Luyia, Latvian, Maithili, Masai, Meru, Malagasy, Makhuwa-Meetto, Metaʼ, Maori, Macedonian, Malayalam, Mongolian, Manipuri, Marathi, Malay, Maltese, Mundang, Burmese, Mazanderani, Nama, North Ndebele, Low German, Nepali, Dutch, Kwasio, Norwegian Nynorsk, Nyankole, Oromo, Odia, Ossetic, Punjabi, Polish, Prussian, Pashto, Portuguese, Quechua, Romansh, Rundi, Romanian, Rombo, Russian, Kinyarwanda, Rwa, Samburu, Santali, Sangu, Sindhi, Northern Sami, Sena, Sango, Tachelhit, Sinhala, Slovak, Slovenian, Inari Sami, Shona, Somali, Albanian, Serbian, Swedish, Swahili, Tamil, Telugu, Teso, Tajik, Thai, Tigrinya, Turkish, Tatar, Uyghur, Ukrainian, Urdu, Uzbek, Vai, Volapük, Vunjo, Walser, Wolof, Xhosa, Soga, Yangben, Yiddish, Cantonese, Standard Moroccan Tamazight, Chinese, Traditional Chinese, Zulu.

Somehow specifically Cherokee code points trigger the bug.

On top of that, https://www.punycoder.com/ converts 'ꮳꮃꭹ' into 'xn--f9dt7l' and back. However 'xn--tz9ata7l' is reported as an invalid punycode.

History
Date	User	Action	Args
2020-06-02 17:53:00	Roman Akopov	set	recipients: + Roman Akopov, vstinner, ezio.melotti
2020-06-02 17:53:00	Roman Akopov	set	messageid: <1591120380.37.0.839523789408.issue40845@roundup.psfhosted.org>
2020-06-02 17:53:00	Roman Akopov	link	issue40845 messages
2020-06-02 17:52:59	Roman Akopov	create