This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Tools/i18n/pygettext.py doesn't parse unicode string.
Type: behavior Stage: resolved
Components: Demos and Tools, Unicode Versions: Python 3.2, Python 3.3, Python 3.4, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: ezio.melotti, flipmcf, loewis, python-dev, serhiy.storchaka, umedoblock
Priority: normal Keywords: patch

Created on 2013-02-08 01:31 by umedoblock, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
konnichiha.py umedoblock, 2013-02-08 01:31 konnichiha.py
pygettext.py.patch umedoblock, 2013-02-08 01:37 pygettext.py.patch
pygettext_unicode.patch serhiy.storchaka, 2013-02-08 13:29 review
pygettext_unicode-2.7.patch serhiy.storchaka, 2013-02-08 13:44 review
konnichiha.tar.gz umedoblock, 2013-02-09 00:19 konnichiha.tar.gz
konnichiha.2.tar.gz umedoblock, 2013-02-09 15:06 konnichiha.2.tar.gz
Messages (10)
msg181651 - (view) Author: umedoblock (umedoblock) Date: 2013-02-08 01:31
I'd like to parse _('こんにちは').
However pygettext.py doesn't parse _('こんにちは').
pygettext.py said me 'IndexError'.
now I attached pygettext.py.patch to fix a bug.
I show you command history.

$ pygettext.py -o - --verbose konnichiha.py
...
#: konnichiha.py:6
msgid "konnichiha"
msgstr ""

#: konnichiha.py:7
Traceback (most recent call last):
  File "/home/umetaro/local/bin/pygettext.py", line 664, in <module>
    main()
  File "/home/umetaro/local/bin/pygettext.py", line 657, in main
    eater.write(fp)
  File "/home/umetaro/local/bin/pygettext.py", line 497, in write
    print('msgid', normalize(k), file=fp)
  File "/home/umetaro/local/bin/pygettext.py", line 250, in normalize
    s = '"' + escape(s) + '"'
  File "/home/umetaro/local/bin/pygettext.py", line 236, in escape
    s[i] = escapes[ord(s[i])]
IndexError: list index out of range

please use pygettext.py.patch.

$ pygettext.py -o - --verbose konnichiha.py
...
#: konnichiha.py:6
msgid "konnichiha"
msgstr ""

#: konnichiha.py:7
msgid "こんにちは"
msgstr ""
msg181652 - (view) Author: umedoblock (umedoblock) Date: 2013-02-08 01:37
TOO SORRY.

pygettext.py.patch 	umedoblock, 2013-02-08 10:32

is wrong a patch.

please forget it.
msg181668 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-08 13:29
Here is a patch for 3.x, which correctly detects input file encoding and correctly escapes non-ascii output files if -E specified (and only if it specified).

For 2.7 we should just negate an argument for make_escapes.
msg181669 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-08 13:44
Here is a patch for 2.7. pygettext doesn't try to detect input encoding and transparently works with bytes, but it no longer escapes non-ascii bytes if -E is not specified.
msg181708 - (view) Author: umedoblock (umedoblock) Date: 2013-02-09 00:19
thanks serhiy.storchaka.
I try to use Shift_JIS, UTF-8, ISO-2022-JP and EUC-JP.

your patch detects UTF-8.

However it doesn't detect Shift_JIS, ISO-2022-JP and EUC-JP.
it misunderstand ISO-2022-JP charset is UTF-8.
it raised UnicodeDecodeError when Shift_JIS, EUC-JP.

Please check my test to use konnichiha.sh.
msg181709 - (view) Author: umedoblock (umedoblock) Date: 2013-02-09 00:21
I use just a pygettext_unicode.patch.
don't use a pygettext_unicode-2.7.patch.
msg181731 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-02-09 10:35
Default encoding on Python 3 is UTF-8. You should declare your encoding at the top of file if it differs from UTF-8 or ASCII (i.e. "# -*- coding: euc-jp -*-"). Otherwise Python will reject your file (for Shift_JIS and EUC-JP) or produce incorrect result (for ISO-2022-JP).

$ python3 konnichiha.Shift_JIS.py
  File "konnichiha.Shift_JIS.py", line 5
SyntaxError: Non-UTF-8 code starting with '\x82' in file konnichiha.Shift_JIS.py on line 5, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
$ python3 konnichiha.ISO-2022-JP.py
konnichiha
B$3$s$K$A$O
msg181740 - (view) Author: umedoblock (umedoblock) Date: 2013-02-09 15:06
python3 output translate Japanese with pygettext.install().

EVERYTHING IS OK!

please check to use a konnichiha.2.tar.gz.
==============================================
please do below shell command.

$ for f in `find . -name 'konnichiha.*.py'` ; do echo f=$f ; python3 $f ; echo -- ; done
f=./konnichiha.Shift_JIS.py
HELLO ハローで、今日は
日本語をUTF8にしてコンニチハ
--
f=./konnichiha.UTF-8.py
HELLO ハローで、今日は
日本語をUTF8にしてコンニチハ
--
f=./konnichiha.ISO-2022-JP.py
HELLO ハローで、今日は
日本語をUTF8にしてコンニチハ
--
f=./konnichiha.EUC-JP.py
HELLO ハローで、今日は
日本語をUTF8にしてコンニチハ
--

==============================================
konnichiha script encoding is OK!

$ nkf -g ./konnichiha.*.py                   
./konnichiha.EUC-JP.py: EUC-JP
./konnichiha.ISO-2022-JP.py: ISO-2022-JP
./konnichiha.Shift_JIS.py: Shift_JIS
./konnichiha.UTF-8.py: UTF-8

==============================================
also coding: is OK!
$ head -2 konnichiha.*.py             
==> konnichiha.EUC-JP.py <==
# coding: euc-jp
import gettext

==> konnichiha.ISO-2022-JP.py <==
# coding: iso-2022-jp
import gettext

==> konnichiha.Shift_JIS.py <==
# coding: shift-jis
import gettext

==> konnichiha.UTF-8.py <==
# coding: utf-8
import gettext

==============================================
THANK YOU serhiy.storchaka !
msg181758 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2013-02-09 20:41
New changeset 49b1fde510a6 by Serhiy Storchaka in branch '2.7':
Issue #17156: pygettext.py now correctly escapes non-ascii characters.
http://hg.python.org/cpython/rev/49b1fde510a6

New changeset cd59b398907d by Serhiy Storchaka in branch '3.2':
Issue #17156: pygettext.py now uses an encoding of source file and correctly
http://hg.python.org/cpython/rev/cd59b398907d

New changeset 062406c06cc1 by Serhiy Storchaka in branch '3.3':
Issue #17156: pygettext.py now uses an encoding of source file and correctly
http://hg.python.org/cpython/rev/062406c06cc1

New changeset 99795d711a40 by Serhiy Storchaka in branch 'default':
Issue #17156: pygettext.py now uses an encoding of source file and correctly
http://hg.python.org/cpython/rev/99795d711a40
msg241126 - (view) Author: Michael McFadden (flipmcf) * Date: 2015-04-15 16:27
Also fixes 19907?
History
Date User Action Args
2022-04-11 14:57:41adminsetgithub: 61358
2015-04-15 16:27:16flipmcfsetnosy: + flipmcf
messages: + msg241126
2013-02-09 20:42:16serhiy.storchakasetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2013-02-09 20:41:23python-devsetnosy: + python-dev
messages: + msg181758
2013-02-09 15:06:52umedoblocksetfiles: + konnichiha.2.tar.gz

messages: + msg181740
2013-02-09 10:35:43serhiy.storchakasetmessages: + msg181731
2013-02-09 00:21:10umedoblocksetmessages: + msg181709
2013-02-09 00:19:27umedoblocksetfiles: + konnichiha.tar.gz

messages: + msg181708
2013-02-08 13:44:55serhiy.storchakasetfiles: + pygettext_unicode-2.7.patch

messages: + msg181669
versions: + Python 2.7
2013-02-08 13:29:47serhiy.storchakasetfiles: + pygettext_unicode.patch

components: + Unicode
versions: + Python 3.3, Python 3.4
nosy: + loewis, ezio.melotti

messages: + msg181668
stage: patch review
2013-02-08 10:45:52serhiy.storchakasetassignee: serhiy.storchaka

nosy: + serhiy.storchaka
2013-02-08 08:56:04serhiy.storchakasetfiles: - pygettext.py.patch
2013-02-08 01:37:15umedoblocksetfiles: + pygettext.py.patch

messages: + msg181652
2013-02-08 01:32:34umedoblocksetfiles: + pygettext.py.patch
keywords: + patch
2013-02-08 01:31:18umedoblockcreate