Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gettext: GNUTranslations doesn't parse properly comments in description #80420

Closed
vstinner opened this issue Mar 8, 2019 · 12 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir

Comments

@vstinner
Copy link
Member

vstinner commented Mar 8, 2019

BPO 36239
Nosy @vstinner, @serhiy-storchaka, @JulienPalard
PRs
  • bpo-36239: Skip comments in gettext infos #12255
  • [3.7] bpo-36239: Skip comments in gettext infos (GH-12255) #13218
  • Files
  • parse.py
  • comments.po
  • messages.mo
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-05-09.19:31:19.825>
    created_at = <Date 2019-03-08.13:58:19.463>
    labels = ['3.8', '3.7', 'library']
    title = "gettext: GNUTranslations doesn't parse properly comments in description"
    updated_at = <Date 2019-05-09.22:24:38.531>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2019-05-09.22:24:38.531>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-05-09.19:31:19.825>
    closer = 'mdk'
    components = ['Library (Lib)']
    creation = <Date 2019-03-08.13:58:19.463>
    creator = 'vstinner'
    dependencies = []
    files = ['48195', '48196', '48197']
    hgrepos = []
    issue_num = 36239
    keywords = ['patch']
    message_count = 12.0
    messages = ['337476', '337477', '337486', '337490', '337491', '337492', '337493', '337494', '337495', '337497', '341981', '342002']
    nosy_count = 3.0
    nosy_names = ['vstinner', 'serhiy.storchaka', 'mdk']
    pr_nums = ['12255', '13218']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue36239'
    versions = ['Python 2.7', 'Python 3.7', 'Python 3.8']

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2019

    When a translation .po file contains a comment in headers, it's kept when compiled as .mo by msgfmt.

    Example with test.po:
    ---
    msgid ""
    msgstr ""
    "Content-Type: text/plain; charset=UTF-8\n"
    "Plural-Forms: nplurals=2; plural=(n != 1);\n"
    "#-#-#-#-# plo.po (PACKAGE VERSION) #-#-#-#-#\n"
    ---

    Compile it with "msgfmt". Parse the output file messages.mo using test.py script:
    ---

    import gettext, pprint
    with open("messages.mo", "rb") as fp:
        t = gettext.GNUTranslations()
        t._parse(fp)
        pprint.pprint(t._info)

    Output on Python 3.7.2:
    ---
    {'content-type': 'text/plain; charset=UTF-8',
    'plural-forms': 'nplurals=2; plural=(n != 1);\n'
    '#-#-#-#-# plo.po (PACKAGE VERSION) #-#-#-#-#'}
    ---

    Output of Fedora Python 2.7.15 which contains a fix:
    ---
    {'content-type': 'text/plain; charset=UTF-8',
    'plural-forms': 'nplurals=2; plural=(n != 1);'}
    ---

    I'm not sure that keeping the comment as part of plural forms is correct. Comments should not be ignored?

    I made my test on Fedora 29: msgfmt 0.19.8.1, Python 3.7.2.

    Links:

    Fedora has a patch since 2007 to ignore comments:
    https://src.fedoraproject.org/rpms/python2/blob/master/f/python-2.5.1-plural-fix.patch

    I can easily convert the patch to a PR, maybe with a test. The question is more if the fix is correct or not.

    @vstinner vstinner added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir labels Mar 8, 2019
    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2019

    Attached files:

    • comments.po: PO file with a comment in headers
    • messages.mo: comments.po compiled with msgfmt
    • parse.py: Python script to parse messages.mo

    @JulienPalard
    Copy link
    Member

    After some research I found a few comments around comments being marked as starting by #-#-#-#-# and ending with #-#-#-#-#, not just starting with #.

    In gettext-0.19.8.1 sources for example:

    $ grep -r '#-#-#-#-' | head
    gettext-tools/misc/po-mode.el:#-#-#-#-#  file name reference  #-#-#-#-#
    gettext-tools/misc/po-mode.el:  (let* ((marker-regex "^#-#-#-#-#  \\(.*\\)  #-#-#-#-#\n")
    gettext-tools/src/msgl-cat.c:                  char *id = xasprintf ("#-#-#-#-#  %s  #-#-#-#-#",

    Or more precisly in gettext-tools/tests/msgcat-10:

    # Verify msgcat of two files, when the header entries have different comments
    # but the same contents. The resulting header entry is not marked fuzzy,
    # because the #-#-#-#-# are only in comments and do not necessarily require
    # translator attention; in other words, an msgstr which is valid in both input
    # files is also valid in the result.

    I'm however surprised not to find much of "#-#-#-#-#" in the source code, like if they are just looking a single # like you do here.

    Not sure which one is the better, eliminating lines with a pair of #-#-#-#-# or lines starting with a #, both looks OK to me (we're only speaking about the header here, not the msgstr, so it won't have much impact).

    Personally I'd go for eliminating #-#-#-#-# as this is the only case we've seen, and is the "documented" one in the GNU gettext test cases.

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2019

    I found a .po file with "#" in headers on the Internet, Sympa mailing list project:
    https://www.sympa.org/distribution/sympa-6.0.10/po-wwsympa/et.po:

    # #-#-#-#-# blank_web_help_et.po (sympa) #-#-#-#-#
    # Sympa online help internationalisation.
    # Copyright (C) 2007
    # This file is distributed under the same license as Sympa.
    # FIRST AUTHOR <david.verdin@cru.fr>, 2007.

    # #-#-#-#-# tmp_web_help_et.po (et) #-#-#-#-#
    # translation of et.po to
    # translation of et.po to
    # #-#-#-#-# et.po (PACKAGE VERSION) #-#-#-#-#
    # Copyright (C) 2005 Free Software Foundation, Inc.
    # #-#-#-#-# et.po (PACKAGE VERSION) #-#-#-#-#
    # #-#-#-#-# et.po (PACKAGE VERSION) #-#-#-#-#
    # This file is distributed under the same license as the PACKAGE package.
    # FIRST AUTHOR <EMAIL>, YEAR.
    # Copyright (C) YEAR Free Software Foundation, Inc.
    # FIRST AUTHOR <EMAIL>, YEAR.#.
    # Copyright (C) YEAR THE PACKAGE'S COPYRIGHT HOLDER.
    # root <root@vykk.vil.ee>, 2005.

    #, fuzzy
    msgid ""
    msgstr ""
    "Project-Id-Version: et\n"
    "POT-Creation-Date: 2007-11-13 14:50+0200\n"
    "PO-Revision-Date: 2007-10-22 00:03+0200\n"
    "Last-Translator: Alar Sing <alar.sing@etv.ee>\n"
    "Language-Team: Estonian\n"
    "MIME-Version: 1.0\n"
    "Content-Type: text/plain; charset=UTF-8\n"
    "Content-Transfer-Encoding: 8bit\n"
    "#-#-#-#-# blank_web_help_et.po (sympa) #-#-#-#-#\n"
    "Plural-Forms: nplurals=2; plural=(n != 1);\n"
    "#-#-#-#-# tmp_web_help_et.po (et) #-#-#-#-#\n"
    "X-Generator: Pootle 1.0.2\n"

    They are 2 headers starting with >"#-#-#-#-# < and ending with > #-#-#-#-#\n"<.

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2019

    I hacked gettext.py to parse all files of my system. I found 3 .mo files which contain "#" in headers:

    /usr/share/locale/fa/LC_MESSAGES/digikam.mo:

    {'content-transfer-encoding': '8bit\n'
    '#-#-#-#-# digikamimageplugin_channelmixer.po '
    '(digikamimageplugin_channelmixer) #-#-#-#-#',
    'content-type': 'text/plain; charset=UTF-8',
    'language': 'fa',
    'language-team': 'Farsi (Persian) <>',
    'last-translator': 'Mohammad Reza Mirdamadi <mohi@ubuntu.ir>',
    'mime-version': '1.0',
    'plural-forms': 'nplurals=1; plural=0;',
    'po-revision-date': '2012-01-13 15:00+0330',
    'pot-creation-date': '2018-03-18 03:11+0100',
    'project-id-version': 'digikam',
    'report-msgid-bugs-to': 'http://bugs.kde.org',
    'x-generator': 'KBabel 1.11.4'}

    /usr/share/locale/ia/LC_MESSAGES/akonadicontact5-serializer.mo:

    {'content-transfer-encoding': '8bit\n'
    '#-#-#-#-# akonadi_kalarm_resource.po '
    '#-#-#-#-#',
    'content-type': 'text/plain; charset=UTF-8',
    'language': 'ia',
    'language-team': 'Interlingua <kde-i18n-it@kde.org>',
    'last-translator': 'g.sora <g.sora@tiscali.it>',
    'mime-version': '1.0',
    'plural-forms': 'nplurals=2; plural=n != 1;',
    'po-revision-date': '2011-11-29 19:38+0100',
    'pot-creation-date': '2018-11-12 06:56+0100',
    'project-id-version': '',
    'report-msgid-bugs-to': 'http://bugs.kde.org',
    'x-generator': 'Lokalize 1.2'}

    /usr/share/locale/ml/LC_MESSAGES/ktraderclient5.mo:

    {'content-transfer-encoding': '8bit',
    'content-type': 'text/plain; charset=UTF-8',
    'language': 'ml',
    'language-team': 'Swathanthra|സ്വതന്ത്ര Malayalam|മലയാളം '
    'Computing|കമ്പ്യൂട്ടിങ്ങ് <smc-discuss@googlegroups.com>',
    'last-translator': '# ANI PETER|അനി പീറ്റര്\u200d <peter.ani@gmail.com>',
    'mime-version': '1.0',
    'plural-forms': 'nplurals=2; plural=(n != 1);',
    'po-revision-date': '2008-07-10 22:04+0530',
    'pot-creation-date': '2018-09-14 06:47+0200',
    'project-id-version': 'ktraderclient',
    'report-msgid-bugs-to': 'http://bugs.kde.org',
    'x-generator': 'KBabel 1.11.4'}

    @JulienPalard
    Copy link
    Member

    The

    'last-translator': '# ANI PETER|അനി പീറ്റര്\u200d <peter.ani@gmail.com>',

    case does not looks like an issue, it does *not* starts with #, it's in the middle of the line, the line starts with "Last-Translator".

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2019

    /usr/share/locale/fa/LC_MESSAGES/digikam.mo:

    I downloaded the .po file using:

    svn cat svn://anonsvn.kde.org/home/kde/trunk/l10n-kf5/fa/messages/extragear-graphics/digikam.po > fa_digikam.po

    It contains many comments in headers. Extract:

    (...)
    # MaryamSadat Razavi <razavi@itland.ir>, 2007.
    # Nasim Daniarzadeh <daniarzadeh@itland.ir>, 2007.
    # Nazanin Kazemi <kazemi@itland.ir>, 2007.
    # Mohammad Reza Mirdamadi <mohi@ubuntu.ir>, 2011, 2012.
    msgid ""
    msgstr ""
    "Project-Id-Version: digikam\n"
    "Report-Msgid-Bugs-To: http://bugs.kde.org\\n"
    "POT-Creation-Date: 2019-03-08 03:08+0100\n"
    "PO-Revision-Date: 2012-01-13 15:00+0330\n"
    "Last-Translator: Mohammad Reza Mirdamadi <mohi@ubuntu.ir>\n"
    "Language-Team: Farsi (Persian) <>\n"
    "Language: fa\n"
    "MIME-Version: 1.0\n"
    "Content-Type: text/plain; charset=UTF-8\n"
    "Content-Transfer-Encoding: 8bit\n"
    "#-#-#-#-# digikamimageplugin_channelmixer.po "
    "(digikamimageplugin_channelmixer) #-#-#-#-#\n"
    "X-Generator: Lokalize 1.2\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_refocus.po (digikamimageplugin_refocus) #-#-#-"
    "#-#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_oilpaint.po (digikamimageplugin_oilpaint) #-#-"
    "#-#-#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_perspective.po "
    "(digikamimageplugin_perspective) #-#-#-#-#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_freerotation.po "
    "(digikamimageplugin_freerotation) #-#-#-#-#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugins.po (digikamimageplugins) #-#-#-#-#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_raindrop.po (digikamimageplugin_raindrop) #-#-"
    "#-#-#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_blowup.po (digikamimageplugin_blowup) #-#-#-#-"
    "#\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_charcoal.po (digikamimageplugin_charcoal) #-#-"
    "#-#-#\n"
    (...)

    @vstinner
    Copy link
    Member Author

    vstinner commented Mar 8, 2019

    /usr/share/locale/ml/LC_MESSAGES/ktraderclient5.mo:

    svn cat svn://anonsvn.kde.org/home/kde/trunk/l10n-kf5/ml/messages/kde-workspace/ktraderclient5.po > ml_ktraderclient5.po

    Extract:

    msgid ""
    msgstr ""
    "Project-Id-Version: ktraderclient\n"
    "Report-Msgid-Bugs-To: http://bugs.kde.org\\n"
    "POT-Creation-Date: 2018-08-16 09:14+0200\n"
    "PO-Revision-Date: 2008-07-10 22:04+0530\n"
    "Last-Translator: # ANI PETER|അനി പീറ്റര്<200d> <peter.ani@gmail.com>\n"
    "Language-Team: Swathanthra|സ്വതന്ത്ര Malayalam|മലയാളം Computing|കമ്പ്യൂട്ടിങ്ങ് <smc-"
    "discuss@googlegroups.com>\n"
    "Language: ml\n"
    "MIME-Version: 1.0\n"
    "Content-Type: text/plain; charset=UTF-8\n"
    "Content-Transfer-Encoding: 8bit\n"
    "X-Generator: KBabel 1.11.4\n"
    "Plural-Forms: nplurals=2; plural=(n != 1);\n"

    @JulienPalard
    Copy link
    Member

    That's literally sick þ Looks like we have to trust the "\n", not the file wrapping, but this means that:

    msgstr ""
    "Pro"
    "jec"
    "t-I"
    "d-V"
    "ers"
    "ion"
    ": "
    "dig"
    "ika"
    "m\n"
    "Report-Msgid-Bugs-To: http://bugs.kde.org\\n"

    is valid, too? I have to try it!

    HAHA it is:

    $ cat ~/clones/python-docs-fr/glossary.po | head -n 20
    # Copyright (C) 2001-2018, Python Software Foundation
    # For licence information, see README file.
    #
    msgid ""
    msgstr ""
    "Pr"
    "oj"
    "ec"
    "t-"
    "Id"
    "-V"
    "er"
    "si"
    "on"
    ":"
    " P"
    "ython 3.6\n"
    "Report-Msgid-Bugs-To: \n"
    "POT-Creation-Date: 2018-12-21 09:48+0100\n"
    "PO-Revision-Date: 2019-03-08 14:48+0100\n"
    
    $ msgcat ~/clones/python-docs-fr/glossary.po | head -n 20
    # Copyright (C) 2001-2018, Python Software Foundation
    # For licence information, see README file.
    #
    msgid ""
    msgstr ""
    "Project-Id-Version: Python 3.6\n"
    "Report-Msgid-Bugs-To: \n"
    "POT-Creation-Date: 2018-12-21 09:48+0100\n"
    "PO-Revision-Date: 2019-03-08 14:48+0100\n"
    "Last-Translator: Jules Lasne <jules.lasne@gmail.com>\n"
    "Language-Team: FRENCH <traductions@lists.afpy.org>\n"
    "Language: fr\n"
    "MIME-Version: 1.0\n"
    "Content-Type: text/plain; charset=UTF-8\n"
    "Content-Transfer-Encoding: 8bit\n"
    "X-Generator: Poedit 2.0.2\n"
    "# Pouette\n"

    @JulienPalard
    Copy link
    Member

    I tested further, and when we have this horrible mess in the po files:

    msgstr ""
    "Pro"
    "jec"
    "t-I"
    "d-V"
    "ers"
    "ion"
    ": "
    "dig"
    "ika"
    "m\n"

    We have a clean string in the .mo file.

    So there is no fear to have of:

    "Plural-Forms: nplurals=1; plural=0;\n"
    "#-#-#-#-# digikamimageplugin_raindrop.po (digikamimageplugin_raindrop) #-#-"
    "#-#-#\n"
    "X-Generator: KBabel 1.11.4\n"

    It will be nicely stored in the mo as:

    Plural-Forms: nplurals=1; plural=0;
    #-#-#-#-# digikamimageplugin_raindrop.po (digikamimageplugin_raindrop) #-#-#-#-#
    X-Generator: KBabel 1.11.4

    So you can safely remove lines starting and ending with #-#-#-#-#.

    @JulienPalard
    Copy link
    Member

    New changeset afd1e6d by Julien Palard in branch 'master':
    bpo-36239: Skip comments in gettext infos (GH-12255)
    afd1e6d

    @vstinner
    Copy link
    Member Author

    vstinner commented May 9, 2019

    Julien: Why not fixing Python 3.7?

    You approved #13218 (Python 3.7 backport) but then you closed it. Only Azure Pipelines PR failed on "ERROR: test_drain_raises (test.test_asyncio.test_streams.StreamTests)" which is unrelated.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants