Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cp65001 to encodings/aliases.py #50308

Closed
tzot mannequin opened this issue May 19, 2009 · 20 comments
Closed

Add cp65001 to encodings/aliases.py #50308

tzot mannequin opened this issue May 19, 2009 · 20 comments
Labels
OS-windows stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement

Comments

@tzot
Copy link
Mannequin

tzot mannequin commented May 19, 2009

BPO 6058
Nosy @malemburg, @loewis, @pitrou, @vstinner, @ezio-melotti, @skrah
Files
  • alias_cp65001.diff: One-line addition of cp65001 aliased to utf_8
  • testnetcodecs.py
  • gen65001.c: Generate multibyte characters with cp65001
  • check65001.py: Check output of gen65001.exe
  • export-encodings.py
  • check-encodings.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-11-08.04:11:51.005>
    created_at = <Date 2009-05-19.00:21:32.287>
    labels = ['invalid', 'type-feature', 'library', 'expert-unicode', 'OS-windows']
    title = 'Add cp65001 to encodings/aliases.py'
    updated_at = <Date 2011-10-26.23:48:07.643>
    user = 'https://bugs.python.org/tzot'

    bugs.python.org fields:

    activity = <Date 2011-10-26.23:48:07.643>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-11-08.04:11:51.005>
    closer = 'vstinner'
    components = ['Library (Lib)', 'Unicode', 'Windows']
    creation = <Date 2009-05-19.00:21:32.287>
    creator = 'tzot'
    dependencies = []
    files = ['14014', '15477', '15661', '15662', '15858', '15859']
    hgrepos = []
    issue_num = 6058
    keywords = ['patch']
    message_count = 20.0
    messages = ['88060', '96065', '96066', '96076', '96077', '96080', '96758', '96796', '96807', '96809', '96815', '97731', '97732', '106274', '119440', '119441', '119444', '119447', '120712', '146467']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'loewis', 'tzot', 'pitrou', 'vstinner', 'ezio.melotti', 'skrah', 'davidsarah', 'David.Sankel']
    pr_nums = []
    priority = 'high'
    resolution = 'not a bug'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue6058'
    versions = ['Python 3.2']

    @tzot
    Copy link
    Mannequin Author

    tzot mannequin commented May 19, 2009

    Add 'cp65001' (Microsoft term for UTF-8) as an alias to 'utf_8'

    @tzot tzot mannequin added stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement OS-windows labels May 19, 2009
    @malemburg
    Copy link
    Member

    Could you provide some official reference defining the alias ?

    Thanks.

    @malemburg
    Copy link
    Member

    Nevermind, I found this reference:

    http://msdn.microsoft.com/en-us/library/system.text.encoding(VS.80).aspx

    Looks like we could add a few more aliases for other encodings as well.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 7, 2009

    http://msdn.microsoft.com/en-us/library/system.text.encoding(VS.80).aspx

    Looks like we could add a few more aliases for other encodings as well.

    I wouldn't trust this table. Microsoft is on record of implementing the
    code pages with slight variations compared to other references for some
    encodings (in particular the Asian ones). So unless there is an actual
    documented need for a certain alias (and preferably a demonstration that
    Microsoft's interpretation of the code page is the same as Python's),
    I would advise against adding such aliases.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    > http://msdn.microsoft.com/en-us/library/system.text.encoding(VS.80).aspx
    >
    > Looks like we could add a few more aliases for other encodings as well.

    I wouldn't trust this table. Microsoft is on record of implementing the
    code pages with slight variations compared to other references for some
    encodings (in particular the Asian ones). So unless there is an actual
    documented need for a certain alias (and preferably a demonstration that
    Microsoft's interpretation of the code page is the same as Python's),
    I would advise against adding such aliases.

    Fair enough.

    Could someone with some IronPython/.NET foo check whether the
    code pages are the same as the Python codecs ?

    The above page has some sample code to get started and IronPython
    provides easy access to both the .NET codecs and the Python ones.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @malemburg
    Copy link
    Member

    Here's a script for IronPython 2.6 that checks a few encoders.

    Since IronPython doesn't appear to come with the full set of Python
    codecs and it's also not clear whether the implemented codecs actually
    match the default Python ones, I'm not sure how reliable this output is.

    It's probably better to dump the encoded data to a file and compare
    against a CPython run.

    Anyway, here's the output:

    Code Page 65000 vs. encoding 'utf-7'

    0 errors

    Code Page 65001 vs. encoding 'utf-8'

    0 errors

    Code Page 1200 vs. encoding 'utf-16-le'

    0 errors

    Code Page 1201 vs. encoding 'utf-16-be'

    0 errors

    Code Page 28591 vs. encoding 'iso-8859-1'

    0 errors

    @pitrou
    Copy link
    Member

    pitrou commented Dec 21, 2009

    (I tried running your script under IronPython 2.6 with Mono but I got a
    bunch of errors; since I don't know IronPython at all I can't really
    investigate)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Dec 22, 2009

    I wrote a small C application that converts all possible
    wchar_t to multibyte strings, using code page 65001.

    Usage:

    cl.exe gen65001.c
    python check65001.py

    Except for the newline character and a sequence from
    55296-57343, this code page matches UFT-8.

    Note, however, that cp65001 is a pseudo code page:

    http://www.postgresql.org/docs/faqs.FAQ_windows.html#2.6

    For instance, setlocale will not work:

    http://blogs.msdn.com/michkap/archive/2006/03/13/550191.aspx

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 22, 2009

    This report is really about the issues reported in bpo-1602 and bpo-7441, i.e.
    where console output fails if the terminal encoding is 65001. Rather
    than adding the alias, I would prefer to find out why terminal output
    fails in that code page.

    @tzot
    Copy link
    Mannequin Author

    tzot mannequin commented Dec 22, 2009

    re Martin's question, I can offer the indirect wisdom of Michael Kaplan
    in this blog post:

    http://blogs.msdn.com/michkap/archive/2008/03/18/8306597.aspx

    where he mentions that the easiest way to output unicode text in the
    Windows console, is:

    int main(void) {
        _setmode(_fileno(stdout), _O_U16TEXT);
        wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
        return 0;
    }

    _setmode being the special call needed.

    I haven't tested with any _O_U8TEXT (if such a thing exists), I don't do
    Windows anymore, therefore I can't provide a patch.

    It also seems that Python —when stdin/stdout/stderr is under control of
    a Windows console— doesn't use plain *printf functions. The example code
    I offered in one of the other issues (dumb stdout doing plain .write as
    UTF-8) runs and displays fine.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 22, 2009

    I also wonder whether stdin/stdout/stderr should be streams on Windows
    that use WriteConsole instead of WriteFile. Then the entire issue of
    console CP would go away for Unicode output.

    @malemburg
    Copy link
    Member

    I created two scripts for exporting the IronPython findings and checking them in CPython.

    These are the results:

    Checking code Page 28591 against encoding 'iso-8859-1' using file 'iso-8859-1.map'

    0 errors

    Checking code Page 28592 against encoding 'iso-8859-2' using file 'iso-8859-2.map'

    0 errors

    Checking code Page 28593 against encoding 'iso-8859-3' using file 'iso-8859-3.map'

    0 errors

    Checking code Page 28594 against encoding 'iso-8859-4' using file 'iso-8859-4.map'

    0 errors

    Checking code Page 28595 against encoding 'iso-8859-5' using file 'iso-8859-5.map'

    0 errors

    Checking code Page 1201 against encoding 'utf-16-be' using file 'utf-16-be.map'

    2048 errors

    Checking code Page 1200 against encoding 'utf-16-le' using file 'utf-16-le.map'

    2048 errors

    Checking code Page 65000 against encoding 'utf-7' using file 'utf-7.map'

    21 errors

    Checking code Page 65001 against encoding 'utf-8' using file 'utf-8.map'

    2048 errors

    Result:

    We can add aliases for the various ISO mappings, but not for the UTF ones. .NET encodes the surrogates differently than Python's codecs and
    it also produces different results for UTF-7 than Python's codec.

    @malemburg
    Copy link
    Member

    What we could do is add new codecs based on the .NET tables for cp65000 et al.

    However, before doing this, I'd like to know where these code page settings can occur and what exact names are used for them. If they only appear in .NET and IronPython, I don't think it's worth adding extra codecs for the MS UTF variants.

    @vstinner
    Copy link
    Member

    Would it be possible to implement a "cp65001" codec in Python using MultiByteToWideChar() / WideCharToMultiByte() with codepage=CP_UTF8?

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Oct 23, 2010

    This problem causes {{{os.getcwdu()}}} to fail when the console code page is set to 65001 (always, I think):
    {{{
    t:\>ver

    Microsoft Windows [Version 6.0.6002]

    t:\>chcp
    Active code page: 65001

    t:\>python -c "import os; print os.getcwdu()"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    LookupError: unknown encoding: cp65001

    t:\>chcp 1252
    Active code page: 1252

    t:\>python -c "import os; print os.getcwdu()"
    t:\
    }}}

    Incidentally, I don't agree that this codepage needs to be distinguished from UTF-8. The deviations in the Microsoft codec are just their bugs. There is only one correct way to encode/decode UTF-8, and cp65001 is supposed to be UTF-8 according to Microsoft (e.g. http://msdn.microsoft.com/en-us/library/86hf4sb8%28en-US,VS.80%29.aspx ).

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Oct 23, 2010

    I said: "There is only one correct way to encode/decode UTF-8". This is true modulo differences in the treatment of initial byte order marks.

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Oct 23, 2010

    I meant to say that the os.getcwdu() test in msg119440 was done with Windows native Python 2.6.2.

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Oct 23, 2010

    Oops, false alarm. python -c "import os; print repr(os.getcwdu())" works as expected, so the exception is part of bpo-1602.

    (My command about there being no need to distinguish this codepage from UTF-8 stands.)

    @vstinner
    Copy link
    Member

    vstinner commented Nov 8, 2010

    Different tests proved that cp65001 can *not* be set as an alias to utf-8, and that's why I'm closing this issue.

    Anyway, I don't think that cp65001 is configured by default on any Windows setup. It is only set by the user, using the chcp command, to try to display unicode characters in the Windows console: but it is not possible to display any unicode character in this console (see issue bpo-1602). And chcp command should not be used in the Windows console because it does not only change the ANSI code page: it changes also the console code page, which is wrong (the console still expect text encoded to the previous code page).

    It is possible to implement a codec for cp65001 using utf-8 existing codec in surrogatepass mode, or by using MultiByteToWideChar() / WideCharToMultiByte() with codepage=CP_UTF8. But I don't think that we need cp65001 at all.

    If you need cp65001 for a good reason and you would like to implement a cp65001 Python codec, open a new issue.

    If you consider that we should use _O_U8TEXT or _O_U16TEXT, open another new issue.

    _O_U8TEXT or _O_U16TEXT might improve unicode support if Python output is redirected to a pipe, but I don't think that it would help to display unicode character in the Windows console. I also fear that it breaks existing code and any function not aware of this special mode.

    @vstinner
    Copy link
    Member

    I added a cp65001 codec to Python 3.3: see issue bpo-13216.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows stdlib Python modules in the Lib dir topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants