Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in Unicode HOWTO #65105

Closed
gwideman mannequin opened this issue Mar 13, 2014 · 21 comments
Closed

Issues in Unicode HOWTO #65105

gwideman mannequin opened this issue Mar 13, 2014 · 21 comments
Assignees
Labels
docs Documentation in the Doc dir type-feature A feature request or enhancement

Comments

@gwideman
Copy link
Mannequin

gwideman mannequin commented Mar 13, 2014

BPO 20906
Nosy @malemburg, @loewis, @akuchling, @pitrou, @vstinner, @benjaminp, @ezio-melotti, @merwok, @bitdancer, @miss-islington
PRs
  • bpo-20906: Various revisions to the Unicode howto  #8394
  • [3.7] bpo-20906: Various revisions to the Unicode howto (GH-8394) #12155
  • bpo-35393: Fix typo #10876
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/akuchling'
    closed_at = <Date 2019-03-04.14:51:31.020>
    created_at = <Date 2014-03-13.09:16:27.753>
    labels = ['type-feature', 'docs']
    title = 'Issues in Unicode HOWTO'
    updated_at = <Date 2019-05-06.13:44:29.142>
    user = 'https://bugs.python.org/gwideman'

    bugs.python.org fields:

    activity = <Date 2019-05-06.13:44:29.142>
    actor = 'autom'
    assignee = 'akuchling'
    closed = True
    closed_date = <Date 2019-03-04.14:51:31.020>
    closer = 'vstinner'
    components = ['Documentation']
    creation = <Date 2014-03-13.09:16:27.753>
    creator = 'gwideman'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 20906
    keywords = ['patch']
    message_count = 21.0
    messages = ['213367', '213741', '213783', '213784', '214025', '214153', '214179', '214197', '214205', '214321', '214324', '214426', '214458', '214470', '214475', '214476', '214509', '222414', '337068', '337104', '337124']
    nosy_count = 13.0
    nosy_names = ['lemburg', 'loewis', 'akuchling', 'pitrou', 'vstinner', 'benjamin.peterson', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'docs@python', 'tshepang', 'gwideman', 'miss-islington']
    pr_nums = ['8394', '12155', '10876']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue20906'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 13, 2014

    The Unicode HOWTO article is an attempt to help users wrap their minds around Unicode. There are some opportunities for improvement. Issues presented in order of the narrative:

    http://docs.python.org/3.3/howto/unicode.html

    History of Character Codes
    ---------------------------

    References to the 1980's are a bit off.

    "In the mid-1980s an Apple II BASIC program..."

    Assuming the comment is about the state of play in the mid-80's, then: The Apple II appeared in 1977. By 1985 we already had Macs, and PCs running DOS, which were capable of various character sets (not to mention lowercase letters!)

    "In the 1980s, almost all personal computers were 8-bit"

    Both the PC (1983) and Mac (1984) had 16-bit processors.

    Definitions:
    ------------
    "Characters are abstractions": Not helpful unless one already knows what "abstraction" means in this specific context.

    "the symbol for ohms (Ω) is usually drawn much like the capital letter omega (Ω) in the Greek alphabet [...] but these are two different characters that have different meanings."

    Omega is a poor example for this concept. Omega is used as the identifier for a unit in the same way as "m" is used for meter, or "A" is used for ampere. Each is a specific use of a character, which, like any specific use, has a particular meaning. However, having a particular meaning doesn't necessarily require a separate character, and in the case of omega, the Unicode standard now says that the separate "ohm" character is deprecated.

    "The ohm sign is canonically equivalent to the capital omega, and normalization would remove any distinction."

    http://www.unicode.org/versions/Unicode4.0.0/ch07.pdf#search=%22character%20U%2B2126%20maps%20OR%20map%20OR%20mapping%22

    A better example might be the roman numerals, code points U+2160 and subsequent.

    Definitions
    ------------

    "A code point is an integer value, usually denoted in base 16."

    When trying to convey clearly the distinction between character, code point, and byte representation, the topic of "how it's denoted" is a potential distraction for the reader, so I suggest this point be a bit more explicitly parenthetical, and less confusable with "16 bit". Like:

    "A code point value is an integer in the range 0 to over 0x10FFFF (about 1.1 million, with some 110 thousand assigned so far). In a narrative such as the current article, a code point value is usually written in hexadecimal. The Unicode standard displays code points with the notation U+265E to mean the character with value 0x265e (9822 decimal; "Black Chess Knight" character)."

    (Also revise subsequent para to use same example character. I suggest not using "Ethiotic Syllable WI", because it's unfamiliar to most readers, and it muddies the topic by suggesting that Unicode in general captures _syllables_ rather than _characters_.)

    Encodings:
    -----------
    "This sequence needs to be represented as a set of bytes"
    --> ""This code point sequence needs to be represented as a sequence of bytes"

    "4. Many Internet standards are defined in terms of textual data"

    This is a vague claim. Probably what was intended was: "Many Internet standards define protocols in which the data must contain no zero bytes, or zero bytes have special meaning." Is this actually true? Are there "many" such standards?

    "Generally people don’t use this encoding,"
    Probably "people" per se don't use any encoding, computers do. --> "Because of these problems, other more efficient and convenient encodings have been devised and are commonly used.

    For continuity, directly after that para should come the later paras starting with "UTF-8 is one of the most common".

    "2. A Unicode string is turned into a string of bytes..."
    --> "2. A Unicode string is turned into a sequence of bytes..." (Ie: don't overload "string" in and article about strings and encodings.).

    Create a new subhead "Converting from Unicode to non-Unicode encodings", and move under it the paras:

    "Encodings don't have to..."
    "Latin-1, also known as..."
    "Encodings don't have to..."

    But also revise:

    "Encodings don’t have to handle every possible Unicode character, and most encodings don’t."

    --> "Non-Unicode code systems usually don't handle all of the characters to be found in Unicode."

    @gwideman gwideman mannequin assigned docspython Mar 13, 2014
    @gwideman gwideman mannequin added docs Documentation in the Doc dir type-feature A feature request or enhancement labels Mar 13, 2014
    @merwok merwok changed the title Unicode HOWTO Issues in Unicode HOWTO Mar 13, 2014
    @pitrou
    Copy link
    Member

    pitrou commented Mar 16, 2014

    Do you want to provide a patch?

    In a narrative such as the current article, a code point value is usually written in hexadecimal.

    I find use of the word "narrative" intimidating in the context of a technical documentation.

    In general, I find it disappointing that the Unicode HOWTO only gives hexadecimal representations of non-ASCII characters and (almost) never represents them in their true form. This makes things more abstract than necessary.

    This is a vague claim. Probably what was intended was: "Many Internet standards define protocols in which the data must contain no zero bytes, or zero bytes have special meaning." Is this actually true? Are there "many" such standards?

    I think it actually means that Internet protocols assume an ASCII-compatible encoding (which UTF-8 is, but not UTF-16 or UTF-32 - nor EBCDIC :-)).

    --> "Non-Unicode code systems usually don't handle all of the characters to be found in Unicode."

    The term *encoding* is used pervasively when dealing with the transformation of unicode to/from bytes, so I find it confusing to introduce another term here ("code systems"). I prefer the original sentence.

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 17, 2014

    Do you want to provide a patch?

    I would be happy to, but I'm not currently set up to create a patch. Also, I hoped that an author who has more history with this article would supervise, especially where I don't know what the original intent was.

    I find use of the word "narrative" intimidating in the context of a technical documentation.

    Agreed. How about "In documentation such as the current article..."

    In general, I find it disappointing that the Unicode HOWTO only gives
    hexadecimal representations of non-ASCII characters and (almost) never
    represents them in their true form. This makes things more abstract
    than necessary.

    I concur with reducing unnecessary abstraction. No sure what you mean by "true form". Do you mean show the glyph which the code point represents? Or the sequence of bytes? Or display the code point value in decimal?

    > This is a vague claim. Probably what was intended was: "Many
    > Internet standards define protocols in which the data must
    > contain no zero bytes, or zero bytes have special meaning."
    > Is this actually true? Are there "many" such standards?

    I think it actually means that Internet protocols assume an ASCII-compatible
    encoding (which UTF-8 is, but not UTF-16 or UTF-32 - nor EBCDIC :-)).

    Ah -- yes that makes sense.

    > --> "Non-Unicode code systems usually don't handle all of
    > the characters to be found in Unicode."

    The term *encoding* is used pervasively when dealing with the transformation
    of unicode to/from bytes, so I find it confusing to introduce another term here
    ("code systems"). I prefer the original sentence.

    I see that my revision missed the target. There is a problem, but it is wider than this sentence.

    One of the most essential points this article should make clear is the distinction between older schemes with a single mapping:

    Characters <--> numbers in particular binary format. (eg: ASCII)

    ... versus Unicode with two levels of mapping...

    Characters <--> code point numbers <--> particular binary format of the number data and sequences thereof.

    In the older schemes, "encoding" referred to the one mapping: chars <--> numbers in particular binary format. In Unicode, "encoding" refers only to the mapping: code point numbers <--> binary format. It does not refer to the chars <--> code point mapping. (At least, I think that's the case. Regardless, the two mappings need to be rigorously distinguished.)

    On review, there are many points in the article that muddy this up. For example, "Unicode started out using 16-bit characters instead of 8-bit characters". Saying "so-an-so-bit characters" about Unicode, in the current article, is either wrong, or very confusing. Unicode characters are associated with code points, NOT with any _particular_ bit level representation.

    If I'm right about the preceding, then it would be good for that to be spelled out more explicitly, and used consistently throughout the article. (I won't try to list all the examples of this problem here -- too messy.)

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 17, 2014

    A further issue regarding "one-to-one mappings".

    Article: "Encodings don’t have to be simple one-to-one mappings like Latin-1. Consider IBM’s EBCDIC, which was used on IBM mainframes."

    I don't think this paragraph is about one-to-one mappings per se. (ie: one character to one code.) It seems to be about whether ranges of characters whose code values are contiguous in one coding system are also contiguous in another coding system. The EBCDIC encoding is still one-to-one, I believe.

    The subject of one-chararacter-to-one-code mapping is important (normalization etc), though perhaps beyond the current article. But I think the article should avoid suggesting that many-to-one or one-to-many scenarios are common.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 18, 2014

    Agreed. How about "In documentation such as the current article..."

    It's better, but how about simply "In this article"?

    I concur with reducing unnecessary abstraction. No sure what you mean
    by "true form". Do you mean show the glyph which the code point
    represents? Or the sequence of bytes? Or display the code point value
    in decimal?

    I mean the glyph.

    In the older schemes, "encoding" referred to the one mapping: chars <-->
    numbers in particular binary format. In Unicode, "encoding" refers only to
    the mapping: code point numbers <--> binary format. It does not refer to
    the chars <--> code point mapping. (At least, I think that's the case.
    Regardless, the two mappings need to be rigorously distinguished.)

    This is true, but in this HOWTO's context the term "code system" is a confusing distraction, IMHO. For all intents and purposes, iso-8859-1 and friends *are* encodings (and this is how Python actually names them).

    On review, there are many points in the article that muddy this up. For
    example, "Unicode started out using 16-bit characters instead of 8-bit
    characters". Saying "so-an-so-bit characters" about Unicode, in the
    current article, is either wrong, or very confusing.

    So it should say "16-bit code points" instead, right?

    The subject of one-chararacter-to-one-code mapping is important
    (normalization etc), though perhaps beyond the current article. But I
    think the article should avoid suggesting that many-to-one or one-to-many
    scenarios are common.

    Agreed.

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 19, 2014

    Antoine:

    Thanks for your comments -- this is slippery stuff.

    It's better, but how about simply "In this article"?

    I was hoping to inform the reader that the hex representations are found in many articles, not just special to this one.

    [ showing the glyph ]

    Agreed -- it would be good to show the glyphs mentioned. But in a way that isn't confusing if the user's web browser doesn't show it correctly.

    For all intents and purposes, iso-8859-1 and friends *are* encodings
    (and this is how Python actually names them).

    I am still mulling this over. iso-8859-1 is most literally an "encoding" in the old sense of the word (character <--> byte representation), and is not, per se, a unicode-related concept.

    I think part of the ambiguity problem here is that there are two subtly but importantly different ideas here:

    1. Python string (capable of representing any unicode text) --> some full-fidelity and industry recognized unicode byte stream, like utf-8, or utf-32. I think this is legitimately described as an "encoding" of the unicode string.

    versus:

      1. Python string --> some other code system, such as ASCII, cp1250, etc. The destination code system doesn't necessarily have anything to do with unicode, and whole ranges of unicode's characters either result in an exception, or get translated as escape sequences. Ie: This is more usefully seen as a translation operation, than "merely" encoding.

    In 1, the encoding process results in data that stays within concepts defined within Unicode. In 2, encoding produces data that would be described by some code system outside of Unicode.

    At the moment I think Python muddles these two ideas together, and I'm not sure how to clarify this.

    So it should say "16-bit code points" instead, right?

    I don't think Unicode code points should ever be described as having a particular number of bits. I think this is a core concept: Unicode separates the character <--> code point, and code point <--> bits/bytes mappings.

    At most, one might want to distinguish different ranges of unicode code points. Even if there is a need to distinguish code points <= 65535, I don't think this should be described as "16-bit", as it muddies the distinction between Unicode's two mappings.

    @malemburg
    Copy link
    Member

    Just to clarify a few things:

    On 20.03.2014 00:50, Graham Wideman wrote:

    I think part of the ambiguity problem here is that there are two subtly but importantly different ideas here:

    1. Python string (capable of representing any unicode text) --> some full-fidelity and industry recognized unicode byte stream, like utf-8, or utf-32. I think this is legitimately described as an "encoding" of the unicode string.

    Right, those are Unicode transformation format (UTF) encodings which are
    capable of representing all Unicode code points.

    versus:

      1. Python string --> some other code system, such as ASCII, cp1250, etc. The destination code system doesn't necessarily have anything to do with unicode, and whole ranges of unicode's characters either result in an exception, or get translated as escape sequences. Ie: This is more usefully seen as a translation operation, than "merely" encoding.

    Those are encodings as well. The operation going from Unicode to one of
    these encodings is called "encode" in Python. The other way around
    "decode".

    In 1, the encoding process results in data that stays within concepts defined within Unicode. In 2, encoding produces data that would be described by some code system outside of Unicode.

    At the moment I think Python muddles these two ideas together, and I'm not sure how to clarify this.

    An encoding is a mapping of characters to ordinals, nothing more or
    less. Unicode is such an encoding, but all others are as well. They
    just happen to have different ranges of ordinals.

    You are viewing all this from the a Unicode point of view, but please
    realize that Unicode is rather new in the business and the many
    other encodings Python supports have been around for decades.

    > So it should say "16-bit code points" instead, right?

    I don't think Unicode code points should ever be described as having a particular number of bits. I think this is a core concept: Unicode separates the character <--> code point, and code point <--> bits/bytes mappings.

    At most, one might want to distinguish different ranges of unicode code points. Even if there is a need to distinguish code points <= 65535, I don't think this should be described as "16-bit", as it muddies the distinction between Unicode's two mappings.

    You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4
    needs 21 bits, but is typically stored in 32-bit. Still,
    you're right: it's better to use the correct terms UCS-2 vs. UCS-4
    rather than refer to the number of bits.

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 20, 2014

    Marc-Andre:

    Thanks for commenting:

    > 2. 1. Python string --> some other code system, such as
    > ASCII, cp1250, etc. The destination code system doesn't
    > necessarily have anything to do with unicode, and whole
    > ranges of unicode's characters either result in an
    > exception, or get translated as escape sequences.
    > Ie: This is more usefully seen as a translation
    > operation, than "merely" encoding.

    Those are encodings as well. The operation going from Unicode to one of
    these encodings is called "encode" in Python.

    Yes I am certainly aware that in Python parlance these are also called "encode" (and achieved with encode()), which, I am arguing, is one reason we have confusion. These are not encoding into a recognized Unicode-defined byte stream, they entail translation and filtering into the allowed character set of a different code system and encoding into that code system's byte representation (encoding).

    > In 1, the encoding process results in data that stays within concepts
    > defined within Unicode. In 2, encoding produces data that would be
    > described by some code system outside of Unicode.
    > At the moment I think Python muddles these two ideas together,
    > and I'm not sure how to clarify this.

    An encoding is a mapping of characters to ordinals, nothing more or less.

    In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise).

    You are viewing all this from the a Unicode point of view, but please
    realize that Unicode is rather new in the business and the many
    other encodings Python supports have been around for decades.

    I'm advocating that the concepts be clear enough to understand that Unicode (UTF-whatever) works differently (two mappings) than non-Unicode systems (single mapping), so that users have some hope of understanding what happens in moving from one to the other.

    > > So it should say "16-bit code points" instead, right?

    > I don't think Unicode code points should ever be described as
    > having a particular number of bits. I think this is a
    > core concept: Unicode separates the character <--> code point,
    > and code point <--> bits/bytes mappings.

    You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4
    needs 21 bits, but is typically stored in 32-bit. Still,
    you're right: it's better to use the correct terms UCS-2 vs. UCS-4
    rather than refer to the number of bits.

    I think mixing in UCS just adds confusion here. Unicode consortium has declared "UCS" obsolete, and even wants people to stop using that term:
    http://www.unicode.org/faq/utf_bom.html
    "UCS-2 is obsolete terminology... the term should now be avoided."
    (That's a somewhat silly position -- we must still use the term to talk about legacy stuff. But probably not necessary here.)

    So my point wasn't about UCS. It was about referring to code points as having a particular bit width. Fundamentally, code points are numbers, without regard to some particular computer number format. It is a separate matter that they can be encoded in 8, 16 or 32 bit encoding schemes (utf-8, 16, 32), and that is independent of the magnitude of the code point number.

    It _is_ the case that some code points are large enough integers that when encoded they _require_, say, 3 bytes in utf-8, or two 16-bit words in utf-16 and so on. But the number of bits used in the encoding does not necessarily correspond to the number of bits that would be required to represent the integer code point number in plain binary. (Only in UTF-32 is the encoded value simply the binary version of the code point value.)

    @malemburg
    Copy link
    Member

    On 20.03.2014 11:49, Graham Wideman wrote:

    > An encoding is a mapping of characters to ordinals, nothing more or less.

    In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise).

    I know that Unicode terminology provides all kinds of ways to name
    things and we can be arbitrarily pedantic about any of them and
    the fact that the Unicode consortium changes its mind every few
    years isn't helpful either :-)

    We could also have called encodings: "character set", "code page",
    "character encoding", "transformation", etc.

    In Python keep it simple: you have Unicode (code points) and 8-bit strings
    or bytes (code units).

    Whenever you go from Unicode to bytes, you encode Unicode into some encoding.
    Going back, you decode the encoding back into Unicode. This operation is
    defined by the codec implementing the encoding and it's *not* guaranteed
    to be lossless.

    See here for how we ended up having Unicode support in Python:

    http://www.egenix.com/library/presentations/#PythonAndUnicode

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 21, 2014

    Marc-Andre: Thanks for your latest comments.

    We could also have called encodings: "character set", "code page",
    "character encoding", "transformation", etc.

    I concur with you that things _could_ be called all sorts of names, and the choices may be arbitrary. However, creating a clear explanation requires figuring out the distinct things of interest in the domain, picking terms for those things that are distinct, and then using those terms rigorously. (Usage in the field may vary, which in itself may warrant comment.)

    I read your slide deck/time-capsule-from-2002, with interest, on a number of points. (I realize that you were involved in the Python 2.x implementation of Unicode. Not sure about 3.x?)

    Page 8 "What is a Character?" is lovely, showing very explicitly Unicode's two levels of mapping, and giving names to the separate parts. It strongly suggests this HOWTO page needs a similar figure.

    That said, there are a few notes to make on that slide, useful in trying to arrive at consistent terms:

    1. The figure shows a more precise word for "what users regard as a character", namely "grapheme". I'd forgotten that.

    2. It shows e-accent-acute to demonstrate a pair of code points representing a single grapheme. That's important, but should avoid suggesting this as the only way to form e-accent-acute (canonical equivalence, U+00E9).

    3. The illustration identifies the series of code points (the middle row) as "the Unicode encoding of the string". Ie: The grapheme-to-code-points mapping is described as an encoding. Not a wrong use of general language. But inconsistent with the mapping that encode() pertains to. (And I don't think that the code-point-to-grapheme transform is ever called "decoding", but I could be wrong.)

    4. The illustration of Code Units (in the third row) shows graphemes for the Code Units (byte values). This confusingly glosses over the fact that those graphemes correspond to what you would see if you _decoded_ these byte values using CP1252 or ISO 8859-1, suggesting that the result is reasonable or useful. It certainly happens that people do this, deliberately or accidentally, but it is a misuse of the data, and should be warned against, or at least explained as a confusion.

    Returning to your most recent message:

    In Python keep it simple: you have Unicode (code points) and
    8-bit strings or bytes (code units).

    I wish it _were_ that simple. And I agree that, in principle, (assuming Python 3+) there should "inside your program" where you have the str type which always acts as sequences of Unicode code points, and has string functions. And then there's "outside your program", where text is represented by sequences of bytes that specify or imply some encoding. And your program should use supplied library functions to mostly automatically convert on the way in and on the way out.

    But there are enough situations where the Python programmer, having adopted Python 3's string = Unicode approach, sees unexpected results. That prompts reading this page, which is called upon to make the fine distinctions to allow figuring out what's going on.

    I'm not sure what you mean by "8-bit strings" but I'm pretty sure that's not an available type in Python 3+. Ie: Some functions (eg: encode()) produce sequences of bytes, but those don't work entirely like strs.

    -----------
    This discussion to try to revise the article piecemeal has become pretty diffuse, with perhaps competing notions of purpose, and what level of detail and precision are needed etc. I will try to suggest something productive in a subsequent message.

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 21, 2014

    At the moment I've run out of time to exert much forward push on this.

    By way of temporary summary/suggestion for regrouping: Focus on what this page is intending to deliver. What concepts should readers of this page be able to distinguish and understand when they are finished?

    To scope out the needed concepts, I suggest identifying representative unicode-related stumbling blocks (possibly from stackoverflow questions).

    Here's an example case: just trying to get trivial "beyond ASCII" functionality to work on Windows (Win7, Python 3.3):

    --------------------

    s = 'knight \u265E'
    print('Hello ' + s)

    ... which fails with:

    "UnicodeEncodeError: 'charmap' codec can't encode character '\u265e' in position 13: character maps to undefined".

    A naive attempt to fix this by using s.encode() results in the "+" operation failing.

    What paths forward do programmers explore in an effort to have this code (a) not throw an exception, and produce at least some output, and (b) make it produce the correct output?

    And why does it work as intended on linux?

    The set of concepts identified and explained in this article needs to be sufficient to underpin an understanding of the distinct data types, encodings, decodings, translations, settings etc relevant to this problem, and how to use them to get a desired result.

    There are similar problems that occur at other Python-system boundaries, which would further illuminate the set of necessary concepts.

    Thanks for all comments.

    -- Graham

    @pitrou
    Copy link
    Member

    pitrou commented Mar 22, 2014

    "UnicodeEncodeError: 'charmap' codec can't encode character '\u265e' in position 13: character maps to undefined".

    That's because stdout is treated as a regular bytestream under Windows (as it is under POSIX), and it therefore uses the current "codepage" to encode unicode strings. See bpo-1602.

    And why does it work as intended on linux?

    Because under most current Linux systems, stdout's encoding will be utf-8, and therefore it will be able to represent the given unicode chars.

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 22, 2014

    @andre:

    _I_ know more or less the explanations behind all this. I am just putting it forward as an example which touches several concepts which are needed to explain it, and that a programmer might reason with to change a program (or the environment) to produce some output (instead of an exception), and possibly even the intended output.

    For example, behind the brief explanation you provide, here are some of the related concepts:

    1. print(s) sends output to stdout, which sends data to windows console (cmd.exe).

    2. In the process, the output function that print --> stdout invokes attempts to encode s according to the encoding that the destination, cmd.exe reports that it expects.

    3. On Windows (in English, or perhaps it's US locale), cmd.exe defaults to expecting encoding cp437.

    4. cp437 is an encoding containing only 256 characters. Many Unicode code points obviously have no corresponding character in cp437.

    5. The encoding process used by print() is set to exception on characters that have no mapping to the encoding wanted by stdout.

    6. Consequently, print() throws an exception on code points outside of those representable in cp437.

    Based on that, there are a number of moves the programmer might make, with varying results... possibly involving:

    -- s.encode([various choices of options here]) --> s_as_bytes
    -- print(s_as_bytes) (noting that 'Hello ' + s_as_bytes doesn't work)
    -- Or maybe ascii(s)
    -- Or possibly sys.stdout.buffer.write()

    -- Pros and cons of the above, which require careful tracking of what the resulting strings or byte sequences "really mean" at each juncture.

    -- cmd.exe chcp 65001 --> so print(unicode) won't exception, but still many chars will show as [?]
    -- various font choices in cmd.exe which might be able to show the needed graphemes.
    -- Automatic font substitution that occurs in some contexts when the selected font doesn't contain a requested code point and its grapheme.

    ... and probably more concepts that I've missed.

    -- Graham

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 22, 2014

    "4. Many Internet standards are defined in terms of textual data"

    I believe the author was thinking of the "old" TCP-based protocols (ftp, smtp, RFC 822, HTTP), which have their commands/messages as ASCII-strings, with a variable-length records (often terminated by line end).

    I think bringing this up as an argument against UTF-32 somewhat flawed, for two reasons:

    1. Historically, many of these protocols restricted themselves to pure ASCII, so using UTF-8 is as much a protocol violation as is using UTF-32.
    2. The tricky part in this protocols is often not the risk of embedding NUL, but embedding CRLF (as 0D 0A might well appear in a character, a.g. MALAYALAM LETTER UU)

    OTOH, it is a fact that several of these protocols got revised to support Unicode, and often re-interpreting the data as UTF-8 (with MIME being the notable exception that actually allows for UTF-32 on the wire if somebody choses to).

    @bitdancer
    Copy link
    Member

    Although I agree that the Unicode Howto needs to provide enough information for someone to reason correctly about python3 unicode, I'd like to note that someone running into the encoding error on windows is *not* going to reach for the unicode howto to solve their problem. Instead they will google the error message, and will find many helpful and unhelpful explanations and solutions. But they currently won't find this document (at least not on the first page of results).

    So, if you really want to help someone with this problem, you need to specifically include that error message in the text as example of a commonly encountered problem, and then give a directed solution.

    @bitdancer
    Copy link
    Member

    On the other hand, I wonder if such problem/solution pairs should go in the FAQ list rather than the howto, perhaps with a pointer to the howto for those wanting more general information. Specifically the Python on Windows section in this case.

    I realize that you were using it as an example to tease out the concepts needed to reason correctly about a problem, but I think approaching it from the point of view of how the user will reason about it is not optimal. Instead, write the FAQ answer, and figure out what concepts you need to use to *explain* the problem, that you then feel the desire to further expand upon in the howto for those users who reach for a deeper understanding instead of just an immediate solution.

    @gwideman
    Copy link
    Mannequin Author

    gwideman mannequin commented Mar 22, 2014

    @r David: I agree with you. Thanks for extending the line of thinking I outlined.

    @ezio-melotti
    Copy link
    Member

    See also bpo-1581182.

    @akuchling akuchling assigned akuchling and unassigned docspython Sep 15, 2018
    @akuchling
    Copy link
    Member

    New changeset 97c288d by Andrew Kuchling in branch 'master':
    bpo-20906: Various revisions to the Unicode howto (bpo-8394)
    97c288d

    @miss-islington
    Copy link
    Contributor

    New changeset 84fa6b9 by Miss Islington (bot) in branch '3.7':
    bpo-20906: Various revisions to the Unicode howto (GH-8394)
    84fa6b9

    @vstinner
    Copy link
    Member

    vstinner commented Mar 4, 2019

    I see a change, so I guess that this old issue can now be fixed. Anything, the issue didn't get much activity last years.

    @vstinner vstinner closed this as completed Mar 4, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants