Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the encoding of functions bytes arguments of the C API #53947

Closed
vstinner opened this issue Sep 1, 2010 · 12 comments
Closed

Document the encoding of functions bytes arguments of the C API #53947

vstinner opened this issue Sep 1, 2010 · 12 comments
Labels
docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode

Comments

@vstinner
Copy link
Member

vstinner commented Sep 1, 2010

BPO 9738
Nosy @terryjreedy, @abalkin, @vstinner, @merwok, @davidmalcolm
Files
  • encodings.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-05-30.21:13:23.317>
    created_at = <Date 2010-09-01.22:41:34.740>
    labels = ['interpreter-core', 'expert-unicode', 'docs']
    title = 'Document the encoding of functions bytes arguments of the C API'
    updated_at = <Date 2011-05-30.21:13:23.316>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2011-05-30.21:13:23.316>
    actor = 'vstinner'
    assignee = 'docs@python'
    closed = True
    closed_date = <Date 2011-05-30.21:13:23.317>
    closer = 'vstinner'
    components = ['Documentation', 'Interpreter Core', 'Unicode']
    creation = <Date 2010-09-01.22:41:34.740>
    creator = 'vstinner'
    dependencies = []
    files = ['18705']
    hgrepos = []
    issue_num = 9738
    keywords = ['patch']
    message_count = 12.0
    messages = ['115339', '115404', '115405', '115523', '115543', '115942', '123655', '123659', '124692', '124696', '125359', '137331']
    nosy_count = 6.0
    nosy_names = ['terry.reedy', 'belopolsky', 'vstinner', 'eric.araujo', 'dmalcolm', 'docs@python']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue9738'
    versions = ['Python 3.2']

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 1, 2010

    Many C functions have bytes argument (char* type) but the encoding is not documented. If would not be a problem if the encoding was always the same, but it is not. Examples:

    • format of PyUnicode_FromFormat() should be encoded as ISO-8859-1
    • filename of PyParser_ASTFromString() should be encoded as utf-8
    • filename of PyErr_SetFromErrnoWithFilename() should be encoded to the filesystem encoding (with strict error handler, and not surrogateescape)
    • 's' argument of PyParser_ASTFromString() should be encoded as utf-8 if PyPARSE_IGNORE_COOKIE flag is set, otherwise the parser checks for #coding:xxx cookie (if there is no cookie, utf-8 is used)

    Attached patch is a try to document most low level functions. I choosed to add the name of function arguments in the headers because I consider that a header can be used as a quick documentation. I only touched .c files to change argument names.

    It is hard to get the right encoding, so I cannot ensure that my patch is correct. My patch is just a draft.

    I don't know if "encoded to utf-8" is the right expression. Or should it be "decoded as utf-8"?

    @vstinner vstinner added docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels Sep 1, 2010
    @merwok
    Copy link
    Member

    merwok commented Sep 2, 2010

    I think either of these is correct:

    • a UTF-8-encoded string
    • a string encoded in UTF-8

    @davidmalcolm
    Copy link
    Member

    I think either of these is correct:

    • a UTF-8-encoded string
    • a string encoded in UTF-8

    Possibly use the word "buffer" here, rather than "string", as "string" may suggest the "str" type.

    Or even: "NUL-terminated buffer of UTF-8-encoded bytes", or whatnot.

    (sorry for bikeshedding)

    @terryjreedy
    Copy link
    Member

    Better specifying requirements is good. A few comments:

    -   The second argument is an error message; it is converted to a string object.
    +   The second argument is an error message; it is decoded to a string object
    +   with ``'utf-8'`` encoding.
     
    I would write the change as
    +   The second argument is a utf-8 encoded error message; it is decoded to a string object. 

    I the second part (what the function will do with the arg) really needed? I think in the current version, it serves to indirectly specify that the arg in not to be a string, but bytes. If the specific encoding required is specified, that also says bytes, making 'will be decoded' redundant and irrelevant.
    -------------------------------

    + a Python exception (class, not an instance). *format* should be a string
    + encoded to ISO-8859-1, containing format codes,

    *format* should be ISO-8859-1 encoded bytes containing format codes,

    although I am not clear about the implications of that. Are not all format code ascii chars?
    --------------------------------

    I do not really like 'encoded to', but 'decoded to' is wrong. 'will be decoded from xxx bytes' is better. I think there should be a general discussion somewhere about bytes arguments and the terminology that will be used.

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 3, 2010

    About PyErr_Format() and PyUnicode_FromFormat*() encoding: it's not exactly ISO-8859-1... there is a bug => issue bpo-9769.

    @vstinner
    Copy link
    Member Author

    vstinner commented Sep 9, 2010

    bpo-6543 changed the encoding of the filename argument of PyRun_SimpleFileExFlags() (and all functions based on PyRun_SimpleFileExFlags) and c_filename attribute of the compiler (private) structure in Python 3.1.3: use utf-8 in strict mode instead of filesystem encoding with surrogateescape.

    @davidmalcolm
    Copy link
    Member

    A (probably crazy) idea that just occurred to me:
      typedef char utf8_bytes;
      typedef char iso8859_1_bytes;
      typedef char fsenc_bytes;

    then specify the encoding in the type signature of the API e.g.:

    • int PyRun_SimpleFile(FILE *fp, const char *filename)
      + int PyRun_SimpleFile(FILE *fp, const fsenc_bytes *filename)

    @abalkin
    Copy link
    Member

    abalkin commented Dec 8, 2010

    A (probably crazy) idea that just occurred to me:
    typedef char utf8_bytes;
    typedef char iso8859_1_bytes;
    typedef char fsenc_bytes;

    I like it! Let's see how far we can get without iso8859_1_bytes, though. (It is likely to be locale_bytes anyways.) There are a few places where we'll need ascii_bytes.

    The added benefit is that we can make these typedefs unsigned char and avoid char signness being ambiguous. We will also need to give the typedefs the Py_ prefix.

    And an obligatory bikesheding comment: if we typedef char, we should use singular form. Or we can typedef char* Py_utf8_bytes.

    @vstinner
    Copy link
    Member Author

    r87504 documents encodings of error functions.
    r87505 documents encodings of unicode functions.
    r87506 documents encodings of AST, compiler, parser and PyRun functions.

    @vstinner
    Copy link
    Member Author

    While documenting encodings, I found two issues: bpo-10778 and bpo-10779.

    @abalkin
    Copy link
    Member

    abalkin commented Jan 4, 2011

    Victor,

    Here is an interesting case for your collection: PyDict_GetItemString. Note that it is documented as not setting error, but in fact it may if encoding fails. This rarely an issue because most uses of PyDict_GetItemString are with an ASCII string literal.

    @vstinner
    Copy link
    Member Author

    Here is an interesting case for your collection: PyDict_GetItemString.

    It's easier to guess the encoding of such function: Python 3 always use UTF-8, but yes, the encoding should be documented.

    I documented many functions, directly in the header files, and sometimes also in the reST documentation.

    I close this issue because I consider it as done. If you would like to document the encoding of some specific functions, please open new issues.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants