Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new PyRun_xxx() functions to not encode the filename #63717

Closed
vstinner opened this issue Nov 7, 2013 · 32 comments
Closed

Add new PyRun_xxx() functions to not encode the filename #63717

vstinner opened this issue Nov 7, 2013 · 32 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@vstinner
Copy link
Member

vstinner commented Nov 7, 2013

BPO 19518
Nosy @birkenfeld, @ncoghlan, @larryhastings, @ericsnowcurrently, @serhiy-storchaka
Files
  • pyrun_object.patch
  • pyrun_object-2.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2015-10-02.21:09:44.761>
    created_at = <Date 2013-11-07.11:47:59.991>
    labels = ['interpreter-core', 'type-feature']
    title = 'Add new PyRun_xxx() functions to not encode the filename'
    updated_at = <Date 2015-10-02.21:09:44.761>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-10-02.21:09:44.761>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2015-10-02.21:09:44.761>
    closer = 'vstinner'
    components = ['Interpreter Core']
    creation = <Date 2013-11-07.11:47:59.991>
    creator = 'vstinner'
    dependencies = []
    files = ['32524', '32538']
    hgrepos = []
    issue_num = 19518
    keywords = ['patch']
    message_count = 32.0
    messages = ['202326', '202329', '202335', '202338', '202392', '202393', '202397', '202398', '202399', '202411', '203447', '203464', '203474', '203476', '203480', '203481', '203489', '203490', '203592', '203593', '203608', '203618', '206391', '206396', '206449', '206453', '206456', '206460', '206462', '206464', '206466', '247988']
    nosy_count = 7.0
    nosy_names = ['georg.brandl', 'ncoghlan', 'larry', 'Arfrever', 'eric.snow', 'serhiy.storchaka', 'Drekin']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'test needed'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue19518'
    versions = ['Python 3.5']

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 7, 2013

    The changeset af822a6c9faf of the issue bpo-19512 added the function PyRun_InteractiveOneObject(). By the way, I forgot to document this function. This issue is also a reminder for that. The purpose of the new function is to avoid creation of temporary Unicode strings and useless call to Unicode encoder/decoder.

    I propose to generalize the change to other PyRun_xxx() functions. Attached patch adds the following functions:

    • PyRun_AnyFileObject()
    • PyRun_SimpleFileObject()
    • PyRun_InteractiveLoopObject()
    • PyRun_FileObject()

    On Windows, these changes should allow to pass an unencodable filename on the command line (ex: japanese script name on an english setup).

    TODO: I should document all these new functions.

    @serhiy-storchaka serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Nov 7, 2013
    @serhiy-storchaka
    Copy link
    Member

    On Windows, these changes should allow to pass an unencodable filename on the command line (ex: japanese script name on an english setup).

    Doesn't the surrogateescape error handler solve this issue?

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 7, 2013

    2013/11/7 Serhiy Storchaka <report@bugs.python.org>:

    > On Windows, these changes should allow to pass an unencodable filename on the command line (ex: japanese script name on an english setup).

    Doesn't the surrogateescape error handler solve this issue?

    surrogateescape is very specific to UNIX, or more generally systems
    using bytes filenames. Windows native type for filename is Unicode. To
    support any Unicode filename on Windows, you must never encode a
    filename.

    surrogateescape avoids decoding errors, here is the problem is an
    encoding error.

    For example, "abé" cannot be encoded to ASCII. "abé".encode("ascii",
    "surrogateescape") doesn't help here.

    @serhiy-storchaka
    Copy link
    Member

    I added some comments on Rietveld.

    Please do not commit without documentation and tests.

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 7, 2013

    Updated patch addressing some remarks of Serhiy and adding documentation.

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 7, 2013

    Updated patch addressing some remarks of Serhiy and adding documentation.

    Oh, and it adds also an unit test. I didn't run the unit test on Windows yet.

    @ericsnowcurrently
    Copy link
    Member

    PEP-432 relates pretty closely here.

    @vstinner
    Copy link
    Member Author

    vstinner commented Nov 8, 2013

    PEP-432 relates pretty closely here.

    What is the relation between this issue and the PEP-432?

    @ericsnowcurrently
    Copy link
    Member

    PEP-432 is all about the PyRun_* API and especially relates to refactoring it with the goal of improving extensibility and maintainability. I'm sure Nick could expound, but the PEP is a response to the cruft that has accumulated over the years in Python/pythonrun.c. The result of that organic growth makes it harder than necessary to do things like adding new commandline options. While I haven't looked closely at the new function you added, I expect PEP-432 would have simplified things or even removed the need for a new function.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Nov 8, 2013

    PEP-432 doesn't really touch the PyRun_* APIs - it's all about refactoring
    Py_Initialize so you can use most of the C API during the latter parts of
    the configuration process (e.g. setting up the path for the import system).

    pythonrun.c is just a monstrous beast that covers the entire interpreter
    lifecycle from initialisation through script execution through to
    termination.

    @vstinner
    Copy link
    Member Author

    Updated patch addressing some remarks of Serhiy and adding documentation.

    Anyone for a new review?

    @serhiy-storchaka
    Copy link
    Member

    PyRun_FileObject() looks misleading, because it works with FILE*, not with a file object.

    @vstinner
    Copy link
    Member Author

    PyRun_FileObject() looks misleading, because it works with FILE*, not with a file object.

    I simply replaced the current suffix with Object(). Only filename is converted from char* to PyObject*. Do you have a better suggestion for the new name?

    @serhiy-storchaka
    Copy link
    Member

    No I have not a better suggestion. But I afraid that one day you will wanted to extend PyRun_File*() function to work with a general Python file object (perhaps there is such issue already) and then you will encountered a problem.

    @ncoghlan
    Copy link
    Contributor

    Perhaps we could we use the suffix "Unicode" rather than "Object"? These don't work with arbitrary objects, they expect a unicode string.

    PyRun_InteractiveOneObject would be updated to use the new suffix as well.

    That would both be clearer for the user, and address Serhiy's concern about the possible ambiguity: PyRun_FileUnicode still isn't crystal clear, but it's clearer than PyRun_FileObject.

    @vstinner
    Copy link
    Member Author

    FYI I already added a bunch of new functions with Object suffix when I replaced char* with PyObject*.

    Example:

    http://hg.python.org/cpython/rev/df2fdd42b375
    http://bugs.python.org/issue11619

    @ncoghlan
    Copy link
    Contributor

    Hmm, reading more of those and I think Serhiy is definitely right -
    Object is the wrong suffix. Unicode isn't right either, since the main
    problem is that ambiguity around *which* parameter is a Python Unicode
    object. The API names that end in *StringObject or *FileObject don't
    give the right idea at all.

    The shortest accurate suffix I can come up with at the moment is the
    verbose "WithUnicodeFilename":

    PyParser_ParseStringObject vs
    PyParser_ParseStringWithUnicodeFilename
    

    Other possibilities:

    PyParser_ParseStringUnicode # Huh?
    PyParser_ParseStringDecodedFilename # Slight fib on Windows, but
    

    mostly accurate
    PyParser_ParseStringAnyFilename

    Inserting an underscore before the suffix is another option (although
    I don't think it much matters either way).

    @serhiy-storchaka
    Copy link
    Member

    FYI I already added a bunch of new functions with Object suffix when I replaced char* with PyObject*.

    Most of them were added in 3.4. Unfortunately several functions were added earlier (e.g. PyImport_ExecCodeModuleObject, PyErr_SetFromErrnoWithFilenameObject).

    @vstinner
    Copy link
    Member Author

    So, which suffix should be used?

    @serhiy-storchaka
    Copy link
    Member

    "*Unicode" suffix in existing functions means Py_UNICODE* argument.

    May be "*Ex2"? It can't be misinterpreted but looks ugly.

    @vstinner
    Copy link
    Member Author

    "*Unicode" suffix in existing functions means Py_UNICODE* argument.

    Yes, this is why I chose Object() suffix. Are you still opposed to
    "Object" suffix?

    (Yes, "*Ex2" is really ugly.)

    @ncoghlan
    Copy link
    Contributor

    How about "ExName"?

    This patch:
    PyRun_AnyFileExName
    PyRun_SimpleFileExName
    PyRun_InteractiveOneExName
    PyRun_InteractiveLoopExName
    PyRun_FileExName

    Previous patch:
    Py_CompileStringExName
    PyAST_FromNodeExName
    PyAST_CompileExName
    PyFuture_FromASTExName
    PyParser_ParseFileExName
    PyParser_ParseStringExName
    PyErr_SyntaxLocationExName
    PyErr_ProgramTextExName
    PyParser_ASTFromStringExName
    PyParser_ASTFromFileExName

    • "Ex" has precedent as indicating a largely functionally equivalent API with a different signature
    • "Name" suggests strongly that we're tinkering with the filename (since this APIs don't accept another name)
    • "ExName" is the same length as "Object" but far more explicit

    Thoughts?

    @vstinner
    Copy link
    Member Author

    Sorry, but because of the bikeshedding, I'm not more interested to work on this issue. Don't hesitate to re-work my patch if you want to fix the bug ("On Windows, these changes should allow to pass an unencodable filename on the command line").

    @ncoghlan
    Copy link
    Contributor

    Just getting this on Larry's radar and summarising the current position.

    The original problem: using "char *" to pass filenames around doesn't work properly on Windows, we need to use Unicode objects.

    The solution: parallel APIs that accept PyObject * rather than char * for the filename parameters.

    The new problem: both Serhiy and I find the *Object() suffix currently used for those "filename as Unicode object instead of C string" parallel APIs to be ambiguous and confusing. However, the problem the parallel APIs solve is real, and reverting or excessively modifying any of the work Victor has already done would be silly.

    That means we're now in a situation where we have to either:

    • accept *Object as the suffix for all of these APIs indefinitely, even though it's ambiguous and confusing
    • choose a new suffix and use that for the APIs already added in 3.4 and add compatibility aliases for the older APIs to make them consistent
    • change the public API additions already made for 3.4 to new private APIs by adding an underscore prefix, and then reconsider the public API naming question for 3.5
    • accept *Object as the suffix for the moment, but aim to replace it with something more descriptive in Python 3.5

    Neither Serhiy nor I are comfortable with the first option, and making a decision in haste for the second option doesn't seem like a good idea. Option 3 seems like far too much work to make things less useful (a capability that works, but has an ambiguous and confusing name, is better than a capability that isn't provided at all)

    That leaves option number 4: don't change anything further now, but revisit it for 3.5, including changing the preferred name of the existing APIs.

    I like that approach, so I'm assigning to myself to take a closer look at how some of the suggestions above read in the docs once 3.4 is out the door.

    @ncoghlan ncoghlan self-assigned this Dec 17, 2013
    @larryhastings
    Copy link
    Contributor

    So all the PyRun_*Object functions are new in 3.4, and none of them are documented yet?

    Option 4 is silly--I don't think we should ship them as public APIs in 3.4 if we're planning to rename them. I prefer the previous options.

    p.s. fwiw I hate "ExName".

    @serhiy-storchaka
    Copy link
    Member

    So all the PyRun_*Object functions are new in 3.4, and none of them are documented yet?

    Not all. Only following functions are new in 3.4:

    Parser/parsetok.c:PyParser_ParseStringObject
    Parser/parsetok.c:PyParser_ParseFileObject
    Python/future.c:PyFuture_FromASTObject
    Python/symtable.c:PySymtable_BuildObject
    Python/compile.c:PyAST_CompileObject
    Python/_warnings.c:PyErr_WarnExplicitObject
    Python/ast.c:PyAST_FromNodeObject
    Python/errors.c:PyErr_SyntaxLocationObject
    Python/errors.c:PyErr_ProgramTextObject
    Python/pythonrun.c:PyRun_InteractiveOneObject
    Python/pythonrun.c:Py_CompileStringObject
    Python/pythonrun.c:Py_SymtableStringObject
    Python/pythonrun.c:PyParser_ASTFromStringObject
    Python/pythonrun.c:PyParser_ASTFromFileObject

    Following functions existed in 3.3:

    Objects/moduleobject.c:PyModule_NewObject
    Objects/moduleobject.c:PyModule_GetNameObject
    Objects/moduleobject.c:PyModule_GetFilenameObject
    Objects/abstract.c:PyObject_CallObject
    Objects/bytesobject.c:PyBytes_FromObject
    Objects/fileobject.c:PyFile_WriteObject
    Objects/memoryobject.c:PyMemoryView_FromObject
    Objects/longobject.c:PyLong_FromUnicodeObject
    Objects/weakrefobject.c:PyWeakref_GetObject
    Objects/exceptions.c:PyUnicodeEncodeError_GetObject
    Objects/exceptions.c:PyUnicodeDecodeError_GetObject
    Objects/exceptions.c:PyUnicodeTranslateError_GetObject
    Objects/unicodeobject.c:PyUnicode_FromObject
    Objects/unicodeobject.c:PyUnicode_FromEncodedObject
    Objects/unicodeobject.c:PyUnicode_AsDecodedObject
    Objects/unicodeobject.c:PyUnicode_AsEncodedObject
    Objects/bytearrayobject.c:PyByteArray_FromObject
    Python/sysmodule.c:PySys_GetObject
    Python/sysmodule.c:PySys_SetObject
    Python/errors.c:PyErr_SetObject
    Python/errors.c:PyErr_SetFromErrnoWithFilenameObject
    Python/import.c:_PyImport_FixupExtensionObject
    Python/import.c:_PyImport_FindExtensionObject
    Python/import.c:PyImport_AddModuleObject
    Python/import.c:PyImport_ExecCodeModuleObject
    Python/import.c:PyImport_ImportFrozenModuleObject
    Python/import.c:PyImport_ImportModuleLevelObject
    Python/modsupport.c:PyModule_AddObject
    Python/pyarena.c:PyArena_AddPyObject

    @larryhastings
    Copy link
    Contributor

    Are all the functions that use "Object" to indicate "Unicode object instead of string" new in 3.4? Of those, how many are undocumented?

    @serhiy-storchaka
    Copy link
    Member

    Are all the functions that use "Object" to indicate "Unicode object instead
    of string" new in 3.4? Of those, how many are undocumented?

    Following 5 functions work with PyObject* filenames and have Object-less
    variants which works with char * filenames:

    Python/errors.c:PyErr_SetFromErrnoWithFilenameObject
    Python/import.c:PyImport_AddModuleObject
    Python/import.c:PyImport_ExecCodeModuleObject
    Python/import.c:PyImport_ImportFrozenModuleObject
    Python/import.c:PyImport_ImportModuleLevelObject

    Private _PyImport_FixupExtensionObject and _PyImport_FindExtensionObject have
    no Object-less variants.

    All other *Object functions are unrelated.

    @larryhastings
    Copy link
    Contributor

    Are those five functions new in 3.4 and undocumented?

    @larryhastings
    Copy link
    Contributor

    Are we proposing renaming any functions that are either
    a) not new in 3.4, or
    b) were documented as of 3.4 beta 1?

    @serhiy-storchaka
    Copy link
    Member

    Are those five functions new in 3.4 and undocumented?

    PyErr_SetFromErrnoWithFilenameObject exists even in 2.7. Other 4
    PyImport_*Object functions all added in 3.3 (see bpo-3080). All 5 functions
    are documented.

    14 new functions were added in 3.4.

    @ncoghlan ncoghlan removed their assignment Jun 28, 2015
    @Drekin
    Copy link
    Mannequin

    Drekin mannequin commented Aug 4, 2015

    I'm not sure this is the right issue. The support for Unicode filenames is not (at least on Windows) ideal.

    Let α.py be a Python script with invalid syntax.

    py α.py
    File "<encoding error>", line 2
    as as compile error
    ^
    SyntaxError: invalid syntax

    On the other hand, if run.py is does something like

    path = sys.argv[1]
    with tokenize.open(path) as f:
        source = f.read()
    code = compile(source, path, "exec")
    exec(code, __main__.__dict__)

    we get

    py run.py α.py
    File "Python Unicode\\u03b1.py", line 2
    as as compile error
    ^
    SyntaxError: invalid syntax

    (or 'File "Python Unicode\α.py", line 2' depending on whether sys.stdout can encode the string).

    So the "<encoding error>" in the first example is unfortunate as it is easy to get better result even by a simple pure Python approach.

    @vstinner vstinner closed this as completed Oct 2, 2015
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants