Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blacklist base64, hex, ... codecs from bytes.decode() and str.encode() #63818

Closed
vstinner opened this issue Nov 16, 2013 · 71 comments
Closed

Blacklist base64, hex, ... codecs from bytes.decode() and str.encode() #63818

vstinner opened this issue Nov 16, 2013 · 71 comments
Labels
release-blocker topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@vstinner
Copy link
Member

BPO 19619
Nosy @malemburg, @doerwalter, @birkenfeld, @ncoghlan, @vstinner, @jwilk, @ezio-melotti, @vadmium, @serhiy-storchaka
Files
  • issue19619_blacklist_proof_of_concept.diff: Simple proof of concept (missing test fixes)
  • issue19619_blacklist_transforms_py34.diff: Implementation using a private attribute on CodecInfo
  • issue19619_blacklist_transforms_py34_postreview.diff: With review comments incorporated
  • issue19619_blacklist_transforms_py34_keyword_only_param.diff: Switched to a keyword only parameter for _is_text_encoding
  • issue19619_blacklist_transforms_py34_refleak_hunting.diff: Refleaks and a new comment in codecs.py
  • issue19619_blacklist_transforms_py33.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-03-02.08:22:09.258>
    created_at = <Date 2013-11-16.00:47:09.694>
    labels = ['type-bug', 'expert-unicode', 'release-blocker']
    title = 'Blacklist base64, hex, ... codecs from bytes.decode() and str.encode()'
    updated_at = <Date 2015-05-31.17:21:40.658>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2015-05-31.17:21:40.658>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-03-02.08:22:09.258>
    closer = 'georg.brandl'
    components = ['Unicode']
    creation = <Date 2013-11-16.00:47:09.694>
    creator = 'vstinner'
    dependencies = []
    files = ['32702', '32755', '32772', '32773', '32774', '34213']
    hgrepos = []
    issue_num = 19619
    keywords = ['patch']
    message_count = 71.0
    messages = ['202996', '203016', '203018', '203029', '203036', '203037', '203039', '203041', '203042', '203046', '203068', '203361', '203364', '203377', '203380', '203386', '203391', '203401', '203631', '203655', '203683', '203699', '203706', '203708', '203730', '203737', '203738', '203739', '203740', '203741', '203742', '203744', '203747', '203748', '203749', '203750', '203752', '203755', '203756', '203785', '203786', '203788', '203822', '203841', '203896', '203897', '203927', '203939', '203955', '203956', '206506', '206514', '206515', '206517', '207111', '210216', '210992', '211012', '211021', '211034', '211060', '211432', '212080', '212081', '212084', '212090', '212097', '212191', '212536', '212537', '244548']
    nosy_count = 10.0
    nosy_names = ['lemburg', 'doerwalter', 'georg.brandl', 'ncoghlan', 'vstinner', 'jwilk', 'ezio.melotti', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue19619'
    versions = ['Python 3.3']

    @vstinner
    Copy link
    Member Author

    I propose to add new input_type and output_type to CodecInfo. These attributes would only be defined for base64, hex, ... codecs which are not the classic encode: str=>bytes, decode: bytes=>str codecs.

    I also propose to modify str.encode() and bytes.encode() to only accept codecs using the right types. If the type doesn't match, the codec raises a LookupError.

    This issue should avoid the denial of service attack when a compression codec is used, see:
    https://mail.python.org/pipermail/python-dev/2013-November/130188.html

    @ncoghlan
    Copy link
    Contributor

    The full input/output type specifications can't be implemented sensibly without also defining at least a ByteSequence ABC. While I think it's a good idea in the long run, there's no feasible way to design such a system in the time remaining before the Python 3.4 feature freeze.

    However, we could do something much simpler as a blacklist API:

        def is_unicode_codec(name):
            """Returns true if this is the name of a known Unicode text encoding"""
    
        def set_as_non_unicode(name):
            """Indicates that the named codec is not a Unicode codec"""

    And then the codecs module would just maintain a set internally of all the names explicitly flagged as non-unicode.

    Such an API remains useful even if the input/output type support is added in Python 3.5 (since "codecs.is_unicode_codec(name)" is a bit simpler thing to explain than the exact type restrictions).

    Alternatively, implementing just the "encodes_to" and "decodes_to" attributes would be enough for str.encode, bytes.decode and bytearray.decode to reject known bad encodings early, leaving the input type checks to the codecs for now (since it is correctly defining "encode_from" and "decode_from" for many stdlib codecs that would need the ByteSequence ABC).

    @serhiy-storchaka
    Copy link
    Member

    I think internal blacklist for all standard non-unicode codecs will be enough to prevent the denial of service attack in maintenance releases.

    @serhiy-storchaka serhiy-storchaka added topic-unicode type-bug An unexpected behavior, bug, or error labels Nov 16, 2013
    @malemburg
    Copy link
    Member

    On 16.11.2013 10:16, Nick Coghlan wrote:

    Nick Coghlan added the comment:

    The full input/output type specifications can't be implemented sensibly without also defining at least a ByteSequence ABC. While I think it's a good idea in the long run, there's no feasible way to design such a system in the time remaining before the Python 3.4 feature freeze.

    However, we could do something much simpler as a blacklist API:

    def is_unicode_codec(name):
        """Returns true if this is the name of a known Unicode text encoding"""
    
    def set_as_non_unicode(name):
        """Indicates that the named codec is not a Unicode codec"""
    

    And then the codecs module would just maintain a set internally of all the names explicitly flagged as non-unicode.

    That doesn't look flexible enough to cover the various different
    input/output types.

    Such an API remains useful even if the input/output type support is added in Python 3.5 (since "codecs.is_unicode_codec(name)" is a bit simpler thing to explain than the exact type restrictions).

    Alternatively, implementing just the "encodes_to" and "decodes_to" attributes would be enough for str.encode, bytes.decode and bytearray.decode to reject known bad encodings early, leaving the input type checks to the codecs for now (since it is correctly defining "encode_from" and "decode_from" for many stdlib codecs that would need the ByteSequence ABC).

    The original idea we discussed some time ago was to add a mapping
    or list attribute to CodecInfo which lists all supported type
    combinations.

    The codecs module could then make this information available through
    a simple type check API (which also caches the lookups for performance
    reasons), e.g.

    codecs.types_supported(encoding, input_type, output_type) -> boolean.

    Returns True/False depending on whether the codec for
    encoding supports the given input and output types.
    

    Usage:

    if not codecs.types_support(encoding, str, bytes):
        # not a Unicode -> 8-bit codec
        ...

    @ncoghlan
    Copy link
    Contributor

    Note that users can completely blacklist any codec that hasn't been imported yet by preventing imports of that codec definition:

    >>> import sys, encodings
    >>> blocked_codecs = "bz2_codec", "zlib_codec"
    >>> for name in blocked_codecs:
    ...     sys.modules["encodings." + name] = None
    ...     setattr(encodings, name, None)
    ... 
    >>> b"payload".decode("bz2_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    LookupError: unknown encoding: bz2_codec
    >>> b"payload".decode("zlib_codec")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    LookupError: unknown encoding: zlib_codec

    Add in an "encodings._cache.clear()" and you can also block the use of previously used codecs.

    Regardless of what else we do, we should document this so that users know how to do it.

    This means the case we're handling in this issue is just the one where we want to block a codec from the builtin method APIs, while still allowing it in the codecs module APIs.

    @ncoghlan
    Copy link
    Contributor

    Now that I understand Victor's proposal better, I actually agree with it, I just think the attribute names need to be "encodes_to" and "decodes_to".

    With Victor's proposal, *input* validity checks (including type checks) would remain the responsibility of the codec itself. What the new attributes would enable is *output* type checks *without having to perform the encoding or decoding operation first*. codecs will be free to leave these as None to retain the current behaviour of "try it and see".

    The specific field names "input_type" and "output_type" aren't accurate, since the acceptable input types for encoding or decoding are likely to be more permissive than the specific output type for the other operation. Most of the binary codecs, for example, accept any bytes-like object as input, but produce bytes objects as output for both encoding and decoding. For Unicode encodings, encoding is strictly str->bytes, but decoding is generally the more permissive bytes-like object -> str.

    I would still suggest providing the following helper function in the codecs module (the name has changed from my earlier suggestion and I now suggest implementing it in terms of Victor's suggestion with more appropriate field names):

        def is_text_encoding(name):
            """Returns true if the named encoding is a Unicode text encoding"""
            info = codecs.lookup(name)
            return info.encodes_to is bytes and info.decodes_to is str

    This approach covers all the current stdlib codecs:

    • the text encodings encode to bytes and decode to str
    • the binary transforms encode to bytes and also decode to bytes
    • the lone text transform (rot_13) encodes and decodes to str

    This approach also makes it possible for a type inference engine (like mypy) to potentially analyse codec use, and could be expanded in 3.5 to offer type checked binary and text transform APIs that filtered codecs appropriately according to their output types.

    @malemburg
    Copy link
    Member

    On 16.11.2013 13:44, Nick Coghlan wrote:

    Nick Coghlan added the comment:

    Now that I understand Victor's proposal better, I actually agree with it, I just think the attribute names need to be "encodes_to" and "decodes_to".

    With Victor's proposal, *input* validity checks (including type checks) would remain the responsibility of the codec itself. What the new attributes would enable is *output* type checks *without having to perform the encoding or decoding operation first*. codecs will be free to leave these as None to retain the current behaviour of "try it and see".

    The specific field names "input_type" and "output_type" aren't accurate, since the acceptable input types for encoding or decoding are likely to be more permissive than the specific output type for the other operation. Most of the binary codecs, for example, accept any bytes-like object as input, but produce bytes objects as output for both encoding and decoding. For Unicode encodings, encoding is strictly str->bytes, but decoding is generally the more permissive bytes-like object -> str.

    I would still suggest providing the following helper function in the codecs module (the name has changed from my earlier suggestion and I now suggest implementing it in terms of Victor's suggestion with more appropriate field names):

    def is_text_encoding(name):
        """Returns true if the named encoding is a Unicode text encoding"""
        info = codecs.lookup(name)
        return info.encodes_to is bytes and info.decodes_to is str
    

    This approach covers all the current stdlib codecs:

    • the text encodings encode to bytes and decode to str
    • the binary transforms encode to bytes and also decode to bytes
    • the lone text transform (rot_13) encodes and decodes to str

    This approach also makes it possible for a type inference engine (like mypy) to potentially analyse codec use, and could be expanded in 3.5 to offer type checked binary and text transform APIs that filtered codecs appropriately according to their output types.

    Nick, you are missing an important point: codecs can have any
    number of input/output type combinations, e.g. they may
    convert bytes -> str and str->str (output type depends on
    input type).

    For this reason the simplistic approach with just one type
    conversion will not work. Codecs will have to provide a
    *mapping* of input to output types for each direction
    (encoding and decoding) - either as Python mapping or
    as list of mapping tuples.

    @ncoghlan
    Copy link
    Contributor

    Such codecs can be represented (for 3.4) by simply not setting the attribute and leaving the output types unspecified. We don't need that complexity for the standard library, and the "not specified" escape hatch means complex codecs will be no worse off than they are now.

    The elegance of Victor's proposal is that it doesn't lock us out of solving the more complex cases later (where the codec's output type depends on the input type) by permitting a tuple or dict mapping input types to output types for "encodes_to" and "decodes_to", while still solving all of our immediate problems.

    This is especially relevant since we can't effectively represent codec input types until we have a ByteSequence ABC to cover the "bytes-like object" case, so demanding that the general case be handled immediately is the same as requesting that the feature be postponed completely to Python 3.5.

    @malemburg
    Copy link
    Member

    On 16.11.2013 14:37, Nick Coghlan wrote:

    Such codecs can be represented (for 3.4) by simply not setting the attribute and leaving the output types unspecified. We don't need that complexity for the standard library, and the "not specified" escape hatch means complex codecs will be no worse off than they are now.

    The elegance of Victor's proposal is that it doesn't lock us out of solving the more complex cases later (where the codec's output type depends on the input type) by permitting a tuple or dict mapping input types to output types for "encodes_to" and "decodes_to", while still solving all of our immediate problems.

    This is especially relevant since we can't effectively represent codec input types until we have a ByteSequence ABC to cover the "bytes-like object" case, so demanding that the general case be handled immediately is the same as requesting that the feature be postponed completely to Python 3.5.

    I don't agree.

    The mapping API is not much more complex than
    the single type combination proposal and it could well handle the
    case for which you'd have to add a ByteSequence ABC now to be
    able to define this single type combination using one ABC.

    Rather than adding the ABC now, you could simply add all
    relevant types to the mappings and then replace those mappings
    with an ABC in 3.5.

    BTW: I don't see a need to rush any of this. If more discussion
    is needed, then it's better to have a more complete proposal
    implemented in 3.5 than to try to do patchwork this late in the
    3.4 release process.

    @ncoghlan
    Copy link
    Contributor

    The only reasonable way to accurately represent "anything that exposes a buffer memoryview can read" as a type check is to write an appropriately duck-typed ABC. You can't enumerate all the types that the binary codecs accept as input, because that list of types isn't finite (unlike the output types, which are far more tightly constrained).

    I'd also be fine with Serhiy's suggestion of a private "non Unicode codec" set that is maintained by hand and checked *before* the codec operations in the codec methods - that then just becomes an internal implementation detail to improve the efficiency of the output type checks where we have the additional info needed to save the interpreter some work.

    @malemburg
    Copy link
    Member

    On 16.11.2013 15:52, Nick Coghlan wrote:

    The only reasonable way to accurately represent "anything that exposes a buffer memoryview can read" as a type check is to write an appropriately duck-typed ABC. You can't enumerate all the types that the binary codecs accept as input, because that list of types isn't finite (unlike the output types, which are far more tightly constrained).

    Theoretically, yes. However, in practice, you'd only be interested
    in a few type combinations (until the ABC is available).

    I'd also be fine with Serhiy's suggestion of a private "non Unicode codec" set that is maintained by hand and checked *before* the codec operations in the codec methods - that then just becomes an internal implementation detail to improve the efficiency of the output type checks where we have the additional info needed to save the interpreter some work.

    For 3.4 that would also do fine :-)

    @ncoghlan
    Copy link
    Contributor

    Given the time frame, how about we just go with Serhiy's suggestion of a "known non-Unicode codec" internal blacklist for both 3.3 and 3.4?

    I still like the idea of exposing codec type maps for introspection, but designing a decent API for that which also handles type preserving codecs is going to take some work, and can't realistically be included in 3.4.

    By my count, if we delay the blacklisting until after we do the codec lookup, there's only seven names we need to block:

    >>> from codecs import lookup
    >>> blacklist = "base64 uu quopri hex bz2 zlib rot13".split()
    >>> for name in blacklist:
    ...     print(lookup(name).name)
    ... 
    base64
    uu
    quopri
    hex
    bz2
    zlib
    rot-13

    @malemburg
    Copy link
    Member

    On 19.11.2013 12:38, Nick Coghlan wrote:

    Given the time frame, how about we just go with Serhiy's suggestion of a "known non-Unicode codec" internal blacklist for both 3.3 and 3.4?

    +1

    @ncoghlan
    Copy link
    Contributor

    Attached is a proof of concept for the blacklist approach (for 3.4, but without the fixes needed for the transform codec handling tests in test_codecs)

    This does have the potential to add a reasonable amount of additional overhead to encoding and decoding for shortstrings. Since it isn't obvious where to store a set for faster checking against the blacklist, it may be worth benchmarking this naive approach before doing something more complicated.

    Regardless, I don't plan to take this further myself any time soon - I just wanted to give it a firm nudge in the direction of the blacklist approach by providing a proof of concept.

    @vstinner
    Copy link
    Member Author

    + /* A set would be faster, but when to build it, where to store it? */
    + if (_PyUnicode_CompareWithId(codec_name, &PyId_base64) == 0 ||
    + _PyUnicode_CompareWithId(codec_name, &PyId_uu) == 0 ||
    + _PyUnicode_CompareWithId(codec_name, &PyId_quopri) == 0 ||
    + _PyUnicode_CompareWithId(codec_name, &PyId_hex) == 0 ||
    + _PyUnicode_CompareWithId(codec_name, &PyId_bz2) == 0 ||
    + _PyUnicode_CompareWithId(codec_name, &PyId_zlib) == 0 ||
    + PyUnicode_CompareWithASCIIString(codec_name, "rot-13") == 0
    + ) {
    + is_text_codec = 0;
    + }

    This is slow and not future proof. It would be faster and simpler to have two registries: a register only for bytes.decode()/str.encode() and another for "custom codecs" for codecs.encode/decode (or (bytes|str).transform()/untransform()).

    So "abc".encode("rot13") would simply fail with a LookupError.

    @ncoghlan
    Copy link
    Contributor

    Future proofing is irrelevant at this point - this is just about what
    can realistically be implemented in 3.4, not what can be implemented
    with the luxury of several months to rearchitect the codec system (and
    if we were going to do that, we'd just fix the type mapping
    introspection problem).

    There is no "codec registry" - there is only the default codec search
    function, the encodings import namespace, the normalisation algorithm
    and the alias dictionary.

    It sounds to me like you still believe it is possible to stick the
    genie back in the bottle and limit the codec system to what *you*
    think is a good idea. It doesn't work like that - the codecs module
    already provides a fully general data transformation system backed by
    lazy imports, and that isn't going to change due to backwards
    compatibility constraints. The only option we have is whether or not
    we file off the rough edges and try to ease the transition for users
    migrating from Python 2, where all of the standard library codecs fit
    within the limits of the text model, so the general purpose codec
    infrastructure almost never came into play. Getting rid of it is no
    longer a realistic option - documenting it, improving the failure
    modes and potentially adding some features (in Python 3.5+) are the
    only improvements that are genuinely feasible.

    @serhiy-storchaka
    Copy link
    Member

    Blacklisting by name is slow and it prevents a user from defining a codec with blacklisted name.

    What if just add private attribute ("_not_text"?) to unsafe codecs? If a codec has this attribute, than it should not be used it text encoding/decoding. Checking an attribute is much faster than comparing with a couple of strings.

    Another possibility is an inheriting all unsafe codecs from special class.

    @ncoghlan
    Copy link
    Contributor

    Yes, a private attribute on CodecInfo is probably better - the rest of the
    patch would stay the same, it would just check for that attribute instead
    of particular names.

    @ncoghlan ncoghlan self-assigned this Nov 21, 2013
    @ncoghlan
    Copy link
    Contributor

    New patch for 3.4 that uses a private attribute on CodecInfo and a private class method to set it appropriately (as I believe that is a better approach than changing the signature of CodecInfo.__init__ at this point, especially if we end up pursuing the codec type map idea in 3.5)

    This version also updates the tests to check for the appropriate error messages.

    The integration into the text model related methods is that same as in the proof of concept: a parallel private text-encoding-only C API that is used in preference to the general purpose codec machinery where appropriate.

    If there aren't any objections to this approach, I'll commit this one tomorrow.

    @serhiy-storchaka
    Copy link
    Member

    Why

        return codecs.CodecInfo._declare_transform(
            name='base64',
            encode=base64_encode,
            decode=base64_decode,
            incrementalencoder=IncrementalEncoder,
            incrementaldecoder=IncrementalDecoder,
            streamwriter=StreamWriter,
            streamreader=StreamReader,
        )

    instead of

        codec = codecs.CodecInfo(
            name='base64',
            encode=base64_encode,
            decode=base64_decode,
            incrementalencoder=IncrementalEncoder,
            incrementaldecoder=IncrementalDecoder,
            streamwriter=StreamWriter,
            streamreader=StreamReader,
        )
        codec._is_text_encoding = False
        return codec

    ?

    I have added other minor comments on Rietveld.

    @ncoghlan
    Copy link
    Contributor

    I used the private class method to minimise the per-codec impact (1
    modified/added line per codec rather than 3).

    Your other suggestions look good, so I'll apply those before committing.

    @vstinner
    Copy link
    Member Author

    return codecs.CodecInfo._declare_transform()

    I also prefer the private attribute option.

    @ncoghlan
    Copy link
    Contributor

    Victor pointed out this should now raise LookupError rather than TypeError.

    However, I'm not going to duplicate the manipulation of the private
    attribute across seven different codecs when a private alternate
    constructor solves that problem far more cleanly.

    @vstinner
    Copy link
    Member Author

    There is no "codec registry" - there is only the default codec search
    function, the encodings import namespace, the normalisation algorithm
    and the alias dictionary.

    interp->codec_search_cache can be seen as the "registry". If you store codecs in two different registries depending a property, attribute, whatever; you keep O(1) complexity (bo extra strcmp or getting an attribute at each lookup). The overhead is only when you load a codec for the first time.

    It should not be so hard to add a second dictionary.

    You don't need to touch all parts of the codecs machinery, only interp->codec_search_cache.

    It would not be possible to have the name in the two registries. So codecs.lookup() would still return any kind of codecs, it would just lookup in two dictionaries instead of one. So codecs.encode/decode would be unchanged too (if you want to keep these functions ;-)).

    Only bytes.decode/str.encode would be modified to only lookup in the "text codecs" only registry.

    Yet another option: add a new dictionary, but leave interp->codec_search_cache unchanged. Text codecs would also be registered twice: once in interp->codec_search_cache, once in the second dictionary. So bytes.decode/str.encode would only lookup in the text codecs dictionary, instead of interp->codec_search_cache. That's all ;-)

    Victor pointed out this should now raise LookupError rather than TypeError.

    If you accept to raise a LookupError, the "two registries" option may become more obvious, isn't it?

    @ncoghlan
    Copy link
    Contributor

    Victor, you can propose whatever you like for 3.5, but I'm not adding new
    interpreter state two days before feature freeze when we don't have to.
    Looking up the private CodecInfo attribute is still O(1) anyway.

    @ncoghlan
    Copy link
    Contributor

    • switched to LookupError for the dedicated exception
    • default value moved to a CodecInfo class attribute
    • new private APIs guarded by PY_LIMITED_API
    • used repr formatting where appropriate in the tests
    • cleaned up the tests a bit by using encodings.normalize_encoding
    • new test to ensure the codec output type checks are still exercised
    • backwards compatibility tweaks for raw tuples returned from the codec registry lookup (uncovered by the full test suite run)

    I'll be committing this version after a final local run of "make test" and a refleak check on test_codecs, test_charmapcodec and test_unicode (the latter two are the ones that found the backwards compatibility issue with the attribute lookup).

    @malemburg
    Copy link
    Member

    Nick: I've had a look at your second patch. A couple of notes:

    • I think you should add the flag to the constructor of the CodecInfo
      tuple subclass and then set this in the resp. codecs. The extra
      constructor class method looks too much like a hack and is
      not needed.

    • The comment in codecs.h should read:

      """
      Checks the encoding against a list of codecs which do not
      implement a str<->bytes encoding before attempting the
      operation.

      Please note that these APIs are internal and should not
      be used in Python C extensions.
      """

    Regarding Victor's suggestion to use a separate registry dict
    for this: I'm definitely -1 on this.

    The black listing is a very special case only used for the
    .encode()/.decode() methods and otherwise doesn't have anything to do
    with the codecs sub-system. It doesn't make sense to change the design
    of the registry just to implement this one special case.

    @vstinner
    Copy link
    Member Author

    With blacklisted transform codecs, I'm fine with the idea of restoring codecs aliases for transform codecs in Python 3.4. Go ahead Nick.

    --

    For Python 3.5, a better solution should be found to declare transform codecs.

    And I had like to also add transform()/untransform() methods on bytes and str types. So you would have 4 API:

    • Unicode text codecs: str.encode/str.decode, str=>bytes
    • bytes transform codecs: bytes.transform/untransform, bytes-like object=>bytes
    • Unicode transform codecs: str.transform/untransform, str=>str
    • all codecs: codecs.encode/codecs.decode, something=>something else

    But only few developers (only me?) are interested by transform/untransform, so codecs.encode/codecs.decode might be enough.

    @ncoghlan
    Copy link
    Contributor

    Yay, thanks Victor!

    Regarding UTF-8 et al, the existing shortcuts in unicodeobject.c already
    bypass the full codec machinery, and that includes the exception wrapping
    on failures.

    @vstinner
    Copy link
    Member Author

    Regarding UTF-8 et al, the existing shortcuts in unicodeobject.c already
    bypass the full codec machinery, and that includes the exception wrapping
    on failures.

    There are still platforms using locale encodings different than ascii,
    latin1 or utf8. For example, Windows never uses these encodings for
    ANSI or OEM code page. ANSI code page is used as the Python filesystem
    encoding which is used in a lot of places. OEM code page is used for
    the stdio streams (stdin, stdout, stderr). There are some users using
    locales with the latin9 encoding.

    I proposed to remove the new code for exception handling to simplify
    the code (because the error should not occur anymore).

    @ncoghlan
    Copy link
    Contributor

    Just noting the exact list of codecs that currently bypass the full codec machinery and go direct to the C implementation by normalising the codec name (which includes forcing to lowercase) and then using strcmp to check against a specific set of known encodings.

    In PyUnicode_Decode (and hence bytes.decode and bytearray.decode):

    utf-8
    utf8
    latin-1
    latin1
    iso-8859-1
    iso8859-1
    mbcs (Windows only)
    ascii
    utf-16
    utf-32

    In PyUnicode_AsEncodedString (and hence str.encode), the list is mostly the same, but utf-16 and utf-32 are not accelerated (i.e. they're currently still looked up through the codec machinery).

    It may be worth opening a separate issue to restore the consistency between the lists by adding utf-16 and utf-32 to the fast path for encoding as well.

    As far as the wrapping mechanism from issue bpo-17828 itself goes:

    • it only triggers if PyEval_CallObject on the encoder or decoder returns NULL
    • stateful exceptions (which includes UnicodeEncodeError and UnicodeDecodeError) and those with custom __init__ or __new__ implementations don't get wrapped
    • the actual wrapping process is just the C equivalent of "raise type(exc)(new_msg) from exc", plus the initial checks to determine if the current exception can be wrapped safely
    • it applies to the *general purpose* codec machinery, not just to the text model related convenience methods

    @ncoghlan
    Copy link
    Contributor

    The examples from the 3.4 What's New may make it clearer why the exception wrapping in the codec machinery is much broader in scope that just detecting output type errors (even though handling that case was the original motivation for the idea):

    >>> import codecs
    
        >>> codecs.decode(b"abcdefgh", "hex")
        binascii.Error: Non-hexadecimal digit found
    
        The above exception was the direct cause of the following exception:
    
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        binascii.Error: decoding with 'hex' codec failed (Error: Non-hexadecimal digit found)
    
        >>> codecs.encode("hello", "bz2")
        TypeError: 'str' does not support the buffer interface
    
        The above exception was the direct cause of the following exception:
    
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: encoding with 'bz2' codec failed (TypeError: 'str' does not support the buffer interface)

    @ncoghlan
    Copy link
    Contributor

    Unassigning this one - I don't think the solution we used for 3.4 is appropriate in a maintenance release, but I'm not sure how else to resolve it.

    @ncoghlan ncoghlan removed their assignment Dec 18, 2013
    @serhiy-storchaka
    Copy link
    Member

    What about my comments in msg203841?

    See http://comments.gmane.org/gmane.comp.python.devel/143943 for example how hard to get rid of private arguments in public functions.

    @ncoghlan
    Copy link
    Contributor

    I was planning to fix pydoc to not show private keyword only arguments.

    @serhiy-storchaka
    Copy link
    Member

    If people don't pay attention on explicit warning not to use certain
    parameters, is the lack of documentation will stop them?

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 30, 2013

    New changeset 0e10367c88ce by Zachary Ware in branch 'default':
    bpo-19619: skip zlib error test when zlib not available
    http://hg.python.org/cpython/rev/0e10367c88ce

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 4, 2014

    New changeset f3ec00d2b75e by Nick Coghlan in branch 'default':
    Close bpo-20404: blacklist non-text encodings in io.TextIOWrapper
    http://hg.python.org/cpython/rev/f3ec00d2b75e

    @vstinner
    Copy link
    Member Author

    Is it still something to do? The initial issue looks to be fixed.

    You may open new issue if you see more things to do?

    @serhiy-storchaka
    Copy link
    Member

    It isn't fixed in 3.3 yet.

    @ncoghlan
    Copy link
    Contributor

    As Serhiy noted, 3.3 is still affected. On the other hand, the approach I
    used for 3.4 is a pretty invasive fix, so I'm not sure it's a good idea to
    implement something like that in a maintenance release.

    @vstinner
    Copy link
    Member Author

    It isn't fixed in 3.3 yet.

    I'm not sure that the issue should be fixed in 3.3 because the patch is quite large. Do you consider that the bug is important enough?

    @serhiy-storchaka
    Copy link
    Member

    Is a bug which allows easily make DDOS attacks important enough? Every Python HTTP server, mail or news client is affected.

    @serhiy-storchaka
    Copy link
    Member

    I don't think that adding underscored parameter to public API is best solution, but we need the fix for 3.3. So here is a patch for backporting d68df99d7a57 to 3.3.

    @serhiy-storchaka
    Copy link
    Member

    Sorry, I missed the patch.

    @birkenfeld
    Copy link
    Member

    That looks ok to me.

    @ncoghlan
    Copy link
    Contributor

    Backporting just the encode/decode changes sounds reasonable to me, since open() isn't likely to be a DoS vector in this particular case.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 24, 2014

    New changeset 95386bbf9471 by Serhiy Storchaka in branch '3.3':
    Issue bpo-19619: Blacklist non-text codecs in method API
    http://hg.python.org/cpython/rev/95386bbf9471

    @serhiy-storchaka
    Copy link
    Member

    I think that bpo-20404 should be backported too. It is common to wrap socket stream with io.TextIOWrapper, and socket.makefile() returns text stream by default.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 25, 2014

    New changeset 9975f827eefd by Serhiy Storchaka in branch '3.3':
    Fix typo (issue bpo-19619).
    http://hg.python.org/cpython/rev/9975f827eefd

    @birkenfeld
    Copy link
    Member

    This is fixed now, right?

    @birkenfeld
    Copy link
    Member

    Both this backport and bpo-20404 will make it into 3.3.5rc2.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 31, 2015

    New changeset cf6e782a7f94 by Serhiy Storchaka in branch '2.7':
    Issue bpo-19543: Emit deprecation warning for known non-text encodings.
    https://hg.python.org/cpython/rev/cf6e782a7f94

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    release-blocker topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants