Message 332319 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	djhoulihan
Recipients	djhoulihan
Date	2018-12-22.02:49:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1545446982.53.0.98272194251.issue35557@roundup.psfhosted.org>
In-reply-to

Content
Currently, the `base64` method `b16decode` does not decode a hexadecimal string with lowercase characters by default. To do so requires passing `casefold=True` as the second argument. I propose a change to the `b16decode` method to allow it to accept hexadecimal strings containing lowercase characters without requiring the `casefold` argument. The revision itself is straightforward. We simply have to amend the regular expression to match the lowercase characters a-f in addition to A-F. Likewise the corresponding tests in Lib/base64.py also need to be changed to account for the lack of a second argument. Therefore there are two files total which need to be refactored. In my view, there are several compelling reasons for this change: 1. There is a nontrivial performance improvement. I've already made the changes on my own test branch[1] and I see a mean decoding performance improvement of approximately 9.4% (tested by taking the average of 1,000,000 hexadecimal string encodings). The testing details are included in a file attached to this issue. 2. Hexadecimal strings are case insensitive, i.e. 8DEF is equivalent to 8def. This is the particularly motivating reason why I've written the patch - there have been many times when I've been momentarily confounded by a hexadecimal string that won't decode only to realize I'm yet again passing in a lowercase string. 3. The behavior of the underlying method in `binascii`, `unhexlify`, accepts both uppercase and lowercase characters by default without requiring a second parameter. From the perspective of code hygiene and language consistency, I think it's both more practical and more elegant for the language to behave in the same, predictable manner (particularly because `base64.py` simply calls `binascii.c` under the hood). Additionally, the `binascii` method `hexlify` actually outputs strings in lowercase encoding, meaning that any use of both `binascii` and `base64` in the same application will have to consistently do a `casefold` conversion if output from `binascii.hexlify` is fed back as input to `base64.b16decode` for some reason. There are two arguments against this patch, as far as I can see it: 1. In the relevant IETF reference documentation (RFC3548[2], referenced directly in the `b16decode` docstring; and RFC4648[3] with supersedes it), under Security Considerations the author Simon Josefsson claims that there exists a potential side channel security issue intrinsic to accepting case insensitive hexadecimal strings in a decoding function. While I'm not dismissing this out of hand, I personally do not find the claimed vulnerability compelling, and Josefsson does not clarify a real world attack scenario or threat model. I think it's important we challenge this assumption in light of the potential nontrivial improvements to both language consistency and performance. I would be very interested in hearing a real threat model here that would practically exist outside of a very contrived scenario. Moreover if this is such a security issue, why is the behavior already evident in `binascii.unhexlify`? 2. The other reason may be that there's simply no reason to make such a change. An argument can be put forward that a developer won't frequently have to deal with this issue because the opposite method, `b16encode`, produces hexadecimal strings with uppercase characters. However, in my experience hexadecimal strings with lowercase characters are extremely common in situations where developers haven't produced the strings themselves in the language. As I mentioned, I have already written the changes on my own patch branch. I'll open a pull request once this issue has been created and reference the issue in the pull request on GitHub. References: 1. https://github.com/djhoulihan/cpython/tree/base64_case_sensitivity 2. https://tools.ietf.org/html/rfc3548 3. https://tools.ietf.org/html/rfc4648

Currently, the `base64` method `b16decode` does not decode a hexadecimal string with lowercase characters by default. To do so requires passing `casefold=True` as the second argument. I propose a change to the `b16decode` method to allow it to accept hexadecimal strings containing lowercase characters without requiring the `casefold` argument.

The revision itself is straightforward. We simply have to amend the regular expression to match the lowercase characters a-f in addition to A-F. Likewise the corresponding tests in Lib/base64.py also need to be changed to account for the lack of a second argument. Therefore there are two files total which need to be refactored.

In my view, there are several compelling reasons for this change:

1. There is a nontrivial performance improvement. I've already made the changes on my own test branch[1] and I see a mean decoding performance improvement of approximately 9.4% (tested by taking the average of 1,000,000 hexadecimal string encodings). The testing details are included in a file attached to this issue.

2. Hexadecimal strings are case insensitive, i.e. 8DEF is equivalent to 8def. This is the particularly motivating reason why I've written the patch - there have been many times when I've been momentarily confounded by a hexadecimal string that won't decode only to realize I'm yet again passing in a lowercase string.

3. The behavior of the underlying method in `binascii`, `unhexlify`, accepts both uppercase and lowercase characters by default without requiring a second parameter. From the perspective of code hygiene and language consistency, I think it's both more practical and more elegant for the language to behave in the same, predictable manner (particularly because `base64.py` simply calls `binascii.c` under the hood). Additionally, the `binascii` method `hexlify` actually outputs strings in lowercase encoding, meaning that any use of both `binascii` and `base64` in the same application will have to consistently do a `casefold` conversion if output from `binascii.hexlify` is fed back as input to `base64.b16decode` for some reason.

There are two arguments against this patch, as far as I can see it:

1. In the relevant IETF reference documentation (RFC3548[2], referenced directly in the `b16decode` docstring; and RFC4648[3] with supersedes it), under Security Considerations the author Simon Josefsson claims that there exists a potential side channel security issue intrinsic to accepting case insensitive hexadecimal strings in a decoding function. While I'm not dismissing this out of hand, I personally do not find the claimed vulnerability compelling, and Josefsson does not clarify a real world attack scenario or threat model. I think it's important we challenge this assumption in light of the potential nontrivial improvements to both language consistency and performance. I would be very interested in hearing a real threat model here that would practically exist outside of a very contrived scenario. Moreover if this is such a security issue, why is the behavior already evident in `binascii.unhexlify`?

2. The other reason may be that there's simply no reason to make such a change. An argument can be put forward that a developer won't frequently have to deal with this issue because the opposite method, `b16encode`, produces hexadecimal strings with uppercase characters. However, in my experience hexadecimal strings with lowercase characters are extremely common in situations where developers haven't produced the strings themselves in the language.

As I mentioned, I have already written the changes on my own patch branch. I'll open a pull request once this issue has been created and reference the issue in the pull request on GitHub.

References:

1. https://github.com/djhoulihan/cpython/tree/base64_case_sensitivity

2. https://tools.ietf.org/html/rfc3548

3. https://tools.ietf.org/html/rfc4648

History
Date	User	Action	Args
2018-12-22 02:49:44	djhoulihan	set	recipients: + djhoulihan
2018-12-22 02:49:42	djhoulihan	set	messageid: <1545446982.53.0.98272194251.issue35557@roundup.psfhosted.org>
2018-12-22 02:49:42	djhoulihan	link	issue35557 messages
2018-12-22 02:49:42	djhoulihan	create