classification
Title: Allow lowercase hexadecimal characters in base64.b16decode()
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, djhoulihan, serhiy.storchaka, xtreak
Priority: normal Keywords: patch

Created on 2018-12-22 02:49 by djhoulihan, last changed 2018-12-23 00:33 by djhoulihan.

Files
File name Uploaded Description Edit
testing_data.txt djhoulihan, 2018-12-22 02:49 Testing functions used to demonstrate a ~9.4% improvement in hexadecimal decoding performance.
Pull Requests
URL Status Linked Edit
PR 11285 closed djhoulihan, 2018-12-22 03:01
Messages (4)
msg332319 - (view) Author: Dylan Houlihan (djhoulihan) * Date: 2018-12-22 02:49
Currently, the `base64` method `b16decode` does not decode a hexadecimal string with lowercase characters by default. To do so requires passing `casefold=True` as the second argument. I propose a change to the `b16decode` method to allow it to accept hexadecimal strings containing lowercase characters without requiring the `casefold` argument.

The revision itself is straightforward. We simply have to amend the regular expression to match the lowercase characters a-f in addition to A-F. Likewise the corresponding tests in Lib/base64.py also need to be changed to account for the lack of a second argument. Therefore there are two files total which need to be refactored.

In my view, there are several compelling reasons for this change:

1. There is a nontrivial performance improvement. I've already made the changes on my own test branch[1] and I see a mean decoding performance improvement of approximately 9.4% (tested by taking the average of 1,000,000 hexadecimal string encodings). The testing details are included in a file attached to this issue.

2. Hexadecimal strings are case insensitive, i.e. 8DEF is equivalent to 8def. This is the particularly motivating reason why I've written the patch - there have been many times when I've been momentarily confounded by a hexadecimal string that won't decode only to realize I'm yet again passing in a lowercase string.

3. The behavior of the underlying method in `binascii`, `unhexlify`, accepts both uppercase and lowercase characters by default without requiring a second parameter. From the perspective of code hygiene and language consistency, I think it's both more practical and more elegant for the language to behave in the same, predictable manner (particularly because `base64.py` simply calls `binascii.c` under the hood). Additionally, the `binascii` method `hexlify` actually outputs strings in lowercase encoding, meaning that any use of both `binascii` and `base64` in the same application will have to consistently do a `casefold` conversion if output from `binascii.hexlify` is fed back as input to `base64.b16decode` for some reason.

There are two arguments against this patch, as far as I can see it:

1. In the relevant IETF reference documentation (RFC3548[2], referenced directly in the `b16decode` docstring; and RFC4648[3] with supersedes it), under Security Considerations the author Simon Josefsson claims that there exists a potential side channel security issue intrinsic to accepting case insensitive hexadecimal strings in a decoding function. While I'm not dismissing this out of hand, I personally do not find the claimed vulnerability compelling, and Josefsson does not clarify a real world attack scenario or threat model. I think it's important we challenge this assumption in light of the potential nontrivial improvements to both language consistency and performance. I would be very interested in hearing a real threat model here that would practically exist outside of a very contrived scenario. Moreover if this is such a security issue, why is the behavior already evident in `binascii.unhexlify`?

2. The other reason may be that there's simply no reason to make such a change. An argument can be put forward that a developer won't frequently have to deal with this issue because the opposite method, `b16encode`, produces hexadecimal strings with uppercase characters. However, in my experience hexadecimal strings with lowercase characters are extremely common in situations where developers haven't produced the strings themselves in the language.

As I mentioned, I have already written the changes on my own patch branch. I'll open a pull request once this issue has been created and reference the issue in the pull request on GitHub.

References:

1. https://github.com/djhoulihan/cpython/tree/base64_case_sensitivity

2. https://tools.ietf.org/html/rfc3548

3. https://tools.ietf.org/html/rfc4648
msg332326 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-12-22 06:31
Thanks for the report. A couple of points as below : 

* This changes the interface of the function by removing a parameter. Thus it will break compatibility with Python 2 and also earlier versions of Python 3. Removing a parameter in the signature has to go through a deprecation cycle if this is going to be accepted.
* Please don't use time.time and mean for benchmarks that can be misleading. There are modules like timeit and perf (https://pypi.org/project/perf/) that are more reliable.

I looked for some more inefficiencies and I can see re.search for every run. Perhaps re.compile can be used to store the compiled regex at module level and then to match against the string. This makes the function 25% faster without changing the interface. In case casefold=False then an extra call to make the string upper case is avoided giving some more benefit.

With re.search inside the function

$ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca"' 'base64.b16decode(hex_data, casefold=True)'
.....................
Mean +- std dev: 3.08 us +- 0.22 us
$ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca".upper()' 'base64.b16decode(hex_data)'
.....................
Mean +- std dev: 2.93 us +- 0.20 us

With the regex compiled to a variable at the module level

$ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca"' 'base64.b16decode(hex_data, casefold=True)'
.....................
Mean +- std dev: 2.08 us +- 0.15 us
$ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca".upper()' 'base64.b16decode(hex_data)'
.....................
Mean +- std dev: 1.98 us +- 0.17 us


Since this is a comparison of fixed set of elements I tried using a set of elements and any to short-circuit but it seems to be slower

$ python3.7 -m perf timeit -s 'import base64; hex_data="806903d098eb50957b1b376385f233bb3a5d54f54191c8536aefee21fc9ba3ca"' 'base64.b16decode(hex_data, casefold=True)'
.....................
Mean +- std dev: 8.21 us +- 0.66 us


I am opening a PR to use the compiled regex at the module level since I see it as a net win of 25-30% without any interface change or test case changes required.
msg332329 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-22 07:13
This is compatibility breaking change. Furthermore, it changes the purposed behavior that was from the initial version (4c904d1bf71d01bc22fbb123493f975050560e9c).

Since currently there is an option which allows to accept lowercase hexadecimal characters, I do not see a need in this change. You can also use bytes.fromhex().
msg332377 - (view) Author: Dylan Houlihan (djhoulihan) * Date: 2018-12-23 00:33
Karthikeyan,

Thank you for taking the time to respond so thoroughly. In particular, in the future I'll be more careful to qualify and quantify potential performance improvements using `timeit` or `perf`.

That being said, as I mentioned the primary motivation for this is not a performance improvement - I just felt that was a nice potential side effect. Rather, this enhancement brings `base64.b16decode` into behavioral consistency with `binascii.unhexlify`. The `binascii` method already accepts both uppercase and lowercase hexadecimal characters by default.

However I can definitely understand what you and Serhiy are saying about this being a breaking change. Therefore I'd like to amend my proposal to the following:

1. Keep the `casefold` argument and corresponding logic, but keep the revised regex that will match against both uppercase and lowercase characters; and

2. Only put this change in for Python 3.8, this way existing code which uses the explicit argument on versions <= 3.7 does not break (but will still function normally).

I've altered this issue to reflect my amended proposal, targeting only version 3.8 and editing the type to be behavior instead of performance. In this way, the change will still make `base64.b16decode` consistent with `binascii.unhexlify` (and the case insensitivity of hexadecimal encoding more generally) without breaking existing code or requiring developers to change workflows using `casefold`.

Naturally there would be additional logic that *enforces* the case sensitivity if `casefold=false` is passed to the method, and this will likewise not break existing workflows either. 

If this change is considered agreeable, I will amend my open pull request to roll back the breaking change and refactor the way `casefold` is processed. From my perspective this amended proposal offers an upside in language consistency without any downside.
History
Date User Action Args
2018-12-23 00:33:11djhoulihansetmessages: + msg332377
2018-12-23 00:21:31djhoulihansettype: performance -> behavior
2018-12-23 00:19:30djhoulihansetversions: - Python 3.4, Python 3.5, Python 3.6, Python 3.7
2018-12-22 07:13:42serhiy.storchakasetnosy: + barry, serhiy.storchaka
messages: + msg332329
2018-12-22 06:31:51xtreaksetnosy: + xtreak
messages: + msg332326
2018-12-22 03:01:23djhoulihansetkeywords: + patch
stage: patch review
pull_requests: + pull_request10512
2018-12-22 02:49:42djhoulihancreate