Issue 34832: "Short circuiting" in base64's b64decode, decode, decodebytes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/79013

classification

Title:	"Short circuiting" in base64's b64decode, decode, decodebytes
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	fbidu, pw.michael.harris
Priority:	normal	Keywords:

Created on 2018-09-28 12:20 by pw.michael.harris, last changed 2022-04-11 14:59 by admin.

Messages (4)
msg326630 - (view)	Author: Michael Harris (pw.michael.harris)	Date: 2018-09-28 12:20
When given an invalid base64 string that starts with a valid base64 substring, the functions will return the decoded bytes only up to the substring rather then ignoring the non-alphabet character. Examples: >>> base64.b64decode("AAAAAAAA") b'\x00\x00\x00\x00\x00\x00' >>> base64.b64decode("AA=AAAAAA") b'\x00\x00\x00\x00\x00\x00' >>> base64.b64decode("AAA=AAAAA") b'\x00\x00'
msg326661 - (view)	Author: Felipe Rodrigues (fbidu) *	Date: 2018-09-28 23:37
I am not sure if simply ignoring the non-valid character is the best way to go. Feels like silencing errors. b64decode does accept the 'validate' flag - defaulted to False - that will halt the execution and throw an error. What might be a good idea is to implement an 'errors' argument that accepts 'ignore' as a value, like we do for bytes.decode (https://docs.python.org/3/library/stdtypes.html#bytes.decode)
msg326662 - (view)	Author: Felipe Rodrigues (fbidu) *	Date: 2018-09-28 23:49
Actually, I'm not even sure if it makes sense to decode the 'first valid substring'... IMHO, we should warn the user
msg326680 - (view)	Author: Felipe Rodrigues (fbidu) *	Date: 2018-09-29 16:05
For reference in future discussions, Python's base64 module implements RFC 3548 (https://tools.ietf.org/html/rfc3548) whose section 2.3 (https://tools.ietf.org/html/rfc3548#section-2.3) discusses about "Interpretation of non-alphabet characters in encoded data". The section's content is: Base encodings use a specific, reduced, alphabet to encode binary data. Non alphabet characters could exist within base encoded data, caused by data corruption or by design. Non alphabet characters may be exploited as a "covert channel", where non-protocol data can be sent for nefarious purposes. Non alphabet characters might also be sent in order to exploit implementation errors leading to, e.g., buffer overflow attacks. Implementations MUST reject the encoding if it contains characters outside the base alphabet when interpreting base encoded data, unless the specification referring to this document explicitly states otherwise. Such specifications may, as MIME does, instead state that characters outside the base encoding alphabet should simply be ignored when interpreting data ("be liberal in what you accept"). Note that this means that any CRLF constitute "non alphabet characters" and are ignored. Furthermore, such specifications may consider the pad character, "=", as not part of the base alphabet until the end of the string. If more than the allowed number of pad characters are found at the end of the string, e.g., a base 64 string terminated with "===", the excess pad characters could be ignored. In my opinion, the RFC is rather permissive about strange characters in the encoded data. The RFC refers to the MIME specification that ignores the data and hints the possibility of rejecting the pad symbol '=' unless it is found in the end of the string. I think that our best option if we would like to address this issue is to add an 'errors' argument whose default value will keep the current behavior for backwards compatibility but will accept more options in order to both ignore the strange characters and carry on with the processing - like bytes.decode's errors=ignore flag - and to raise an error in such situations, like bytes.decode's errors=strict.

History
Date	User	Action	Args
2022-04-11 14:59:06	admin	set	github: 79013
2018-09-29 16:05:02	fbidu	set	messages: + msg326680
2018-09-28 23:49:43	fbidu	set	messages: + msg326662
2018-09-28 23:37:28	fbidu	set	nosy: + fbidu messages: + msg326661
2018-09-28 12:20:13	pw.michael.harris	create