classification
Title: base64 module ignores non-alphabet characters
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: r.david.murray Nosy List: Julian, Red HamsterX, ajaksu2, amaury.forgeotdarc, eric.araujo, ltaczuk, pitrou, r.david.murray, sanxiyn
Priority: normal Keywords: patch

Created on 2006-04-07 01:19 by sanxiyn, last changed 2014-07-26 22:44 by Julian. This issue is now closed.

Files
File name Uploaded Description Edit
1466065[3.2-complete].diff Red HamsterX, 2009-07-29 19:36 Unit-test/solution/documentation update for Python 3.2 review
1466065[2.7-complete].diff Red HamsterX, 2009-07-29 19:38 Unit-test/solution/documentation update for Python 2.7 review
binascii-validate.patch r.david.murray, 2010-11-08 21:29
Messages (33)
msg60903 - (view) Author: Seo Sanghyeon (sanxiyn) Date: 2006-04-07 01:19
See:
http://mail.python.org/pipermail/python-dev/2006-April/063504.html

base64.b64decode function ignores non-alphabet
characters, which contradicts its documentation: see
http://docs.python.org/lib/module-base64.html

Also, the exception must be ValueError instead of
TypeError.
msg91023 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-28 22:38
Attached a patch that modifies Modules/binascii.c to raise a TypeError
on any unrecognized base64 input, rather than failing silently. Also
includes unit tests.

Patch built against Python 3.2, r74246.

TypeError was preserved over ValueError, despite the lack of related
logic, for legacy purposes.
msg91024 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-28 22:39
Attached a patch that modifies Lib/base64.py to raise a TypeError on any
unrecognized base64 input, rather than failing silently. Also includes
unit tests.

Patch built against Python 2.7, r74246.
msg91025 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-28 22:40
Attached a patch that modifies Lib/base64.py to raise a TypeError on any
unrecognized base64 input, rather than continuing silently. Also
includes unit tests.

Patch built against Python 3.2, r74246.

Correction to previous text: base64 continues silently, discarding
unrecognized characters, rather than failing.
msg91039 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-07-29 13:44
The new unit tests pass without modifying the library.
Could you include a case that fails with the current version?
msg91043 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 16:31
I can't add a test for that without changing unrelated behaviour in the
library that is already known to work fine or checking the string value
of the raised exception, which seems like a bad idea, even though it
would work.

If a character is ignored and this leads to a padding-length issue,
TypeError is raised in both 2.7 and 3.2: 2.7 because everything is a
TypeError, and 3.2 because binascii.Error is converted to TypeError for
legacy purposes.

If a character is ignored and the string's length is still acceptable,
then no error is reported because this was a silent problem.

Post-library-modification, both of these cases will uniformly produce
the proper error, although it is, through type-checking alone,
indistinguishable from the errors that would have existed before -- the
value is in the fact that it will tell the user the nature of the
failure, and it will be noisy when it may have been silent before.
msg91045 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-07-29 16:40
If "it may be noisy where it was silent before", then add one of those
cases and make sure the noise doesn't happen before your fix, and does
happen after.

If you have to check the value of the exception string for other tests,
then do so.  There are plenty of examples of this in the existing tests,
(see the pydoc tests, for example).  If you can limit what you test for
so that the test will be resitent to changes in the exact text, so much
the better.  You can use  assertRaisesRegexp for this in 2.7.
msg91046 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-07-29 16:44
What is the correct behavior for something like this?
   base64.b64decode('!')
2.6 silently returns ''.
msg91047 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 16:58
According to the documentation cited by Seo Sanghyeon in the first post,
"A TypeError is raised if s were incorrectly padded or if there are
non-alphabet characters present in the string."

The valid range of characters is [A-Za-z0-9], and one or two '='s may
appear at the end of the input to signify dimension-padding. Therefore,
'!' should fail with a TypeError alerting the user to the presence of
unrecognized data, rather than being discarded.

(Additionally, it looks like newline characters in the input aren't
unheard of, and those are probably the reason behind the silent ignores)
msg91048 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 16:58
R. David Murray, should I update the patches for both the pure-Python
solution and the C solution, or is one domain preferable here? The
Python-based solution keeps all of the invalid-character TypeErrors
collected in the same module, but the C-based solution allows this
problem to be caught more efficiently.
msg91049 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-07-29 17:00
> Therefore, '!' should fail with a TypeError
Here is your test case!
"Errors should never pass silently."
msg91050 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-07-29 17:05
Amaury is probably better qualified to answer that question, but I would
think the C code version is preferable.
msg91052 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-07-29 17:23
I've dig into the history of the file and found this change:
http://svn.python.org/view?view=rev&revision=13939
"- Illegal padding is now ignored.  (Recommendation by GvR.)"

The motivation at the time was based on "the general Internet philosophy":
http://mail.python.org/pipermail/python-bugs-list/1999-October/000234.html

I don't know if this is still valid 10 years later, though.
msg91053 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-07-29 17:35
Perhaps Guido remembers why the decision was made.
msg91054 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 17:40
I'll hold off on uploading new patches until someone makes a decision, then.

It seems like, perhaps, simply amending the documentation would be
sufficient, since this behaviour shouldn't break any valid messages that
might reach this module. At worst, it'll just treat gibberish as valid,
and that's what it's been doing for a decade. (Although the other decode
routines are all strict by comparison)
msg91055 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2009-07-29 17:55
It strikes me as simply a documentation bug. Maybe whoever wrote the
docs just assumed it would work that way. I expect that changing it to
insist on valid input would break some use cases. If you want a
validating API, perhaps a new API could be added that validates as well
as converts.
msg91056 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-07-29 18:04
It turns out that the RFC is unambiguous on this point.  Quoting the
base64 section of RFC 2045:

   The encoded output stream must be represented in lines of no more
   than 76 characters each.  All line breaks or other characters not
   found in Table 1 must be ignored by decoding software.  In base64
   data, characters other than those in Table 1, line breaks, and other
   white space probably indicate a transmission error, about which a
   warning message or even a message rejection might be appropriate
   under some circumstances.

Since "some circumstances" is not something the base64 decoder can
decide, that has to be left to a higher level ap.  So if unexpected
characters are to generate an error, it would need to be enabled via a
flag that defaults to not raising the error, IMO.  Unless someone has a
use case for rejecting an improperly encoded message, we should probably
just fix the docs (perhaps noting that this behavior is in accordance
with the RFC).
msg91057 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 18:14
RFC 3548, referenced by the base64 module's docs, has a rather different
statement on how invalid characters should be treated.

From 2.3 Interpretation of non-alphabet characters in encoded data:
   Implementations MUST reject the encoding if it contains characters
   outside the base alphabet when interpreting base encoded data, unless
   the specification referring to this document explicitly states
   otherwise.  Such specifications may, as MIME does, instead state that
   characters outside the base encoding alphabet should simply be
   ignored when interpreting data ("be liberal in what you accept").

So it looks like we can safely just say that invalid characters are
ignored in the docs, as long as it's explicit, but that's probably not
what people will expect.


I'll add doc patches in a moment, and someone who's actually a developer
(i.e., not me) can decide whether they're good enough.
msg91060 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 18:23
Attached a documentation patch indicating that the
ignore-invalid-characters behaviour is intentional, citing relevant RFCs
for support.

Patch built against Python 3.2, r74261.
msg91061 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-07-29 18:39
Hmm.  But if the module is used outside of MIME (which it can be, and in
fact is in the stdlib itself), then an error must be raised in order to
comply with that RFC.  So it sounds like we really ought to have that
flag.  And I was even wrong about the appropriate default.
msg91062 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 18:53
It isn't written that only MIME may ignore such content. The key terms
there are 'may' and 'explicitly states otherwise'.

If the documentation is clear, then all future application developers
will know to check for validity using a regular expression like
'^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which
validity matters should already have a similar workaround.

While I do agree that standards are always good and that workarounds are
bad, Guido does have a very valid point: "changing it to
insist on valid input would break some use cases", and I think we
already missed the 2.x -> 3.x window where that would have been acceptable.
msg91064 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-07-29 19:05
> If the documentation is clear, then all future application developers
> will know to check for validity using a regular expression like
> '^[A-Za-z0-9+/\r\n]+={0,2}$'. Any existing applications in which
> validity matters should already have a similar workaround.

But having to validate input manually kinds of defeats the point of
having a decoder in the stdlib, therefore I agree with MRAB that a
validation flag would be useful.
msg91065 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-07-29 19:07
And if the flag defaults to the current behavior that should satisfy the
backward compatiblity issues.
msg91067 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 19:36
Attached a documentation/unit-test/solution patch that adds a
validate=False parameter to the b64decode function of Lib/base64.py,
which may be set to True to have invalid base64 content be rejected with
a TypeError.

Patch built against Python 3.2, r74261.
msg91068 - (view) Author: Neil Tallim (Red HamsterX) Date: 2009-07-29 19:38
Attached a documentation/unit-test/solution patch that adds a
validate=False parameter to the b64decode function of Lib/base64.py,
which may be set to True to have invalid base64 content be rejected with
a TypeError.

Patch built against Python 2.7, r74261.


Note: Sorry this went on for so long. However, I think I understand the
patch-submission process a lot better now.
msg114650 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 09:05
Attached patches against 2.7 and 3.2 contain doc and unit test changes so can someone please review and commit if acceptable.
msg115687 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-06 02:29
Comments on patch:

1) if I'm reading the RFC correctly, to be validating strictly in compliance with the RFC \r and \n should also raise an error.  Do you agree?

2) We've pretty much dropped the convention of adding history notes to the file itself.

3) The original code documented a TypeError on incorrect padding but in py3k in fact raises a binascii.Error, as you noted in the patch.  I wonder if it would be better to raise a binascii.Error on invalid data as well, since it isn't, strictly speaking, a TypeError.  That would also make it easier to move the code into the C module later if that is deemed desirable.

4) The negative in the doc language ("If validate is not set to True...") is awkward and confusing.  Better would be "If validate is False (the default)..."

Since the patch does add an API change (AKA a feature) I think this can only go into 3.2.

I have some additional concerns when considering this in the context of email6.  email6 will have a 'strict' mode where it would be sensible for invalid base64 to raise an error.  But in non-strict mode, it would be ideal to have a way to (a) know if there is invalid input, but still decode it, and (b) decode it even if the padding is off after ignoring the invalid data.  I'm not saying that this patch should try to address those issues, I just want to put them on record in case I want to do something about them later.
msg116049 - (view) Author: Neil Tallim (Red HamsterX) Date: 2010-09-10 22:08
I agree that it does look like RFC 3548 implicitly denies the use of \r and \n. A few *very* simple tests also makes it look like ommitting them breaks nothing, but we'd need to add a test that includes them before applying the patch.

Other than that, I agree with everything and find the email6 notes intruiging.

Would you like me to update the patch, or do you want to do it?
msg116072 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-11 01:35
If you could update it that would be great.
msg120803 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-08 21:29
Here is an updated patch that addresses the concerns I noted.  I modified the tests: given that I've changed the code to raise binascii.Error as discussed, we don't really care from an API point of view what the error text is, just that the error is raised in the cases given.  Nor do we care (again, from an API point of view) if the first error detected is the invalid character or the invalid padding.  

I made the new test test the validate=False case, so all of the invalid character tests are now correctly padded if you ignore the invalid characters.
msg120966 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-11-11 20:12
Committed in r86414.
msg176409 - (view) Author: Lukasz Taczuk (ltaczuk) Date: 2012-11-26 12:30
Could someone update the docs for python 2.7.3?

This ticket is marked as closed and b64decode still silently ignores non base64 chars, but the documentation (for 2.7.3) still states that while calling b64decode "A TypeError is raised if (...) or if there are non-alphabet characters present in the string.".
http://docs.python.org/2.7/library/base64.html
msg224090 - (view) Author: Julian Berman (Julian) * Date: 2014-07-26 22:44
Created issue22088 to address not having fixed Py2 here.
History
Date User Action Args
2014-07-26 22:44:38Juliansetnosy: + Julian
messages: + msg224090
2012-11-26 12:30:22ltaczuksetnosy: + ltaczuk
messages: + msg176409
2010-11-11 20:12:52r.david.murraysetstatus: open -> closed

nosy: - BreamoreBoy
messages: + msg120966

resolution: accepted
stage: patch review -> resolved
2010-11-08 21:29:50r.david.murraysetfiles: + binascii-validate.patch

messages: + msg120803
2010-09-11 01:35:02r.david.murraysetmessages: + msg116072
2010-09-10 22:08:01Red HamsterXsetmessages: + msg116049
2010-09-06 02:29:49r.david.murraysetassignee: r.david.murray
messages: + msg115687
versions: - Python 3.1, Python 2.7
2010-08-22 16:57:23gvanrossumsetnosy: - gvanrossum
2010-08-22 09:05:54BreamoreBoysetversions: + Python 3.1
nosy: + BreamoreBoy

messages: + msg114650

stage: test needed -> patch review
2010-02-16 05:50:30eric.araujosetnosy: + eric.araujo
2009-07-29 22:25:28rhettingersetmessages: - msg91059
2009-07-29 19:38:23Red HamsterXsetfiles: + 1466065[2.7-complete].diff

messages: + msg91068
2009-07-29 19:36:54Red HamsterXsetfiles: + 1466065[3.2-complete].diff

messages: + msg91067
2009-07-29 19:33:05Red HamsterXsetfiles: - 1466065[3.2].diff
2009-07-29 19:33:01Red HamsterXsetfiles: - 1466065[2.7].diff
2009-07-29 19:07:54r.david.murraysetmessages: + msg91065
2009-07-29 19:05:03pitrousetmessages: + msg91064
2009-07-29 18:53:24Red HamsterXsetmessages: + msg91062
2009-07-29 18:39:26r.david.murraysetmessages: + msg91061
2009-07-29 18:28:05rhettingersetmessages: - msg91022
2009-07-29 18:23:55Red HamsterXsetfiles: + 1466065[3.2].diff

messages: + msg91060
2009-07-29 18:23:26Red HamsterXsetfiles: + 1466065[2.7].diff

messages: + msg91059
2009-07-29 18:14:04Red HamsterXsetmessages: + msg91057
2009-07-29 18:04:57r.david.murraysetmessages: + msg91056
2009-07-29 17:55:46gvanrossumsetmessages: + msg91055
2009-07-29 17:40:17Red HamsterXsetmessages: + msg91054
2009-07-29 17:35:27pitrousetnosy: + gvanrossum, pitrou

messages: + msg91053
versions: + Python 2.7, Python 3.2, - Python 2.6, Python 3.0
2009-07-29 17:23:20amaury.forgeotdarcsetmessages: + msg91052
2009-07-29 17:14:34Red HamsterXsetfiles: - 1466065[3.2-pure-python].diff
2009-07-29 17:14:30Red HamsterXsetfiles: - 1466065[2.7-pure-python].diff
2009-07-29 17:14:26Red HamsterXsetfiles: - 1466065[3.2].diff
2009-07-29 17:14:21Red HamsterXsetfiles: - 1466065[2.7].diff
2009-07-29 17:05:10r.david.murraysetmessages: + msg91050
2009-07-29 17:00:08amaury.forgeotdarcsetmessages: + msg91049
2009-07-29 16:58:32Red HamsterXsetmessages: + msg91048
2009-07-29 16:58:04Red HamsterXsetmessages: + msg91047
2009-07-29 16:44:24amaury.forgeotdarcsetmessages: + msg91046
2009-07-29 16:40:21r.david.murraysetnosy: + r.david.murray
messages: + msg91045
2009-07-29 16:31:00Red HamsterXsetmessages: + msg91043
2009-07-29 13:44:01amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg91039
2009-07-28 22:40:57Red HamsterXsetfiles: + 1466065[3.2-pure-python].diff

messages: + msg91025
2009-07-28 22:39:11Red HamsterXsetfiles: + 1466065[2.7-pure-python].diff

messages: + msg91024
2009-07-28 22:38:07Red HamsterXsetfiles: + 1466065[3.2].diff

messages: + msg91023
2009-07-28 22:36:56Red HamsterXsetfiles: + 1466065[2.7].diff

nosy: + ajaksu2, Red HamsterX
messages: + msg91022

keywords: + patch
2009-03-21 02:03:38ajaksu2setstage: test needed
type: behavior
versions: + Python 2.6, Python 3.0
2006-04-07 01:19:53sanxiyncreate