This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author v+python
Recipients Arach, Arfrever, Huzaifa.Sidhpurwala, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, georg.brandl, gvanrossum, gz, jcea, lemburg, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner
Date 2012-01-08.05:18:53
SpamBayes Score 6.1062266e-16
Marked as misclassified No
Message-id <4F092727.2010200@g.nevcal.com>
In-reply-to <CAO_YWRUVxD8Sw148ezEfxqV2b-0hhMqYDC8MdjU3pAjh0NxK2w@mail.gmail.com>
Content
[offlist]
Paul, thanks for the enumeration and response.  Some folks have more 
experience, but the rest of us need to learn.  Having the proposal in 
the ticket, with an explanation of its deficiencies is not all bad, 
however, others can learn, perhaps.  On the other hand, I'm willing to 
learn more, if you are willing to address my concerns below.

I had read the whole thread and issue, but it still seemed like a leap 
of faith to conclude that the only, or at least best, solution is 
changing the hash.  Yet, changing the hash still doesn't seem like a 
sufficient solution, due to long-lived processes.

On 1/7/2012 6:40 PM, Paul McMillan wrote:
> Paul McMillan<paul@mcmillan.ws>  added the comment:
>
>> Alex, I agree the issue has to do with the origin of the data, but the modules listed are the ones that deal with the data supplied by this particular attack.
> They deal directly with the data. Do any of them pass the data
> further, or does the data stop with them?

For web forms and requests, which is the claimed vulnerability, I would 
expect that most of them do not pass the data further, without 
validation or selection, and it is unlikely that the form is actually 
expecting data with colliding strings, so it seems very unlikely that 
they would be passed on. At least that is how I code my web apps: just 
select the data I expect from my form.  At present I do not reject data 
I do not expect, but I'll have to consider either using SafeDict (which 
I can start using ASAP, not waiting for a new release of Perl to be 
installed on my Web Server (currently running Perl 2.4), or rejecting 
data I do not expect prior to putting it in a dict.  That might require 
tweaking urllib.parse a bit, or cgi, or both.

> A short and very incomplete
> list of vulnerable standard lib modules includes: every single parsing
> library (json, xml, html, plus all the third party libraries that do
> that), all of numpy (because it processes data which probably came
> from a user [yes, integers can trigger the vulnerability]), difflib,
> the math module, most database adaptors, anything that parses metadata
> (including commonly used third party libs like PIL), the tarfile lib
> along with other compressed format handlers, the csv module,
> robotparser, plistlib, argparse, pretty much everything under the
> heading of "18. Internet Data Handling" (email, mailbox, mimetypes,
> etc.), "19. Structured Markup Processing Tools", "20. Internet
> Protocols and Support", "21. Multimedia Services", "22.
> Internationalization", TKinter, and all the os calls that handle
> filenames. The list is impossibly large, even if we completely ignore
> user code. This MUST be fixed at a language level.
>
> I challenge you to find me 15 standard lib components that are certain
> to never handle user-controlled input.

I do appreciate your enumeration, but I'll decline the challenge.  While 
all of them can be interesting exploits of naïve applications (written 
by programmers who may be quite experienced in some things, but can 
naïvely overlook other things), most of them probably do not apply to 
the documented vulnerability. Many I had thought of, but rejected for 
this context; some I had not.  So while there are many possible 
situations where happily stuffing things into a dict may be an easy 
solution, there are many possible cases where it should be prechecked on 
the way in.  And there is another restriction: if the user-controlled 
input enters a user-run program, it is unlikely to be attacked in the 
same manner than web servers are attacked.  A user, for example, is 
unlikely to contrive colliding file names for the purpose of making his 
file listing program run slow.

So it is really system services and web services that need to be 
particularly careful. Randomizing the hash seed might reduce the problem 
from any system/web services to only long-running system/web services, 
but doesn't really solve the complete problem, as far as I can tell... 
only proper care in writing the application (and the stdlib code) will 
solve the complete problem.  Sadly, beefing up the stdlib code will 
probably reduce performance for things that will not be exploited to be 
careful enough in the cases that could be exploited.

>> Note that changing the hash algorithm for a persistent process, even though each process may have a different seed or randomized source, allows attacks for the life of that process, if an attack vector can be created during its lifetime. This is not a problem for systems where each request is handled by a different process, but is a problem for systems where processes are long-running and handle many requests.
> This point has been made many times now. I urge you to read the entire
> thread on the mailing list. Your implementation is impractical because
> your "safe" implementation completely ignores all hash caching (each
> entry must be re-hashed for that dict). Your implementation is still
> vulnerable in exactly the way you mentioned if you ever have any kind
> of long-lived dict in your program thread.

I had read the entire thread, or at least the part that made it to my 
inbox. It is a very interesting discussion.

I have to admit I was aware that hashes got cached, but the fact that 
strings keep getting recreated causes the cache to be sidestepped did 
slip past me. On the other hand, the only strings that will get 
recreated in a CGI situation are a few FORM parameters per request... 
the expected ones.  And only if the application keeps referring to the 
CGI dict to find them, instead of binding the values locally.  For JSON 
data structures, it seems more likely that they would be iterated, 
rather than point lookups being done.  But it does depend on the 
application whether missing the hash caching would be severe.

A long-lived dict is only vulnerable if it continues to accept updates 
during its lifetime.  For web requests, the vulnerability of concern 
here, that is generally not the case... web requests are generally 
discarded at completion.

>> You have entered the class of people that claim lots of vulnerabilities, without enumeration.
> I have enumerated. Stop making this argument.

I appreciate the enumeration, even though you were not the person to 
whom that statement was addressed.

And, I see that a randomized hash per process helps reduce a large class 
of potential vulnerabilities for short-lived processes, but I still have 
concerns that simply randomizing the hash per process adds startup 
overhead, gratuitous incompatibility, and doesn't fully solve the 
problem, as collisions are still possible, and attackers are clever.
History
Date User Action Args
2012-01-09 18:21:49terry.reedyunlinkissue13703 messages
2012-01-08 05:18:56v+pythonsetrecipients: + v+python, lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, jcea, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, Arfrever, alex, skrah, dmalcolm, gz, Arach, Mark.Shannon, Zhiping.Deng, Huzaifa.Sidhpurwala, PaulMcMillan
2012-01-08 05:18:55v+pythonlinkissue13703 messages
2012-01-08 05:18:53v+pythoncreate