Message 244752 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mgdelmonte
Recipients	Lukasa, barry, demian.brecht, icordasc, martin.panter, mgdelmonte, piotr.dobrogost, r.david.murray
Date	2015-06-03.14:36:26
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1433342186.84.0.340444077082.issue24363@psf.upfronthosting.co.za>
In-reply-to

Content
Given that obs-fold is technically valid, then can I recommend reading the entire header first (reading to the first blank line) and then tokenizing the individual headers using a regular expression rather than line by line? That would solve the problem more elegantly and easily than attempting to read lines on the fly and then "unreading" the nonconforming lines. Here's my recommendation: def readheaders(self): self.dict = {} self.unixfrom = '' self.headers = hlist = [] self.status = '' # read entire header (read until first blank line) while True: line = self.fp.readline(_MAXLINE+1) if not line: self.status = 'EOF in headers' break if len(line) > _MAXLINE: raise LineTooLong("header line") hlist.append(line) if line in ('\n', '\r\n'): break if len(hlist) > _MAXHEADERS: raise HTTPException("got more than %d headers" % _MAXHEADERS) # reproduce and parse as string header = "\n".join(hlist) self.headers = re.findall(r"[^ \n][^\n]+\n(?: +[^\n]+\n)*", header) firstline = True for line in self.headers: if firstline and line.startswith('From '): self.unixfrom = self.unixfrom + line continue firstline = False if ':' in line: k,v = line.split(':',1) self.addheader(k, re.sub("\n +", " ", v.strip())) else: self.status = 'Non-header line where header expected' if self.dict else 'No headers' I think this handles everything you're trying to do. I don't understand the unixfrom bit, but I think I have it right. As for Cory's concern re: smuggling, _MAXLINE and _MAXHEADERS should help with that. The regexp guarantees that every line plus continuation appears as a single header. I use re.sub("\n +", " ", v.strip()) to clean the value and remove the continuation.

Given that obs-fold is technically valid, then can I recommend reading the entire header first (reading to the first blank line) and then tokenizing the individual headers using a regular expression rather than line by line?  That would solve the problem more elegantly and easily than attempting to read lines on the fly and then "unreading" the nonconforming lines.

Here's my recommendation:

    def readheaders(self):
        self.dict = {}
        self.unixfrom = ''
        self.headers = hlist = []
        self.status = ''
        # read entire header (read until first blank line)
        while True:
            line = self.fp.readline(_MAXLINE+1)
            if not line:
                self.status = 'EOF in headers'
                break
            if len(line) > _MAXLINE:
                raise LineTooLong("header line")
            hlist.append(line)
            if line in ('\n', '\r\n'):
                break
            if len(hlist) > _MAXHEADERS:
                raise HTTPException("got more than %d headers" % _MAXHEADERS)
        # reproduce and parse as string
        header = "\n".join(hlist)
        self.headers = re.findall(r"[^ \n][^\n]+\n(?: +[^\n]+\n)*", header)
        firstline = True
        for line in self.headers:
            if firstline and line.startswith('From '):
                self.unixfrom = self.unixfrom + line
                continue
            firstline = False
            if ':' in line:
                k,v = line.split(':',1)
                self.addheader(k, re.sub("\n +", " ", v.strip()))
            else:
                self.status = 'Non-header line where header expected' if self.dict else 'No headers'


I think this handles everything you're trying to do.  I don't understand the unixfrom bit, but I think I have it right.

As for Cory's concern re: smuggling, _MAXLINE and _MAXHEADERS should help with that.  The regexp guarantees that every line plus continuation appears as a single header.

I use re.sub("\n +", " ", v.strip()) to clean the value and remove the continuation.

History
Date	User	Action	Args
2015-06-03 14:36:26	mgdelmonte	set	recipients: + mgdelmonte, barry, r.david.murray, martin.panter, piotr.dobrogost, icordasc, demian.brecht, Lukasa
2015-06-03 14:36:26	mgdelmonte	set	messageid: <1433342186.84.0.340444077082.issue24363@psf.upfronthosting.co.za>
2015-06-03 14:36:26	mgdelmonte	link	issue24363 messages
2015-06-03 14:36:26	mgdelmonte	create