Rietveld Code Review Tool
Help | Bug tracker | Discussion group | Source code | Sign in
(2)

Side by Side Diff: Lib/email/quoprimime.py

Issue 5803: email/quoprimime: encode and decode are very slow on large messages
Patch Set: Created 6 years, 8 months ago
Left:
Right:
Use n/p to move between diff chunks; N/P to move between comments. Please Sign in to add in-line comments.
Jump to:
View unified diff | Download patch
« no previous file with comments | « no previous file | no next file » | no next file with comments »
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
OLDNEW
1 # Copyright (C) 2001-2006 Python Software Foundation 1 # Copyright (C) 2001-2006 Python Software Foundation
2 # Author: Ben Gertzfield 2 # Author: Ben Gertzfield
3 # Contact: email-sig@python.org 3 # Contact: email-sig@python.org
4 4
5 """Quoted-printable content transfer encoding per RFCs 2045-2047. 5 """Quoted-printable content transfer encoding per RFCs 2045-2047.
6 6
7 This module handles the content transfer encoding method defined in RFC 2045 7 This module handles the content transfer encoding method defined in RFC 2045
8 to encode US ASCII-like 8-bit data called `quoted-printable'. It is used to 8 to encode US ASCII-like 8-bit data called `quoted-printable'. It is used to
9 safely encode text that is in a character set similar to the 7-bit US ASCII 9 safely encode text that is in a character set similar to the 7-bit US ASCII
10 character set, but that includes some 8-bit characters that are normally not 10 character set, but that includes some 8-bit characters that are normally not
(...skipping 35 matching lines...) Expand 10 before | Expand all | Expand 10 after
46 46
47 CRLF = '\r\n' 47 CRLF = '\r\n'
48 NL = '\n' 48 NL = '\n'
49 EMPTYSTRING = '' 49 EMPTYSTRING = ''
50 50
51 # Build a mapping of octets to the expansion of that octet. Since we're only 51 # Build a mapping of octets to the expansion of that octet. Since we're only
52 # going to have 256 of these things, this isn't terribly inefficient 52 # going to have 256 of these things, this isn't terribly inefficient
53 # space-wise. Remember that headers and bodies have different sets of safe 53 # space-wise. Remember that headers and bodies have different sets of safe
54 # characters. Initialize both maps with the full expansion, and then override 54 # characters. Initialize both maps with the full expansion, and then override
55 # the safe bytes with the more compact form. 55 # the safe bytes with the more compact form.
56 _QUOPRI_HEADER_MAP = dict((c, '=%02X' % c) for c in range(256)) 56 _QUOPRI_MAP = ['=%02X' % c for c in range(256)]
57 _QUOPRI_BODY_MAP = _QUOPRI_HEADER_MAP.copy() 57 _QUOPRI_HEADER_MAP = _QUOPRI_MAP[:]
58 _QUOPRI_BODY_MAP = _QUOPRI_MAP[:]
58 59
59 # Safe header bytes which need no encoding. 60 # Safe header bytes which need no encoding.
60 for c in b'-!*+/' + ascii_letters.encode('ascii') + digits.encode('ascii'): 61 for c in b'-!*+/' + ascii_letters.encode('ascii') + digits.encode('ascii'):
61 _QUOPRI_HEADER_MAP[c] = chr(c) 62 _QUOPRI_HEADER_MAP[c] = chr(c)
62 # Headers have one other special encoding; spaces become underscores. 63 # Headers have one other special encoding; spaces become underscores.
63 _QUOPRI_HEADER_MAP[ord(' ')] = '_' 64 _QUOPRI_HEADER_MAP[ord(' ')] = '_'
64 65
65 # Safe body bytes which need no encoding. 66 # Safe body bytes which need no encoding.
66 for c in (b' !"#$%&\'()*+,-./0123456789:;<>' 67 for c in (b' !"#$%&\'()*+,-./0123456789:;<>'
67 b'?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`' 68 b'?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`'
(...skipping 46 matching lines...) Expand 10 before | Expand all | Expand 10 after
114 else: 115 else:
115 L.append(s.lstrip()) 116 L.append(s.lstrip())
116 117
117 118
118 def unquote(s): 119 def unquote(s):
119 """Turn a string in the form =AB to the ASCII character with value 0xab""" 120 """Turn a string in the form =AB to the ASCII character with value 0xab"""
120 return chr(int(s[1:3], 16)) 121 return chr(int(s[1:3], 16))
121 122
122 123
123 def quote(c): 124 def quote(c):
124 return '=%02X' % ord(c) 125 return _QUOPRI_MAP[ord(c)]
125
126 126
127 127
128 def header_encode(header_bytes, charset='iso-8859-1'): 128 def header_encode(header_bytes, charset='iso-8859-1'):
129 """Encode a single header line with quoted-printable (like) encoding. 129 """Encode a single header line with quoted-printable (like) encoding.
130 130
131 Defined in RFC 2045, this `Q' encoding is similar to quoted-printable, but 131 Defined in RFC 2045, this `Q' encoding is similar to quoted-printable, but
132 used specifically for email header fields to allow charsets with mostly 7 132 used specifically for email header fields to allow charsets with mostly 7
133 bit characters (and some 8 bit) to remain more or less readable in non-RFC 133 bit characters (and some 8 bit) to remain more or less readable in non-RFC
134 2045 aware mail clients. 134 2045 aware mail clients.
135 135
136 charset names the character set to use in the RFC 2046 header. It 136 charset names the character set to use in the RFC 2046 header. It
137 defaults to iso-8859-1. 137 defaults to iso-8859-1.
138 """ 138 """
139 # Return empty headers as an empty string. 139 # Return empty headers as an empty string.
140 if not header_bytes: 140 if not header_bytes:
141 return '' 141 return ''
142 # Iterate over every byte, encoding if necessary. 142 # Iterate over every byte, encoding if necessary.
143 encoded = [] 143 encoded = header_bytes.decode('latin1').translate(_QUOPRI_HEADER_MAP)
144 for octet in header_bytes:
145 encoded.append(_QUOPRI_HEADER_MAP[octet])
146 # Now add the RFC chrome to each encoded chunk and glue the chunks 144 # Now add the RFC chrome to each encoded chunk and glue the chunks
147 # together. 145 # together.
148 return '=?%s?q?%s?=' % (charset, EMPTYSTRING.join(encoded)) 146 return '=?%s?q?%s?=' % (charset, encoded)
149 147
150 148
151 class _body_accumulator(io.StringIO): 149 _QUOPRI_BODY_ENCODE_MAP = _QUOPRI_BODY_MAP[:]
152 150 for c in b'\r\n':
153 def __init__(self, maxlinelen, eol, *args, **kw): 151 _QUOPRI_BODY_ENCODE_MAP[c] = chr(c)
154 super().__init__(*args, **kw)
155 self.eol = eol
156 self.maxlinelen = self.room = maxlinelen
157
158 def write_str(self, s):
159 """Add string s to the accumulated body."""
160 self.write(s)
161 self.room -= len(s)
162
163 def newline(self):
164 """Write eol, then start new line."""
165 self.write_str(self.eol)
166 self.room = self.maxlinelen
167
168 def write_soft_break(self):
169 """Write a soft break, then start a new line."""
170 self.write_str('=')
171 self.newline()
172
173 def write_wrapped(self, s, extra_room=0):
174 """Add a soft line break if needed, then write s."""
175 if self.room < len(s) + extra_room:
176 self.write_soft_break()
177 self.write_str(s)
178
179 def write_char(self, c, is_last_char):
180 if not is_last_char:
181 # Another character follows on this line, so we must leave
182 # extra room, either for it or a soft break, and whitespace
183 # need not be quoted.
184 self.write_wrapped(c, extra_room=1)
185 elif c not in ' \t':
186 # For this and remaining cases, no more characters follow,
187 # so there is no need to reserve extra room (since a hard
188 # break will immediately follow).
189 self.write_wrapped(c)
190 elif self.room >= 3:
191 # It's a whitespace character at end-of-line, and we have room
192 # for the three-character quoted encoding.
193 self.write(quote(c))
194 elif self.room == 2:
195 # There's room for the whitespace character and a soft break.
196 self.write(c)
197 self.write_soft_break()
198 else:
199 # There's room only for a soft break. The quoted whitespace
200 # will be the only content on the subsequent line.
201 self.write_soft_break()
202 self.write(quote(c))
203
204 152
205 def body_encode(body, maxlinelen=76, eol=NL): 153 def body_encode(body, maxlinelen=76, eol=NL):
206 """Encode with quoted-printable, wrapping at maxlinelen characters. 154 """Encode with quoted-printable, wrapping at maxlinelen characters.
207 155
208 Each line of encoded text will end with eol, which defaults to "\\n". Set 156 Each line of encoded text will end with eol, which defaults to "\\n". Set
209 this to "\\r\\n" if you will be using the result of this function directly 157 this to "\\r\\n" if you will be using the result of this function directly
210 in an email. 158 in an email.
211 159
212 Each line will be wrapped at, at most, maxlinelen characters before the 160 Each line will be wrapped at, at most, maxlinelen characters before the
213 eol string (maxlinelen defaults to 76 characters, the maximum value 161 eol string (maxlinelen defaults to 76 characters, the maximum value
214 permitted by RFC 2045). Long lines will have the 'soft line break' 162 permitted by RFC 2045). Long lines will have the 'soft line break'
215 quoted-printable character "=" appended to them, so the decoded text will 163 quoted-printable character "=" appended to them, so the decoded text will
216 be identical to the original text. 164 be identical to the original text.
217 165
218 The minimum maxlinelen is 4 to have room for a quoted character ("=XX") 166 The minimum maxlinelen is 4 to have room for a quoted character ("=XX")
219 followed by a soft line break. Smaller values will generate a 167 followed by a soft line break. Smaller values will generate a
220 ValueError. 168 ValueError.
221 169
222 """ 170 """
223 171
224 if maxlinelen < 4: 172 if maxlinelen < 4:
225 raise ValueError("maxlinelen must be at least 4") 173 raise ValueError("maxlinelen must be at least 4")
226 if not body: 174 if not body:
227 return body 175 return body
228 176
229 # The last line may or may not end in eol, but all other lines do. 177 # quote speacial characters
230 last_has_eol = (body[-1] in '\r\n') 178 body = body.translate(_QUOPRI_BODY_ENCODE_MAP)
231 179
232 # This accumulator will make it easier to build the encoded body. 180 soft_break = '=' + eol
233 encoded_body = _body_accumulator(maxlinelen, eol) 181 # leave space for the '=' at the end of a line
182 maxlinelen1 = maxlinelen - 1
234 183
235 lines = body.splitlines() 184 encoded_body = []
236 last_line_no = len(lines) - 1 185 append = encoded_body.append
237 for line_no, line in enumerate(lines):
238 last_char_index = len(line) - 1
239 for i, c in enumerate(line):
240 if body_check(ord(c)):
241 c = quote(c)
242 encoded_body.write_char(c, i==last_char_index)
243 # Add an eol if input line had eol. All input lines have eol except
244 # possibly the last one.
245 if line_no < last_line_no or last_has_eol:
246 encoded_body.newline()
247 186
248 return encoded_body.getvalue() 187 for line in body.splitlines():
188 # break up the line into pieces no longer than maxlinelen - 1
189 start = 0
190 laststart = len(line) - 1 - maxlinelen
191 while start <= laststart:
192 stop = start + maxlinelen1
193 # make sure we don't break up an escape sequence
194 if line[stop - 2] == '=':
195 append(line[start:stop - 1])
196 start = stop - 2
197 elif line[stop - 1] == '=':
198 append(line[start:stop])
199 start = stop - 1
200 else:
201 append(line[start:stop] + '=')
202 start = stop
203
204 # handle rest of line, special case if line ends in whitespace
205 if line and line[-1] in ' \t':
206 room = start - laststart
207 if room >= 3:
208 # It's a whitespace character at end-of-line, and we have room
209 # for the three-character quoted encoding.
210 q = quote(line[-1])
211 elif room == 2:
212 # There's room for the whitespace character and a soft break.
213 q = line[-1] + soft_break
214 else:
215 # There's room only for a soft break. The quoted whitespace
216 # will be the only content on the subsequent line.
217 q = soft_break + quote(line[-1])
218 append(line[start:-1] + q)
219 else:
220 append(line[start:])
221
222 # add back final newline if present
223 if body[-1] in CRLF:
224 append('')
225
226 return eol.join(encoded_body)
249 227
250 228
251 229
252 # BAW: I'm not sure if the intent was for the signature of this function to be 230 # BAW: I'm not sure if the intent was for the signature of this function to be
253 # the same as base64MIME.decode() or not... 231 # the same as base64MIME.decode() or not...
254 def decode(encoded, eol=NL): 232 def decode(encoded, eol=NL):
255 """Decode a quoted-printable string. 233 """Decode a quoted-printable string.
256 234
257 Lines are separated with eol, which defaults to \\n. 235 Lines are separated with eol, which defaults to \\n.
258 """ 236 """
(...skipping 54 matching lines...) Expand 10 before | Expand all | Expand 10 after
313 # Header decoding is done a bit differently 291 # Header decoding is done a bit differently
314 def header_decode(s): 292 def header_decode(s):
315 """Decode a string encoded with RFC 2045 MIME header `Q' encoding. 293 """Decode a string encoded with RFC 2045 MIME header `Q' encoding.
316 294
317 This function does not parse a full MIME header value encoded with 295 This function does not parse a full MIME header value encoded with
318 quoted-printable (like =?iso-8895-1?q?Hello_World?=) -- please use 296 quoted-printable (like =?iso-8895-1?q?Hello_World?=) -- please use
319 the high level email.header class for that functionality. 297 the high level email.header class for that functionality.
320 """ 298 """
321 s = s.replace('_', ' ') 299 s = s.replace('_', ' ')
322 return re.sub(r'=[a-fA-F0-9]{2}', _unquote_match, s, re.ASCII) 300 return re.sub(r'=[a-fA-F0-9]{2}', _unquote_match, s, re.ASCII)
OLDNEW
« no previous file with comments | « no previous file | no next file » | no next file with comments »

RSS Feeds Recent Issues | This issue
This is Rietveld 894c83f36cb7+