This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: utf8 in BytesGenerator
Type: behavior Stage:
Components: email Versions: Python 3.9
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barry, chrisstaunton1990, darcy.beurle, iritkatriel, r.david.murray
Priority: normal Keywords:

Created on 2021-02-26 21:50 by darcy.beurle, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
sample.py chrisstaunton1990, 2022-02-24 13:20 a sample script containing a sample email string in MIME Format
Messages (4)
msg387749 - (view) Author: Darcy Beurle (darcy.beurle) Date: 2021-02-26 21:50
I have some emails that I'm importing from an XML format according to rfc822. Some of these have some encoding other than ascii. I create the message with the default policy:

message = email.message_from_string(
                        # Extract text from xml
                        message_name.find("property_string").text,
                        policy=email.policy.default)

Then I want to convert this to bytes so I can append it to an IMAP folder using the imap_tools package:

mailbox.append(email.as_bytes(),
               "INBOX",
               dt=None,
               flag_set=(imap_tools.MailMessageFlags.SEEN))

Which then leads to the following output:

line 405, in parse_goldmine_output
    email.as_bytes(),
  File "/usr/lib64/python3.9/email/message.py", line 178, in as_bytes
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib64/python3.9/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 218, in _dispatch
    meth(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 276, in _handle_multipart
    g.flatten(part, unixfrom=False, linesep=self._NL)
  File "/usr/lib64/python3.9/email/generator.py", line 116, in flatten
    self._write(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 181, in _write
    self._dispatch(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 218, in _dispatch
    meth(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 436, in _handle_text
    super(BytesGenerator,self)._handle_text(msg)
  File "/usr/lib64/python3.9/email/generator.py", line 253, in _handle_text
    self._write_lines(payload)
  File "/usr/lib64/python3.9/email/generator.py", line 155, in _write_lines
    self.write(line)
  File "/usr/lib64/python3.9/email/generator.py", line 410, in write
    self._fp.write(s.encode('ascii', 'surrogateescape'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-43: ordinal not in range(128)


If I change the line:

self._fp.write(s.encode('ascii', 'surrogateescape'))

to:

self._fp.write(s.encode('utf8', 'surrogateescape'))

then it writes the email body with the strange characters (same as in the xml). I'm not sure how to proceed. Those emails should be able to be processed, but the bytes writer doesn't seem to inherit the utf8 encoding from anywhere (e.g. if a utf8 policy is used).
msg411464 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2022-01-24 11:05
Are you able to provide a runnable script that reproduces the error?
msg411513 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2022-01-24 20:13
Yeah, I think we need a complete example here.

Note that in the general case there is no such thing as an RFC-valid email in unicode (which is what python strings are), though with utf8=True and an email involving only text you might get away with it.  I assume you've tried policy=policy.default.clone(utf=True) when creating the email?

It will probably help to encode the 'text' to utf8 and use message_from_bytes to read it, but that may not be your only problem.  It depends on exactly what is in the message and how the message gets recorded in your XML whether this is even going to work in the general case.  The xml conversion may have already lost information, but hopefully not.
msg413906 - (view) Author: Chris (chrisstaunton1990) Date: 2022-02-24 13:20
found this issue while googling the error. Also having the same problem with as_bytes() breaking on non-ascii characters. 

I've tried policy=policy.default.clone(utf8=True) but it gives the same error. 

My sample.py file attached contains a string sample email - which has a character \u200d (https://unicode-table.com/en/200D/) - Zero Width Joiner in the body. 

UnicodeEncodeError: 'ascii' codec can't encode character '\u200d' in position 70: ordinal not in range(128)

Any assistance on what I can do to solve it would be great. It seems I can parse 99% of the emails I've tried but this one has me confused.
History
Date User Action Args
2022-04-11 14:59:42adminsetgithub: 87499
2022-02-24 13:20:32chrisstaunton1990setfiles: + sample.py
nosy: + chrisstaunton1990
messages: + msg413906

2022-01-24 20:13:44r.david.murraysetmessages: + msg411513
2022-01-24 11:05:54iritkatrielsettype: crash -> behavior

messages: + msg411464
nosy: + iritkatriel
2021-02-26 21:50:13darcy.beurlecreate