Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.server Header Unicode Bug #55189

Closed
mitsuhiko opened this issue Jan 22, 2011 · 9 comments
Closed

http.server Header Unicode Bug #55189

mitsuhiko opened this issue Jan 22, 2011 · 9 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@mitsuhiko
Copy link
Member

BPO 10980
Nosy @birkenfeld, @orsenthil, @vstinner, @benjaminp, @mitsuhiko, @ezio-melotti, @merwok
Files
  • http-server-unicode.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/birkenfeld'
    closed_at = <Date 2012-09-25.12:45:02.240>
    created_at = <Date 2011-01-22.12:45:50.503>
    labels = ['type-bug', 'library']
    title = 'http.server Header Unicode Bug'
    updated_at = <Date 2012-09-25.12:45:02.239>
    user = 'https://github.com/mitsuhiko'

    bugs.python.org fields:

    activity = <Date 2012-09-25.12:45:02.239>
    actor = 'pitrou'
    assignee = 'georg.brandl'
    closed = True
    closed_date = <Date 2012-09-25.12:45:02.240>
    closer = 'pitrou'
    components = ['Library (Lib)']
    creation = <Date 2011-01-22.12:45:50.503>
    creator = 'aronacher'
    dependencies = []
    files = ['20486']
    hgrepos = []
    issue_num = 10980
    keywords = ['patch']
    message_count = 9.0
    messages = ['126832', '126834', '126835', '126836', '126840', '126845', '147895', '147896', '147908']
    nosy_count = 7.0
    nosy_names = ['georg.brandl', 'orsenthil', 'vstinner', 'benjamin.peterson', 'aronacher', 'ezio.melotti', 'eric.araujo']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue10980'
    versions = ['Python 2.7', 'Python 3.2']

    @mitsuhiko
    Copy link
    Member Author

    I have a critical bugfix that should make it into Python 3.2 even when it's in release candidate state. Currently http.server.BaseHTTPServer encodes headers with ASCII charset. This is at least in violation with PEP-3333 which demands that latin1 is used.

    Because HTTP itself suggests latin1 (iso-8859-1) I strongly recommend changing this in BaseHTTPServer and not wsgiref.

    The attached patch fixes that in a backwards compatible fashion.

    @mitsuhiko mitsuhiko added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 22, 2011
    @vstinner
    Copy link
    Member

    Extract of PEP-3333: << Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding. >>

    What is the best choice for portability (HTTP servers and web browsers): latin1 or MIME encoding? Latin1 is a small subset of Unicode: only U+0000..U+00FF.

    We should maybe give the choice to the user between Latin1, MIME, or maybe something else (eg. UTF-8, cp1252, ...). Or at least, you should try something like:

    try:
    bytes = text.encode('latin1')
    except UnicodeEncodeError:
    bytes = encodeMIME(text, 'utf-8')

    Would it be a good idea to accept raw bytes headers? HTTP is *supposed* to be correctly encoded using different RFC, but in practical, anyone is free to do whateven he wants.

    Sentence extracted randomly from the WWW (dec. 2008): "it seems that neither Tomcat 5.5 or 6 properly decodes HTTP headers as per RFC 2047! The Tomcat code assumes everywhere that header values use ISO-8859-1."

    Finally, why do you consider that this issue have to be fixed before Python 3.2?

    @vstinner
    Copy link
    Member

    RFC 5987 (Character Set and Language Encoding for Hypertext Transfer Protocol (HTTP) Header Field Parameters), August 2010:
    http://greenbytes.de/tech/webdav/rfc5987.html#language.specification.in.encoded.words

    << 3.3 Language Specification in Encoded Words

    Section 5 of [RFC2231] extends the encoding defined in [RFC2047] to also support language specification in encoded words. Although the HTTP/1.1 specification does refer to RFC 2047 ([RFC2616], Section 2.2), it's not clear to which header field exactly it applies, and whether it is implemented in practice (see <http://tools.ietf.org/wg/httpbis/trac/ticket/111\> for details).

    Thus, this specification does not include this feature. >>

    Hum ok, Latin1 looks safe and enough.

    @mitsuhiko
    Copy link
    Member Author

    Georg Brandl signed off the commit and Python 3.2 will ship with the HTTP server accepting latin1 bytes.

    @birkenfeld
    Copy link
    Member

    Armin committed the patch in r88142 and followed up with r88143 for the http.client library.

    Needs backporting?

    @merwok
    Copy link
    Member

    merwok commented Jan 22, 2011

    I think so.

    @ezio-melotti
    Copy link
    Member

    Now it's too late for 3.1, should this still go to 2.7?

    @benjaminp
    Copy link
    Contributor

    Please.

    @mitsuhiko
    Copy link
    Member Author

    2.7 does not suffer from this since 2.7 does not support unicode in headers.

    @pitrou pitrou closed this as completed Sep 25, 2012
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants