Title: HTMLParser: subsequent duplicate attributes should be ignored
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.10
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, karlcow
Priority: normal Keywords:

Created on 2021-01-04 08:00 by karlcow, last changed 2021-01-06 06:46 by ezio.melotti. This issue is now closed.

Messages (2)
msg384308 - (view) Author: karl (karlcow) * Date: 2021-01-04 08:00
This comes up while working on issue 41748

browser input 
data:text/html,<!doctype html><div class="bar" class="foo">text</div>

browser output
<div class="bar">text</div>

Actual HTMLParser output

('starttag', 'div', [('class', 'bar'), ('class', 'foo')])]

Expected HTMLParser output
('starttag', 'div', [('class', 'bar')])]
msg384475 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2021-01-06 06:46
If we follow the behavior of the browser, we will have to pick one of the two values and discard the other, making this value unaccessible.  If we provide both, scripts and libraries that use HTMLParser will have access to both and can decide what to do.

For example BeautifulSoup already does the right thing:
>>> bs4.BeautifulSoup('<!doctype html><div class="bar" class="foo">text</div>')
<!DOCTYPE html>
<html><body><div class="bar">text</div></body></html>

Changing this might also break code that rely on this behavior.  I'm therefore going to close this as "not a bug".
Date User Action Args
2021-01-06 06:46:27ezio.melottisetstatus: open -> closed

type: behavior
assignee: ezio.melotti

nosy: + ezio.melotti
messages: + msg384475
resolution: not a bug
stage: resolved
2021-01-04 08:00:54karlcowcreate