Message359181
As per the current Robots Exclusion Protocol internet draft, https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should apply the rules respecting the longest match.
urllib.robotparser relies on the order of the rules in the robots.txt file. Here is the section in the specs:
===================
3.2. Longest Match
The following example shows that in the case of a two rules, the
longest one MUST be used for matching. In the following case,
/example/page/disallowed.gif MUST be used for the URI
example.com/example/page/disallow.gif .
<CODE BEGINS>
User-Agent : foobot
Allow : /example/page/
Disallow : /example/page/disallowed.gif
<CODE ENDS>
===================
I'm attaching a simple test file "test_robot.py" |
|
Date |
User |
Action |
Args |
2020-01-02 04:14:05 | gallicrooster | set | recipients:
+ gallicrooster |
2020-01-02 04:14:05 | gallicrooster | set | messageid: <1577938445.91.0.743054392693.issue39187@roundup.psfhosted.org> |
2020-01-02 04:14:05 | gallicrooster | link | issue39187 messages |
2020-01-02 04:14:05 | gallicrooster | create | |
|