classification
Title: robotsparser deny all with some rules
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jmgray47, Patrick Valibus 410 Gone, arnaud
Priority: normal Keywords:

Created on 2019-03-06 09:42 by quentin-maire, last changed 2020-07-31 13:24 by arnaud.

Messages (10)
msg337285 - (view) Author: wats0ns (quentin-maire) Date: 2019-03-06 09:42
RobotsParser parse a "Disallow: ?" rule as a deny all, but this is a valid rule that should be interpreted as "Disallow: /?*" or "Disallow: /*?*"
msg338293 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-03-18 22:13
Can you provide a link to documentation showing that "Disallow: ?" shouldn't be the same as deny all?  Thanks!
msg338298 - (view) Author: wats0ns (quentin-maire) Date: 2019-03-18 23:20
I can't find a documentation about it, but all of the robots.txt checkers I find behave like this. You can test on this site: http://www.eskimoz.fr/robots.txt, I believe that this is how it's implemented now in most parsers ?
msg365770 - (view) Author: Rodriguez (lagustais) Date: 2020-04-04 16:46
I can't display my robot.TXT. I want to ban robots
 https://melwynn-rodriguez.fr/robots.txt
msg366509 - (view) Author: asca (artasca) Date: 2020-04-15 12:57
I thought it was going to work but apparently when I try https://www.actusite.fr/robots.txt, it doesn't
msg367546 - (view) Author: Fred AYERS (Fred AYERS) Date: 2020-04-28 17:20
I tried this one <a href="http://gtxgamer.fr/robots.txt/">http://gtxgamer.fr/robots.txt</a> and it seems to work.
msg370275 - (view) Author: mathias44 (mathias44) Date: 2020-05-28 23:54
I can't display my robot.TXT. I want to ban robots https://ereputation-dereferencement.fr/
msg372112 - (view) Author: Patrick Valibus 410 Gone (Patrick Valibus 410 Gone) Date: 2020-06-22 20:35
Bonjour, nous n'avons pas réussi à le faire fonctionner. Nous l'avons utilisé dans le cadre d'un test seo car nous essayons e reproduire des alternatives à scrappy. Par exemple le robots devrait bine crawler la page de notre agence seo https://www.410-gone.fr/seo.html mais ne devrait pas accepter les pages finissant par /*.php$ et pourtant si malgré qu'elles soient bloquées en référencement dans notre robots.txt, merci.
msg374629 - (view) Author: James Gray (Jmgray47) Date: 2020-07-31 04:34
Bonjour, je vois que nous ne sommes pas les seuls dans ce cas, nous avons besoin que les robots indexent nos pages html mais qu'elles n'indexent pas celles en /*.php$ ainsi que les ressources PC en PDF. Nous avons tenté en vain plusieurs solutions en passant par le robots.txt à la racine de notre domaine https://demolinux.org/ mais sans succès. Le RobotsParser ne prends pas ces règles en compte, merci.
msg374642 - (view) Author: arnaud (arnaud) Date: 2020-07-31 13:24
Do you have documentation about robotParser ? The robot.txt of this website works fine : https://vauros.com/
History
Date User Action Args
2020-07-31 13:24:17arnaudsetnosy: + arnaud
messages: + msg374642
2020-07-31 04:34:49Jmgray47setnosy: + Jmgray47
messages: + msg374629
2020-06-22 20:35:43Patrick Valibus 410 Gonesetnosy: + Patrick Valibus 410 Gone, - cheryl.sabella, quentin-maire, lagustais, artasca, Fred AYERS, mathias44
messages: + msg372112
2020-05-28 23:54:56mathias44setnosy: + mathias44
messages: + msg370275
2020-04-28 17:20:52Fred AYERSsetnosy: + Fred AYERS
messages: + msg367546
2020-04-15 12:57:20artascasetnosy: + artasca
messages: + msg366509
2020-04-04 16:46:51lagustaissetnosy: + lagustais
messages: + msg365770
2019-03-18 23:20:00quentin-mairesetmessages: + msg338298
2019-03-18 22:13:37cheryl.sabellasetnosy: + cheryl.sabella
messages: + msg338293
2019-03-06 09:42:01quentin-mairecreate