-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
bpo-39187: robotparser does not respect longest match #17794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA). CLA MissingOur records indicate the following people have not signed the CLA: For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. If you have recently signed the CLA, please wait at least one business day You can check yourself to see if the CLA has been received. Thanks again for the contribution, we look forward to reviewing it! |
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your PR, @andreburgaud.
Unfortunately, this approach cannot be used when there are special characters * or $ in the patterns. But they are currently not supported in robotparser, so at least we will get more correct behavior for robots.txt files that don't contain them.
|
|
||
|
|
||
| class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase): | ||
| # https://tools.ietf.org/html/draft-koster-rep-00#section-3.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This document is now available as https://datatracker.ietf.org/doc/html/rfc9309. Please update all links.
But this test passes also if all allow rules have higher priority than any disallow rule. So please add also reversed rules (short allow, long disallow) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion @serhiy-storchaka, Working on it.
| bad = ['/folder1/anotherfile.html'] | ||
|
|
||
|
|
||
| class LongestMatchUserAgentTest(BaseRobotTest, unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test replaces GoogleURLOrderingTest. I think that GoogleURLOrderingTest should now be removed, because it lost its meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'm removing it.
| Disallow: /folder1/ | ||
| Allow: /folder1/ | ||
| """ | ||
| good = ['/folder1/myfile.html', '/folder1', '/folder1'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double '/folder1'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid mistake on my end. Thx. Removing the unnecessary double 'folder1'.
https://bugs.python.org/issue39187