Proper etiquette for a web crawler http requests -
I have a simple web crawler to request all pages from a website's Sitemap, which I need to cache and index After many requests, the website starts serving blank pages.
There is nothing except their Sitemap link in their robots.txt , so I think I am not breaking their "rule" I have a descriptive header That really links to my intentions, and only from the sitemap of the pages that I crawl.
The http status codes are all still okay, so I can only imagine they are stopping it, in a short period of time a large number of http requests What is considered as a reasonable delay between requests?
Is there any other idea that I have ignored, which could potentially cause this problem?
Each site looks for different crawler and abuse features.
Simulate the crawler's key to human activity And robots.txt.
A detailed crawl will travel to some websites, and no matter how close they will be when you zip crawlers to some extent and one at a time. You suck everything. Generally you can get as much as 6 pages of pages
If everything fails, do not request faster than one page per minute If a website stops you at this rate, then contact them directly - Do not want to show that you use your content in this way.
Comments
Post a Comment