Proper etiquette for a web crawler http requests -

September 15, 2013

I have a simple web crawler to request all pages from a website's Sitemap, which I need to cache and index After many requests, the website starts serving blank pages.

There is nothing except their Sitemap link in their robots.txt , so I think I am not breaking their "rule" I have a descriptive header That really links to my intentions, and only from the sitemap of the pages that I crawl.

The http status codes are all still okay, so I can only imagine they are stopping it, in a short period of time a large number of http requests What is considered as a reasonable delay between requests?

Is there any other idea that I have ignored, which could potentially cause this problem?

Post text "itemprop =" text ">

Each site looks for different crawler and abuse features.

Simulate the crawler's key to human activity And robots.txt.

A detailed crawl will travel to some websites, and no matter how close they will be when you zip crawlers to some extent and one at a time. You suck everything. Generally you can get as much as 6 pages of pages

The following links will be safe in order of visibility on the webpage.

Try to ignore those links Do not show up on the webpage (many people use honey).

If everything fails, do not request faster than one page per minute If a website stops you at this rate, then contact them directly - Do not want to show that you use your content in this way.

Search This Blog

Raj T

Proper etiquette for a web crawler http requests -

Comments

Post a Comment

Popular posts from this blog

python - Overriding the save method in Django ModelForm -

html - CSS autoheight, but fit content to height of div -

qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -