Proper etiquette for a web crawler http requests -


I have a simple web crawler to request all pages from a website's Sitemap, which I need to cache and index After many requests, the website starts serving blank pages.

There is nothing except their Sitemap link in their robots.txt , so I think I am not breaking their "rule" I have a descriptive header That really links to my intentions, and only from the sitemap of the pages that I crawl.

The http status codes are all still okay, so I can only imagine they are stopping it, in a short period of time a large number of http requests What is considered as a reasonable delay between requests?

Is there any other idea that I have ignored, which could potentially cause this problem?

Post text "itemprop =" text ">

Each site looks for different crawler and abuse features.

Simulate the crawler's key to human activity And robots.txt.

A detailed crawl will travel to some websites, and no matter how close they will be when you zip crawlers to some extent and one at a time. You suck everything. Generally you can get as much as 6 pages of pages

  • The following links will be safe in order of visibility on the webpage.
  • Try to ignore those links Do not show up on the webpage (many people use honey).
  • If everything fails, do not request faster than one page per minute If a website stops you at this rate, then contact them directly - Do not want to show that you use your content in this way.


    Comments

    Popular posts from this blog

    python - Overriding the save method in Django ModelForm -

    html - CSS autoheight, but fit content to height of div -

    qt - How to prevent QAudioInput from automatically boosting the master volume to 100%? -