webserver - What is the optimum duration for a web crawler to wait between repeated requests to a web server -


Is there a standard time period that a crawler should wait between duplicated hits on the same server, so that Do not overload the server.

If not, what can good waiting for the crawler to be regarded as humble but any suggestions?

Does this value vary from server to server ... and if so, how can it decide?

To quote this article in some details and

on IBM.

The first time a page is crawled, the crawler uses the date and time that the page is crawled and the average of the minimum and maximum of the specified records is set to set a record date for that date interval The page will not be redone before. The page will be re-written after that time, the crawler depends on the balance of the new and the old URL in the load and crawl space.

Every time the page is redone, the crawler checks whether the material has changed if the material has changed, then the next recolval interval will be less than the last one, but less than the specified minimum recurrence interval Will never. If the material has not changed, then the next recolocation interval will be more than the previous one, but will not be longer than the specified maximum interval.

This is about their web crawler, but it is very useful to read while creating your own tool.


Comments