BAR-CODE opened this issue on Dec 03, 2006 · 35 posts
TerraDreamer posted Fri, 08 December 2006 at 9:58 PM
Quote - **Hello, JenniferC: **
[shakes head slowly, dabs away welling tears] Oh gawd...poor Jennifer :ohmy:
Quote - This is my first computer. Would you be so kind as to explain the last paragraph to me? Talk to me like a 'third' grader. I'm not the brightest bulb in the room. Are you saying that as you increase servers, etc., you get more traffic and it slows down to a sum, net gain of zero? Or even a net loss of site speed?
Wikipedia is your friend, and if you managed to complete 6th grade, you should be able to understand the following excerpt...
"Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files, a server would have a hard time keeping up with requests from multiple crawlers.
As noted by Koster (Koster, 1995), the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community. The costs of using Web crawlers include:
A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol (Koster, 1996) that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests.
The first proposal for the interval between connections was given in (Koster, 1993) and was 60 seconds. However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire website; also, only a fraction of the resources from that Web server would be used. This does not seem acceptable.
Cho (Cho and Garcia-Molina, 2003) uses 10 seconds as an interval for accesses, and the WIRE crawler (Baeza-Yates and Castillo, 2002) uses 15 seconds as the default. The MercatorWeb crawler (Heydon and Najork, 1999) follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page. Dill et al. (Dill et al., 2002) use 1 second.
Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Brin and Page note that: "... running a crawler which connects to more than half a million servers (...) generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen." (Brin and Page, 1998)."
See en.wikipedia.org/wiki/Web_crawler
Quote - Maybe, until you can actually increase your site speeds at this most wonderful site, you might think of limiting membership here until you can obtain enough capacity and technical know how necessary to serve those who are already here in a better fashion? Just a thought.
I gotta tell ya, you've come up with some real zingers during your endless crusade in this forum, but this one takes first place!