Strategies for crawling Sites
Step 1 - Eyeball the Site
Before kicking off a crawl of a site, visit the site in browser. Familiarise yourself with the homepage and the links off it.
- Did the URL you typed in redirect you elsewhere?
- Is there a lot of dynamic/Javascript content?
- Are there password-proteced areas of the site?
- Is the site merely landing page to other domains?
All that you're doing is getting a sense of what the site is about, so that you can trust some of your gut instincts when you analyse the crawler's behaviour and results.
Determining what Root URL to use
Often you'll be tasked to crawl a domain with the request: "can you run a crawl on www.telstraenterprise.com.au".
You type www.telstraenterprise.com.au
into the browser and it redirects to http://www.telstraenterprise.com.au/Pages/Home.aspx.
In this case the webserver is redirecting you to a new URL. Use the
redirected-to URL as the root url in the host
configuration screen on Discovery, i.e.
http://www.telstraenterprise.com.au/Pages/Home.aspx
Starting Small
Unless you maintain and manage the content on the site you'll have no idea what types of content and how much content is hidden behind the home page. You have no idea whether or not you're going to run into an events calendar causing to the crawler to dissappear to March, 3929 A.D. before you notice and interrupt.
So start small and restrict the crawler to an upper page limit of 3,000 pages.
Once the initial crawl of 3,000 pages has completed, run the built-in Analysis task and check the report for any surprises such as applications indentifiable by the URL parameters being used.
You can exclude applications from being crawled by editing the robots.txt setting in host configuration screen.
Increase the upper page limit once you're happy with results of the intermediate crawl.
Check the Site's robots.txt File
By default Discovery will crawl every URL belonging to the site unless you specify otherwise through the robots.txt setting on the host configuration screen.
The site that you're trying to crawl may already have a robots.txt file for you to use. You can find it by add /robots.txt to the domain name such as:
http://www.telstraenterprise.com.au/robots.txt
The robots.txt file will also give you an early indication of what parts of the site are applications and should be avoided.
All you need to do is copy & paste (or ammend) the contents into the robots.txt setting on the host configuration screen.
Be Nice, Lessen the Load on Servers
Discovery can be used to crawl large powerful websites in a very
short space of time. A typical single Discovery server may have 10
simultaneous threads to its disposal, which could be clustered
together with several more to create tens or even hundreds of
simultaneous crawl threads.
Doing so is usually a bad idea if you're crawling a single
site.
By launching a large number of threads at a site (rule of thumb - more than 10) you're likely to put undue load on the site's webservers and slowing it down.
Besides potentially evoking the wrath of the site's owners you're also jeapordising the quality of the data being collected by causing time-outs and affecting response time data.
Other than the number of crawl threads, which are controlled in a Discovery configuration file, you can control the crawl rate by setting the Host Avoidance Interval in the site's host configuration screen.
By setting the Host Avoidance Interval to say 2 seconds, you're preventing the crawling from revisiting the site withing 2 seconds of last visit.
Remember: long-running crawls over 12-, 24-, or 48-hour periods have the benefit of collecting response-time trend data, something a short crawl is incapable of doing.