Setting up a Crawl
To setup a new crawl of a website visit the Discovery server's homepage and click the New Host link on the Actions section near the bottom of the homepage.
You'll be presented with a form asking you to specify the details of the web site you're about to crawl.
Host & Crawl Frequency Details
Title
The can title can be anything descriptive to you, for example T-Systems Corporate Site.
Base URL
This URL represents the first URL pointing to a HTML page, that the crawler needs to visit before moving on to links harvested from the HTML page retrieved. For example:
http://www.t-systems.com/
Tags
You can assign keywords as tags to divide a long list of hosts into groups. Tags should be comma-separated, for example:
t-systems.fr french, corporatesites, phase1
t-systems.com english, corporatesites, phase2
You can then filter the list of hosts by attaching a tags parameter to the URL of your Discovery server, eg.
http://discovery.local:5679/?tags=french
Adding the tag to the URL would then filter the list of hosts to show only thosed tagged as french.
Max no. of Pages
Restricts the number of HTML web pages being crawled to the set limit, e.g. 3000 pages.
Note: The limit refers to HTML web pages. Once the limit is reached the crawler will most likely continue crawling URLs pointing to assets such as JPEGs and PDFs that are referenced by the crawled HTML pages.
Crawl Interval
When the scheduler is enabled, tells Discovery how frequently to crawl the specified host.
Host Avoidance Interval
Tells Discovery to not retrieve URLs from the specified host if Discovery has already done so within the specified time interval.
Specify a value in whole seconds, eg. 5
Entering a value of 1 or greater will artficially slow down the crawler, lessening the load on the host servers, as well as spreading out the crawl time.
Crawl Restrictions
Storage options - crawl image links
Enables/disables the retrieval of images on web pages, i.e. content referred to through <IMG> tags.
Storage options - store content
Enables/disables the storage of the crawled content, i.e. the
html pages, images, PDFs, CSS, etc.
This setting is useful if you intend to perform any further
processing on the content using either WebWorxx or Content
Migrator.
You do not need to enable this setting to perform the
Analysis task available in Discovery, which uses the harvested
metadata rather than stored content.
Host's robots.txt
The robots.txt standard is a facility employed by websites to tell webspiders used by search engines like Google's and Microsoft Bing to ignore certain groups of URLs encountered on the site - typically to avoid crawling web applications such as calendars or staff directories.
Discovery uses the same syntax to restrict the URLs that it will and will not crawl into. You can therefore copy & paste robots.txt settings directly from the to-be-crawled site.
For example, here are T-Systems settings available at http://www.t-systems.com/robots.txt:
User-agent: *
Disallow: /tsip/servlet/search
Disallow: /*nextstep*
Disallow: /*searchTerm*
To find out more about the syntax, consult http://en.wikipedia.org/wiki/Robots_exclusion_standard.
Discovery extends the capabilities of the robots.txt standard by providing full support for regular expressions, using the Disallow-regex statement:
User-agent: *
Disallow-regex: /archives/201\d/.*
The above statement would allow Discovery to crawl only URLs refererring to the 2010-2019 decade.
Link Harvesting Patterns
By default Discovery will parse the HTML content of a web page and extract all links contained within it, irrespective of where those links are located on the page or their semantic meaning, e.g. main body links vs footer links.
Discovery gives you fine-grained control over the areas of the page that it harvests links from. For example, on an e-commerce site, if you're only interested in crawling products and their categories, you might choose to ignore everything other than the centre pane of the web page.
The areas you choose to harvest links from are specified using a CSS selector, which is an expression use to identify elements of web page. For example using the following web page.
<html>
<body>
<div id="header">…</div>
<div id="main-body">…</div>
<div id="footer">…</div>
</body>
</html>
#main-body refers to the DIV element with id="main-body".
#footer refers to the DIV below #main-body, identified by id="footer".
Here's a good summary of all the CSS selectors available to Discovery: http://www.w3schools.com/cssref/css_selectors.asp
To accomodate that a site employs page templates and that those templates contain links in different areas of the page identified by different CSS selectors, we can tell Discovery to use different CSS selectors when it encounters different URLs.
For example an e-commerce site consists mainly out of two page templates, namely a product-category template and a product template, i.e. the product-category lists all the shoes by a certain type e.g. running shoes, the product page shows detail on a particular running shoe.
So we might want to apply the following link harvesting patterns as an example:
__URL Pattern__ __CSS Selectors__
/.*category=.\* #category-items
/product/.* #main-body, #related-accessories
With the example above, Discovery would harvest links only from
the #category-items section when encountering URLs containing a
category parameter. Discovery would harvest links from
#main-body and #related-accessories sections when encountering URLs
pointing to products.
Discovery uses regular expressions to identify URLs with different
page templates.
To store link harvesting patterns you need to encode them in JSON notation. Here's the above example encoded in JSON:
[
["/.*category=.*",["#category-items"]],
["/product/.*",["#main-body", "#related-accessories"]],
[".*",["html > body"]]
]
The last pattern is a catch-all pattern which in the event of Discovery encountering a page that is neither a product category page or product page, it will simple grab all URLs in the HTML body.
Duplicate Detection Patterns
Duplicate Detection Patterns are used by Discovery to determine what parts of a web page make that page unique.
Without specifying any patterns, Discovery will create a digital fingerprint of the whole page (an MD5 digest in technical parlance). This default behaviour works for most sites, hence does not need any tweaking.
However if you find yourself crawling an e-commerce website where there are typically multiple versions of the same basic product page available and you want to treat those versions as all belonging to the same page, then Duplicate Detection Patterns are needed.
An e-commerce site might have 3 product pages for the same pair of running shoes whereby each of those pages only differs in their breadcrumbs, e.g.:
<div id="product">
<div id="bc">Home > Shoes > Running > Nike Air Max 2011</div>
<div id="product-description">…</div>
</div>
<div id="product">
<div id="bc">Home > Shoes > Nike > Nike Air Max 2011</div>
<div id="product-description">…</div>
</div>
<div id="product">
<div id="bc">Home > Nike > Nike Air Max 2011</div>
<div id="product-description">…</div>
</div>
Discovery would generate three distict digital fingerprints for these three pages, because included in the fingerprinting would be three different breadcrumbs.
Discovery will generate the exact digital fingerprint for each of the three occurences, by ignoring the breadcrumb (<div id="bc">) and focusing on the product description instead (<div id="product-description">).
Here's how Discovery is configured for the goal above:
[
["/product/.*",["#product-description"]],
["/.*category=.*",["#category-items"]],
[".*",["html > body"]]
]
As you can see, we use the same syntax as Link Harvesting Patterns, however the meaning is very different. URL regular expressions are used to indentify page types, followed by a list of CSS selectors identifying the areas of the page to be used in digital fingerprinting.
Drop URL Parameter Patterns
Drop URL Parameter Patterns tell Discovery which URL parameters are inconsequential to a successful crawl of website.
You may find yourself crawling an e-commerce website that has
multiple renderings of the same basic information. For example a
product category page for running shoes may provide different sort
orders (price ascending, price descending, popularity descending,
etc.) by clicking on various links. Each of these links would
present the same information but ordered differently, therefore
would not be caught by Duplicate
Detection Patterns.
You simply want one rendering of the product category page, the
default sort order, which gives you access to all the product pages
- dropping any other sort orders.
Sort is typically embbedded in URL parameters, for example:
http://retailer.com/shoes/?category=running&sort=price-desc&page=2
However you want Discovery to turn the above URL into one without the sort parameter:
http://retailer.com/shoes/?category=running&page=2
The following fragment shows how Drop URL Parameter Patterns are configured to achieve the above result:
[
["/.*category=",["sort", "viewAll]]
]
Once again a URL regular expression is used to map the action of
dropping paramaters to certain page types, in this case product
categories. In the configuration example Discovery would remove the
sort and/or
viewAll parameters if the URL matched the regular
expression. If viewAll didn't occur in the URL, but
sort did, then sort would be still be
removed. So the configuration line should read like:
*if you encounter product category URLs, remove sort and viewAll
parameters if they are used*
Tip: A great way to find out which parameters
are used by a website is to perform a small 3000 page crawl, run
the standard audit and check the
URL parameters report, which will show you what parameters are used
and
in what combinations.