Setting up a Crawl

To setup a new crawl of a website visit the Discovery server's homepage and click the New Host link on the Actions section near the bottom of the homepage.

You'll be presented with a form asking you to specify the details of the web site you're about to crawl.

Host & Crawl Frequency Details

Title

The can title can be anything descriptive to you, for example T-Systems Corporate Site.

Base URL

This URL represents the first URL pointing to a HTML page, that the crawler needs to visit before moving on to links harvested from the HTML page retrieved. For example:

http://www.t-systems.com/

Tags

You can assign keywords as tags to divide a long list of hosts into groups. Tags should be comma-separated, for example:

t-systems.fr             french, corporatesites, phase1
t-systems.com         english, corporatesites, phase2

You can then filter the list of hosts by attaching a tags parameter to the URL of your Discovery server, eg.

http://discovery.local:5679/?tags=french

Adding the tag to the URL would then filter the list of hosts to show only thosed tagged as french.

Max no. of Pages

Restricts the number of HTML web pages being crawled to the set limit, e.g. 3000 pages.

Note: The limit refers to HTML web pages. Once the limit is reached the crawler will most likely continue crawling URLs pointing to assets such as JPEGs and PDFs that are referenced by the crawled HTML pages.

Crawl Interval

When the scheduler is enabled, tells Discovery how frequently to crawl the specified host.

Host Avoidance Interval

Tells Discovery to not retrieve URLs from the specified host if Discovery has already done so within the specified time interval.

Specify a value in whole seconds, eg. 5

Entering a value of 1 or greater will artficially slow down the crawler, lessening the load on the host servers, as well as spreading out the crawl time.

Crawl Restrictions

Storage options - crawl image links

Enables/disables the retrieval of images on web pages, i.e. content referred to through <IMG> tags.

Storage options - store content

Enables/disables the storage of the crawled content, i.e. the html pages, images, PDFs, CSS, etc.
This setting is useful if you intend to perform any further processing on the content using either WebWorxx or Content Migrator.
You do not need to enable this setting to perform the Analysis task available in Discovery, which uses the harvested metadata rather than stored content.

Host's robots.txt

The robots.txt standard is a facility employed by websites to tell webspiders used by search engines like Google's and Microsoft Bing to ignore certain groups of URLs encountered on the site - typically to avoid crawling web applications such as calendars or staff directories.

Discovery uses the same syntax to restrict the URLs that it will and will not crawl into. You can therefore copy & paste robots.txt settings directly from the to-be-crawled site.

For example, here are T-Systems settings available at http://www.t-systems.com/robots.txt:

User-agent: *
Disallow: /tsip/servlet/search
Disallow: /*nextstep*
Disallow: /*searchTerm*

To find out more about the syntax, consult http://en.wikipedia.org/wiki/Robots_exclusion_standard.

Discovery extends the capabilities of the robots.txt standard by providing full support for regular expressions, using the Disallow-regex statement:

User-agent: *
Disallow-regex: /archives/201\d/.*

The above statement would allow Discovery to crawl only URLs refererring to the 2010-2019 decade.

Link Harvesting Patterns

By default Discovery will parse the HTML content of a web page and extract all links contained within it, irrespective of where those links are located on the page or their semantic meaning, e.g. main body links vs footer links.

Discovery gives you fine-grained control over the areas of the page that it harvests links from. For example, on an e-commerce site, if you're only interested in crawling products and their categories, you might choose to ignore everything other than the centre pane of the web page.

The areas you choose to harvest links from are specified using a CSS selector, which is an expression use to identify elements of web page. For example using the following web page.

<html>
  <body>
    <div id="header">…</div>
    <div id="main-body">…</div>
    <div id="footer">…</div>
  </body>
</html>

#main-body refers to the DIV element with id="main-body". 
#footer    refers to the DIV below #main-body, identified by id="footer".

Here's a good summary of all the CSS selectors available to Discovery: http://www.w3schools.com/cssref/css_selectors.asp

To accomodate that a site employs page templates and that those templates contain links in different areas of the page identified by different CSS selectors, we can tell Discovery to use different CSS selectors when it encounters different URLs.

For example an e-commerce site consists mainly out of two page templates, namely a product-category template and a product template, i.e. the product-category lists all the shoes by a certain type e.g. running shoes, the product page shows detail on a particular running shoe.

So we might want to apply the following link harvesting patterns as an example:

__URL Pattern__     __CSS Selectors__
/.*category=.\*     #category-items
/product/.*         #main-body, #related-accessories

With the example above, Discovery would harvest links only from the #category-items section when encountering URLs containing a category parameter. Discovery would harvest links from #main-body and #related-accessories sections when encountering URLs pointing to products.
Discovery uses regular expressions to identify URLs with different page templates.

To store link harvesting patterns you need to encode them in JSON notation. Here's the above example encoded in JSON:

[
  ["/.*category=.*",["#category-items"]],
  ["/product/.*",["#main-body", "#related-accessories"]],
  [".*",["html > body"]]
]

The last pattern is a catch-all pattern which in the event of Discovery encountering a page that is neither a product category page or product page, it will simple grab all URLs in the HTML body.

Duplicate Detection Patterns

Duplicate Detection Patterns are used by Discovery to determine what parts of a web page make that page unique.

Without specifying any patterns, Discovery will create a digital fingerprint of the whole page (an MD5 digest in technical parlance). This default behaviour works for most sites, hence does not need any tweaking.

However if you find yourself crawling an e-commerce website where there are typically multiple versions of the same basic product page available and you want to treat those versions as all belonging to the same page, then Duplicate Detection Patterns are needed.

An e-commerce site might have 3 product pages for the same pair of running shoes whereby each of those pages only differs in their breadcrumbs, e.g.:

<div id="product">
  <div id="bc">Home > Shoes > Running > Nike Air Max 2011</div>
  <div id="product-description">…</div>
</div>

<div id="product">
  <div id="bc">Home > Shoes > Nike > Nike Air Max 2011</div>
  <div id="product-description">…</div>
</div>

<div id="product">
  <div id="bc">Home > Nike > Nike Air Max 2011</div>
  <div id="product-description">…</div>
</div>

Discovery would generate three distict digital fingerprints for these three pages, because included in the fingerprinting would be three different breadcrumbs.

Discovery will generate the exact digital fingerprint for each of the three occurences, by ignoring the breadcrumb (<div id="bc">) and focusing on the product description instead (<div id="product-description">).

Here's how Discovery is configured for the goal above:

[
  ["/product/.*",["#product-description"]],
  ["/.*category=.*",["#category-items"]],
  [".*",["html > body"]]
]

As you can see, we use the same syntax as Link Harvesting Patterns, however the meaning is very different. URL regular expressions are used to indentify page types, followed by a list of CSS selectors identifying the areas of the page to be used in digital fingerprinting.

Drop URL Parameter Patterns

Drop URL Parameter Patterns tell Discovery which URL parameters are inconsequential to a successful crawl of website.

You may find yourself crawling an e-commerce website that has multiple renderings of the same basic information. For example a product category page for running shoes may provide different sort orders (price ascending, price descending, popularity descending, etc.) by clicking on various links. Each of these links would present the same information but ordered differently, therefore would not be caught by Duplicate
Detection Patterns.
You simply want one rendering of the product category page, the default sort order, which gives you access to all the product pages - dropping any other sort orders.
Sort is typically embbedded in URL parameters, for example:

http://retailer.com/shoes/?category=running&sort=price-desc&page=2

However you want Discovery to turn the above URL into one without the sort parameter:

http://retailer.com/shoes/?category=running&page=2

The following fragment shows how Drop URL Parameter Patterns are configured to achieve the above result:

[
  ["/.*category=",["sort", "viewAll]]
]

Once again a URL regular expression is used to map the action of dropping paramaters to certain page types, in this case product categories. In the configuration example Discovery would remove the sort and/or
viewAll parameters if the URL matched the regular expression. If viewAll didn't occur in the URL, but sort did, then sort would be still be removed. So the configuration line should read like:

*if you encounter product category URLs, remove sort and viewAll
parameters if they are used*

Tip: A great way to find out which parameters are used by a website is to perform a small 3000 page crawl, run the standard audit and check the
URL parameters report, which will show you what parameters are used and
in what combinations.