Customizing the Report Data View

The Report Data view shows the detailed results of crawling a website, i.e. the list of URLs crawled and their related metadata.

The results often span many pages and hence can be filtered by a query to provide more clarity.

A query editor is revealed by clicking on the "Refine Query" button. Queries are defined within curly-braces {} and are best explained through some examples.

Filter to show only URLs that returned a 200 status code

{"status_code":200}

In this example we specify that we want to filter on the status_code field and in particular only return those URLs whereby that particular field matches: 200.

Filter to show only URLs that returned a 404 status code, i.e. URLs causing broken links

{"status_code":404}

Similar to the previous example, we're still querying the same field but for a different value.

Filter to show only URLs that returned PDF documents

{"content_type":"application/pdf"}

In this example we chose a different field, namely content_type and matched on a text value "application/pdf". Make sure you wrap all text values in double-quotes: ".

Filter to show only URLs that return a HTML document and with a 404 status code

{"content_type":"application/pdf", "status_code":404}

The above query combines two previously used criteria (content_type & status_code) by separating the field name and field value pairs by a comma: ,

Filter to show only URLs that return all kinds of HTML documents

Sometimes you want to match on fragment of a value such as in the case of returning the list of URLs pointing to HTML content, irrespective if their content_type is "text/html, charset=ISO-8891-1", "text/html, charset=UTF-8", simply "text/html" or a multitude of other variation on that them. Rather than writing a very complex query we can use a regular expression:

{"content_type":"/html/"}

A regular expression is text value that is prefixed and suffixed by a front-slash: /. Regular expressions are a complex subject and beyond the scope of this text. However here's a good starter tutorial.

Filter all images that are redirected

Regular expressions can be combined with other criteria as before.

{"content_type":"/image/", "status_code":302}

In this case we filter content_type to retrieve all image types (irrespective if they're JPEG or GIF or otherwise) and filter that list further using status_code, wanting see those URLs that have been temporarily redirected - a 302 status code.

Filtering on multiple redirect codes

Redirect status codes usually come in two flavours, 301 - moved permanently and 302 - moved temporarily. The wrong way to query for both to be returned would be

{"status_code":301, "status_code":302}    # WRONG - would only return 302s
{"status_code":301, 302}                  # WRONG - bad syntax

In this case we need to query using a special operator, $in, like so:

{"status_code": {"$in": [301,302]} }

We're still filtering on the status_code field, but in this case we're nesting another set of curly braces, {}, and asking that the field values match the one of the values $in the [301, 302] list.