Customizing the Report Data View
The Report Data view shows the detailed results of crawling a website, i.e. the list of URLs crawled and their related metadata.
The results often span many pages and hence can be filtered by a query to provide more clarity.
A query editor is revealed by clicking on the "Refine Query" button. Queries are defined within curly-braces {} and are best explained through some examples.
Filter to show only URLs that returned a 200 status code
{"status_code":200}
In this example we specify that we want to filter on the status_code field and in particular only return those URLs whereby that particular field matches: 200.
Filter to show only URLs that returned a 404 status code, i.e. URLs causing broken links
{"status_code":404}
Similar to the previous example, we're still querying the same field but for a different value.
Filter to show only URLs that returned PDF documents
{"content_type":"application/pdf"}
In this example we chose a different field, namely content_type and matched on a text value "application/pdf". Make sure you wrap all text values in double-quotes: ".
Filter to show only URLs that return a HTML document and with a 404 status code
{"content_type":"application/pdf", "status_code":404}
The above query combines two previously used criteria (content_type & status_code) by separating the field name and field value pairs by a comma: ,
Filter to show only URLs that return all kinds of HTML documents
Sometimes you want to match on fragment of a value such as in the case of returning the list of URLs pointing to HTML content, irrespective if their content_type is "text/html, charset=ISO-8891-1", "text/html, charset=UTF-8", simply "text/html" or a multitude of other variation on that them. Rather than writing a very complex query we can use a regular expression:
{"content_type":"/html/"}
A regular expression is text value that is prefixed and suffixed by a front-slash: /. Regular expressions are a complex subject and beyond the scope of this text. However here's a good starter tutorial.
Filter all images that are redirected
Regular expressions can be combined with other criteria as before.
{"content_type":"/image/", "status_code":302}
In this case we filter content_type to retrieve all image types (irrespective if they're JPEG or GIF or otherwise) and filter that list further using status_code, wanting see those URLs that have been temporarily redirected - a 302 status code.
Filtering on multiple redirect codes
Redirect status codes usually come in two flavours, 301 - moved permanently and 302 - moved temporarily. The wrong way to query for both to be returned would be
{"status_code":301, "status_code":302} # WRONG - would only return 302s
{"status_code":301, 302} # WRONG - bad syntax
In this case we need to query using a special operator, $in, like so:
{"status_code": {"$in": [301,302]} }
We're still filtering on the status_code field, but in this case we're nesting another set of curly braces, {}, and asking that the field values match the one of the values $in the [301, 302] list.