Designing Task Pipelines

In this document we aim to explain to you the power and flexibility in Vamosa Content Migrator's task pipeline processing system. Task pipelines have been a key defining feature of the software since the early days of version 1.0. You might say they form the heart of the migration software and give it its powerful capabilities, i.e.:

  • flexibility
  • repeatability
  • expandability
  • performance

In this guide you'll learn about the different types of tasks you can build and assemble in task pipelines. The 'pros and cons' of each type and how you can optimise your task pipelines for best performance and flexibility.


What is a Task Pipeline ?

A task pipeline describes a process consisting of individual tasks, each performing a step in that process. In the case of Vamosa Content Migrator that process is content migration and task pipelines are assembled to automate the migration of content from source to target content management system.

Task pipelines can contain as many tasks as you feel are necessary to accomplish your content migration goals. You can have small task pipelines during the development and testing phases of your migration and then quickly construct larger ones during the production runs.

Here's a sample task pipeline used to crawl websites and store the content inside the Vamosa repository. Along the way the content is processed by various tasks.

taskpipeline.jpg

Let's a look at each task in a little more detail.

  1. Clear Project will delete the current content and associated metadata from the project so that next step in the pipeline, the Web Selector, doesn't double-up the content. Its a 'per project' task as it operates at project level.
  2. Web Selector is a content selector task which will crawl a specified website and as it finds an HTML page or an image pass that content and its captured metadata onto the remainder of the task pipelines.
  3. Tidy Content is applied to each piece of content as it is grabbed off the web by the Web Selector. Tidy Content cleans up the HTML markup and converts the content to XHTML.
  4. Feature Extract and Classify are two tasks that will classify an XHTML page against page templates definitions.
  5. Store Content will the content and metadata that has been captured and transformed in previous tasks in the Vamosa Content Migrator repository.

Tidy Content, Feature Extract, Classify, and Store Content are 'per object' tasks and are the predominant type of task on the system. All three task types - 'per object', 'per project' and content selectors will be explained in further detail in the sections that follow.


Executing a Task Pipeline

Task pipelines can be executed from within Vamosa Content Migrator in two different manners, namely in an ad-hoc fashion or as part of a schedule.

Executing a task pipeline, either scheduled or in ad-hoc fashion, will lock the project for the duration that it takes the task pipeline to complete. During this time you cannot execute any other task pipeline on the same project. The project unlocks when the task pipeline completes, ready for further task pipelines to run.

Schedule-based Execution

The Schedules feature allows you to execute your pipeline at a given time and date in the future. You can also use it to repeatedly execute a task pipeline every minute, or at a given time every day.

You access the Schedules by clicking the orange 'Schedules' tab at the top of the Content Migrator screen.

You can create a new schedule by clicking the 'Create New Schedule' button from the silver navigation bar at the top of the screen. You can edit an existing schedule by double-clicking on the entry in the Schedules list.

Each schedule entry consists of

Schedules-1.jpg

  • a name, that makes sense to you such as "Daily Recrawl of vamosa.com"
  • a start date and time, when the task pipeline will be first executed
  • an optional end date and time, after which the task pipeline will cease to executed. Any scheduled task pipelines executing at that particular will be allowed to complete normally.
  • an interval ranging from seconds to days.
  • an enabled flag, allowing you to disable or re-enable a schedule.
  • the pipeline and project that you would like to execute
  • the timezone of the schedule entry's start and end times.

Content Migrator will quietly skip execution of a task pipeline if that task pipeline is scheduled to run, say every hour, and the previous execution has not completed yet (leaving the project in a locked state).

Ad-hoc Execution

During migration the more common form of running pipelines is by clicking the green 'play' button found next to a task pipeline on each project panel:

projectpanel.jpg

or at the top of the task pipeline itself on the task pipeline design screen:

taskpipeline_designer.jpg

As with Schedule-based execution, ad-hoc execution will lock the project preventing other task pipelines from running on the same project and its related content and metadata.

The mechanism used for executing a task pipeline in an ad-hoc fashion is identical to its schedule-based counterpart. With ad-hoc execution a schedule entry is placed in the schedule telling it to execute the taskpipeline 'now'.

Both methods of triggering execution result in the processing of tasks across content. Both methods create logging information that can then be queried, either during execution or after the pipeline has completed its run (see below).

Stopping Execution

You can stop execution of a task pipeline that is currently running by visiting the the Home screen and hitting the red 'stop' button next to the task pipeline on the list of running pipelines - located on the right hand side of the screen.

The task pipeline will gracefully stop execution, once the red 'stop' button is clicked. 'Graceful' means Content Migrator will complete processing the content that is currently in memory but then stop.

For example, a pipeline is processing through 10,000 HTML pages performing the Tidy and Store Content tasks. After 7,000 pages you hit the stop button. Content Migrator will complete both Tidy and Store Content for page 7,000 but will decline to process pages 7,001 to 10,000.

To stop a scheduled task pipeline from executing, you need to disable the schedule entry, by visiting the Schedules screen, double-clicking on the entry and un-ticking the 'Enable' button, before hitting 'Save'.


Task Types

As noted above, there are three task types in Content Migrator:

  • content selectors
  • per object tasks
  • per project tasks

Each helps you solve a piece of the migration puzzle in a different way.

Content Selectors

Content Selectors help you get content into a task pipeline so that it can be processed by the 'per object' tasks that follow a content selector. Examples of content selectors are

  • Web Selector, used to crawl websites
  • RSS/Atom Selector, used to ingest content from RSS feeds etc.
  • Repository Selector, used to query content held in Content Migrator's repository

Each content selector will provide a content object and related metadata to the 'per object' tasks that follow it, one content object at a time.

You can craft your own Content Selector tasks to solve your specific problems of getting content into Content Migrator. You may want to write a

  • file system crawler
  • task to read content from a bespoke XML format
  • task to extract content from an in-house CMS system

To write your own you need to create a new task in a task library, which are accessible via the orange 'Tasks' tab at the top of the screen.

Here's a sample content selector

require "java"

module Vamosa
  include_package 'com.vamosa.content'
end

class SampleSelector < com.vamosa.tasks.ParameterisedTask
  def usage()
    requiresProject("project", "the default to select content from")
  end

  def iterator(project)
    puts "iterator is called"
    @content_descriptors = java.util.ArrayList.new
    @content_descriptors.add Vamosa::ContentDescriptor.new("http://www.vamosa.com/a.html", project)
    @content_descriptors.add Vamosa::ContentDescriptor.new("http://www.vamosa.com/b.html", project)
    @content_descriptors.add Vamosa::ContentDescriptor.new("http://www.vamosa.com/c.html", project)
    @content_descriptors.add Vamosa::ContentDescriptor.new("http://www.vamosa.com/d.html", project)
    @content_descriptors.add Vamosa::ContentDescriptor.new("http://www.vamosa.com/e.html", project)
    puts("@content_descriptors contains #{@content_descriptors}")
    @content_descriptors.iterator
  end
end

The key component of a content selector that makes it differ from the other two task types is the def iterator(project) method, which needs to return a java.util.Iterator object. The task pipeline runtime environment will call this method and use the iterator object that you return to feed objects into tasks that follow.

'Per Object' Tasks

'Per object' tasks are tasks that operate on a single content object at a time. You would use tasks of this type for most purposes during a content migration. Below are a few common ones

  • extract fragments of HTML such as hte contents of the title tags, and store them as metadata
  • transform metadata such as authoring permissions & ACLs from the source CMS to those of the target
  • transform content from one layout to another
  • enrich content by dispatching to a 3rd party classification system and storing results as metadata
  • mark objects as 'to be deleted' based on a predefined condition

To define a new 'per object' task, you create a new task in a task library, which are accessible via the orange 'Tasks' tab at the top of the screen.

Below is a sample 'per object' task

require "java"

class Echo < com.vamosa.tasks.ParameterisedTask
  def usage()
    requiresContentDescriptor("contentDescriptor", "the default content descriptor")
    requiresContent("content", "the default content")
  end

  def enhance( contentDescriptor, content )
    $logger.info "Echoing: #{contentDescriptor}"
  end
end​

'Per Object' tasks differ from Content Selector tasks in the method that you are required to define. In the case of 'per object' tasks you define a def enhance( contentDescriptor, content ) method. You'll use the body of this
method perform the work on the content that is passed in by the task pipeline runtime environment.

Any modifications made to content objects is passed forward to the next task in the pipeline, so the modifications are cumulative.

Note: make sure you stick a Store Content task at the end of the pipeline if you want to save the changes you make to an content object.

'Per Project' Tasks

In previous versions of Content Migrator 'per project' tasks were used for numerous purposes such as importing content into the repository or to set up projects in bulk. They were used a general purpose task when it was too difficult to solve the problem with 'per object' tasks.

The performance gains enjoyed by using a combination a content selector and 'per object tasks' (explained in the previous section) now outweighs the general purpose flexibility of 'per project' tasks. Our advice is to use them as a last resort.

With version 3 of Content Migrator the main reasons for using 'Per Project' tasks are

  • setting up projects in bulk, i.e. when automating the setup of 50 projects.
  • exporting content to XML or 3rd party database

Here's a sample 'per project' task

require "java"

class EchoProject < com.vamosa.tasks.ParameterisedTask
  def usage()
    requiresProject("project", "the default to select content from")
  end

  def enhance( project )
    $logger.info "You are currently executing against #{project.name}"
  end
end​

Similar to 'per object' tasks, a 'per project' task requires a def enhance(project) method to be defined. This method is called by the task pipeline runtime environment and is passed the project object that the task pipeline is running against. Importing of project resources, setting of project properties, or creating whole new projects can be done from the body of this method.


Designing Task Pipelines for Best Performance

Performance improvements have been introduced into Content Migrator version 3 that will change the design of your task pipelines if you're used to using version 2. In version 2 you could mix and match 'per object' and 'per project' tasks across the pipeline. No matter which way you ordered your 'per object' and 'per project' tasks, there was no real performance gain to be made. The engine was also single threaded and was not able to make use of modern multi-core processors.

Version 3 allows you to take full advantage of multi-core processors, lots of RAM, and 64bit operating systems. However you need to focus on solving you migration problems using 'per object' tasks as these are capable of processing multiple pieces of content in parallel. 'Per project' tasks continue to be executed in a single-threaded manner.

Here's a process diagram of our task pipeline on a dual-core server.

task_pipeline_execution.jpg

In Content Migrator version 3 'per project' tasks cannot following a content selector or 'per object' tasks in a task pipeline. This means you to need place all your 'per project' tasks at the start of the task pipeline. Looking at the diagram above, we see that processing is single-threaded to the left of the web selector and multi-threaded to the right of it. Content Migrator is configured to use four threads on a server with dual-core processor, i.e. twice as many threads as there are processor cores on the server. This means that once the task pipeline hits the web selector it immediately starts to populate the available threads, four in our case, with content and tasks. This is another improvement from previous versions of Content Migrator, where processing of content wouldn't commence until the web crawler had completed.

So to make the most of the performance improvements in version 3: write 'per object' tasks and use multi-core hardware. For example on an 8-core server you would have 16 content objects being processed in parallel by the task pipeline runtime environment.

You can tune the number of threads by editing the 'executor' section of your vamosa-runtime.properties file (typically found in you Vamosa application install folder):

vamosa.executor.corePoolSize=2              # low watermark of number of threads, equates to the number of cores
vamosa.executor.maxPoolSize=4               # high watermark of the number of threads, equates to twice the number 
vamosa.executor.idleThreadKeepAlive=120     # leave as default, no impact on performance
vamosa.executor.queueSize=4                 # size of staging area for content object, should always be equal to maxPoolSize


Logging

Task pipelines generate logging information when they are executed. This logging information is in part system-generated by the task pipeline runtime environment and by yourself using the logger object that is available
in each script.

All logging information is collected into Vamosa Content Migrator's underlying MongoDB and can be queried at anytime, including whilst the task pipeline is being executed.

Job Log

Logging information is available in two formats. Firstly there's the job log, which is similar to every log file you've probably encountered: flat, ordered by date & time, with each line containing some information.

joblog.jpg

You can view the log either in chronological or reverse chronological order by double-clicking on the Timestamp column header. Large logs are broken up into pages, which you can skip through by clicking the green arrow buttons or by entering the page number in the Page text box and pressing "Enter".

Each log entry is assigned a log level, starting with Debug at the lowest, all the way up to Fatal. You can filter out low-level debug information by clicking one the Log Level filters, e.g. clicking Info will only show entries marked Info, Error, or Fatal.

Pipeline Report

Pipeline reports provide a high-level summary of the execution of every task against every content object in your task pipeline.

pipelinereport.jpg

The top of the report provides a high summary of the pipeline's execution. Start time and duration are provided, as are performance counts relating to the number of objects processed.

The bottom part of the report shows all the URLs that were processed and the tasks that were applied to them. Red-coloured tasks indicate failure and for those content objects no further tasks are executed.

Similar to the job log you can skip through the URL list by using the page-left and page-right arrow at the top of the pipeline report. You can also skip to a specific page by typing in its number in the text box, next to the page-right arrow.

Lastly you can filter the URL list to only show those that failed as well as search on URL field by typing in a fragment of the URL into the "Search..." box. You can use regular expressions to build more powerful search queries. MongoDB uses Javascript-style regular expressions for which there are many cheat sheets available for download.

Both the job log and the pipeline reports are accessible from the project panels on the Home screen and on the Projects screen.