Web Crawl Failing to Finish due to Errors
One of our partners is having trouble crawling a site which contains sections of Javascript as shown below:
<script language="javascript">
OAS_sitepage="www.accaglobal.com/uk/members/support/public_practice/resource_centre/it/facilities";
OAS_url ='http://ads.accaglobal.com/RealMedia/ads/';
OAS_listpos = 'Top,Left';
OAS_query = '';
OAS_version = 10;
OAS_rn = '001234567890'; OAS_rns = '1234567890';
OAS_rn = new String (Math.random()); OAS_rns = OAS_rn.substring (2, 11);
function OAS_NORMAL(pos) {
document.write('<A HREF="' + OAS_url + 'click_nx.ads/' + OAS_sitepage + '/1' +
OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + '" TARGET=_top>');
document.write('<IMG SRC="' + OAS_url + 'adstream_nx.ads/' + OAS_sitepage + '/1' +
OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + '" BORDER=0></A>');
}
</script>
The error returned by the crawl is as follows:
15:04:05,514 ERROR [Job] Error in URL syntax ' + OAS_url + 'click_nx.ads/' + OAS_sitepage + '/1' +
OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + ' on page http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/it/facilities?fffffffeffffffff
15:04:05,514 ERROR [Job] Error in URL syntax ' + OAS_url + 'adstream_nx.ads/' + OAS_sitepage + '/1' +
OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + ' on page http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/it/facilities?fffffffeffffffff
The crawl then gives the following error after throwing a duplicate key violation on dbo.contentdescriptor, before fianlly giving up:
15:04:05,546 INFO [Job] Completed crawl of batch [2c967a402597ea85012597f08c2c0004] in [1300] secs. with cQ:vL:xpl:md5 = 62575:3912:0:60
15:04:06,405 ERROR [DefaultMessageListenerContainer] Execution of JMS message listener failed
org.springframework.transaction.UnexpectedRollbackException: JTA transaction unexpectedly rolled back (maybe due to a timeout); nested exception is javax.transaction.RollbackException: Already marked for rollback TransactionImpl:XidImpl[FormatId=257, GlobalId=vam-dev-vm2/104, BranchQual=, localId=104]
Caused by:
javax.transaction.RollbackException: Already marked for rollback TransactionImpl:XidImpl[FormatId=257, GlobalId=vam-dev-vm2/104, BranchQual=, localId=104]
at org.jboss.tm.TransactionImpl.checkBeforeStatus(TransactionImpl.java:1164)
at org.jboss.tm.TransactionImpl.checkIntegrity(TransactionImpl.java:1107)
at org.jboss.tm.TransactionImpl.beforePrepare(TransactionImpl.java:1090)
at org.jboss.tm.TransactionImpl.commit(TransactionImpl.java:306)
at org.jboss.tm.TxManager.commit(TxManager.java:224)
at org.jboss.tm.usertx.client.ServerVMClientUserTransaction.commit(ServerVMClientUserTransaction.java:126)
at org.springframework.transaction.jta.JtaTransactionManager.doCommit(JtaTransactionManager.java:773)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.processCommit(AbstractPlatformTransactionManager.java:654)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.commit(AbstractPlatformTransactionManager.java:624)
at org.springframework.transaction.interceptor.TransactionAspectSupport.commitTransactionAfterReturning(TransactionAspectSupport.java:307)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:117)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:176)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:210)
at $Proxy61.onMessage(Unknown Source)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:856)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:795)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:765)
at org.springframework.jms.listener.DefaultMessageListenerContainer.doReceiveAndExecute(DefaultMessageListenerContainer.java:546)
at org.springframework.jms.listener.DefaultMessageListenerContainer.receiveAndExecute(DefaultMessageListenerContainer.java:474)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:904)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:857)
at org.springframework.core.task.SimpleAsyncTaskExecutor$ConcurrencyThrottlingRunnable.run(SimpleAsyncTaskExecutor.java:203)
at java.lang.Thread.run(Thread.java:619)
I have provided a work around in the mean time using a proxy server to remove the <script> tag which appears to be causing the problem.
Is this a problem with the way that the crawler is parsing links within the HTML pages?
Cheers,
Paul
Comments are currently closed for this discussion. You can start a new one.
Support Staff 2 Posted by Ijonas Kisselbach on 17 Dec, 2009 07:14 AM
Really hard to tell without being able to recreate the error condition based on the info given so far.
The thing that makes the crawler blow up is the duplicate key violation, i.e. the same URL has already been stored against the same project.
The trick is finding out what causes this situation to occur.
3 Posted by Riaz Ahmed on 17 Dec, 2009 03:23 PM
Hi guys,
I am still getting the same error after putting this regexp rule on the proxy client (privoxy):
This rule ‘should’ replace anything in the script tag with a blank space...right?!?
So the following culprit JavaScript shouldn’t be causing any problems....
www.accaglobal.com/uk/members/support/public_practice/resource_cent...http://ads.accaglobal.com/RealMedia/ads/'); document.write('But the crawler still stops and times out with the following in the log:
15:12:39,058 INFO [Job] Identified http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/general/tools?fffffffeffffffff0a0121394b2a10332a842b69b170babb9c816de52734df8c [200]
15:12:39,058 INFO [Job] Retrieved http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/general/tools?fffffffeffffffff0a0121394b2a10332a842b69b170babb9c816de52734df8c [200]
15:12:39,058 ERROR [Job] Error in URL syntax ' + OAS_url + 'click_nx.ads/' + OAS_sitepage + '/1' +
OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + ' on page http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/general/tools?fffffffeffffffff0a0121394b2a10332a842b69b170babb9c816de52734df8c
15:12:39,058 ERROR [Job] Error in URL syntax ' + OAS_url + 'adstream_nx.ads/' + OAS_sitepage + '/1' +
OAS_rns + '@' + OAS_listpos + '!' + pos + OAS_query + ' on page http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/general/tools?fffffffeffffffff0a0121394b2a10332a842b69b170babb9c816de52734df8c
15:12:39,058 ERROR [Job] Error in URL syntax http://search.accaglobal.com/search?ie=&q=&site=acca-collection&output=xml_no_dtd&client=acca-collection&access=p&lr=&ip=10.10.13.12&proxystylesheet=acca-collection&search=search&oe=&proxycustom= on page http://www.accaglobal.com/uk/members/support/public_practice/resource_centre/general/tools
?fffffffeffffffff0a0121394b2a10332a842b69b170babb9c816de52734df8c 15:12:39,089 ERROR [ContentManagerServiceImpl] Unable to store content for Content Descriptor
org.springframework.dao.DataIntegrityViolationException: PreparedStatementCallback; SQL []; Violation of UNIQUE KEY constraint 'UQcontentdescripto2C3393D0'. Cannot insert duplicate key in object 'dbo.contentdescriptor'.; nested exception is com.microsoft.sqlserver.jdbc.SQLServerException: Violation of UNIQUE KEY constraint 'UQcontentdescripto
2C3393D0'. Cannot insert duplicate key in object 'dbo.contentdescriptor'.
Caused by:
com.microsoft.sqlserver.jdbc.SQLServerException: Violation of UNIQUE KEY constraint 'UQcontentdescripto2C3393D0'. Cannot insert duplicate key in object 'dbo.contentdescriptor'.
15:12:39,120 INFO [Job] Completed crawl of batch [2c967a40259d2dd101259d3067d10004] in [144] secs. with cQ:vL:xpl:md5 = 61961:3929:0:0
15:12:44,355 INFO [Job] Checking in job[2c967a40259d2dd101259d3007330002], batch[2c967a40259d2dd101259d3067d10004] with crawlQ{61961}, vL{3929}, xpl{0}
15:17:40,126 WARN [TransactionImpl] Transaction TransactionImpl:XidImpl[FormatId=257, GlobalId=vam-dev-vm2/144, BranchQual=, localId=144] timed out. status=STATUS_ACTIVE
any ideas?
Cheers
Riaz
4 Posted by Riaz Ahmed on 17 Dec, 2009 03:26 PM
the regexp rule got stripped out...trying again....
s|<script language=".*?</script>| |ig
Support Staff 5 Posted by paul.henderson on 22 Dec, 2009 05:43 PM
Hi Riaz,
I managed to strip out all of the
<script>tags using the following regex:s|<script[^>]*>.*?<\/script>||igsI did notice however, within my version of Privoxy that the following line was commented out within the config.txt which prevented the user.filter from being applied.
filterfile user.filter # User customizations
I removed the '#' from the start of this line and the user.filter file was applied which removed the
<script>tags and all of their contents from the page.Hope this helps,
Paul
6 Posted by Patric DelCioppo on 22 Dec, 2009 11:02 PM
Hey Paul,
I think the JavaScript is a red herring.
We've run into the unique key constraint violation recently on a few crawls. Although I haven't proven this, my theory is that redirect targets don't get saved to the crawl state and/or purged from the crawl queue, and therefore if the crawler then attempts to capture the already stored redirect target DIRECTLY, it throws this error.
Given a url "A" which redirects to "B", and another page "C" which links to "B":
1) Crawl queue contains "A" and "C".
2) Capture of "A" results in creation of contentdescriptor "B".
3) "A" is removed from crawl queue.
4) Capture of "C" causes "B" to be added to the crawl queue.
5) Attempted capture of "B" throws exception.
Ijonas - is this plausible?
Cheers,
Patric
7 Posted by stewartmckee on 04 Mar, 2010 02:52 PM
Has this issue been resolved or is it still ongoing? Its been a while since the last post.
Stewart.
stewartmckee closed this discussion on 09 Mar, 2010 03:41 PM.