Tuesday, March 23, 2010

BPEL 10g: Clustering with JGroups on OC4J and Weblogic

Using a cluster with Oracle SOA Suite with BPEL is more or less straight forward. BPEL is using JGroups for clustering. It is used for process deployed, process state changes and process undeployment. Most likely user will use the out-of-the-box feature of JGroups. This is based on multicasting. I have written a cluster document that was based on mulicast earlier.

Using multicasting can be an issue in the network. Multicast is based on UDP and is often blocked in the network. I encountered this issue at a few customers. The solution is rather simple, instead of using UDP/multicast we use TCP to point to the nodes in the cluster.

In the next example, we have two nodes; node1 and node2. They will use port 7900 with a range of 3 to broadcast the BPEL process changes.

On each node:
cd /u01/appl/p1bplpe/product/OracleAS_1/bpel/system/config
mv jgroups-protocol.xml jgroups-protocol.xml.old
cat >> jgroups-protocol.xml << EOF 
<config> 
  <TCP start_port="7900" loopback="true"
      send_buf_size="32000" recv_buf_size="64000"/>
 <TCPPING timeout="3000" initial_hosts="node1[7900],node2[7900]" port_range="3" num_initial_members="3"/>
  <FD timeout="2000" max_tries="4"/>
  <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
  <pbcast.NAKACK gc_lag="100" retransmit_timeout="600,1200,2400,4800"/>
  <pbcast.STABLE stability_delay="1000" desired_avg_gossip="20000" 
down_thread="false" max_bytes="0" up_thread="false"/>
  <VIEW_SYNC avg_send_interval="60000" down_thread="false" up_thread="false" />
  <pbcast.GMS print_local_addr="true" join_timeout="5000" 
join_retry_timeout="2000" shun="true"/>
 </config> EOF 
 
cp jgroups-protocol.xml jgroups-protocol.xml.new
Restart the nodes

5 comments:

Tony said...

Hi Marc,

We have a bpel 10g (10.1.3.4) cluster aswell using standard jgroup configuraton. Recently we encountered an issue with a bpel flow the team had developed, which worked fine on dev & test (single instance), but failed on out clustered environment.
They were actually doing a front-end process, which calls backend async process. The callback from the async process to the sync process never appeared and would timeout, although the async process succeeds.
Problem: sync bpelprocess uses a local (jvm) mutex.

I dont know if its good (or common) practice to call a async process/service from a sync process, but I (as an 'admin') had some big discussions with the developers about this 'bug', which Oracle qualifies as not a bug!
Do you have any information about these type of clustering issues?

regards,
Tony van Esch

Marc Kelderman SOA Blog said...

If you call an a-sync process should not be a problem. As long you send a reply to your caller within the sync-time period (default 45 sec).

Tony said...

Unfortunately, that's not the problem. Your collegue Tom H has confirmed this (he is sitting next to me)

Sync processes run in a thread and this thread is blocked until a callback returns. Because threads are not shared among JVM's, in the case where the async process rehydrates on another JVM the callback will never arrive (although the async process itself succesfully completes). The sync proces will timeout after 45 seconds.

Ahmed Aboulnaga said...

Tony,

There are numerous limitations in an Oracle SOA Suite 10g cluster.

If your process includes an asynchronous invocation with callback to an external service, then you are possibly hitting one of the limitations.

See my blog post:
http://blog.thisisahmed.com/2008/10/behavior-of-bpel-processes-in-bpel.html

You may have to redesign your process. Sorry.

Tony said...

Hi Ahmed,

thanks for the link to your blog. Very interesting. I'll be watching it from now on!
The problem we had, is actually brought on due to the way bpel is implemented. When a async bpel process is dehydrated, you don't know which JVM will actually rehydrate the bpel process in a cluster. When you call the async bpel from a sync bpel (which will wait on the async callback in a local thread (thus blocking)), and it is rehydrated in another JVM, the callback can never arrive. Thus you end up with a sync bpel that will timeout.

BTW: the implementation is not flawed, but is actually designed this way for performance.

kind regards,
Tony van Esch

Post a Comment

Post a Comment