Wednesday, March 23, 2011

Disaster Recovery with SOA Suite and OSB and other products

In a production environment we did a disaster recovery test. The production environment contains of two separate locations. Location #1 will be up and running, while we isolated location #2 from #1. We assume that location #1 is down. The goal is to bring up location #2 as soon as possible.

I want to share you experience during this exercise. All the applications servers are bases on Oracle technology:
  • Oracle SOA Suite 10g (10.1.3.4)
  • Oracle SOA Suite 11g, ps2
  • Oracle Service Bus 11g, ps3
  • Oracle WebCenter 11g, ps2
  • Oracle Weblogic + ADF 11g, ps3
All the application servers are installed in a cluster, form 2-node to 8-node clusters.

The database are running in active/standby, this means that Oracle Real Application Clusters is not used. On location #1 the active databases is running, on location #2 the databases are running in standby mode.

The trick is how do we get the application clusters up and running, while location #1 is not available. When location #1 is not available, the application servers on location #2 can not reach the database.

Meanwhile the DBA are working to bring up the database on location #2 from standby to active.

We were able to run the application servers on node #2 without restarting the servers when the databases are brought up to active. Only the Oracle SOA 10g application servers need to be restarted.

What did we do? We configured all the hostnames in the datasource to a generic hostname in the DNS server.

Instead of having a hostname to the active database on location #1, we are using a general hostname.
  • old: soaprod.node1.database.nl
  • new: soaprod.cluster.database.nl
In the period that the database on location #2 is brought up to active, the DNS entry of the *.cluster.database.nl to location #2.

When the database is active we saw:
  • Oracle SOA 10g did not run succesfully. A restarted was needed.
  • Oracle SOA 11g, did run, but some datasource are in suspend mode, resuming the datasource fixed the issue.
  • Oracle WebCenter 11g, did run, but only on location #1. It was installed on a replication volume, we brought up this server from the replication volume succesfully.
  • Oracle Web Service Manager 10g, did not run, a restarted was needed.
  • Oracle Weblogic + ADF, did run succesfully.
Notes:

All the adminstration servers are installed on a seperate server on a replication volume. This makes it easy to bring up the administartion server on the other location.

The log files will grow, while the 'Cluster' periodlivly checks if the other node is available.


Using virtualisation to run your application servers saves a lot of time in case of a disaster.

Some Reference links:

Post a Comment