Connected On-Prem Disaster Recovery Strategy (3 Node Architecture)

Updated 1 month ago by Michael Cretzman

This document describes the Disaster Recovery strategy for Harness Docker Connected On-Prem using a 3 node architecture.

In this topic:

3 Node Architecture

The following architecture uses 3 nodes for Harness Docker On-Prem:

In this design, all Harness microservices are running on every node. 

Some of the Harness microservices are stateful and share state across nodes while others are stateless. 

  • With 3 nodes, every microservice is running in a HA configuration. 
  • The system is resilient upto 2 failures per microservice in either of the 2 nodes. 
  • The system can function without any service interruption upto 1 node failure. 

In addition to the HA for this On-Prem setup, we can also achieve a Disaster Recovery setup in case of a data center failure.

Disaster Recovery Setup Options

Following are the Disaster Recovery setup options for a Harness On-Prem installation.

Option 1: 3 Data Centers (Active/Active Strategy)

This option uses the following architecture:

The requirements below are in addition to the basic infrastructure and network requirements specified in Docker Connected On-Prem Setup.

Infrastructure Requirements
  • 3 separate data centers (DC) with network connectivity between 1 VM in every DC.
  • Harness Ambassador should be running on the node in the primary DC.
  • GSLB/Load Balancer that can connect to each node in the 3 DCs.
Network Requirements
  • MUST have low network latency among 3 DCs.
  • Harness Ambassador should have SSH-based connectivity to the other 2 nodes in the other 2 DCs.
Failure Tolerance

The setup can tolerate the following:

  • Any microservice failure on upto 2 nodes.
  • Any single node failure (complete AZ/DC failure).
Recovery

When the DCs recover, there are two options to recover the Harness setup:

  1. Contact Harness Support. Harness Support will restart the microservices on each box using the Ambassador and restore the setup. In case the primary DC has recovered, you will have to restart the Ambassador for Harness Support to help with setup.
  2. Run startup scripts on each node in a specific order and at specified locations on each node. The following microservices have to be brought up in the exact order specified below:
    1. MongoDB.
    2. Proxy.
    3. The rest of the Harness microservices (Manager, UI, Verification Service, Learning Engine).
Data Loss

There will not be any data loss in this setup upto a 2 node failure. The system will resume from where it last left off before the failure.

In case of all 3 nodes failing, data loss will be based on the cadence of the data backup which has been setup. For example, if you are taking backups every 15 mins, Harness can restore the lost data up to a maximum of 15 mins.

Option 2: 2 Data Centers (Active/Passive Strategy)

In case the infrastructure or network requirements specified in the 3 DC-based architecture are not available, Harness can support an active/passive Disaster Recovery setup.

The requirements below are in addition to the basic infrastructure and network requirements specified in Docker Connected On-Prem Setup.

Infrastructure Requirements
  • 2 separate data centers (DCs) with 3 nodes in each DC.
  • Harness Ambassador should be running on 1 node in each DC.
  • Backup storage setup in a third location which has to be configured as per Connected On-Prem Backup and Restore Strategy. Both full and incremental backups are recommended. 
  • GSLB/Load Balancer that can connect to each node in the 2 DCs.
Failure Tolerance 

This setup can tolerate the following: 

  • Any node failure within a DC.
  • Any 1 DC failure.
Recovery for Individual Node Failure

This setup can tolerate 1 node failure in the primary DC. 

In the case of 2 node failure in the primary DC, the system will become inoperational. 

The recovery strategy mentioned in this scenario is the same as in the 3 DC architecture.

There are 2 options to recover the Harness setup:

  1. Contact Harness Support. Harness Support will restart the microservices on each box using the Ambassador and restore the setup. In the case where the primary node has recovered, you will have to restart the Ambassador for Harness Support to help run the setup. 
  2. Run startup scripts on each node in a specific order and at specified locations on each node. The following microservices have to be brought up on the recovered nodes in the exact order specified below:
    1. MongoDB
    2. Proxy
    3. The rest of the Harness microservices (Manager, UI, Verification Service, Learning Engine).
Recovery for DC Failure

In the case of primary DC failure, Harness must perform the following steps to bring back the system:

  1. Copy the backup files onto the specified directory locations in the primary node in DC2.
  2. After copying the backup files, there are 2 mechanisms to bring back Harness microservices:
    1. Contact Harness Support to bring back the microservices on DC2.
    2. Use the following steps to bring back Harness microservices. There are scripts available on every node for each of these steps, but the script must be run in the exact order as specified below:
      1. Start MongoDB on primary node.
      2. Reconfigure mongodb replicaset to the new nodes.
      3. Start MongoDB on the other 2 nodes.
      4. Start Proxy microservice on all 3 nodes.
      5. Start the rest of the Harness microservices (Manager, UI, Verification Service, Learning Engine) on all 3 nodes.
  3. Reconfigure the GSLB to point to the 3 nodes in DC2.
Data Loss

The maximum data loss will be as per the backup strategy. For example, if the backup strategy has been configured to be taken every 15 mins, maximum data loss will be 15 mins.

Conclusion

Both options mentioned above are available to achieve Disaster Recovery for the Harness Docker Connected On-Prem version.

An Active/Active Disaster Recovery strategy is always preferred over an Active/Passive strategy. Harness recommends Option 1: 3 Data Centers (Active/Active Strategy) for achieving High Availability across DCs; however, there is a stringent requirement to have low latency among data centers.


How did we do?