Connected On-Prem Disaster Recovery Strategy (3 Node Architecture)

Updated 2 weeks ago by Michael Cretzman

This document describes the Disaster Recovery strategy for Harness Docker Connected On-Prem using a 3 node architecture.

In this topic:

Harness Docker Connected On-Prem Architecture

Here is the high-level architecture of the Harness Docker Connected On-Prem (3 Node Architecture).

  • All Harness microservices are running on every node. 
  • Some of the hHarness microservices are stateful and share state across nodes while others are stateless. 
    • With 3 nodes, every microservice is running in a HA configuration. 
    • The system is resilient to up to 2 failures per microservice in either of the 2 nodes. 
    • The system can function without any service interruption up to 1 node failure. 
  • In addition to the HA for this On-Premise setup, we can also achieve a Disaster Recovery setup in case of a data center (DC) failure.

Recovery Option 1: 3 Data Centers (Active/Active Strategy)

This section describes Disaster Recovery options for the Harness Docker Connected On-Prem (3 Node Architecture).

This option uses the following architecture:

The requirements below are in addition to the basic infrastructure and network requirements specified in Docker Connected On-Prem Setup.

Infrastructure Requirements 

  • 3 separate DCs with network connectivity between them.
  • 1 VM in every DC.
  • Harness Ambassador should be running on the node in the primary DC.
  • Global Server Load Balancing (GSLB)/Load Balancer that can connect to each node in the 3 DCs.

Network Requirements 

  • MUST have low network latency among 3 DCs.
  • Harness Ambassador should have SSH-based connectivity to the other 2 nodes in the other 2 DCs.

Failure Tolerance 

This setup can tolerate: 

  • Any microservice failure on up to 2 nodes.
  • Any 1 node failure (complete AZ/DC failure). 

Recovery Options

When the DCs recover, there are 2 options to recover the Harness setup.

  • Contact Harness Support — Harness Support will restart the microservices on each box using the Harness Ambassador, and restore the setup. If the primary DC has recovered, you will have to restart the Harness Ambassador for Harness Support to help bring up the setup. 

Start/Stop Scripts — Use the local start/stop scripts on the Harness Ambassador to bring up Harness on the 3 boxes. See Docker Connected Start/Stop Scripts.

Data Loss

This setup prevents data loss up to a 2 node failure. The system will resume from where it last left off before the failure.

For all 3 node failures, data loss will be based on the cadence of the data backup that was set up.

Recover Option 2: 2 Data Centers (Active/Passive Strategy)

In case the infrastructure or network requirements specified in the 3 DC-based architecture are not available, Harness can support an active/passive Disaster Recovery setup.

The requirements below are in addition to the basic infrastructure and network requirements specified in Docker Connected On-Prem Setup.

Infrastructure Requirements

  • 2 separate data centers (DCs) with 3 nodes in each DC.
  • Harness Ambassador should be running on 1 node in each DC.
  • Backup storage set up in a third location that has to be configured as per Connected On-Prem Backup and Restore Strategy. Both full and incremental backups are recommended. 
  • GSLB/Load Balancer that can connect to each node in the 2 DCs.

Failure Tolerance 

This setup can tolerate the following: 

  • Any node failure within a DC.
  • Any 1 DC failure.

Recovery Scenarios

This section describes common recovery scenarios.

Individual Node Failure

This setup can tolerate 1 node failure in the primary DC.

For 2 node failure in the primary DC, the system will become inoperative.

The recovery strategy mentioned in this scenario is the same as in the 3 DC architecture.

There are 2 options to recover the Harness setup:

  • When restarting microservices on any particular node, use the following steps:
    a) Ensure Docker daemon is running on the box.  
    • Run the following scripts, located in the $HOME/harness-scripts folder on the node:
      $ bash run_mongo.sh
      $ bash start_harness.sh 
  • When restarting microservices on all boxes, use the following steps:
    • Ensure Docker daemon is running on the box.
    • Run the following script in the harness-ambassador-scripts folder, located at $HOME/harness-ambassador-scripts/<CUSTOMER-NAME> on the Ambassador node:
      bash start_harness.sh 

See Docker Connected Start/Stop Scripts.

Planned Migration to Another DC
  1. For a planned crossover to another data center, full backup is the preferred option to migrate the setup.
  2. Please create the backup using the system's backup instructions.
  3. Next, gracefully shut down the system by running the following script in the harness-ambassador-scripts folder, located at $HOME/harness-ambassador-scripts/<CUSTOMER-NAME> on the Ambassador node:
    bash stop_harness.sh
    See Docker Connected Start/Stop Scripts.
  4. Restore on the target system using the restore steps described in Connected On-Prem Backup and Restore Strategy.
  5. Restart the system using the start/stop scripts.
  6. Reconfigure the GSLB to point to the 3 nodes in DC2.

Recovery for DC Failure

For primary DC failure, you need to perform the following steps to bring back the system:

  1. Copy the backup files onto the specified directory locations on the primary node in DC2.
  2. Follow the restore steps mentioned in Connected On-Prem Backup and Restore Strategy.
  3. Restart the system using the start/stop scripts described in Docker Connected Start/Stop Scripts.
  4. Reconfigure the GSLB to point to the 3 nodes in DC2.

Data Loss

The maximum data loss depends on the backup strategy. If backup strategy has been configured to be taken every 3 hours, maximum data loss will be for 3 hours.

Conclusion

All of the options mentioned above achieve Disaster Recovery for the Harness Docker Connected On-Prem.

Active/Active Disaster Recovery strategy is always preferred over an Active/Passive strategy. 

For achieving HA across data centers, Harness recommends Recovery Option 1: 3 Data Centers (Active/Active Strategy).

For this method, it is important to note that there is a stringent requirement of low latency among data centers.


How did we do?