CV Strategies, Tuning, and Best Practices

Updated 2 months ago by Michael Cretzman

This topic helps you pick the best analysis strategy when setting up Harness Continuous Verification (CV) for deployments, and helps you tune the results using your expertise.

First, learn about the types of analysis strategies, and then learn about best practices and tuning.

Where are Analysis Strategies Set Up? 

When you set up a verification step in a Harness Workflow, each supported APM lists the available analysis strategies in its Baseline for Risk Analysis setting and for how long the verification should be performed in its Analysis Time duration setting.

These two settings are used to tune the verification Harness performs. They are discussed in detail in this topic. 

Types of Analysis Strategies

Harness uses three types of analysis strategies, each with a different load (datasets) and granularity combination:

Analysis Strategy

Load

Granularity

Previous

Synthetic

Container level

Canary

Real user traffic

Container level

Predictive

Real User Traffic

Service level

Each strategy is defined below. For each strategy, remember that verification steps are only used after you have deployed successfully at least once. In order to verify deployments and find anomalies, Harness needs data from previous deployments.

Previous Analysis

In Previous Analysis, Harness compares the metrics received for the nodes deployed in each Workflow Phase with metrics received for all the nodes during the previous deployment. 

For example, if Phase 1 deploys app version 1.2 to node A, the metrics received from the APM during this deployment are compared to the metrics for nodes A, B, and C (all the nodes) during the previous deployment (version 1.1). Previous Analysis is best used when you have predictable load, such as in a QA environment.

For Previous Analysis to be effective, the load on the application should be the same across deployments. For example, a test load (synthetic) provided using Apache JMeter. if the load varies between deployments, then Previous Analysis is not effective.

Canary Analysis

For Canary Analysis, Harness compares the metrics received for the nodes deployed in each Workflow Phase with metrics received for the rest of the nodes hosting the application. For example, if Phase 1 deploys to 25% of your nodes, the metrics received for these nodes from the APM during this phase are compared with metrics received for the other 75% of nodes during the period of time defined in Analysis Time duration.

Here is an example of a Prometheus verification step using Canary Analysis to compare new and previous nodes:

Predictive Analysis

Predictive Analysis is performed at the service level, which is less granular than the container level where Previous and Canary Analysis are performed.

For Predictive Analysis, Harness takes previous logs over the length of time specified in Baseline for Predictive Analysis, sets those logs as a baseline analysis, and then compares that baseline with future logs for the length of time in Analysis Time duration

Harness then analyses these past and future logs to see if there are anomalies or unknown and unexpected frequencies that were potentially triggered during deployment.

Predictive Analysis analyzes services but uses hosts for data collection.

When you use Predictive Analysis in a Harness verification step you select a host for the Expression for Host/Container name field even though Predictive Analysis analyzes services.

Baseline for Predictive Analysis

The Baseline for Predictive Analysis option appears if you select Predictive Analysis. Specify the time unit Harness should use to pull logs to set as the baseline for predictive analysis, such as Last 30 minutes.

A few notes about selecting the time unit for Baseline for Predictive Analysis:

  • The greater the length of time you specify for a Predictive Analysis baseline (in Baseline for Predictive Analysis), the longer it takes Harness to run the analysis. If you select Last 24 hours, it could take up to 15 or more minutes to perform predictive analysis.
  • The greater the length of time you specify for a Predictive Analysis baseline, the more API calls Harness makes to the verification provider. Harness makes API calls to verification providers to obtain logs grouped in 15 minutes batches. If you specify a long amount of time for a Predictive Analysis baseline, Harness will need to make a lot of API calls to the verification provider. For example, if you select Last 24 hours as the baseline for Predictive Analysis, then Harness will make 96 API calls to collect that data.

What about 24/7 Service Guard?

For 24/7 Service Guard, all verification providers (metrics and logs) use the Predictive Analysis strategy. For more information, see 24/7 Service Guard Overview.

Verification Best Practices

When picking an analysis strategy, there are several factors to consider, such as the type of deployment, in which Phase of the Workflow to add verification, and whether the number of instances/nodes/etc are consistent between deployments.  

This section provides help on selecting the right analysis strategy for your deployment.

Previous Analysis 

Use the following best practices with Previous Analysis.

Do
  • Use Previous Analysis in deployments where 100% of instances are deployed at once (single-phase deployments):
    • Basic deployment
    • Canary deployment with only one phase
    • Blue/Green deployment
    • Rolling deployment
  • Use Previous Analysis if the number of instances deployed remains the same between deployments.
Don't

Canary Analysis

Use the following best practices with Canary Analysis.

Do
  • Use Canary Analysis in multiphase Canary Workflows only.
Don't
  • Don't use Canary Analysis if there is only one phase in the Canary Workflow.
  • Don't use Canary Analysis in the last phase of a Canary Workflow because the final phase deploys to 100% of nodes and so there are no other nodes to compare.
  • Don't use Canary Analysis when deploying 100% of instances at once.

Predictive Analysis

Use the following best practices with Predictive Analysis.

Do
  • Use Predictive Analysis when you want to analyze deployments at the service level, which is less granular than the container level analyzed by Previous and Canary Analysis. While you should only use Canary Analysis for the first phases of a Canary Workflow, you can use Predictive Analysis on the final phase to compare previous and future logs; however this is not as granular, or likely as useful, as the Canary Analysis in the earlier phases. 
Don't
  • Don't use Predictive Analysis when you can use Previous or Canary Analysis because Predictive Analysis is less granular than Previous or Canary Analysis. But there are scenarios where Previous or Canary Analysis analysis are not appropriate, such as when you want to compare past and future logs over a specified period of time at the service level.

Analysis Time Duration

The recommended Analysis Time Duration is 10 minutes for logging providers and 15 minutes for APM and infrastructure providers.

Harness waits 2-3 minutes to allow enough time for the data to be sent to the verification provider before it analyzes the data. This wait time is a standard with monitoring tools. When you set the Analysis Time Duration to 10 minutes, it includes the initial 2-3 minute wait, and so the total run time is 13 minutes.

Wait Before Execution

The Verify Service section of the Workflow has a Wait before execution setting.

As stated earlier, Harness waits 2-3 minutes before performing analysis to avoid initial noise. Use the Wait before execution setting only when your deployed application takes more than 3-4 minutes to reach steady state. This will help avoid initial noise when an application starts like CPU spikes.

Algorithm Sensitivity and Failure Criteria

When adding a verification step to your Workflow, you can use the Algorithm Sensitivity setting to define the risk level that will be used as failure criteria during the deployment.

When the criteria is met, the Failure Strategy for the Workflow is executed.

The risk level is determined based on the following general guidelines:

  • For log analysis: The risk level is based on text similarity and frequency. Previously unseen new messages represent higher risk, while similar or identical messages represent low risk. In addition, common messages that suddenly appear in a much higher frequency are flagged as high risk.
  • For time-series analysis (APM): The risk level is determined using Standard Deviation. 5𝞼 (sigma) represents high risk, 4𝞼 represents medium risk, and 3𝞼 or below represents low risk. Harness also takes into account the number of points that deviated: 50%+ is high risk, 25%-50% is medium risk, and 25% or below is low risk.

Every successful deployment contributes to creating and shaping a healthy baseline that tells Harness what a successful deployment looks like, and what should be flagged as a risk. If a deployment failed due to verification, Harness will not consider any of the metrics produced by that deployment as part of the baseline.

Tuning Your Verification

When you first start using Harness Continuous Verification, we recommend you examine the results and use the following features to tune your verification using your knowledge of your application and deployment environment:

Customize Threshold

In your deployment verification results, you can customize the threshold of each metric/transaction for a Harness Service in a Workflow.

You can tune each specific metric for each Harness Service to eliminate noise. 

The example above helps you refine the response time. This means if the response time is less than the value entered in Ignore if [95th Percentile Response Time (ms)] is [less] then Harness will not mark it as a failure even if it is an anomaly.

Let's say the response time was around 10ms and it went to 20ms. Harness machine-learning engine will flag it as an anomaly because it jumped 100%. If you add a threshold configured to ignore a response time is less than 100ms, then Harness will not flag it.

You can adjust the threshold for any metric analysis. The following example shows how you can adjust the min and max of host memory comparisons.

3rd Party API Call History

You can view each API call and response between Harness and a verification provider by selecting View 3rd Party API Calls in the deployment's verification details.

The Request section shows the API call made by Harness and the Response section shows what the verification provider returned:

 {"sdkResponseMetadata":{"requestId":"bd678748-f905-46bc-91e1-f17843f87ac2"},"sdkHttpMetadata":{"httpHeaders":{"Content-Length":"988","Content-Type":"text/xml","Date":"Wed, 07 Aug 2019 20:12:06 GMT","x-amzn-RequestId":"bd678748-f905-46bc-91e1-f17843f87ac2"},"httpStatusCode":200},"label":"MemoryUtilization","datapoints":[{"timestamp":1565208600000,"average":10.512906283101966,"unit":"Percent"},{"timestamp":1565208540000,"average":10.50788872163684,"unit":"Percent"},{"timestamp":1565208420000,"average":10.477005777531302,"unit":"Percent"},{"timestamp":1565208480000,"average":10.493672297485643,"unit":"Percent"}]} 

The API response details allow you to drill down to see the specific datapoints and the criteria used for comparison. Failures can also examined in the Response section:

Event Distribution

You can view the event distribution for each event by clicking the graph icon:

The Event Distribution will show you the measured and baseline data, allowing you to see why the comparison resulted in an anomaly. 

Analysis Support for Providers

The following table lists which analysis strategies are supported for each Verification Provider.

Provider

Previous

Canary

Predictive

AppDynamics

Yes

Yes

Yes

NewRelic

Yes

Yes

DynaTrace

Yes

Yes

Prometheus

Yes

Yes

SplunkV2

Yes

Yes

ELK

Yes

Yes

Yes

LogZ

Yes

Yes

Sumo

Yes

Yes

Yes

Datadog Metrics

Yes

Yes

Yes

Datadog Logs

Yes

Yes

Yes

CloudWatch

Yes

Yes

Yes

Custom Metric Verification

Yes

Yes

Custom Log Verification

Yes

Yes

BugSnag

Yes

Yes

Yes

Stackdriver Metrics

Yes

Yes

Stackdriver Logs

Yes

Yes

Deployment Type Support

The following table lists which analysis strategies are supported in each deployment type.

Deployment Type

Analysis Supported

Basic

Previous, Predictive

Canary

Canary, Predictive

BlueGreen

Previous, Predictive

Rolling

Previous, Predictive

Multi-service

No

Build 

No

Custom

No


How did we do?