Docker Connected On-Prem Monitoring

Updated 3 weeks ago by Michael Katz

This topic outlines how to monitor the health of a Harness Docker Connected On-Prem setup, in the following sections:

Overview

A Harness Docker Connected On-Prem (three-box) installation runs various microservices on different hosts. You can monitor Harness microservices using the monitoring service provided as part of the Harness On-Prem installation. As outlined below, the monitoring service uses four components (which are themselves microservices):

cAdvisor: This will be installed on each host, and is used to scrape the host for the different metrics that Harness needs for monitoring. 

Prometheus: This serves as a persistent store for the metrics scraped by cAdvisor and by Prometheus' own alert-triggering platform. Within Prometheus, you can set up different alert rules over different metrics received from cAdvisor. Prometheus keeps polling cAdvisor for different metrics, and keeps matching them with the configured alert rules.

Alertmanager: This propagates the alerts created by Prometheus to your notification platform (Slack, email, etc.). Prometheus creates alerts according to the alerts metrics it gets from cAdvisor, and passes the alerts to Alertmanager. As input, Alertmanager takes a configuration file that specifies how Alertmanager will connect to different notification platforms. 

Grafana: Grafana is a dashboarding platform that relies on Prometheus data. It creates various time-series dashboards for CPU and memory usage, network traffic, I/O, and other critical metrics. 

Infrastructure Requirements

Harness recommends deploying the monitoring service on a separate host from the three boxes that it monitors. Typically, this host can be the same machine on which the Ambassador is running. The cAdvisor, Prometheus, Grafana, and Alertmanager components are each configured to use .25 cores of CPU and 500 MB RAM.

Prerequisites

If you are running RHEL7 OS on your hosts: RHEL 7 boxes have a default configuration that blocks cAdvisor from scraping system metrics. (For details, see this cAdvisor issue.) Use the following procedure to enable cAdvisor to scrape metrics.

You must execute these steps as a root user, on all three machines.
  1. On a command line, execute these commands to mount the cgroup directory:
    mount -o remount,rw '/sys/fs/cgroup'
    ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu
  2. Run the command to open the crontab:
    crontab -e
  3. Append these two lines to the bottom of the crontab file, to make the remount commands persistent upon host restarts:
    @reboot mount -o remount,rw '/sys/fs/cgroup'
    @reboot ln -s /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpuacct,cpu
  4. Save the crontab file.

Access URLs

The four monitoring components can be accessed at the following URLs:

  • cAdvisor: [runs on all hosts on port 7152]
  • Prometheus: http://monitoringhost​:7149/alerts
  • Alertmanager: http://monitoringhost:7150/
  • Grafana: http://monitoringhost:7152/

Dashboards

Harness Monitoring installs a preconfigured Grafana setup, which can be used to monitor containers' health, along with other metrics like CPU and memory usage, network traffic, and I/O. The Grafana dashboard provides a combined graphical view of container metrics per box.

To add a basic set of graphs to Grafana, download the JSON template from this Docker monitoring dashboard, and import it in Grafana. This will create a dashboard like the one shown below.

Grafana provides advanced features like templatizing dashboards, changing dashboards' granularity, built-in authentication, and adding custom dashboards to match your requirements. For details about adding dashboards, see Grafana's documentation.

Default Alert Setup

As shipped, Harness' monitoring service provides the four above components, along with the following configuration:

  1. Alerts are configured on all the Harness microservices (Manager, UI, Verification Service, Learning Engine, etc.) running on the three hosts. If Prometheus doesn’t receive any heartbeat from the microservices for 5 minutes, it triggers an alert, which is sent to the configured notification method.
  2. All alerts are grouped by the host on which they are running. Once a grouped alert is sent for one or multiple microservices, the alert is repeated every 10 minutes. You can change this value in the Alertmanager configuration. This is how one grouped alert looks in email:
  3. If any container goes down, the first alert is sent after 5 minutes.
  4. If the Alertmanager container goes down for any reason, any changes you have made in Alertmanager during the preceding 10 minutes will be lost. (Alertmanager writes data to disk on a 10-minute lag.)

Responding to Alerts

Here are recommendations for responding to some common alert conditions.

All Microservices on a Box Are Down

This can happen if the box restarted, or if the Docker daemon on the box restarted. Use the Harness local start script to bring the system back to its healthy state.

A Few Microservices on a Box Are Down

Look at the logs of the container that has gone down. These should give some indication of why the service is down. For example, if mongoContainer is down, you can use the command: docker inspect mongoContainer

In the case of harnessManager, the logs will be present in:

<HOME_DIR>/Connected on-prem docker installer/Manager/<CUSTOMER_NAME>/runtime/logs/portal.log

If the service is down because of an infrastructure issue (such as out of disk space), correct the infrastructure issue, and then use Harness' local start script to restore Harness services.

If the service is down because of some other issue, contact Harness Support at support@harness.io. Specify the issue you're seeing, and include the container logs. Harness Support will analyze the problem, fix it, and roll out an update using the Ambassador.

Customizing Alerts

You can customize the following aspects of Harness' monitoring service.

Set Up Custom Alert Rules

The Prometheus microservice controls alerts. To see existing alert rules, go to Prometheus and access Status > Alerts. One alert rule looks something like this:

  - alert: Manager_HOST1
expr: absent(container_last_seen{name="harnessManager", instance="HOST_1:7152"})
for: 1m
labels:
severity: error
annotations:
summary: "Harness container is down on HOST_1"
description: "harnessManager is down on: HOST_1"

All alert rules reside in the ~/$INSTALLER_DIR/Prometheus/<CUSTOMER_NAME>/runtime/alert.rules file. To add new rules, append them to the end of this file, then restart Prometheus. For alert configuration details, see Prometheus' Alerting Configuration documentation.

Set Up Custom Alert Notification Platforms

The Alertmanager microservice handles alert delivery. Alertmanager supports multiple integrations, including email and Slack. Alertmanager’s default configuration provides email notifications, and resides in the ~/$INSTALLER_DIR/Prometheus/<CUSTOMER_NAME>/runtime/alertmanager.yaml file. For details on adding custom notification providers to this configuration file, refer back to Prometheus' Alerting Configuration documentation.

Set Up Custom Alert Notification Frequency

You can configure the alerting frequency and grouping in this file:

~/Connected on-prem docker installer/Alert Manager/<CUSTOMER_NAME>/runtime/alertmanager.yaml

In the file’s route section, you must update the following settings:

  # When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 3h

Restart Prometheus to Apply Configuration Updates

For any configuration update to take effect, you must restart the Prometheus container.

If you have updated alert.rules, run following command to restart Prometheus:

docker restart harness-prometheus

If you have updated alertmanager.yaml, run following command to restart Alertmanager:

docker restart harness-alertmanager

Customizing Prometheus and Grafana

If you have a preconfigured Prometheus and Grafana system, you can reconfigure these components to point to the three hosts running Harness microservices. Prometheus can scrape metrics from cAdvisor on each of the three hosts, on port 7152.

Configure Custom Prometheus

To customize Prometheus according to your needs, update the prometheus.yml file. For details, see Prometheus' Configuration documentation.

Below is Harness' recommended prometheus.yml configuration. Its scrape_configs section shows how to configure Prometheus to get information from cAdvisors on the three boxes.

  # my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

external_labels:
monitor: 'harness-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- /etc/prometheus/alert.rules
# - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['<MONITORING_HOST>:7149']

- job_name: 'cadvisor-metrics'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['<HOST_1>:7152', '<HOST_2>:7152', '<HOST_3>:7152']

alerting:
alertmanagers:
- static_configs:
- targets: ['<MONITORING_HOST>:7150']

Configure Custom Grafana

For details on configuring Grafana, see Prometheus' Grafana Support for Prometheus documentation.


How did we do?