17 Feb 2023

Prometheus + Grafana + Slack + SNS + Ansible

The Goal

Create 3 Ubuntu VM using Vagrant
Install and configure Prometheus + Grafana + Alertmanager on a dedicated machine using Ansible
Install Node Exporter on all machines using Ansible
Setup SNS Email notification using Terraform
Setup Slack Notification on a channel dedicated to alerts
Do a CPU stress test to see the dashboards react and receive notifications

The project

This github project is composed by :

vagrant : some vagrant assets used to create 3 VM
terraform : terraform templates used to create a SNS topic + an IAM user + …
ansible : an ansible playbook using 4 roles to install + configure all monitoring tools

Setup variables

Let’s start by initializing the project

The env-create script creates an .env file at the root of the project :

# create .env file
make env-create

Now update the .env file to use your own variables :

SNS_EMAIL=[change-here]@gmail.com
SLACK_API_URL=https://hooks.slack.com/services/[change-here]
SLACK_CHANNEL=#[change-here]

Setup Slack

Create a slack channel. I choose the name alerts :

Create a new Slack app :

Setup the name and choose a workspace :

Select Incoming Webhooks functionnality :

Activate Incoming Webhooks :

Enable the application :

Create a new Webhook :

Important : copy / paste the Hello, World! curl example in a Terminal to test / activate the channel messaging :

curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' https://hooks...

Copy a new Webhook URL :

Paste it in the .env file :

SLACK_API_URL=https://hooks.slack.com/services/[change-here]

Setup infrastructure

Initialize the terraform project :

# terraform init (upgrade) + validate
make terraform-init

Validate and apply the terraform project :

# terraform create sns topic + ssh key + iam user ...
make infra-create

The SNS Topic is created :

A confirmation email is received :

The email used is the one defined in the .env file :

SNS_EMAIL=[change-here]@gmail.com

Click the confirm subscription link :

This is confirmed :

Create 3 machines using Vagrant

A Vagrantfile is used to create 3 Ubuntu machines

Important : On Linux and macOS VirtualBox will only allow IP addresses in 192.168.56.0/21 range to be assigned

You MUST create a /etc/vbox/networks.conf file and add this line to be able to run this project :

* 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16

Start the 3 machines :

# create monitoring + node1 + node2
vagrant-up

The script also setup ~/.ssh/known_hosts and ~/.ssh/config :

for ip in $MONITORING_IP $NODE1_IP $NODE2_IP
do
  # prevent SSH warning : 
  # @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!   @ 
  # @ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! @ 
  ssh-keygen -f "$HOME/.ssh/known_hosts" -R "$ip" 1>/dev/null 2>/dev/null

  # prevent SSH answer :
  # Are you sure you want to continue connecting (yes/no/[fingerprint])?
  # /!\ important option `-p` MUST be defined BEFORE ip $SSH_HOST
  ssh-keyscan $ip 2>/dev/null >> ~/.ssh/known_hosts
done

if [[ -z $(grep "Host monitoring $MONITORING_IP # $PROJECT_NAME" ~/.ssh/config) ]];
then
    echo "
Host monitoring $MONITORING_IP # $PROJECT_NAME
    HostName $MONITORING_IP
    User vagrant
    IdentityFile ~/.ssh/$PROJECT_NAME
    
    # ...
"
fi

After that, we can connect directly to our machines with :

ssh monitoring
# or
ssh node1
# or
ssh node2

Install monitoring tools using Ansible

The following command install + configure everything needed :

# install + configure prometheus + grafana + alert manager ...
make ansible-play

Our playbook contains 4 roles :

- name: install node_exporter
  hosts: all
  become: true
  roles:
    - node-exporter

- name: install prometheus + grafana
  hosts: monitoring
  become: true
  roles:
    - prometheus
    - alertmanager
    - grafana

The node-exporter role is a wrapper for the cloudalchemy/ansible-node-exporter role
The prometheus role is a wrapper for the cloudalchemy/ansible-prometheus role

The vars contains importants setup directives :

prometheus_global:
  # demo intervals
  scrape_interval: 6s
  scrape_timeout: 3s
  evaluation_interval: 6s

prometheus_alertmanager_config:
  - scheme: http
    static_configs:
      - targets:
        - "{{ ansible_host }}:9093"

prometheus_alert_rules:
  - record: 'node_cpu_percentage'
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)

  - alert: memory-warning
    expr: node_cpu_percentage > 50
    labels:
      severity: warning
    annotations:
      description: "{% raw %}CPU load is > 50%\n  VALUE = {{ $value }}{% endraw %}"
      url: "http://{{ ansible_host }}:9090/alerts?search=memory-warning"

  - alert: memory-critical
    expr: node_cpu_percentage > 75
    labels:
      severity: critical
    annotations:
      description: "{% raw %}Memory critical {{ $value }} !{% endraw %}"
      url: "http://{{ ansible_host }}:9090/alerts?search=memory-critical"

A must see : Awesome Prometheus alerts website

Targets are available at http://10.20.20.20:9090/targets :

Alerts are available at http://10.20.20.20:9090/alerts :

The alertmanager role is a wrapper for the cloudalchemy/ansible-alertmanager role

Slack + email notifications are defined here :

- ansible.builtin.include_role:
    name: ansible-alertmanager
  vars: 
    alertmanager_version: "{{ alertmanager_latest_version }}"
    alertmanager_template_files:
      - "{{ current_role_path }}/templates/*.tmpl"
    alertmanager_slack_api_url: "{{ alertmanager_slack_url }}"
    alertmanager_receivers:
      - name: default-dummy
      - name: slack
        slack_configs:
          - send_resolved: true
            # The channel or user to send notifications to.
            channel:  "{{ alertmanager_slack_channel }}"
            title: '{% raw %}{{ template "slack-message-title" . }}{% endraw %}'
            text: '{% raw %}{{ template "slack-message-description" . }}{% endraw %}'
      - name: sns
        sns_configs:
        - send_resolved: true
          # SNS topic ARN, i.e. arn:aws:sns:us-east-2:698519295917:My-Topic
          topic_arn: "{{ sns_topic_arn }}"
          sigv4:
            access_key: "{{ sns_access_key }}"
            secret_key: "{{ sns_secret_key }}"
            region: "{{ sns_region }}"

    alertmanager_route:
      # default receiver (nothing behind, we'll use routes below)
      receiver: default-dummy
      # To aggregate by all possible labels use the special value '...' as the sole label name, for example:
      # group_by: ['...']
      group_by: [...] 
      group_interval: 6s # testing delay
      group_wait: 6s # testing delay
      repeat_interval: 3h
      # Zero or more child routes.
      routes:
      - receiver: sns
        continue: true
      - receiver: slack
        continue: true

The grafana role is a wrapper for the cloudalchemy/ansible-grafana role

Datasource and dashboards are defined here :

grafana_datasources:
- name: prometheus
  type: prometheus
  access: proxy
  url: 'http://127.0.0.1:9090'
  basicAuth: false

# https://grafana.com/grafana/dashboards/1860
# https://grafana.com/grafana/dashboards/11074
grafana_dashboards:
  # Node Exporter Full
  - dashboard_id: 1860
    revision_id: 30
    datasource: "{{ grafana_datasources.0.name }}"
  # Node Exporter for Prometheus Dashboard
  - dashboard_id: 11074
    revision_id: 9
    datasource: "{{ grafana_datasources.0.name }}"

Grafana is available at http://10.20.20.20:3000 :

Use admin / password to login

Two dashboards are available :

Let’s see Node Exporter Full on node1 with Last 5 minutes and 10s refresh :

CPU Stress test

SSH login to node1 :

ssh node1

Start the CPU stress test :

vagrant@node1:~$ stress --cpu 2

After few seconds :

The table view of the 3 machines http://10.20.20.20:9090/graph?g0.expr=node_cpu_percentage :

The graph view of the 3 machines http://10.20.20.20:9090/graph?g0.expr=node_cpu_percentage&… :

The alerts are firing http://10.20.20.20:9090/alerts :

Slack notifications are received :

Email notifications are received :

The other dashboard also show critical values :

Cleaning

This demonstration is now over, we are destroying the resources :

# destroy the 3 machines
vagrant-destroy

# terraform destroy sns topic + ssh key + iam user ...
infra-destroy

alertmanager ansible aws grafana prometheus sns terraform vagrant