Prometheus + Grafana + Slack + SNS + Ansible
- Create 3 Ubuntu VM using Vagrant
 - Install and configure Prometheus + Grafana + Alertmanager on a dedicated machine using Ansible
 - Install Node Exporter on all machines using Ansible
 - Setup SNS Email notification using Terraform
 - Setup Slack Notification on a channel dedicated to alerts
 - Do a CPU stress test to see the dashboards react and receive notifications
 

The project
This github project is composed by :
- vagrant : some vagrant assets used to create 3 VM
 - terraform : terraform templates used to create a SNS topic + an IAM user + …
 - ansible : an ansible playbook using 4 roles to install + configure all monitoring tools
 
Setup variables
Let’s start by initializing the project
The env-create script creates an .env file at the root of the project :
# create .env file
make env-create
Now update the .env file to use your own variables :
SNS_EMAIL=[change-here]@gmail.com
SLACK_API_URL=https://hooks.slack.com/services/[change-here]
SLACK_CHANNEL=#[change-here]
Setup Slack
Create a slack channel. I choose the name alerts :

Create a new Slack app :

Setup the name and choose a workspace :

Select Incoming Webhooks functionnality :

Activate Incoming Webhooks :

Enable the application :

Create a new Webhook :

Important : copy / paste the Hello, World! curl example in a Terminal to test / activate the channel messaging :
curl -X POST -H 'Content-type: application/json' --data '{"text":"Hello, World!"}' https://hooks...

Copy a new Webhook URL :

Paste it in the .env file :
SLACK_API_URL=https://hooks.slack.com/services/[change-here]
Setup infrastructure
Initialize the terraform project :
# terraform init (upgrade) + validate
make terraform-init
Validate and apply the terraform project :
# terraform create sns topic + ssh key + iam user ...
make infra-create
The SNS Topic is created :

A confirmation email is received :

The email used is the one defined in the .env file :
SNS_EMAIL=[change-here]@gmail.com
Click the confirm subscription link :

This is confirmed :

Create 3 machines using Vagrant
A Vagrantfile is used to create 3 Ubuntu machines
Important : On Linux and macOS VirtualBox will only allow IP addresses in 192.168.56.0/21 range to be assigned
You MUST create a /etc/vbox/networks.conf file and add this line to be able to run this project :
* 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16
Start the 3 machines :
# create monitoring + node1 + node2
vagrant-up
The script also setup ~/.ssh/known_hosts and ~/.ssh/config :
for ip in $MONITORING_IP $NODE1_IP $NODE2_IP
do
  # prevent SSH warning : 
  # @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!   @ 
  # @ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! @ 
  ssh-keygen -f "$HOME/.ssh/known_hosts" -R "$ip" 1>/dev/null 2>/dev/null
  # prevent SSH answer :
  # Are you sure you want to continue connecting (yes/no/[fingerprint])?
  # /!\ important option `-p` MUST be defined BEFORE ip $SSH_HOST
  ssh-keyscan $ip 2>/dev/null >> ~/.ssh/known_hosts
done
if [[ -z $(grep "Host monitoring $MONITORING_IP # $PROJECT_NAME" ~/.ssh/config) ]];
then
    echo "
Host monitoring $MONITORING_IP # $PROJECT_NAME
    HostName $MONITORING_IP
    User vagrant
    IdentityFile ~/.ssh/$PROJECT_NAME
    
    # ...
"
fi
After that, we can connect directly to our machines with :
ssh monitoring
# or
ssh node1
# or
ssh node2
Install monitoring tools using Ansible
The following command install + configure everything needed :
# install + configure prometheus + grafana + alert manager ...
make ansible-play
Our playbook contains 4 roles :
- name: install node_exporter
  hosts: all
  become: true
  roles:
    - node-exporter
- name: install prometheus + grafana
  hosts: monitoring
  become: true
  roles:
    - prometheus
    - alertmanager
    - grafana
- 
The node-exporter role is a wrapper for the cloudalchemy/ansible-node-exporter role
 - 
The prometheus role is a wrapper for the cloudalchemy/ansible-prometheus role
 
The vars contains importants setup directives :
prometheus_global:
  # demo intervals
  scrape_interval: 6s
  scrape_timeout: 3s
  evaluation_interval: 6s
prometheus_alertmanager_config:
  - scheme: http
    static_configs:
      - targets:
        - "{{ ansible_host }}:9093"
prometheus_alert_rules:
  - record: 'node_cpu_percentage'
    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[1m])) * 100)
  - alert: memory-warning
    expr: node_cpu_percentage > 50
    labels:
      severity: warning
    annotations:
      description: "{% raw %}CPU load is > 50%\n  VALUE = {{ $value }}{% endraw %}"
      url: "http://{{ ansible_host }}:9090/alerts?search=memory-warning"
  - alert: memory-critical
    expr: node_cpu_percentage > 75
    labels:
      severity: critical
    annotations:
      description: "{% raw %}Memory critical {{ $value }} !{% endraw %}"
      url: "http://{{ ansible_host }}:9090/alerts?search=memory-critical"
A must see : Awesome Prometheus alerts website
Targets are available at http://10.20.20.20:9090/targets :

Alerts are available at http://10.20.20.20:9090/alerts :

- The alertmanager role is a wrapper for the cloudalchemy/ansible-alertmanager role
 
Slack + email notifications are defined here :
- ansible.builtin.include_role:
    name: ansible-alertmanager
  vars: 
    alertmanager_version: "{{ alertmanager_latest_version }}"
    alertmanager_template_files:
      - "{{ current_role_path }}/templates/*.tmpl"
    alertmanager_slack_api_url: "{{ alertmanager_slack_url }}"
    alertmanager_receivers:
      - name: default-dummy
      - name: slack
        slack_configs:
          - send_resolved: true
            # The channel or user to send notifications to.
            channel:  "{{ alertmanager_slack_channel }}"
            title: '{% raw %}{{ template "slack-message-title" . }}{% endraw %}'
            text: '{% raw %}{{ template "slack-message-description" . }}{% endraw %}'
      - name: sns
        sns_configs:
        - send_resolved: true
          # SNS topic ARN, i.e. arn:aws:sns:us-east-2:698519295917:My-Topic
          topic_arn: "{{ sns_topic_arn }}"
          sigv4:
            access_key: "{{ sns_access_key }}"
            secret_key: "{{ sns_secret_key }}"
            region: "{{ sns_region }}"
    alertmanager_route:
      # default receiver (nothing behind, we'll use routes below)
      receiver: default-dummy
      # To aggregate by all possible labels use the special value '...' as the sole label name, for example:
      # group_by: ['...']
      group_by: [...] 
      group_interval: 6s # testing delay
      group_wait: 6s # testing delay
      repeat_interval: 3h
      # Zero or more child routes.
      routes:
      - receiver: sns
        continue: true
      - receiver: slack
        continue: true
- The grafana role is a wrapper for the cloudalchemy/ansible-grafana role
 
Datasource and dashboards are defined here :
grafana_datasources:
- name: prometheus
  type: prometheus
  access: proxy
  url: 'http://127.0.0.1:9090'
  basicAuth: false
# https://grafana.com/grafana/dashboards/1860
# https://grafana.com/grafana/dashboards/11074
grafana_dashboards:
  # Node Exporter Full
  - dashboard_id: 1860
    revision_id: 30
    datasource: "{{ grafana_datasources.0.name }}"
  # Node Exporter for Prometheus Dashboard
  - dashboard_id: 11074
    revision_id: 9
    datasource: "{{ grafana_datasources.0.name }}"
Grafana is available at http://10.20.20.20:3000 :

Use admin / password to login
Two dashboards are available :

Let’s see Node Exporter Full on node1 with Last 5 minutes and 10s refresh :

CPU Stress test
SSH login to node1 :
ssh node1
Start the CPU stress test :
vagrant@node1:~$ stress --cpu 2
After few seconds :

The table view of the 3 machines http://10.20.20.20:9090/graph?g0.expr=node_cpu_percentage :

The graph view of the 3 machines http://10.20.20.20:9090/graph?g0.expr=node_cpu_percentage&… :

The alerts are firing http://10.20.20.20:9090/alerts :

Slack notifications are received :

Email notifications are received :

The other dashboard also show critical values :

Cleaning
This demonstration is now over, we are destroying the resources :
# destroy the 3 machines
vagrant-destroy
# terraform destroy sns topic + ssh key + iam user ...
infra-destroy