Jérôme Decoster

Jérôme Decoster

3x AWS Certified - Architect, Developer, Cloud Practionner

24 Sep 2020

Prometheus + Grafana + Node

The Goal
  • Create a website with node
  • Export personal metrics from the site to Prometheus
  • Export also system metrics with Node Exporter
  • Display metrics in Grafana
  • Create alerts rules on specific metric
  • Receive Slack notification when an alert is fired
  • Redo everything with docker-compose

    architecture.svg

    Install, setup and explore the project

    Get the code from this github repository :

    # download the code
    $ git clone \
        --depth 1 \
        https://github.com/jeromedecoster/note-prometheus-grafana-node.git \
        /tmp/note
    
    # cd
    $ cd /tmp/note
    

    To setup the project, run the following command :

    # install stress + docker pull prometheus + node-exporter + alertmanager + grafana ...
    $ make setup
    

    Configuring slack notifications

    The project uses Slack notifications.

    In my Slack account, I have two channels #my-channel and #another-channel.

    They are configured to receive notifications with Incoming Webhooks :

    slack-01-incoming-webhook.png

    I search for the Webhook URLs of each channel :

    slack-02-webhook-url.png

    I modify my two files local-alert.yaml and compose-alert.yaml.

    To replace api_url values with my Webhook URLs :

    receivers:
      - name: slack_default
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/<CHANGE_URL_HERE>'
            text: "{{ .CommonAnnotations.description }}"
            icon_url: 'https://avatars3.githubusercontent.com/u/3380462'
    

    The configuration is complete.

    Exploring the website

    The website uses the npm module prom-client.

    The server uses 3 metrics :

    • Counter : a counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
    • Gauge : a gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
    • Histogram : a histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
    const client = require('prom-client')
    
    // ...
    
    // 
    // Counter
    // 
    
    const _counter = new client.Counter({
      name: 'request_count',
      help: 'Number of requests.'
    })
    
    app.get('/counter', (req, res) => {
      _counter.inc()
      res.send(`<b>${_counter.name}</b> increased`)
    })
    
    // 
    // Gauge
    // 
    
    const _queue = new client.Gauge({
      name: 'queue_size',
      help: 'The size of the queue.'
    })
    
    app.get('/push', (req, res) => {
      _queue.inc()
      res.send(`<b>${_queue.name}</b> increased`)
    })
    
    app.get('/pop', (req, res) => {
      _queue.dec()
      res.send(`<b>${_queue.name}</b> decreased`)
    })
    
    // 
    // Histogram
    // 
    
    const _histogram = new client.Histogram({
      name: 'request_duration',
      help: 'Time for HTTP request.',
      // buckets: [1, 2, 5, 6, 10]
    })
    
    app.get('/wait', (req, res) => {
      var max
      var rnd = Math.random()
      if (rnd < .4) { max = 1 } 
      else if (rnd < .8) { max = 3 } 
      else max = 10
    
      const ms = Math.floor(Math.random() * max * 1000)
      setTimeout(function () {
        // convert to seconds
        _histogram.observe(ms / 1000)
        res.send(`<b>${_histogram.name}_bucket</b> filled.<br/>
                  <b>${_histogram.name}_sum</b> computed.<br/>
                  <b>${_histogram.name}_count</b> increased.<br/>
                  I kept you waiting for ${ms} ms!`)
      }, ms)
    })
    
    
    // metrics endpoint
    app.get('/metrics', (req, res) => {
      res.set('Content-Type', client.register.contentType)
      res.end(client.register.metrics())
    })
    

    Let’s start the website :

    # local development (by calling npm script directly)
    $ make dev
    

    By opening the address http://localhost:5000 you can see this tiny website :

    website-01-homepage.png

    By displaying page counter we increase the request_count metric :

    website-02-counter.png

    By displaying page push we increase the queue_size metric :

    website-03-push.png

    By displaying page pop we decrease the queue_size metric :

    website-04-pop.png

    By displaying page wait we vary the request_duration_bucket, request_duration_sum and request_duration_count metrics :

    website-05-wait.png

    The page metrics display all the metrics :

    website-06-metrics.png

    Prometheus

    Let’s start Prometheus :

    # run local prometheus
    $ make local-prometheus
    

    This command does this :

    $ docker run --detach \
        --name=prometheus \
        --network host \
        --volume $(pwd)/local-prometheus.yaml:/etc/prometheus/prometheus.yaml \
        --volume $(pwd)/local-rules.yaml:/etc/prometheus/rules.yaml \
        prom/prometheus \
        --config.file=/etc/prometheus/prometheus.yaml
    

    Prometheus is configured with the local-prometheus.yaml file :

    scrape_configs:
      - job_name: 'local'
        scrape_interval: 10s
        static_configs:
        - targets:
          - '0.0.0.0:5000'
          - '0.0.0.0:9100'
    
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - '0.0.0.0:9093'
    
    rule_files:
      - '/etc/prometheus/rules.yaml'
    

    Let’s detail this configuration :

    • Prometheus will retrieve metrics from 0.0.0.0:5000, those emitted by our website in localhost.
    • Also retrieve metrics from 0.0.0.0:9100. These are the metrics emitted by Node Exporter that we will install later.
    • Use Alertmanager which listens on port 9093 and which we will install later.
    • Declare rules via the rules.yaml file.

    Here is the content of the rules file :

    groups:
      - name: memory-rule
        rules:
          - record: node_cpu_seconds_total:avg
            expr: (((count(count(node_cpu_seconds_total{job="local"}) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle',job="local"}[1m])))) * 100) / count(count(node_cpu_seconds_total{job="local"}) by (cpu))
          
          - alert: memory-warning
            expr: node_cpu_seconds_total:avg > 45
            labels:
              severity: warning
            annotations:
              description: Memory warning {{ $value }} !
    
          - alert: memory-critical
            expr: node_cpu_seconds_total:avg > 80
            labels:
              severity: critical
            annotations:
              description: Memory critical {{ $value }} !
    

    Let’s detail this configuration :

    • We define the recording rules node_cpu_seconds_total:avg
    • Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.
    • The associated expression expr calculates the CPU usage
    • We define 2 alerting rules which will be triggered if node_cpu_seconds_total:avg > 45 or if node_cpu_seconds_total:avg > 80

    Prometheus is launched. We can see the rules by opening http://localhost:9090/rules :

    prometheus-01-rules.png

    We can see the alerts by opening http://localhost:9090/alerts :

    prometheus-02-alerts.png

    We can now use Prometheus to display our metrics.

    The URL http://localhost:9090/graph?g0.range_input=5m&g0.expr=queue_size&g0.tab=0 displays the evolution of the queue_size metric that we have varied via our website by visiting push and pop pages :

    prometheus-03-queue-size.png

    node-exporter

    We install Node Exporter :

    # run local node-exporter
    $ make local-node-exporter
    

    This command does this :

    $ docker run --detach \
        --name node-exporter \
        --restart=always \
        --network host \
        prom/node-exporter
    

    Once installed, the metrics are available at the address http://localhost:9100/metrics :

    node-exporter-01-metrics.png

    Now all these new metrics are available in Prometheus :

    The URL http://localhost:9090/graph?g0.range_input=5m&g0.expr=node_cpu_seconds_total&g0.tab=1 displays the evolution of the node_cpu_seconds_total metric :

    node-exporter-02-node-cpu-seconds.png

    Installing Alertmanager

    Let’s install Alertmanager

    # run local alertmanager
    $ make local-alertmanager
    

    This command does this :

    $ docker run --detach \
        --name=alertmanager \
        --network host \
        --volume $(pwd)/local-alert.yaml:/etc/alertmanager/local-alert.yaml \
        prom/alertmanager \
        --config.file=/etc/alertmanager/local-alert.yaml
    

    Once installed, the Alertmanager is available at the address http://localhost:9093 :

    alertmanager-01-ui.png

    Stress test

    We will trigger an alert by stressing our CPU with the stress executable :

    # hot !
    $ stress --cpu 2
    

    The URL http://localhost:9090/graph?g0.range_input=1h&g0.expr=node_cpu_seconds_total%3Aavg&g0.tab=0 displays the evolution of our custom metric node_cpu_seconds_total:avg :

    prometheus-05-graph-node-cpu-seconds.png

    We see that 45% of CPU usage is exceeded, an alert is triggered and displayed in Alertmanager :

    alertmanager-02-warning-fired.png

    You can see this alert triggered also in the Prometheus interface :

    prometheus-04-warning-fired.png

    And our Slack channel has been notified :

    slack-03-warning-fired.png

    Installing Grafana

    Grafana allows us to display this data in the form of a pretty dashboard.

    Let’s start Grafana :

    $ make local-grafana
    

    This command does this :

    $ docker run --detach \
        --env GF_AUTH_BASIC_ENABLED=false \
        --env GF_AUTH_ANONYMOUS_ENABLED=true \
        --env GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
        --name=grafana \
        --network host \
        grafana/grafana
    

    After a few seconds of initialization, Grafana is visible at the address :

    grafana-01-ui.png

    Grafana however still needs to be configured :

    1. First, add a datasource
    2. Then add a dashboard

    We can configure it with this command:

    $ make local-grafana-configure
    

    This command adds Promotheus as a datasource like this :

    $ curl http://localhost:3000/api/datasources \
        --header 'Content-Type: application/json' \
        --data @local-datasource.json
    

    The local-datasource.json file is simple :

    {
        "name": "Prometheus",
        "type": "prometheus",
        "access": "proxy",
        "url": "http://localhost:9090",
        "basicAuth": false,
        "isDefault": true
    }
    

    This command then installs a dashboard specially designed to retrieve data from Node Exporter.

    The command retrieves the dashboard from this JSON data : https://grafana.com/api/dashboards/1860.

    # create dashboard-1860.json
    $ curl https://grafana.com/api/dashboards/1860 | jq '.json' > dashboard-1860.json
    

    To be able to be imported into Grafana, we need to modify our JSON by wrapping it like this :

    # wrap some JSON data
    $ ( echo '{ "overwrite": true, "dashboard" :'; \
        cat dashboard-1860.json; \
        echo '}' ) \
        | jq \
        > dashboard-1860-modified.json
    

    We now add it to Grafana :

    # add dashboard-1860-modified
    $ curl http://localhost:3000/api/dashboards/db \
        --header 'Content-Type: application/json' \
        --data @dashboard-1860-modified.json
    

    Another dashboard specific to the metrics of our website is added.

    We reload our browser, we see that the dashboards have been added :

    grafana-02-configured.png

    We display the Node Exporter Full dashboard :

    grafana-03-node-exporter-dashboard.png

    We display the My dashboard dashboard :

    grafana-04-my-dashboard.png

    A new stress test

    We are going to stress our CPU again. This time a little stronger :

    $ stress --cpu 3
    

    We see that 80% of the CPU usage is exceeded :

    grafana-05-stress.png

    Our 2 alerts are triggered and displayed in Alertmanager :

    alertmanager-03-stress-fired.png

    We can also see our alerts triggered in the Prometheus interface :

    prometheus-06-stress-alerts.png

    The #my-channel slack channel has received the warning notification :

    slack-04-stress-warning-fired.png

    The #another-channel slack channel has received the critical notification :

    slack-05-stress-critical-fired.png

    Our tests are now complete, we can remove the running containers :

    # remove all running containers
    $ make rm
    

    Using Docker-compose

    The goal is to set up a similar environment using docker-compose.

    One command is enough :

    # docker-compose up
    $ make compose-up
    

    It is interesting to go and see the configuration files.

    The docker-compose.yaml file :

    version: "3"
    services:
      site:
        build:
          context: ./site
        container_name: site
        ports:
          - "5000:5000"
    
      node-exporter:
        container_name: node-exporter
        image: prom/node-exporter
        ports: 
          - "9100:9100"
    
      alertmanager:
        container_name: alertmanager
        image: prom/alertmanager
        ports: 
          - "9093:9093"
        command: --config.file=/etc/alertmanager/compose-alert.yaml
        volumes:
          - ./compose-alert.yaml:/etc/alertmanager/compose-alert.yaml
    
      prometheus:
        container_name: prometheus
        image: prom/prometheus
        ports: 
          - "9090:9090"
        command: --config.file=/etc/prometheus/prometheus.yaml
        volumes:
          - ./compose-prometheus.yaml:/etc/prometheus/prometheus.yaml
          - ./compose-rules.yaml:/etc/prometheus/rules.yaml
          
      grafana:
        container_name: grafana
        image: grafana/grafana
        ports:
          - "3000:3000"
        environment:
          - GF_AUTH_BASIC_ENABLED=false
          - GF_AUTH_ANONYMOUS_ENABLED=true
          - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    
      grafana-setup:
        container_name: grafana-setup
        image: alpine:3.10
        depends_on:
          - grafana
        volumes:
          - ./compose-datasource.json:/etc/grafana/compose-datasource.json
          - ./compose-dashboard.json:/etc/grafana/compose-dashboard.json
          - ./compose-my-dashboard.json:/etc/grafana/compose-my-dashboard.json
        command: >
          /bin/sh -c "
            apk add --no-cache curl
            echo 'waiting for grafana'
            sleep 7s
            cd /etc/grafana/
            curl http://grafana:3000/api/datasources \
              --header 'Content-Type: application/json' \
              --data @compose-datasource.json
            curl http://grafana:3000/api/dashboards/db \
              --header 'Content-Type: application/json' \
              --data @compose-dashboard.json
            curl http://grafana:3000/api/dashboards/db \
              --header 'Content-Type: application/json' \
              --data @compose-my-dashboard.json"
    

    Note how the Promotheus configuration has been modified :

    scrape_configs:
      - job_name: 'compose'
        scrape_interval: 5s
        static_configs:
        - targets:
          - 'site:5000'
          - 'node-exporter:9100'
    
    rule_files:
      - '/etc/prometheus/rules.yaml'
    
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
          - 'alertmanager:9093'