Monitoring and Logging with Prometheus: A Practical Guide

Monitoring and logging are critical components of any robust IT infrastructure. They ensure that systems run smoothly, issues are detected early, and performance is optimized. Prometheus has emerged as a popular tool for these tasks, offering powerful features for metrics collection, querying, and alerting. In this guide, we'll explore how to effectively use Prometheus for monitoring and logging, providing a practical, hands-on approach to get you started. Key topics include Prometheus architecture, installation, setup, metric collection, alerting, and integration with other tools.

If you find yourself curious about who is utilizing Prometheus globally, here is a brief list of some notable companies that have adopted Prometheus for their needs: JPMorgan, Dice, Garfana Labs, Visa, more ...

What is Prometheus? Understanding the Basics

Prometheus is an open-source monitoring and alerting toolkit designed specifically for reliability and scalability. Originally developed at SoundCloud, Prometheus is now a part of the Cloud Native Computing Foundation (CNCF). This section will delve into Prometheus main concepts.

Node exporter

is a component or service that collects metrics from third-party systems or applications (such as databases, web servers, or hardware devices) and exposes them in a format that Prometheus can scrape. Exporters are commonly used for systems that don't have native Prometheus metric exposure.

Example of node exporter could be:

The Node Exporter is used to expose system metrics (e.g., CPU, memory, disk usage) from Linux systems.
The MySQL Exporter collects MySQL database metrics and makes them available for Prometheus to scrape.

Service discovery

is a mechanism in Prometheus that automatically detects and configures scrape targets (applications, services, or exporters) without manual intervention. Instead of manually configuring each target, Prometheus can discover them through integrations with various platforms, such as Kubernetes, AWS EC2, or Consul. Service discovery allows Prometheus to keep its list of scrape targets up to date, even as services are added or removed dynamically. As an example, in Kubernetes, Prometheus can automatically discover and monitor new pods as they are deployed.

Scraping

refers to the process by which Prometheus collects metrics from monitored services or systems. Prometheus periodically makes HTTP requests to endpoints that expose metrics, retrieves the data, and stores it in its time-series database. The frequency of scraping and the list of targets are defined in the Prometheus configuration. As an example Prometheus might scrape the /metrics endpoint of a web server every 15 seconds to collect HTTP request statistics.

PromQL (Prometheus Query Language)

is the query language used in Prometheus to retrieve and manipulate time-series data. With PromQL, users can query the stored metrics, perform calculations, and create visualizations or alerts. PromQL is a powerful and flexible tool for analyzing data, supporting operations like filtering, aggregations, and rate calculations.

Collecting Metrics with Prometheus

Prometheus collects metrics by scraping targets at specified intervals. This section will explain how to define targets in the prometheus.yml configuration file and how to use exporters to expose metrics from different systems (like Node Exporter for Linux system metrics, Blackbox Exporter for probing endpoints, etc.). It will also cover writing custom exporters for your specific applications.

Callback.

In the context of Prometheus, a callback refers to a function or handler that is invoked when Prometheus scrapes a target. This typically applies when metrics are not stored in memory directly but need to be computed or fetched on demand from an external source. The callback allows a target to dynamically return metrics at scrape time. This is common in custom instrumentation where specific logic is run to compute metrics only when needed.

Targets

In Prometheus, "targets" refer to the individual endpoints (typically applications or services) that Prometheus scrapes to collect metrics. These targets are defined in the Prometheus configuration file (prometheus.yml) and usually expose metrics via HTTP on a specific endpoint (e.g., /metrics). Targets are organized into groups and jobs, and Prometheus regularly queries them to fetch up-to-date data. A target could be a Node Exporter running on a server that Prometheus scrapes to gather system metrics.

Prometheus architecture, showing how Prometheus scrapes metrics from various sources, stores them in a time-series database, and provides alerting.

Installing and Setting Up Prometheus

This section provides a step-by-step guide on how to install Prometheus on Linux based operating system. The goal is to have a working Prometheus instance by the end of this section.

Software/Hardware Description

Oracle VirtualBox: A powerful x86 and AMD64/Intel64 virtualization product.
ISO image of a Linux distribution (Ubuntu Budgie 22.04.2 LTS 64bit).
Memory: 4 GB
Hard Disk: 60 GB
CPU 4 CPU

Prometheus Installation

Download site: https://prometheus.io/download/

Downlversion: prometheus-2.54.1.linux-amd64

Run the following command in the terminal:

   tar xvfz prometheus-*.tar.gz

   cd prometheus-*

   ./prometheus

Launch your browser and navigate to Prometheus page: http://localhost:9090

Prometheus initial page.

Prometheus catalog of accessible metrics.

process_resident_memory_bytes

This metric represents the amount of memory in bytes that a process is currently using in RAM (also called "resident memory"). It is a gauge-type metric that provides a real-time snapshot of memory consumption, which can help in identifying memory leaks or understanding the memory footprint of an application.

If you monitor a Prometheus server and see that process_resident_memory_bytes is consistently increasing, it could indicate a memory leak or inefficiency in the application.

Graph of resident process memory in bytes.

process_cpu_seconds_total

This metric shows the total amount of CPU time (in seconds) consumed by a process. It accumulates over time and represents the CPU usage since the start of the process. This metric is typically used to analyze CPU utilization and is useful when combined with the rate() function to monitor CPU usage over a specified time window.

rate(process_cpu_seconds_total[5m]) calculates the CPU usage rate over the last 5 minutes, giving insight into how intensively the application is using CPU resources.

rate(process_cpu_seconds_total[5m]) calculates the CPU usage rate over the last 5 minutes

Prometheus can be set up using the XML file: prometheus.yml.

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: "prometheus"

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ["localhost:9090"]

Install node exporter

An exporter is a component or service that collects metrics from third-party systems or applications (such as databases, web servers, or hardware devices) and exposes them in a format that Prometheus can scrape. Exporters are commonly used for systems that don't have native Prometheus metric exposure.

Node Exporter Installation

Download site: https://github.com/prometheus/node_exporter/releases/

Download version: node_exporter-1.8.2.linux-amd64.tar.gz

Run the following command in the terminal:

   tar xvfz node_exporter-*.tar.gz

   cd node_exporter-*

   ./node_exporter

Edit the Prometheus configuration file, prometheus.yml, to include the following lines:

- job_name: "node"

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ["localhost:9100"]

The entire Prometheus configuration file is:

# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:

- targets: ["localhost:9100"]

Reboot Prometheus.

On the Prometheus target page, both "node" and "prometheus" are operational.

Graph depicting the resident memory usage of the node labeled "node."

Setting Up Alerts with Prometheus and Alertmanager

Prometheus provides robust alerting capabilities through its integration with Alertmanager. This section will guide you on how to define alert rules, configure Alertmanager, and route alerts to various channels such as email, Slack, or PagerDuty. We'll also discuss best practices for setting up alerts to avoid alert fatigue.

Alerting rules define conditions in Prometheus that trigger alerts when certain thresholds or patterns are met in the metric data. These rules are written in PromQL and evaluated periodically by Prometheus. If a rule's condition is true, it generates an alert, which is then sent to the Alertmanager for further processing (e.g., routing or notification).

Alertmanager

Alertmanager is a component in the Prometheus ecosystem responsible for handling alerts generated by Prometheus alerting rules. It manages the lifecycle of alerts, including deduplication, grouping, and routing them to various notification channels (e.g., email, Slack, PagerDuty). It also allows users to silence or inhibit alerts under certain conditions (e.g., during maintenance windows).

Alertmanager Installation

Download page: https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz

Download version: alertmanager-0.27.0.linux-amd64.tar.gz

Run from the command line:

   tar xvfz alertmanager-*.tar.gz

   cd alertmanager-*

   ./alertmanager

Revise the Prometheus configuration file, prometheus.yml, to match the following updated format:

# my global config

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
rule_files: [rules.yml]
alerting:
alertmanagers:
- static_configs:
- targets: [localhost:9093]
# - alertmanager:9093
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:

- targets: ["localhost:9100"]

Store a file named rules.xml in the prometheus directory:

groups:
- name: test
rules:
- alert: InstanceDown
expr: up == 0

for: 1m

You can customize Alertmanager using the alertmanager.yml file located in the alertmanager directory.

global:
smtp_smarthost: 'localhost:25'
smtp_from: 'yourprometheus@text.org'
route:
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 10s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 30m
# A default receiver
receiver: "web.hook"
# All the above attributes are inherited by all child routes and can
# overwritten on each.
routes:
- receiver: "web.hook"
group_wait: 30s
match_re:
severity: critical|warning
continue: true
- receiver: "test-email"
group_wait: 10s
match_re:
severity: critical
continue: true
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'test-email'
email_configs:
- to: 'yourprometheus@text.org'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'

equal: ['alertname', 'dev', 'instance']

Launch your browser and navigate to Alertmanager page: http://localhost:9093/#/alerts

Launch your browser and navigate to Alertmanager page

Whenever a target goes down for any reason, the alert manager will send an email notification through the newly configured system.

Prometheus alert page indicating that a target has gone offline.

alerting rules and recording rules

The rules.yml file in Prometheus is used to define alerting rules and recording rules. This file is part of Prometheus’ alerting and monitoring configuration. While it’s typically named rules.yml, you can name it anything, as long as it’s referenced properly in the Prometheus configuration file (prometheus.yml).

Here’s a breakdown of the two types of rules that can be configured in this file:

Alerting Rules

Alerting rules are used to trigger alerts based on conditions evaluated from the metrics collected by Prometheus. When the conditions defined in the rule are met, Prometheus creates alerts, which can be sent to Alertmanager (another Prometheus component) for processing and routing (e.g., sending notifications to email, Slack, PagerDuty, etc.).

Example of an alerting rule in rules.yml:groups:

- name: example-alerts
  rules:
  - alert: HighCPUUsage
    expr: sum(rate(process_cpu_seconds_total[5m])) by (instance) > 0.8
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage detected on instance {{ $labels.instance }}"

description: "CPU usage has exceeded 80% for the last 5 minutes on {{ $labels.instance }}."

alert: The name of the alert (e.g., HighCPUUsage).
expr: The PromQL expression that defines the condition for the alert (e.g., CPU usage above 80%).
for: The duration for which the condition must be true before an alert is triggered.
labels: Metadata to help identify and route the alert (e.g., severity level).
annotations: Extra information about the alert, like a summary and a description, which can be displayed in alert notifications.

Recording Rules

Recording rules allow you to precompute and store the results of Prometheus queries. This is useful when you want to run complex queries frequently or across large datasets, as it reduces the computational overhead by precomputing the result and storing it as a new metric.

Example of a recording rule in rules.yml:groups:

- name: example-recording-rules
rules:
- record: job:http_inprogress_requests:sum

expr: sum by (job) (http_inprogress_requests)

record: The name of the new metric to be recorded (e.g., job:http_inprogress_requests:sum).
expr: The PromQL query to compute the value of the metric (e.g., the sum of in-progress HTTP requests per job).

Structure of rules.yml

The rules in the rules.yml file are grouped together using groups. Each group can contain multiple alerting and recording rules. Grouping rules is useful because Prometheus evaluates the rules in a group together at the same time.

Example structure of rules.yml:groups:
- name: example-group
  interval: 30s # Evaluation interval for this group
  rules:
  - alert: HighMemoryUsage
    expr: sum(container_memory_usage_bytes) > 1e+09
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Memory Usage Detected"
      description: "Memory usage is above 1GB for more than 5 minutes."

  - record: instance:node_cpu:rate1m

expr: rate(node_cpu_seconds_total[1m])

Visualizing Data with Prometheus and JAVA client library

A client library in Prometheus is a library provided for various programming languages (e.g., Go, Python, Java, Ruby) that allows applications to expose custom metrics to Prometheus. These libraries help developers instrument their code to define, create, and expose various metric types such as counters, gauges, histograms, and summaries. Once metrics are exposed, Prometheus can scrape the application's endpoint to collect the data for monitoring.

In my Java example code, I will demonstrate a straightforward application of Prometheus features.

Setting up maven application workspace

<dependencies>
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient</artifactId>
<version>0.16.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_hotspot</artifactId>
<version>0.16.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_httpserver</artifactId>
<version>0.16.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>

</dependencies>

Counter

A counter is a metric that represents a monotonically increasing value that can only go up (or reset to zero). Counters are often used to track events, such as the number of HTTP requests, number of errors, or bytes processed. A counter should never decrease, but it can reset (usually after a service restart). For example, http_requests_total might count the total number of HTTP requests processed by a server.

In the example below, the inc() method is utilized within a loop solely to keep track of the loop's increments.

On the graph page, choose the my_counter_total metric. Clearly, the graph will display a steadily ascending ramp.

Rate

In Prometheus, rate is a function used to calculate the per-second average rate of increase of a given counter over a specified time range. It is most commonly applied to counters, such as the number of requests or errors, and helps in visualizing how fast something is changing over time. For example, using rate(my_counter_total[1m]) calculates the per-second increase of the my_counter counter over the last 1 minute.

Gauge

A gauge is a metric that represents a value that can go up and down over time, as opposed to a counter, which only increases. Gauges are used to track values like current memory usage, CPU utilization, the number of active connections, or the last time a request was processed. For example, a gauge might track the time needed for a loop to complete, where the value can increase or decrease based on the conditions.

In the java example, inProgress is incremented on every loop and last records the ending time for every loop.

Bucket

A bucket refers to a range in which an observation value falls when using a histogram. Buckets allow the histogram to track how many observations fall within certain ranges of values. You can define buckets based on what makes sense for the metric being tracked (e.g., latencies in milliseconds).

For example, you might have buckets for request latencies: [0.1, 0.5, 1, 2, 5] seconds. A request that takes 0.7 seconds would fall into the 1-second bucket. By analyzing bucket counts, you can understand the distribution of latencies (or other measured values).

Summary

A summary is a type of metric in Prometheus that captures observations like response sizes or request durations. It can calculate quantiles (like the 50th percentile, 90th percentile, etc.) and also keeps a running total of observations and their sum. Summaries are useful for tracking how long operations take or the size of responses, providing insights into distribution and latency.

Summaries are calculated locally in each instance, making it challenging to aggregate across multiple instances or services.

Histogram

A histogram is a type of metric that measures the distribution of observations (such as request durations or response sizes) by segmenting them into configurable buckets. Unlike summaries, histograms are designed to be aggregatable across multiple instances.

import java.util.Date; import java.util.Random; import io.prometheus.client.Histogram; import io.prometheus.client.exporter.HTTPServer; import io.prometheus.client.hotspot.DefaultExports; public class HistogramApp { static final Histogram eventDuration = Histogram.build().name("my_loop_event_duration").help("event duration").linearBuckets(0, 100, 10).register(); public static void main( String[] args ) throws IOException, InterruptedException { DefaultExports.initialize(); HTTPServer server = new HTTPServer(8000); Random r = new Random(); int low = 100; int high = 1000; while(true) { long start = (new Date()).getTime(); int result = r.nextInt(high-low) + low; Thread.sleep(result); long test = (new Date()).getTime()-start; eventDuration.observe(test); } } }

Table of buckets distribution.

Graph of buckets distribution.

A quantile is a statistical measure that divides a dataset into equal-sized, consecutive intervals. For example, the 0.5 quantile (also called the median) divides the dataset such that 50% of the data falls below this value, and 50% falls above it. In Prometheus, quantiles are used with histograms to measure response times or latency at different levels (e.g., 90th or 99th percentile), giving you insight into the distribution of a particular metric.

Example: The 99th percentile (0.99 quantile) of request duration tells you that 99% of all requests completed in this amount of time or less, while 1% took longer.

histogram_quantile() is a Prometheus function used to estimate quantiles (e.g., the 90th or 99th percentile) from histogram buckets. It's often used to analyze response time distributions, giving insight into how long requests take at certain percentiles. This function is particularly useful when you want to understand the latency experience of the slowest users. Tracks how many observations fall into each bucket (based on defined thresholds). Histograms can be useful when analyzing latency, response size, and request durations across different services.

histogram_quantile(0.95, rate(my_loop_event_duration_bucket[10m])) computes the 95th percentile of loop durations over the last 10 minutes.

Visualizing Data with Prometheus and Grafana

While Prometheus comes with a basic web UI for querying and visualizing data, Grafana is often used for more advanced dashboards and visualizations. This section will cover how to integrate Prometheus with Grafana, create custom dashboards, and use Prometheus queries (PromQL) to visualize key metrics.

Grafana download page: https://grafana.com/grafana/download

In the command line, enter:

sudo apt-get install -y adduser libfontconfig1 musl
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_11.2.0_amd64.deb

sudo dpkg -i grafana-enterprise_11.2.0_amd64.deb

To start Grafana, enter:

sudo grafana-server -homepath /usr/share/grafana

Steps to Connect Grafana to Prometheus

1. Access the Grafana UI

Open your web browser and go to your Grafana instance (typically http://localhost:3000 if running locally).
Log in with your credentials (default username and password are both admin for the first login).

2. Add Prometheus as a Data Source

Once logged in, on the left sidebar, click on "Connections" and then select "Data Sources."
Click the "Add Data Source" button.

3. Choose Prometheus

In the list of available data sources, click on "Prometheus."

4. Configure the Prometheus Data Source

You'll be taken to a settings page to configure the Prometheus data source.
URL: Enter the URL of your Prometheus server (e.g., http://localhost:9090 or the address of your remote Prometheus instance).
Access: Set this to "Server" (default) if Grafana and Prometheus are running on the same network or use "Browser" if you want the browser to connect to Prometheus directly.
Basic Authentication: If your Prometheus instance requires authentication, configure the necessary credentials here.

Prometheus default datasource on Grafana dashboard

5. Test the Connection

Scroll down to the bottom of the page and click "Save & Test."
If the connection is successful, you'll see a green message that says "Data source is working."

6. Create a New Dashboard

To create a new dashboard, click on the "+ New Dashboard" button from the left sidebar or go to Dashboards → Manage → New Dashboard.
Add visualization
Select Prometheus Data source

Grafana Dashboard configuration

Grafana empty panel

7. Query Prometheus Metrics

In the "Query" section, select "Prometheus" as the data source (it should already be selected if it's the only data source).
Select Label filters as job and value app_test (job = app_test).
Set on Standard options pannel, Unit as bytes.
Select Run queries.

Grafana dashboard displaying Prometheus metrics.

8. Customize the Dashboard

You can adjust the visualization type (graph, gauge, table, etc.), time range, and other settings.
Add more panels as needed to visualize different metrics from Prometheus.

9. Save the Dashboard
Once you're satisfied with the panels, click "Save" in the top-right corner of the screen, give the dashboard a name, and save it for future use.

Conclusion

Prometheus offers a powerful, flexible solution for monitoring and logging in modern IT environments. By following this practical guide, you should be well-equipped to set up Prometheus, collect and visualize metrics, configure alerts, and integrate it with other tools in your observability stack. Whether you're new to Prometheus or looking to enhance your existing setup, this guide serves as a comprehensive resource to achieve your monitoring and logging goals.

References

Prometheus Documentation: The official documentation of Prometheus provides detailed insights into its architecture, setup, and usage. It’s a primary resource for understanding how Prometheus operates and how to use it effectively for monitoring.

Grafana Documentation: This guide from the Grafana website walks you through integrating Grafana with Prometheus to visualize metrics. It’s an essential resource for creating and managing dashboards with Grafana.

"Prometheus: Up & Running" by Brian Brazil: Published by O'Reilly Media, 2018. A comprehensive guide to understanding Prometheus from installation to advanced topics like PromQL and alerting. It’s a great resource for both beginners and advanced users.

Alertmanager Documentation: The official Alertmanager documentation, which covers how to configure and use Alertmanager for handling alerts generated by Prometheus.

Your Feedback Matters!

Have ideas or suggestions? Follow the blog and share your thoughts in the comments.

About Me

I am passionate about IT technologies. If you’re interested in learning more or staying updated with my latest articles, feel free to connect with me on:

Feel free to reach out through any of these platforms if you have any questions!

I am excited to hear your thoughts! 👇

Monitoring and Logging with Prometheus: A Practical Guide