Skip to content
VoIP Monitoring with Prometheus and Grafana
Architecture

VoIP Monitoring with Prometheus and Grafana

Build a production VoIP observability stack with Prometheus and Grafana: Kamailio stats, Asterisk metrics via snmp-exporter, RTP quality dashboards, and SLA alerting rules.

Tumarm Engineering10 min read

VoIP Monitoring with Prometheus and Grafana

VoIP systems fail in ways that don't show up in generic infrastructure monitoring. CPU at 20%, memory healthy, network up — but calls are dropping because the SIP registrar is rejecting REGISTERs due to a certificate expiry, or RTP packet loss is hitting 8% on a specific carrier route, or the T38 fax relay is silently failing. This post covers building a VoIP-specific observability stack that surfaces the metrics that matter: call quality, registration health, SIP error rates, and carrier route performance.

What to Monitor in VoIP Infrastructure

Before instrumenting anything, define the four signal types for VoIP:

SignalExamplesTool
SIP signalingINVITE rate, 4xx/5xx rate, REGISTER failureskamailio_exporter, snmp_exporter
Media qualityPacket loss, jitter, MOS scorertpengine metrics, Homer SIPcapture
InfrastructureCPU, memory, network I/O, disknode_exporter
BusinessASR (answer rate), ACD (avg call duration), NERCDR database queries

A complete monitoring setup scrapes all four. Infrastructure metrics alone give you uptime; the other three give you quality.

Kamailio Metrics with kamailio_exporter

Kamailio exposes internal statistics via the statistics module. The kamailio_exporter translates these to Prometheus metrics.

Install the exporter:

# Run kamailio_exporter as a sidecar
docker run -d \
  --name kamailio-exporter \
  --network host \
  -e KAMAILIO_HOST=127.0.0.1 \
  -e KAMAILIO_PORT=5060 \
  -p 9494:9494 \
  hunterlong/kamailio-exporter

Or configure Kamailio to expose a JSON stats endpoint:

loadmodule "xhttp.so"
loadmodule "statistics.so"

event_route[xhttp:request] {
    if ($hu =~ "^/metrics") {
        xhttp_reply("200", "OK", "text/plain", $stat(all));
        exit;
    }
}

Key Kamailio metrics to track:

# prometheus/recording_rules.yml
groups:
  - name: kamailio_derived
    rules:
      - record: kamailio:sip_4xx_rate
        expr: rate(kamailio_core_rcv_requests_total{method="INVITE",status=~"4.."}[5m])

      - record: kamailio:register_failure_rate
        expr: rate(kamailio_core_rcv_requests_total{method="REGISTER",status="401"}[5m])
           / rate(kamailio_core_rcv_requests_total{method="REGISTER"}[5m])

      - record: kamailio:active_dialogs
        expr: kamailio_dialog_active

      - record: kamailio:invite_per_second
        expr: rate(kamailio_core_rcv_requests_total{method="INVITE"}[1m])

Asterisk Metrics

Asterisk does not natively expose Prometheus metrics. Use one of two approaches:

Option 1: asterisk_exporter (AMI-based)

# /etc/asterisk_exporter/config.yml
ami:
  host: 127.0.0.1
  port: 5038
  username: prometheus
  password: secret

metrics:
  - active_channels
  - active_calls
  - active_agents
  - queue_waiting
  - queue_completed
# /etc/asterisk/manager.conf
[prometheus]
secret=secret
permit=127.0.0.1/255.255.255.255
read=system,call,agent,user,config,dtmf,reporting,cdr,dialplan
write=

Option 2: CEL to Prometheus via Loki/Grafana pipeline

Write CDR/CEL events to a PostgreSQL table and expose them via a custom exporter. This approach gives you business metrics (ASR, ACD, call volumes by trunk) that the AMI exporter cannot provide:

# asterisk_business_exporter.py
from prometheus_client import Gauge, start_http_server
import psycopg2
import time

asr_gauge = Gauge('asterisk_asr_ratio', 'Answer-Seizure Ratio', ['trunk'])
acd_gauge = Gauge('asterisk_acd_seconds', 'Average Call Duration', ['trunk'])
calls_gauge = Gauge('asterisk_active_calls', 'Active calls', ['direction'])

def collect_metrics():
    conn = psycopg2.connect("host=localhost dbname=asterisk_cdr user=monitor")
    cur = conn.cursor()
    
    # ASR per trunk (last 5 minutes)
    cur.execute("""
        SELECT
            accountcode AS trunk,
            ROUND(AVG(CASE WHEN disposition='ANSWERED' THEN 1.0 ELSE 0.0 END), 3) AS asr,
            AVG(CASE WHEN disposition='ANSWERED' THEN billsec ELSE NULL END) AS acd
        FROM cdr
        WHERE calldate > NOW() - INTERVAL '5 minutes'
          AND accountcode IS NOT NULL
        GROUP BY accountcode
    """)
    
    for trunk, asr, acd in cur.fetchall():
        asr_gauge.labels(trunk=trunk).set(asr or 0)
        acd_gauge.labels(trunk=trunk).set(acd or 0)

if __name__ == '__main__':
    start_http_server(9200)
    while True:
        collect_metrics()
        time.sleep(30)

rtpengine Metrics

rtpengine exposes Prometheus metrics natively when built with --with-transcoding:

# /etc/rtpengine/rtpengine.conf
[rtpengine]
prometheus = yes
prometheus-listen = 127.0.0.1:9900

Key media quality metrics from rtpengine:

MetricAlert thresholdDescription
rtpengine_packet_loss_ratio> 0.03Packet loss > 3%
rtpengine_jitter_ms> 50Jitter > 50ms
rtpengine_mos_score< 3.5MOS below acceptable
rtpengine_active_sessions> 80% capacityApproaching session limit
rtpengine_transcoded_sessionsRate spikeUnexpected transcoding

MOS (Mean Opinion Score) ranges from 1 (unusable) to 5 (excellent). A score above 4.0 is toll-quality; 3.5–4.0 is acceptable; below 3.5 users notice degradation. Set your alert at 3.5.

Prometheus Alerting Rules

# prometheus/alerts/voip.yml
groups:
  - name: voip_sip
    rules:
      - alert: HighSIP5xxRate
        expr: |
          rate(kamailio_core_rcv_replies_total{status=~"5.."}[5m])
          / rate(kamailio_core_rcv_replies_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
          team: voip
        annotations:
          summary: "SIP 5xx rate {{ $value | humanizePercentage }} on {{ $labels.instance }}"
          runbook: "https://wiki.example.com/runbooks/sip-5xx"

      - alert: KamailioDialogsHigh
        expr: kamailio:active_dialogs > 8000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Active dialogs approaching capacity: {{ $value }}"

      - alert: RegistrationFailureSpike
        expr: kamailio:register_failure_rate > 0.2
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "20%+ of SIP registrations failing — possible auth issue or attack"

  - name: voip_media
    rules:
      - alert: MediaQualityDegraded
        expr: rtpengine_mos_score < 3.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MOS score {{ $value }} below 3.5 on {{ $labels.instance }}"

      - alert: MediaPacketLossHigh
        expr: rtpengine_packet_loss_ratio > 0.03
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "RTP packet loss {{ $value | humanizePercentage }} — calls impacted"

      - alert: rtpengineCapacityHigh
        expr: rtpengine_active_sessions / rtpengine_max_sessions > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "rtpengine at {{ $value | humanizePercentage }} capacity"

Grafana Dashboard Layout

Structure your Grafana dashboard in four rows:

Row 1: SIP Signaling Health

  • INVITE rate (calls/sec) — line graph, 1h window
  • SIP 4xx/5xx rate — stat panel with threshold coloring
  • Active dialogs — gauge panel
  • Registration success rate — stat panel

Row 2: Media Quality

  • MOS score distribution by trunk — heatmap
  • Packet loss % by carrier — time series
  • Jitter ms — time series with threshold line at 50ms
  • Active RTP sessions — gauge

Row 3: Infrastructure

  • CPU per VoIP node — multi-series line
  • Network I/O (bytes/sec) — time series
  • Memory usage — time series

Row 4: Business Metrics

  • ASR by trunk — bar gauge
  • ACD (average call duration) — stat panel
  • Total calls in last 24h — stat panel
  • Calls by outcome (Answered/No Answer/Busy) — pie chart

Prometheus Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'kamailio'
    static_configs:
      - targets: ['kamailio-1:9494', 'kamailio-2:9494']
    scrape_interval: 10s

  - job_name: 'asterisk'
    static_configs:
      - targets: ['asterisk-1:9200', 'asterisk-2:9200']
    scrape_interval: 30s

  - job_name: 'rtpengine'
    static_configs:
      - targets: ['rtpengine-1:9900', 'rtpengine-2:9900']
    scrape_interval: 10s

  - job_name: 'coturn'
    static_configs:
      - targets: ['turn-1:9641']
    scrape_interval: 30s

  - job_name: 'node'
    static_configs:
      - targets: ['kamailio-1:9100', 'asterisk-1:9100', 'rtpengine-1:9100']
    scrape_interval: 15s

Storage Sizing for VoIP Metrics

VoIP monitoring generates high-cardinality metrics — per-call, per-trunk, per-carrier labels multiply metric series. Calculate your Prometheus storage requirements:

  • Samples per scrape: ~500 (typical VoIP stack)
  • Scrape interval: 10s → 6 scrapes/minute
  • Samples/minute: 3,000
  • Samples/day: 4,320,000
  • Prometheus bytes per sample: ~1.5 bytes (compressed)
  • Storage/day: ~6 MB
  • 90-day retention: ~540 MB

This fits comfortably on any VPS. For longer retention or higher cardinality (1,000+ trunks), use Thanos or Mimir to offload to object storage and query across retention windows.

prometheusgrafanavoip-monitoringobservabilitykamailioasteriskrtp
Benchmark
BALI Pvt.Ltd
Brave BPO
Wave
SmartBrains BPO

Ready to build on carrier-grade voice?

Talk to a VoIP engineer — not a salesperson.

Schedule a Technical Call →