Monitoring

Complete guide for monitoring NCN Network services.

Overview

Effective monitoring includes:

Metrics: Numerical data (latency, throughput, errors)
Logs: Event records and debugging info
Alerts: Notifications for issues
Dashboards: Visual status overview

Prometheus Setup

Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
sudo mv prometheus-*/prometheus /usr/local/bin/
sudo mv prometheus-*/promtool /usr/local/bin/

Configure Prometheus

Create /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'ncn-gateway'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics

  - job_name: 'ncn-registry'
    static_configs:
      - targets: ['localhost:50050']
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

Start Prometheus

prometheus --config.file=/etc/prometheus/prometheus.yml

Key Metrics

Gateway Metrics

Metric

Type

Description

ncn_requests_total

Counter

Total requests

ncn_requests_failed_total

Counter

Failed requests

ncn_request_duration_seconds

Histogram

Request latency

ncn_active_connections

Gauge

Active connections

ncn_compute_nodes_registered

Gauge

Registered compute nodes

Registry Metrics

Metric

Type

Description

ncn_validators_active

Gauge

Active validators

ncn_validations_total

Counter

Total validations

ncn_consensus_reached_total

Counter

Successful consensus

ncn_mempool_size

Gauge

Pending requests

ncn_p2p_peers

Gauge

Connected peers

Compute Metrics

Metric

Type

Description

ncn_tasks_completed_total

Counter

Completed tasks

ncn_task_duration_seconds

Histogram

Task execution time

ncn_gpu_utilization

Gauge

GPU usage

ncn_memory_used_bytes

Gauge

Memory usage

Grafana Dashboards

Install Grafana

# Add repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Access Grafana

URL: http://localhost:3000
Default credentials: admin/admin

NCN Dashboard JSON

Create dashboard with panels:

{
  "title": "NCN Network Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(ncn_requests_total[5m])",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(ncn_requests_failed_total[5m]) / rate(ncn_requests_total[5m]) * 100",
          "legendFormat": "Error %"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(ncn_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Active Validators",
      "type": "stat",
      "targets": [
        {
          "expr": "ncn_validators_active"
        }
      ]
    },
    {
      "title": "Compute Nodes",
      "type": "stat",
      "targets": [
        {
          "expr": "ncn_compute_nodes_registered"
        }
      ]
    }
  ]
}

Alerting

Alert Rules

Create /etc/prometheus/rules/ncn-alerts.yml:

groups:
- name: ncn-critical
  rules:
  - alert: HighErrorRate
    expr: rate(ncn_requests_failed_total[5m]) / rate(ncn_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on NCN Gateway"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: GatewayDown
    expr: up{job="ncn-gateway"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "NCN Gateway is down"

  - alert: NoComputeNodes
    expr: ncn_compute_nodes_registered == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "No compute nodes registered"

- name: ncn-warning
  rules:
  - alert: HighLatency
    expr: histogram_quantile(0.99, rate(ncn_request_duration_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency on NCN Gateway"
      description: "P99 latency is {{ $value }}s"

  - alert: LowValidators
    expr: ncn_validators_active < 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low number of active validators"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage"

Alertmanager Configuration

Create /etc/alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'

receivers:
  - name: 'default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'

  - name: 'critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-critical'
    pagerduty_configs:
      - service_key: 'xxx'

Log Aggregation

Using Loki + Promtail

# promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: ncn-gateway
    static_configs:
      - targets:
          - localhost
        labels:
          job: ncn-gateway
          __path__: /var/log/ncn/gateway.log

  - job_name: ncn-registry
    static_configs:
      - targets:
          - localhost
        labels:
          job: ncn-registry
          __path__: /var/log/ncn/registry.log

  - job_name: ncn-compute
    static_configs:
      - targets:
          - localhost
        labels:
          job: ncn-compute
          __path__: /var/log/ncn/compute.log

Log Queries (LogQL)

# Error logs
{job="ncn-gateway"} |= "error"

# Slow requests
{job="ncn-gateway"} | json | duration > 5s

# Failed validations
{job="ncn-registry"} |= "validation failed"

Health Checks

HTTP Health Check

# Gateway health
curl -f http://localhost:8080/health || echo "Gateway unhealthy"

# Expected response
{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600
}

Custom Health Script

Create /opt/ncn/health-check.sh:

#!/bin/bash

# Check gateway
if ! curl -sf http://localhost:8080/health > /dev/null; then
  echo "Gateway unhealthy"
  exit 1
fi

# Check registry
if ! nc -z localhost 50050; then
  echo "Registry unhealthy"
  exit 1
fi

# Check compute (optional)
if systemctl is-active --quiet ncn-compute; then
  echo "All services healthy"
  exit 0
else
  echo "Compute node not running"
  exit 1
fi

Systemd Health Integration

[Service]
ExecStartPost=/opt/ncn/health-check.sh

Uptime Monitoring

External Monitoring

Use services like:

Uptime Robot
Pingdom
Better Uptime

Configure to monitor:

https://api.ncn-network.io/health
Expected response code: 200
Check interval: 1 minute

Resource Monitoring

Node Exporter

# Install node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/

# Run as service
sudo systemctl start node_exporter

GPU Monitoring (NVIDIA)

# Install DCGM exporter for GPU metrics
docker run -d --gpus all \
  -p 9400:9400 \
  nvidia/dcgm-exporter

Dashboard Examples

Key Dashboard Panels

Overview
- Request rate
- Error rate
- Active nodes
Performance
- Latency percentiles
- Throughput
- Queue depth
Resources
- CPU usage
- Memory usage
- Disk I/O
- Network I/O
Validators
- Active validators
- Consensus success rate
- Reputation distribution

Best Practices

Retention
- Metrics: 15-30 days
- Logs: 90 days
- Alerts history: 1 year
Alerting
- Alert on symptoms, not causes
- Have actionable runbooks
- Avoid alert fatigue
Dashboards
- Start with high-level overview
- Drill-down capability
- Include links to runbooks
On-Call
- Define escalation paths
- Regular rotation
- Post-incident reviews

Next Steps

Production Deployment - Production setup
Troubleshooting - Common issues

PreviousValidator Operator NextClient Integration

Last updated 3 months ago

hashtagOverview

hashtagPrometheus Setup

hashtagInstall Prometheus

hashtagConfigure Prometheus

hashtagStart Prometheus

hashtagKey Metrics

hashtagGateway Metrics

hashtagRegistry Metrics

hashtagCompute Metrics

hashtagGrafana Dashboards

hashtagInstall Grafana

hashtagAccess Grafana

hashtagNCN Dashboard JSON

hashtagAlerting

hashtagAlert Rules

hashtagAlertmanager Configuration

hashtagLog Aggregation

hashtagUsing Loki + Promtail

hashtagLog Queries (LogQL)

hashtagHealth Checks

hashtagHTTP Health Check

hashtagCustom Health Script

hashtagSystemd Health Integration

hashtagUptime Monitoring

hashtagExternal Monitoring

hashtagResource Monitoring

hashtagNode Exporter

hashtagGPU Monitoring (NVIDIA)

hashtagDashboard Examples

hashtagKey Dashboard Panels

hashtagBest Practices

hashtagNext Steps

Overview

Prometheus Setup

Install Prometheus

Configure Prometheus

Start Prometheus

Key Metrics

Gateway Metrics

Registry Metrics

Compute Metrics

Grafana Dashboards

Install Grafana

Access Grafana

NCN Dashboard JSON

Alerting

Alert Rules

Alertmanager Configuration

Log Aggregation

Using Loki + Promtail

Log Queries (LogQL)

Health Checks

HTTP Health Check

Custom Health Script

Systemd Health Integration

Uptime Monitoring

External Monitoring

Resource Monitoring

Node Exporter

GPU Monitoring (NVIDIA)

Dashboard Examples

Key Dashboard Panels

Best Practices

Next Steps