Monitoring

Complete guide for monitoring NCN Network services.


Overview

Effective monitoring includes:

  • Metrics: Numerical data (latency, throughput, errors)

  • Logs: Event records and debugging info

  • Alerts: Notifications for issues

  • Dashboards: Visual status overview


Prometheus Setup

Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
sudo mv prometheus-*/prometheus /usr/local/bin/
sudo mv prometheus-*/promtool /usr/local/bin/

Configure Prometheus

Create /etc/prometheus/prometheus.yml:

Start Prometheus


Key Metrics

Gateway Metrics

Metric
Type
Description

ncn_requests_total

Counter

Total requests

ncn_requests_failed_total

Counter

Failed requests

ncn_request_duration_seconds

Histogram

Request latency

ncn_active_connections

Gauge

Active connections

ncn_compute_nodes_registered

Gauge

Registered compute nodes

Registry Metrics

Metric
Type
Description

ncn_validators_active

Gauge

Active validators

ncn_validations_total

Counter

Total validations

ncn_consensus_reached_total

Counter

Successful consensus

ncn_mempool_size

Gauge

Pending requests

ncn_p2p_peers

Gauge

Connected peers

Compute Metrics

Metric
Type
Description

ncn_tasks_completed_total

Counter

Completed tasks

ncn_task_duration_seconds

Histogram

Task execution time

ncn_gpu_utilization

Gauge

GPU usage

ncn_memory_used_bytes

Gauge

Memory usage


Grafana Dashboards

Install Grafana

Access Grafana

  • URL: http://localhost:3000

  • Default credentials: admin/admin

NCN Dashboard JSON

Create dashboard with panels:


Alerting

Alert Rules

Create /etc/prometheus/rules/ncn-alerts.yml:

Alertmanager Configuration

Create /etc/alertmanager/alertmanager.yml:


Log Aggregation

Using Loki + Promtail

Log Queries (LogQL)


Health Checks

HTTP Health Check

Custom Health Script

Create /opt/ncn/health-check.sh:

Systemd Health Integration


Uptime Monitoring

External Monitoring

Use services like:

  • Uptime Robot

  • Pingdom

  • Better Uptime

Configure to monitor:

  • https://api.ncn-network.io/health

  • Expected response code: 200

  • Check interval: 1 minute


Resource Monitoring

Node Exporter

GPU Monitoring (NVIDIA)


Dashboard Examples

Key Dashboard Panels

  1. Overview

    • Request rate

    • Error rate

    • Active nodes

  2. Performance

    • Latency percentiles

    • Throughput

    • Queue depth

  3. Resources

    • CPU usage

    • Memory usage

    • Disk I/O

    • Network I/O

  4. Validators

    • Active validators

    • Consensus success rate

    • Reputation distribution


Best Practices

  1. Retention

    • Metrics: 15-30 days

    • Logs: 90 days

    • Alerts history: 1 year

  2. Alerting

    • Alert on symptoms, not causes

    • Have actionable runbooks

    • Avoid alert fatigue

  3. Dashboards

    • Start with high-level overview

    • Drill-down capability

    • Include links to runbooks

  4. On-Call

    • Define escalation paths

    • Regular rotation

    • Post-incident reviews


Next Steps

Last updated