Monitoring
Complete guide for monitoring NCN Network services.
Overview
Effective monitoring includes:
Metrics: Numerical data (latency, throughput, errors)
Logs: Event records and debugging info
Alerts: Notifications for issues
Dashboards: Visual status overview
Prometheus Setup
Install Prometheus
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
sudo mv prometheus-*/prometheus /usr/local/bin/
sudo mv prometheus-*/promtool /usr/local/bin/Configure Prometheus
Create /etc/prometheus/prometheus.yml:
Start Prometheus
Key Metrics
Gateway Metrics
ncn_requests_total
Counter
Total requests
ncn_requests_failed_total
Counter
Failed requests
ncn_request_duration_seconds
Histogram
Request latency
ncn_active_connections
Gauge
Active connections
ncn_compute_nodes_registered
Gauge
Registered compute nodes
Registry Metrics
ncn_validators_active
Gauge
Active validators
ncn_validations_total
Counter
Total validations
ncn_consensus_reached_total
Counter
Successful consensus
ncn_mempool_size
Gauge
Pending requests
ncn_p2p_peers
Gauge
Connected peers
Compute Metrics
ncn_tasks_completed_total
Counter
Completed tasks
ncn_task_duration_seconds
Histogram
Task execution time
ncn_gpu_utilization
Gauge
GPU usage
ncn_memory_used_bytes
Gauge
Memory usage
Grafana Dashboards
Install Grafana
Access Grafana
URL: http://localhost:3000
Default credentials: admin/admin
NCN Dashboard JSON
Create dashboard with panels:
Alerting
Alert Rules
Create /etc/prometheus/rules/ncn-alerts.yml:
Alertmanager Configuration
Create /etc/alertmanager/alertmanager.yml:
Log Aggregation
Using Loki + Promtail
Log Queries (LogQL)
Health Checks
HTTP Health Check
Custom Health Script
Create /opt/ncn/health-check.sh:
Systemd Health Integration
Uptime Monitoring
External Monitoring
Use services like:
Uptime Robot
Pingdom
Better Uptime
Configure to monitor:
https://api.ncn-network.io/healthExpected response code: 200
Check interval: 1 minute
Resource Monitoring
Node Exporter
GPU Monitoring (NVIDIA)
Dashboard Examples
Key Dashboard Panels
Overview
Request rate
Error rate
Active nodes
Performance
Latency percentiles
Throughput
Queue depth
Resources
CPU usage
Memory usage
Disk I/O
Network I/O
Validators
Active validators
Consensus success rate
Reputation distribution
Best Practices
Retention
Metrics: 15-30 days
Logs: 90 days
Alerts history: 1 year
Alerting
Alert on symptoms, not causes
Have actionable runbooks
Avoid alert fatigue
Dashboards
Start with high-level overview
Drill-down capability
Include links to runbooks
On-Call
Define escalation paths
Regular rotation
Post-incident reviews
Next Steps
Production Deployment - Production setup
Troubleshooting - Common issues
Last updated
