# Monitoring

Complete guide for monitoring NCN Network services.

***

## Overview

Effective monitoring includes:

* **Metrics**: Numerical data (latency, throughput, errors)
* **Logs**: Event records and debugging info
* **Alerts**: Notifications for issues
* **Dashboards**: Visual status overview

***

## Prometheus Setup

### Install Prometheus

```bash
# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
sudo mv prometheus-*/prometheus /usr/local/bin/
sudo mv prometheus-*/promtool /usr/local/bin/
```

### Configure Prometheus

Create `/etc/prometheus/prometheus.yml`:

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'ncn-gateway'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: /metrics

  - job_name: 'ncn-registry'
    static_configs:
      - targets: ['localhost:50050']
    metrics_path: /metrics

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - /etc/prometheus/rules/*.yml
```

### Start Prometheus

```bash
prometheus --config.file=/etc/prometheus/prometheus.yml
```

***

## Key Metrics

### Gateway Metrics

| Metric                         | Type      | Description              |
| ------------------------------ | --------- | ------------------------ |
| `ncn_requests_total`           | Counter   | Total requests           |
| `ncn_requests_failed_total`    | Counter   | Failed requests          |
| `ncn_request_duration_seconds` | Histogram | Request latency          |
| `ncn_active_connections`       | Gauge     | Active connections       |
| `ncn_compute_nodes_registered` | Gauge     | Registered compute nodes |

### Registry Metrics

| Metric                        | Type    | Description          |
| ----------------------------- | ------- | -------------------- |
| `ncn_validators_active`       | Gauge   | Active validators    |
| `ncn_validations_total`       | Counter | Total validations    |
| `ncn_consensus_reached_total` | Counter | Successful consensus |
| `ncn_mempool_size`            | Gauge   | Pending requests     |
| `ncn_p2p_peers`               | Gauge   | Connected peers      |

### Compute Metrics

| Metric                      | Type      | Description         |
| --------------------------- | --------- | ------------------- |
| `ncn_tasks_completed_total` | Counter   | Completed tasks     |
| `ncn_task_duration_seconds` | Histogram | Task execution time |
| `ncn_gpu_utilization`       | Gauge     | GPU usage           |
| `ncn_memory_used_bytes`     | Gauge     | Memory usage        |

***

## Grafana Dashboards

### Install Grafana

```bash
# Add repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
```

### Access Grafana

* URL: <http://localhost:3000>
* Default credentials: admin/admin

### NCN Dashboard JSON

Create dashboard with panels:

```json
{
  "title": "NCN Network Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(ncn_requests_total[5m])",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(ncn_requests_failed_total[5m]) / rate(ncn_requests_total[5m]) * 100",
          "legendFormat": "Error %"
        }
      ]
    },
    {
      "title": "P99 Latency",
      "type": "graph",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, rate(ncn_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Active Validators",
      "type": "stat",
      "targets": [
        {
          "expr": "ncn_validators_active"
        }
      ]
    },
    {
      "title": "Compute Nodes",
      "type": "stat",
      "targets": [
        {
          "expr": "ncn_compute_nodes_registered"
        }
      ]
    }
  ]
}
```

***

## Alerting

### Alert Rules

Create `/etc/prometheus/rules/ncn-alerts.yml`:

```yaml
groups:
- name: ncn-critical
  rules:
  - alert: HighErrorRate
    expr: rate(ncn_requests_failed_total[5m]) / rate(ncn_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate on NCN Gateway"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: GatewayDown
    expr: up{job="ncn-gateway"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "NCN Gateway is down"

  - alert: NoComputeNodes
    expr: ncn_compute_nodes_registered == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "No compute nodes registered"

- name: ncn-warning
  rules:
  - alert: HighLatency
    expr: histogram_quantile(0.99, rate(ncn_request_duration_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High latency on NCN Gateway"
      description: "P99 latency is {{ $value }}s"

  - alert: LowValidators
    expr: ncn_validators_active < 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Low number of active validators"

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage"
```

### Alertmanager Configuration

Create `/etc/alertmanager/alertmanager.yml`:

```yaml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical'

receivers:
  - name: 'default'
    email_configs:
      - to: 'ops@example.com'
        from: 'alerts@ncn-network.io'
        smarthost: 'smtp.example.com:587'

  - name: 'critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/xxx'
        channel: '#alerts-critical'
    pagerduty_configs:
      - service_key: 'xxx'
```

***

## Log Aggregation

### Using Loki + Promtail

```yaml
# promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: ncn-gateway
    static_configs:
      - targets:
          - localhost
        labels:
          job: ncn-gateway
          __path__: /var/log/ncn/gateway.log

  - job_name: ncn-registry
    static_configs:
      - targets:
          - localhost
        labels:
          job: ncn-registry
          __path__: /var/log/ncn/registry.log

  - job_name: ncn-compute
    static_configs:
      - targets:
          - localhost
        labels:
          job: ncn-compute
          __path__: /var/log/ncn/compute.log
```

### Log Queries (LogQL)

```logql
# Error logs
{job="ncn-gateway"} |= "error"

# Slow requests
{job="ncn-gateway"} | json | duration > 5s

# Failed validations
{job="ncn-registry"} |= "validation failed"
```

***

## Health Checks

### HTTP Health Check

```bash
# Gateway health
curl -f http://localhost:8080/health || echo "Gateway unhealthy"

# Expected response
{
  "status": "healthy",
  "version": "0.1.0",
  "uptime_seconds": 3600
}
```

### Custom Health Script

Create `/opt/ncn/health-check.sh`:

```bash
#!/bin/bash

# Check gateway
if ! curl -sf http://localhost:8080/health > /dev/null; then
  echo "Gateway unhealthy"
  exit 1
fi

# Check registry
if ! nc -z localhost 50050; then
  echo "Registry unhealthy"
  exit 1
fi

# Check compute (optional)
if systemctl is-active --quiet ncn-compute; then
  echo "All services healthy"
  exit 0
else
  echo "Compute node not running"
  exit 1
fi
```

### Systemd Health Integration

```ini
[Service]
ExecStartPost=/opt/ncn/health-check.sh
```

***

## Uptime Monitoring

### External Monitoring

Use services like:

* Uptime Robot
* Pingdom
* Better Uptime

Configure to monitor:

* `https://api.ncn-network.io/health`
* Expected response code: 200
* Check interval: 1 minute

***

## Resource Monitoring

### Node Exporter

```bash
# Install node exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
sudo mv node_exporter-*/node_exporter /usr/local/bin/

# Run as service
sudo systemctl start node_exporter
```

### GPU Monitoring (NVIDIA)

```bash
# Install DCGM exporter for GPU metrics
docker run -d --gpus all \
  -p 9400:9400 \
  nvidia/dcgm-exporter
```

***

## Dashboard Examples

### Key Dashboard Panels

1. **Overview**
   * Request rate
   * Error rate
   * Active nodes
2. **Performance**
   * Latency percentiles
   * Throughput
   * Queue depth
3. **Resources**
   * CPU usage
   * Memory usage
   * Disk I/O
   * Network I/O
4. **Validators**
   * Active validators
   * Consensus success rate
   * Reputation distribution

***

## Best Practices

1. **Retention**
   * Metrics: 15-30 days
   * Logs: 90 days
   * Alerts history: 1 year
2. **Alerting**
   * Alert on symptoms, not causes
   * Have actionable runbooks
   * Avoid alert fatigue
3. **Dashboards**
   * Start with high-level overview
   * Drill-down capability
   * Include links to runbooks
4. **On-Call**
   * Define escalation paths
   * Regular rotation
   * Post-incident reviews

***

## Next Steps

* [Production Deployment](/nc/neurochainai-guides/deployment/production.md) - Production setup
* [Troubleshooting](/nc/neurochainai-guides/troubleshooting/troubleshooting.md) - Common issues


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.neurochain.ai/nc/neurochainai-guides/operators/monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
