§ Wiki · Wiki entry

Node Provider Maintenance Guide

Day-to-day responsibilities for keeping node machines healthy on the Internet Computer — monitoring, common maintenance tasks, scheduled outages, and peer support.

This guide is the operational handbook for an active node provider. It covers the recurring work — monitoring, common machine-level tasks, coordinating around data-center outages, and where to go for peer support — that keeps a fleet healthy between major lifecycle events. For the role as a whole, see Node Provider Documentation.

Troubleshooting

When something is wrong with a node, start with the Node Provider Troubleshooting guide. It indexes the deployment-error, unhealthy-node, networking, and NNS-proposal subguides.

Submitting NNS proposals

Many maintenance actions — onboarding a new machine, updating a node's IPv4 address, retiring a node, changing principals — are carried out through proposals to the Network Nervous System (NNS). When a proposal affects rewards, check the next minting date so you understand which period the change will land in.

If a proposal you submitted does not adopt cleanly, follow Troubleshooting Failed NNS Proposals.

Monitoring

You are expected to monitor your nodes continuously. Public dashboards (such as the IC Dashboard) expose per-node health and the IC observability stack provides machine-readable metrics.

Several community-built tools make this easier:

  • Aviate Labs Node Monitor — turnkey email alerts for unhealthy nodes.
  • DIY Node Monitoring — a community-shared GitHub repository with scripts you can adapt.
  • Prometheus exporter for node status — exports node health in a Prometheus-compatible format so it can plug into your existing observability stack.

For more detail and links, see Node Provider Alerting Options.

Common maintenance tasks

The following tasks recur across the lifetime of a node. Each has its own detailed procedure:

Limited node provider console

Technicians interact with a running node through a deliberately restricted console. The console exposes only the operations a node provider needs — diagnostics, recovery initiation, registry-related queries — and nothing else.

Permitted tools

[!WARNING] For security and confidentiality reasons, no other software is permitted to run on a node alongside the replica. Do not install diagnostic tools, monitoring agents, or shells onto the node itself.

Run troubleshooting tooling from a USB-booted Linux distribution or from a separate auxiliary machine where you have full administrative control. See Setting up an auxiliary machine below.

Scheduled data-center outages

When a data center announces planned downtime that will affect your nodes:

  1. Notify DFINITY in the Node Provider Matrix channel ahead of the outage.
  2. After the outage, verify that every affected node returns to a healthy state. The two acceptable states are:
    • Active in subnet — the node is healthy and serving traffic.
    • Awaiting subnet — the node is operational and ready to be assigned.
  3. If a node remains degraded, give it time to catch up — but confirm it has reached one of the two healthy states before considering the outage closed.

Node rewards based on useful work

[!NOTE] The Internet Computer protocol can tolerate up to one third of nodes misbehaving, but providers are expected to keep their machines online and healthy regardless.

Automatic reward mechanisms tied to useful work are under development. Until they ship, the operational expectation is the same: keep nodes online and healthy. Once the new mechanisms are in effect, unhealthy nodes will be penalised automatically.

Subnet recovery

Occasionally a subnet will need recovery, and the recovery procedure will require participation from the providers running its nodes. When that happens, instructions are issued in the Node Provider Matrix channel.

Follow the dedicated Manual Node Recovery Guide when instructed. Enable notifications on the Matrix channel so you do not miss direct mentions during an active recovery.

General best practice

  1. Keep a separate diagnostic machine in the same rack as your nodes, so you can investigate problems without depending on the node itself or on remote access.
  2. Engage with peers in the Node Provider Matrix channel — most problems have been seen before.

Setting up an auxiliary machine for network diagnostics

An auxiliary machine sits next to your nodes in the rack and gives you full-control Linux tooling without touching the replica. Provision it as follows.

Hardware

  • Use any appropriately resourced server. There is no requirement for Gen-1 or Gen-2 hardware — this machine is not part of the network.
  • Apply physical security controls equivalent to those you apply to the nodes themselves.

Operating system and software

  1. Install a minimal, hardened Linux distribution. Ubuntu 22.04 LTS is a good default.
  2. Apply the latest security patches and firmware updates before placing the machine in service.

Network configuration

  • Assign an IPv6 address from the same range as your IC nodes so the diagnostic machine can talk to them on the data plane.
  • Apply a restrictive firewall — allow only the traffic you actively need.
  • Consider running the auxiliary machine behind a VPN except during active troubleshooting.

Diagnostic tools

Install at minimum:

ping
traceroute
nmap
tcpdump
iperf

Configure any monitoring agents that simulate node-side traffic so you can baseline the network in normal conditions.

Access control

  • Use strong, unique passwords; prefer SSH key-based authentication.
  • Disable root SSH login.
  • Review authentication and command logs regularly for anomalies.

Maintenance

  • Update the operating system and tooling on a regular cadence.
  • Periodically re-run the diagnostic tools against a known-good node to confirm the auxiliary machine itself is still working.

Peer support: Node Provider Matrix channel

The Node Provider Matrix channel is the primary venue for maintenance-related questions, peer assistance, and incident reporting. Search the channel before posting — many issues have been resolved there before.

When using the channel:

  • Enable notifications so you receive direct mentions promptly, especially during incidents.
  • Include your node-provider name in your Matrix alias so other providers and DFINITY engineers can identify you quickly.

Always consult Node Provider Troubleshooting before asking for help.