How to Monitor SSD Health on Dedicated Servers: Using smartctl, node_exporter, and Grafana for Proactive Disk Failure Detection

How to Monitor SSD Health on Dedicated Servers: Using smartctl, node_exporter, and Grafana for Proactive Disk Failure Detection

If you're planning to order a dedicated server from Hetzner — or already have — don't assume the hardware is in perfect condition, especially the SSDs. Even 'new' NVMe drives can be heavily worn or nearing failure. It's essential to check disk health immediately after provisioning and set up continuous monitoring from day one.

In this article, I’ll walk you through:

  • Manually checking SSD health with smartctl,
  • Setting up real-time monitoring with node_exporter and Prometheus,
  • Visualizing key disk metrics in Grafana.

With this setup, you’ll be able to:

  • Catch issues before deploying production workloads,
  • Continuously monitor SSD health over time,
  • Collect clear, actionable evidence to request disk replacements before downtime hits.

Manual SSD Health Check Using smartctl

Before diving into automated monitoring, it’s essential to perform a one-time manual inspection of your SSDs using the smartctl utility from the smartmontools package.

What is smartmontools?

It's a handy set of tools that lets you pull S.M.A.R.T.(Self-Monitoring, Analysis and Reporting Technology) data from your drives — things like temperature, wear level, power-on hours, and error stats.

Install smartmontools

On most Linux distributions:

# Debian/Ubuntu
sudo apt update && sudo apt install smartmontools

# CentOS/RHEL
sudo yum install smartmontools

# Arch
sudo pacman -S smartmontools

Listing Drives

Let’s see what drives are available:

sudo smartctl --scan

We have two drives NVMe here:

/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device

Get drives health info

To inspect the first drive, run:

sudo smartctl -a /dev/nvme0

The first section contains general information about the device, such as model, firmware version, capacity, and supported features.
smart_serial(1)(1).png
Keep an eye on the warning and critical temperature thresholds here. We'll use them later when setting up alerts.
Next, you'll see the SMART data section:
smart2(2)(1).png

The key things to look at here are:

  • Temperature: 32–38°C — totally healthy.
  • Available Spare: 100% - shows how much spare capacity the controller has left to replace worn-out cells. If it drops close to the threshold (usually 10%), the drive is at risk.
  • Percentage Used is just 6%, which means the drive is still in good shape and far from wearing out.
  • Media and Data Integrity Errors are at zero — a good sign that the SSD isn’t silently failing.
  • And finally, Critical Warning is 0x00, meaning the drive isn’t reporting any major issues.

Extracting SSD Metrics and Configuring Prometheus Alerts

Key Metrics and Alerts Logic

So, what key metrics should we monitor, and at what points should we trigger alerts? The full list is below:

  • SMART overall-health self-assessment test result — if the status is anything other than PASSED, this should trigger an alert.

  • Critical Warning (0x00 means no warnings) — if this value becomes non-zero, it’s a cause for alarm.

  • Temperature — a crucial metric. You should alert if the temperature exceeds a threshold (typically around 70°C–80°C).

  • Available Spare and Available Spare Threshold — represent the remaining reserve capacity of the drive. An alert should be triggered when Available Spare falls below the threshold.

  • Percentage Used — indicates how much of the drive’s lifespan has been consumed. A warning is needed if this exceeds around 80–90%.

  • Unsafe Shutdowns — critical if the count increases, as it may point to power issues or improper shutdowns.

  • Media and Data Integrity Errors and Error Information Log Entries — any non-zero value here should trigger an alert immediately.

Let's set up the metrics mentioned above for continuous monitoring with Prometheus. I'll use smartctl_exporter for this, but first, we need to install Go if it's not already installed:

wget https://go.dev/dl/go1.24.4.linux-amd64.tar.gz
sudo tar -C /usr/local -xzf go1.24.4.linux-amd64.tar.gz
echo 'export PATH=$PATH:/usr/local/go/bin' >> ~/.profile
source ~/.profile
go version

Install smartctl_exporter:

git clone https://github.com/prometheus-community/smartctl_exporter
cd smartctl_exporter
go build
sudo cp smartctl_exporter /usr/local/bin

Run exporter to collect SMART metrics from both NVMe drives:

sudo smartctl_exporter \
  --smartctl.path="/usr/sbin/smartctl" \
  --smartctl.device="/dev/nvme0n1" \
  --smartctl.device="/dev/nvme1n1" \
  --web.listen-address=":9633"

Visit http://your_ip_is_here:9633:

Production security note

In this article, smartctl_exporter is exposed on 0.0.0.0:9633 without TLS or authentication for simplicity and demo purposes.

However, in production environments, it's strongly recommended to protect metrics endpoints. You can:

  • Bind to 127.0.0.1 and have Prometheus scrape locally.
  • Use iptables/firewalld/nftables to allow access only from Prometheus.
  • Set up a reverse proxy (e.g., Nginx or Caddy) with HTTPS and basic auth.
  • Consider mTLS for stricter environments.

Leaving metrics endpoints open to the public is a serious security risk, as they may leak hardware or infrastructure details.

Systemd for smartctl_exporter

In order to continous providing smartctl metrics let's setup systemd service using your favorite editor:

sudo vim /etc/systemd/system/smartctl_exporter.service

Paste the following:

[Unit]
Description=smartctl_exporter for Prometheus
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/smartctl_exporter \
  --smartctl.path=/usr/sbin/smartctl \
  --smartctl.device=/dev/nvme0n1 \
  --smartctl.device=/dev/nvme1n1 \
  --web.listen-address=:9633
Restart=on-failure
User=root

[Install]
WantedBy=multi-user.target

Enable the service:

sudo systemctl daemon-reload
sudo systemctl enable smartctl_exporter
sudo systemctl start smartctl_exporter

Check it:

sudo systemctl status smartctl_exporter

All done — we’re now ready to collect SMART metrics with Prometheus.

Prometheus scraping SMART data and firing alerts

OK, let's configure Prometheus to start scraping the exporter, in your prometheus.yml, inside scrape_configs, add:

- job_name: 'smartctl'
  static_configs:
  - targets: ['your_ip_is_here:9633']
  relabel_configs:
  - source_labels: [__address__]
    regex: '.*'
    target_label: instance
    replacement: 'smartctl'

Then, create a new file named smart_alerts.yml and paste the following alert rules into it:

groups:
  - name: smartctl_nvme_alerts
    rules:

      # Alert if NVMe wear level exceeds 80% (used lifespan)
      - alert: NVMEWearLevelHigh
        expr: smartctl_device_percentage_used >= 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High NVMe wear level on {{ $labels.device }} ({{ $value }}%)"
          description: "The drive {{ $labels.device }} has reached {{ $value }}% of its designed write endurance. Plan replacement soon."

      # Alert if current temperature exceeds safe limits (> 80°C)
      - alert: NVMEHighTemperature
        expr: smartctl_device_temperature{temperature_type="current"} > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High temperature on NVMe {{ $labels.device }} ({{ $value }}°C)"
          description: "The current temperature of {{ $labels.device }} exceeds the safe threshold (80°C). Check airflow and cooling."

      # Alert on SMART critical warning bit (e.g. power loss, overheating, etc.)
      - alert: NVMECriticalWarning
        expr: smartctl_device_critical_warning != 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "SMART critical warning for {{ $labels.device }}"
          description: "SMART reported a critical warning on device {{ $labels.device }}. Immediate attention required."

      # Alert on media errors (e.g. read/write failures)
      - alert: NVMEMediaErrors
        expr: smartctl_device_media_errors > 0
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Media errors detected on {{ $labels.device }}"
          description: "Media-level errors have been reported on {{ $labels.device }}. This may indicate degraded flash cells."

      # Alert if SMART overall status is failed (should be 1 if OK)
      - alert: NVMESmartStatusFailed
        expr: smartctl_device_smart_status != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "SMART health check failed on {{ $labels.device }}"
          description: "The SMART status of {{ $labels.device }} indicates a failure. Consider immediate replacement."

      # Alert if available spare blocks drop below threshold
      - alert: NVMESpareBlocksLow
        expr: smartctl_device_available_spare < smartctl_device_available_spare_threshold
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Available spare blocks low on {{ $labels.device }}"
          description: "Spare capacity on {{ $labels.device }} dropped below the critical threshold. Replace the drive soon."

Make sure to include smartctl_alerts.yml in the rule_files section of your prometheus.yml:

rule_files:
  - "rules/other_rules.yml"
  - "rules/smartctl_alerts.yml"

Then reload Prometheus or restart the service. Once loaded, your SMART metrics will be available for querying via PromQL:

To visualize the health and usage of your SSDs, you can create a custom Grafana dashboard with the following panels. These metrics are collected using the smartctl-exporter and exposed to Prometheus.

Panel Title PromQL Query Example Panel Type
NVMe Temperature smartctl_device_temperature{temperature_type="current"} Gauge
SMART Health Status smartctl_device_smart_status (1 = OK, 0 = FAIL) Stat
Media Errors (Total) smartctl_device_media_errors Time series
Error Log Entries smartctl_device_num_err_log_entries Time series
Percentage Used smartctl_device_percentage_used Bar gauge
Critical Warnings smartctl_device_critical_warning Stat
Available Spare smartctl_device_available_spare Gauge

This set of panels gives a comprehensive real-time view of SSD health and aging, making it easier to catch potential failures early.

Here’s how it looks in my setup:

That's all. You can find the full dashboard JSON and configuration files on GitHub.

Read more