06 — Drive Health Monitoring (smartd)¶
smartd continuously monitors your drives for signs of failure and emails you when anything looks wrong. The philosophy is silence equals healthy — you only get emails when something needs attention.
What's monitored¶
| Drive | Device | Mount | Role |
|---|---|---|---|
| Samsung NVMe | /dev/nvme1n1 |
/ |
OS drive |
| SK hynix NVMe | /dev/nvme0n1 |
/data |
Data drive |
| Samsung 990 PRO 1TB (USB, ASM2462 enclosure) | /dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0 |
/mnt/backup |
Backup drive |
The backup drive is an NVMe SSD inside an ASMedia ASM2462 USB enclosure (it replaced the old Samsung T5 in May 2026 — see 05-backups.md). It's referenced by its stable /dev/disk/by-id/usb-...-0:0 path rather than /dev/sdX — the sdX name changes on USB re-enumeration. The -0:0 suffix means the whole disk (not a partition), which is mandatory here because of the -d sntasmedia flag below.
The off-site rotation drive (the demoted Samsung T5, label backup-offsite) is not monitored by smartd — it's normally disconnected (it lives at family's house) and only plugged in for the manual off-site sync.
Note: Linux device names for NVMe drives can occasionally change between reboots (e.g., what was nvme0n1 becomes nvme1n1). If you ever see smartd errors about devices not found after a reboot, check lsblk and update /etc/smartd.conf accordingly.
Installation (fresh setup)¶
This installs both smartctl (manual one-shot health checks) and smartd (the monitoring daemon).
Configuration¶
File: /etc/smartd.conf
The default file has a DEVICESCAN directive at the bottom. Comment that out and add explicit per-drive entries:
# DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/nvme0n1 -a -s S/../.././02 -m gabrielgabrie99@gmail.com
/dev/nvme1n1 -a -s S/../.././02 -m gabrielgabrie99@gmail.com
/dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0 -d sntasmedia -a -m gabrielgabrie99@gmail.com
Note on the USB backup-drive entry:
-d sntasmedia— the backup drive is an NVMe SSD behind an ASMedia ASM2462 USB↔NVMe bridge. Without this flag,smartctl/smartdonly sees a generic USB mass-storage device and can't read the drive's real NVMe SMART data.-d sntasmediatells it to use the ASMedia bridge's SCSI-translation path to talk NVMe through the enclosure. It works on the z2mini's smartmontools 7.5. (Other bridge chips have their own:-d sntjmicronfor JMicron,-d sntrealtekfor Realtek.)- Referenced by
/dev/disk/by-id/usb-...-0:0, not/dev/sdX— USB device names can change after a disconnect/reconnect cycle, which would break a/dev/sda-based config. The by-id path is stable. It's even more important here than the old UUID approach:-d sntasmedianeeds the whole-disk device (the-0:0suffix), not a filesystem/partition, so a/dev/disk/by-uuid/path wouldn't work at all. - No
-sself-test schedule. USB drives don't get scheduled self-tests anyway (the protocol can be flaky), and in this case the ASM2462 bridge doesn't pass the NVMe self-test command (admin command0x14) through to the drive — so a-sschedule would just produce errors. Consistent with the existing "USB drives skip scheduled self-tests" rationale below.
What each flag means¶
| Flag | Meaning |
|---|---|
-a |
Monitor all SMART attributes |
-s S/../.././02 |
Schedule daily Short self-test at 2 AM |
-m <email> |
Email address for alerts |
The schedule format is TYPE/DAY-OF-MONTH/MONTH/DAY-OF-WEEK/HOUR. The dots mean "any." So S/../.././02 reads as "Short test, any day of month, any month, any day of week, hour 02."
USB drives don't get scheduled self-tests (omitted -s) because the protocol can be flaky and self-tests can interfere with regular operation.
Enable and start¶
Should show "active (running)."
How alerts work¶
smartd uses the system's sendmail command to send emails. On this system, sendmail is symlinked to msmtp, which relays to Gmail. See 07-email-relay.md for that setup.
When something concerning is detected (failing self-test, temperature threshold, read errors, available spare drops, etc.), smartd invokes:
You receive an email titled something like "SMART error (FailedTest) detected on host: z2mini" with details.
Testing email alerts¶
To verify alerts work without waiting for a real failure, temporarily add -M test to each line:
Then restart:
Within a couple minutes, you should receive test emails (one per drive). After confirming they arrive, remove -M test to stop getting test emails on every restart:
Manual one-shot health checks¶
You don't need smartd to check a drive once. Use smartctl:
# Full health report
sudo smartctl -a /dev/nvme0n1
# Just the headline health assessment
sudo smartctl -H /dev/nvme0n1
# Run a short self-test manually
sudo smartctl -t short /dev/nvme0n1
# View self-test results after it completes (~2 minutes)
sudo smartctl -l selftest /dev/nvme0n1
For the USB backup drive (NVMe behind the ASM2462 enclosure):
-d sntasmedia is the ASMedia USB↔NVMe bridge translation — needed to read the 990 PRO's real NVMe SMART data through the enclosure. Without it smartctl just sees a generic USB mass-storage device.
For other USB enclosures: a plain SATA-over-USB enclosure usually wants -d sat (SCSI/ATA Translation); NVMe-in-USB enclosures want one of the -d snt* variants — -d sntasmedia (ASMedia), -d sntjmicron (JMicron), -d sntrealtek (Realtek). Older Samsung T5s (plain SATA SSDs) typically don't need any -d flag at all.
Key SMART metrics to watch¶
| Metric | What it means | Action threshold |
|---|---|---|
SMART overall-health |
Overall pass/fail | Anything other than PASSED → investigate |
Critical Warning |
NVMe critical condition flag | Anything other than 0x00 → investigate |
Available Spare |
Reserve cells remaining | Below 100% means wear is starting |
Percentage Used |
Estimated lifetime used | Above 80% → plan replacement |
Media and Data Integrity Errors |
Detected data corruption | Above 0 → investigate immediately |
Reallocated Sector Ct (SATA only) |
Bad sectors remapped | Above 0 → drive is wearing |
Unsafe Shutdowns |
Counts power-loss events | High count means consider a UPS |
Temperature |
Current drive temp | Above warning threshold (typically 80°C+) → improve cooling |
Routine monitoring¶
Even with email alerts, do a quick manual check monthly:
sudo smartctl -a /dev/nvme0n1 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
sudo smartctl -a /dev/nvme1n1 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
sudo smartctl -a -d sntasmedia /dev/sdb | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
(Use whatever lsblk shows the USB enclosure as in place of /dev/sdb; the -d sntasmedia flag is required to read its NVMe SMART data.)
Look for:
- PASSED (overall health)
- Available Spare: 100% (haven't started wearing)
- Percentage Used: 0% (or low number)
- Media and Data Integrity Errors: 0
Troubleshooting¶
No email alerts arriving:
- Verify smartd is running:
sudo systemctl status smartd - Test email separately:
echo test | sudo msmtp gabrielgabrie99@gmail.com - Check sendmail symlink:
ls -la /usr/sbin/sendmail(should point to msmtp) - Check msmtp logs:
sudo cat /var/log/msmtp.log
smartd alerting about "open() of ATA device failed: No such device":
The drive smartd is monitoring isn't visible. Two common causes:
- USB drive disconnected — check
lsblk. If the drive is gone, see 05-backups.md troubleshooting. - Device name changed — if
/etc/smartd.confever got switched to a/dev/sdXreference and the drive came back as a differentsdXafter a reconnect, smartd is looking in the wrong place. Always reference the USB backup drive by its stable/dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0path (with-d sntasmedia). A/dev/disk/by-uuid/path would not work here —-d sntasmedianeeds the whole-disk device, not a filesystem.
smartd fails to start:
Common cause: a device listed in /etc/smartd.conf no longer exists (e.g., USB drive unplugged). Either reconnect the drive or comment out that line.
"Invalid Field in Command" entries in error logs:
This is normal, especially on Samsung OEM drives. The OS sends commands the drive doesn't support, the drive responds with "I don't recognize that." Not a hardware error — only watch Media and Data Integrity Errors for actual data corruption.
Files and locations¶
| Purpose | Path |
|---|---|
| Configuration | /etc/smartd.conf |
| Service unit | /lib/systemd/system/smartd.service |
| Logs | journalctl -u smartd |