06 — Drive Health Monitoring (smartd)¶

smartd continuously monitors your drives for signs of failure and emails you when anything looks wrong. The philosophy is silence equals healthy — you only get emails when something needs attention.

What's monitored¶

Drive	Device	Mount	Role
Samsung NVMe	`/dev/nvme1n1`	`/`	OS drive
SK hynix NVMe	`/dev/nvme0n1`	`/data`	Data drive
Samsung 990 PRO 1TB (USB, ASM2462 enclosure)	`/dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0`	`/mnt/backup`	Backup drive

The backup drive is an NVMe SSD inside an ASMedia ASM2462 USB enclosure (it replaced the old Samsung T5 in May 2026 — see 05-backups.md). It's referenced by its stable /dev/disk/by-id/usb-...-0:0 path rather than /dev/sdX — the sdX name changes on USB re-enumeration. The -0:0 suffix means the whole disk (not a partition), which is mandatory here because of the -d sntasmedia flag below.

The off-site rotation drive (the demoted Samsung T5, label backup-offsite) is not monitored by smartd — it's normally disconnected (it lives at family's house) and only plugged in for the manual off-site sync.

Note: Linux device names for NVMe drives can occasionally change between reboots (e.g., what was nvme0n1 becomes nvme1n1). If you ever see smartd errors about devices not found after a reboot, check lsblk and update /etc/smartd.conf accordingly.

Installation (fresh setup)¶

sudo apt install -y smartmontools

This installs both smartctl (manual one-shot health checks) and smartd (the monitoring daemon).

Configuration¶

File: /etc/smartd.conf

The default file has a DEVICESCAN directive at the bottom. Comment that out and add explicit per-drive entries:

# DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner

/dev/nvme0n1 -a -s S/../.././02 -m gabrielgabrie99@gmail.com
/dev/nvme1n1 -a -s S/../.././02 -m gabrielgabrie99@gmail.com
/dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0 -d sntasmedia -a -m gabrielgabrie99@gmail.com

Note on the USB backup-drive entry:

-d sntasmedia — the backup drive is an NVMe SSD behind an ASMedia ASM2462 USB↔NVMe bridge. Without this flag, smartctl / smartd only sees a generic USB mass-storage device and can't read the drive's real NVMe SMART data. -d sntasmedia tells it to use the ASMedia bridge's SCSI-translation path to talk NVMe through the enclosure. It works on the z2mini's smartmontools 7.5. (Other bridge chips have their own: -d sntjmicron for JMicron, -d sntrealtek for Realtek.)
Referenced by /dev/disk/by-id/usb-...-0:0, not /dev/sdX — USB device names can change after a disconnect/reconnect cycle, which would break a /dev/sda-based config. The by-id path is stable. It's even more important here than the old UUID approach: -d sntasmedia needs the whole-disk device (the -0:0 suffix), not a filesystem/partition, so a /dev/disk/by-uuid/ path wouldn't work at all.
No -s self-test schedule. USB drives don't get scheduled self-tests anyway (the protocol can be flaky), and in this case the ASM2462 bridge doesn't pass the NVMe self-test command (admin command 0x14) through to the drive — so a -s schedule would just produce errors. Consistent with the existing "USB drives skip scheduled self-tests" rationale below.

What each flag means¶

Flag	Meaning
`-a`	Monitor all SMART attributes
`-s S/../.././02`	Schedule daily Short self-test at 2 AM
`-m <email>`	Email address for alerts

The schedule format is TYPE/DAY-OF-MONTH/MONTH/DAY-OF-WEEK/HOUR. The dots mean "any." So S/../.././02 reads as "Short test, any day of month, any month, any day of week, hour 02."

USB drives don't get scheduled self-tests (omitted -s) because the protocol can be flaky and self-tests can interfere with regular operation.

Enable and start¶

sudo systemctl enable smartd
sudo systemctl restart smartd
sudo systemctl status smartd

Should show "active (running)."

How alerts work¶

smartd uses the system's sendmail command to send emails. On this system, sendmail is symlinked to msmtp, which relays to Gmail. See 07-email-relay.md for that setup.

When something concerning is detected (failing self-test, temperature threshold, read errors, available spare drops, etc.), smartd invokes:

echo "<details>" | mail -s "SMART error..." gabrielgabrie99@gmail.com

You receive an email titled something like "SMART error (FailedTest) detected on host: z2mini" with details.

Testing email alerts¶

To verify alerts work without waiting for a real failure, temporarily add -M test to each line:

/dev/nvme0n1 -a -s S/../.././02 -m gabrielgabrie99@gmail.com -M test

Then restart:

sudo systemctl restart smartd

Within a couple minutes, you should receive test emails (one per drive). After confirming they arrive, remove -M test to stop getting test emails on every restart:

sudo nano /etc/smartd.conf
sudo systemctl restart smartd

Manual one-shot health checks¶

You don't need smartd to check a drive once. Use smartctl:

# Full health report
sudo smartctl -a /dev/nvme0n1

# Just the headline health assessment
sudo smartctl -H /dev/nvme0n1

# Run a short self-test manually
sudo smartctl -t short /dev/nvme0n1

# View self-test results after it completes (~2 minutes)
sudo smartctl -l selftest /dev/nvme0n1

For the USB backup drive (NVMe behind the ASM2462 enclosure):

# whatever lsblk shows the enclosure as (often /dev/sdb)
sudo smartctl -a -d sntasmedia /dev/sdb

-d sntasmedia is the ASMedia USB↔NVMe bridge translation — needed to read the 990 PRO's real NVMe SMART data through the enclosure. Without it smartctl just sees a generic USB mass-storage device.

For other USB enclosures: a plain SATA-over-USB enclosure usually wants -d sat (SCSI/ATA Translation); NVMe-in-USB enclosures want one of the -d snt* variants — -d sntasmedia (ASMedia), -d sntjmicron (JMicron), -d sntrealtek (Realtek). Older Samsung T5s (plain SATA SSDs) typically don't need any -d flag at all.

Key SMART metrics to watch¶

Metric	What it means	Action threshold
`SMART overall-health`	Overall pass/fail	Anything other than PASSED → investigate
`Critical Warning`	NVMe critical condition flag	Anything other than 0x00 → investigate
`Available Spare`	Reserve cells remaining	Below 100% means wear is starting
`Percentage Used`	Estimated lifetime used	Above 80% → plan replacement
`Media and Data Integrity Errors`	Detected data corruption	Above 0 → investigate immediately
`Reallocated Sector Ct` (SATA only)	Bad sectors remapped	Above 0 → drive is wearing
`Unsafe Shutdowns`	Counts power-loss events	High count means consider a UPS
`Temperature`	Current drive temp	Above warning threshold (typically 80°C+) → improve cooling

Routine monitoring¶

Even with email alerts, do a quick manual check monthly:

sudo smartctl -a /dev/nvme0n1 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
sudo smartctl -a /dev/nvme1n1 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
sudo smartctl -a -d sntasmedia /dev/sdb | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"

(Use whatever lsblk shows the USB enclosure as in place of /dev/sdb; the -d sntasmedia flag is required to read its NVMe SMART data.)

Look for: - PASSED (overall health) - Available Spare: 100% (haven't started wearing) - Percentage Used: 0% (or low number) - Media and Data Integrity Errors: 0

Troubleshooting¶

No email alerts arriving:

Verify smartd is running: sudo systemctl status smartd
Test email separately: echo test | sudo msmtp gabrielgabrie99@gmail.com
Check sendmail symlink: ls -la /usr/sbin/sendmail (should point to msmtp)
Check msmtp logs: sudo cat /var/log/msmtp.log

smartd alerting about "open() of ATA device failed: No such device":

The drive smartd is monitoring isn't visible. Two common causes:

USB drive disconnected — check lsblk. If the drive is gone, see 05-backups.md troubleshooting.
Device name changed — if /etc/smartd.conf ever got switched to a /dev/sdX reference and the drive came back as a different sdX after a reconnect, smartd is looking in the wrong place. Always reference the USB backup drive by its stable /dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0 path (with -d sntasmedia). A /dev/disk/by-uuid/ path would not work here — -d sntasmedia needs the whole-disk device, not a filesystem.

smartd fails to start:

sudo journalctl -u smartd --since "10 minutes ago"

Common cause: a device listed in /etc/smartd.conf no longer exists (e.g., USB drive unplugged). Either reconnect the drive or comment out that line.

"Invalid Field in Command" entries in error logs:

This is normal, especially on Samsung OEM drives. The OS sends commands the drive doesn't support, the drive responds with "I don't recognize that." Not a hardware error — only watch Media and Data Integrity Errors for actual data corruption.

Files and locations¶

Purpose	Path
Configuration	`/etc/smartd.conf`
Service unit	`/lib/systemd/system/smartd.service`
Logs	`journalctl -u smartd`