09 — Routine Maintenance¶

What to do, how often, and why. Spending 15 minutes a month here saves hours of debugging later.

Monthly checklist (15 minutes)¶

First Saturday of every month is a good cadence.

1. Update the system¶

sudo apt update
sudo apt upgrade -y

If kernel updates were installed, reboot:

sudo reboot

Wait 30 seconds, SSH back in, verify everything came up:

sudo systemctl status smbd tailscaled smartd
df -h

2. Check backup health¶

tail -30 /mnt/backup/backup.log

Look for: - Recent successful runs (yesterday's date should be present) - "Backup completed" messages - No "ERROR" or "FAILED" entries

ls /mnt/backup/daily/

Should show recent daily snapshots. Newest should be from today or yesterday.

df -h /mnt/backup

Backup drive shouldn't be more than ~80% full. If it is, see 05-backups.md about pruning.

3. Check drive health¶

sudo smartctl -a /dev/nvme0n1 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
sudo smartctl -a /dev/nvme1n1 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"
# Backup drive: 990 PRO behind the ASM2462 USB-NVMe enclosure — needs -d sntasmedia and the
# stable by-id path (the /dev/sdX name is NOT stable; it re-enumerates after a USB reconnect)
sudo smartctl -a -d sntasmedia /dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0 | grep -E "PASSED|Available Spare|Percentage Used|Media and Data"

What you want to see: - PASSED - Available Spare: 100% - Percentage Used: 0% (or low — climbs slowly over years) - Media and Data Integrity Errors: 0

(All three drives are NVMe now — the backup drive is a 990 PRO in a USB enclosure — so there are no SATA-style Reallocated_Sector_Ct / Wear_Leveling attributes to watch anymore.)

If anything is concerning, see 06-drive-monitoring.md.

4. Check temperatures¶

sudo smartctl -a /dev/nvme0n1 | grep -i temperature
sudo smartctl -a /dev/nvme1n1 | grep -i temperature

Drives should be under 60°C at idle. Above 70°C suggests cooling issues (dust in fan, blocked vents, ambient temperature too high).

5. Check for filesystem errors¶

sudo journalctl -p err --since "30 days ago" | head -30

A few entries from boot are normal. Any entries about I/O errors, filesystem corruption, or repeated failures need attention.

6. Check for USB disconnect events on the backup drive¶

sudo dmesg | grep -i "usb disconnect"
sudo grep -i "auto-remounted" /mnt/backup/backup.log

A few disconnects per year is normal background noise. If you're seeing them weekly or daily, the cable, port, or drive is having issues — investigate before it becomes a real failure.

The "auto-remounted" log entries (from the backup script) tell you how often the resilience auto-remount feature is firing — same signal, just from inside the backup workflow.

7. Check the system-state backup is current¶

ls -la /mnt/backup/system-state/system-config-*.tar.gz | tail -3

Newest archive should be from today or yesterday.

Quarterly checklist (every 3 months)¶

Verify backups are actually readable¶

Backups that exist but can't be restored aren't backups. Test by restoring something:

# Pick a file and restore it to a test location
cp /mnt/backup/daily/<recent-date>/some-file /tmp/test-restore
md5sum /data/files/some-file /tmp/test-restore

The MD5 sums should match. If not, something is wrong with the backup.

Rotate the off-site backup drive¶

(Once you have one — currently planned, not yet implemented.)

Bring the off-site drive home, sync it, take it back. Set a calendar reminder so you actually do this.

Rotate Gmail app password¶

If the password has been seen anywhere unexpected (logs you've shared, screenshots, etc.). See 07-email-relay.md.

Annual checklist (once a year)¶

Audit installed packages¶

apt-mark showmanual

Are there things installed that you no longer need? Removing them keeps the attack surface small:

sudo apt remove --purge <package>
sudo apt autoremove

Review the AI context document¶

The z2mini-context-for-ai.md file should still accurately describe your current setup. If you've made changes that aren't in there, update it.

Review and prune old SSH/Tailscale devices¶

In the Tailscale admin panel (https://login.tailscale.com/admin), remove devices you no longer use. Old devices that retain access are a security gap.

Check for OS upgrade availability¶

Ubuntu 24.04 LTS is supported until 2029 (free) or 2034 (with Ubuntu Pro). Major version upgrades aren't urgent but eventually you'll want to plan one.

do-release-upgrade -c

Tells you whether a new LTS is available.

When you make config changes¶

Any time you change something significant on the Z2:

Make the change
Test it works
Run a backup manually so the change is captured: ~/scripts/backup-files.sh
Update the relevant doc in /docs/ so your documentation matches reality
Update the AI context document if the change affects ports, paths, services, or design decisions

The discipline of keeping docs in sync is the biggest payoff over time. Future-you will thank present-you.

When you bring up a new service¶

A new self-hosted service isn't "done" when the container starts — it's done when it's visible and documented. The full checklist lives in z2mini-context-for-ai.md → Checklist: bringing up a new service; the short version:

Port not already reserved; bound to 127.0.0.1:<port>, never 0.0.0.0 or the tailnet IP; add a Caddyfile block (<svc>.z2mini.gabrielgabrie.com { reverse_proxy 127.0.0.1:<port> }) — see 17-caddy.md
/data/docker/<service>/ layout, compose mirrors upstream, numbered NN-<service>.md doc written and added to mkdocs.yml
Port / container / paths / decision row added to both this doc set and 10-system-reference.md
Backup story written into the service's doc (rsync vs pg_dump vs SQLite .backup)
Homepage tile added — edit /data/docker/homepage/config/services.yaml (icon + href + server: my-docker + container:), add a widget: block only if Homepage ships one. See 13-homepage.md → Adding a new service tile. Beszel auto-discovers the container — nothing to do there.

The Homepage tile is the step that's been forgotten before (Radicale and Vaultwarden sat dashboard-invisible for days). If you only remember one item from this list, remember that one.

Useful commands you'll reach for repeatedly¶

Service management¶

sudo systemctl status <service>          # check state
sudo systemctl restart <service>         # restart
sudo systemctl stop <service>            # stop
sudo systemctl start <service>           # start
sudo systemctl enable <service>          # start at boot
sudo systemctl disable <service>         # don't start at boot

sudo journalctl -u <service> --since "1 hour ago"   # recent logs
sudo journalctl -u <service> -f          # live tail

Common services on this system: smbd, tailscaled, smartd, cron, systemd-networkd, wpa_supplicant.

Docker (almost everything else runs here)¶

docker ps                                                        # running containers
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'   # cleaner, audience-friendly view
docker stats --no-stream                                         # per-container CPU/RAM right now
docker logs -f --tail 50 <container>                             # live-tail a container's log (Ctrl-C to stop)

Each stack lives in /data/docker/<stack>/ — stacks are immich, navidrome, homepage, beszel, radicale, vaultwarden (plus icloudpd, a one-shot tool). From inside a stack directory:

cd /data/docker/immich
docker compose ps                              # this stack's containers + health
docker compose logs -f immich-server           # live-tail one service (Ctrl-C to stop)
docker compose pull && docker compose up -d    # update images, recreate changed containers
docker compose restart <service>               # restart one service
docker compose down                            # stop the whole stack
docker compose config --services               # list this stack's service names

Full container ↔ port ↔ path map: 10-system-reference.md.

Disk and filesystem¶

df -h                          # disk usage by mount
du -sh /path/to/dir            # size of a directory
ncdu /path/to/dir              # interactive disk usage explorer
lsblk                          # block devices and partitions
mount | grep <something>       # what's mounted where

Process and resource monitoring¶

htop                           # interactive process viewer (q to quit)
btop                           # prettier live monitor — CPU/RAM/disk/net graphs (q to quit)
free -h                        # memory usage
uptime                         # system load + how long it's been up
ps aux | grep <something>      # find a process
nvidia-smi                     # GPU usage (Quadro T2000) — what Immich's ML is doing

Network¶

ip addr show                   # all network interfaces and IPs
networkctl status              # systemd-networkd status
tailscale status               # Tailscale state
sudo iw dev wls3f3 link        # Wi-Fi signal/quality
ping <host>                    # connectivity test

Logs¶

tail -f /var/log/syslog                     # live system log
sudo journalctl -p err --since "1 day ago"  # errors in last day
sudo journalctl -b                          # since last boot

Tailscale (the network layer)¶

tailscale status                  # every device on the tailnet + connection type (direct vs relayed)
tailscale ip -4                   # this machine's tailnet IP (100.67.235.68)
tailscale ping <device>           # test direct connectivity to another tailnet device
tailscale netcheck                # NAT / relay diagnostics
tailscale dns status              # what resolvers Tailscale is using (should list 1.1.1.1 / 8.8.8.8)
# NOTE: `tailscale serve` is no longer used — the HTTPS front door is the Caddy
# reverse proxy now (docker compose -f /data/docker/caddy/docker-compose.yml ps).

System identity ("what is this box?")¶

fastfetch                         # one-screen summary: OS, kernel, uptime, CPU, RAM, disk
hostnamectl                       # hostname, OS, hardware model, kernel
lsb_release -a                    # Ubuntu release
uname -a                          # kernel version
lsblk                             # drives & partitions (OS / data / backup)

tmux — split panes & jobs that outlive your SSH session¶

Useful for running two things at once, and for long jobs that must survive a disconnect (the icloudpd migration loop runs this way).

tmux                              # start a session
tmux attach                       # re-attach after a disconnect (alias: tmux a)
tmux ls                           # list running sessions

Inside tmux the prefix key is Ctrl-b — press and release it, then:

Keys	Action
`Ctrl-b` then `"`	split pane — top / bottom
`Ctrl-b` then `%`	split pane — left / right
`Ctrl-b` then arrow	move between panes
`Ctrl-b` then `d`	detach (session keeps running in the background)
`Ctrl-b` then `x`	close the current pane

Showing the server to someone¶

A ~60-second walkthrough — every command is read-only, safe to run in front of anyone:

fastfetch                                                   # 1. what the box is (HP Z2 Mini, Ubuntu Server)
uptime                                                      # 2. how long it's been running, untouched
btop                                                        # 3. live load — barely breaks a sweat (q to quit)
docker ps --format 'table {{.Names}}\t{{.Status}}'          # 4. every service it runs
cd /data/docker/immich && docker compose logs -f immich-server   # 5. watch it work — now open the Immich app and scroll; log lines stream live (Ctrl-C to stop)
tailscale status                                            # 6. how it's reachable — private mesh, nothing exposed to the internet
docker compose -f /data/docker/caddy/docker-compose.yml ps  # 7. the single HTTPS front door (Caddy → <svc>.z2mini.gabrielgabrie.com)
du -sh /data/docker/immich/library                          # 8. the scale of it (~500 GB of photos)

Then open the dashboards in a browser to tie it together: Homepage at https://home.z2mini.gabrielgabrie.com, Beszel at https://beszel.z2mini.gabrielgabrie.com. (For the split-pane "mission control" version, run tmux first and put btop in one pane and the docker compose logs -f in another.)

Keep tours read-only. Don't cat .env files on screen — they hold DB passwords, API tokens, and signing keys. Don't run sudo anything you haven't rehearsed.

If something is wrong and you don't know what¶

A standard diagnostic sweep:

# What's the current state?
sudo systemctl status smbd tailscaled smartd cron

# Anything failing?
sudo systemctl --failed

# Recent errors?
sudo journalctl -p err --since "1 day ago" | head -50

# Disk OK?
df -h

# Memory OK?
free -h

# Load OK?
uptime

# Drives OK?
sudo smartctl -H /dev/nvme0n1 /dev/nvme1n1
sudo smartctl -H -d sntasmedia /dev/disk/by-id/usb-ASMT_2462_NVME_2504178506CB-0:0   # backup drive (USB-NVMe enclosure)

If something pops out as obviously wrong, dig into that. If everything looks fine but you're seeing a problem, the issue is probably client-side (your laptop, your Wi-Fi, your iPhone) rather than on the Z2.