smartctl —— 盘健康检查与坏盘预警

一句话定义

smartctl 读取硬盘 / SSD 的 SMART (Self-Monitoring, Analysis and Reporting Technology) 信息——温度、累计读写量、错误计数、剩余寿命。生产用来预警坏盘——不要等盘真挂了才发现。

典型场景

节点 dmesg 报 I/O error、判断盘是不是要坏
周期性 smart 自检（cron 周扫一次）
新装机器盘的"基线"测试
监控集成（Prometheus smartctl_exporter）
SSD 寿命估算（写入量、剩余 spare）

装

apt install -y smartmontools                # Ubuntu / Debian
yum install -y smartmontools                # CentOS / RHEL

smartmontools 包含：

smartctl —— 命令行（主用）
smartd —— 后台守护、定时检查 + 报警

1. 看盘的 SMART 信息

smartctl -i /dev/sda                        # 基本信息
smartctl -H /dev/sda                        # 整体健康
smartctl -a /dev/sda                        # **all**（全部信息、最常用）

`-i` 基本信息

$ smartctl -i /dev/sda
smartctl 7.3 (build ...)
...

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 1TB
Serial Number:    S5XXXXX...
LU WWN Device Id: 5 002538 ...
Firmware Version: SVT02B6Q
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    Solid State Device                ← SSD
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    2026-05-27 14:30:00 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

关键字段：

字段	含义
`Device Model`	厂商型号
`Serial Number`	序列号（关键——RMA / 替换记录用）
`Firmware Version`	固件版本
`User Capacity`	容量
`Rotation Rate`	`Solid State Device` = SSD; `7200 rpm` = HDD
`SMART support is Enabled`	必须 Enabled 才能跑后续

`-H` 整体健康

$ smartctl -H /dev/sda
SMART overall-health self-assessment test result: PASSED
                                                  ^^^^^^
                                                  好

结果	含义
`PASSED`	盘整体健康
`FAILED!`	盘要挂了——立即更换

PASSED ≠ 完全没问题。看具体属性。

`-a` 全部信息

$ smartctl -a /dev/sda
... (基本信息 + 健康 + SMART attributes + 错误日志 + 自检日志 + ...)

输出长——下面挨个段拆。

2. SMART Attributes —— 看坏盘征兆

$ smartctl -A /dev/sda                      # 只看 attributes
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always   -           0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always   -           5000
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always   -           50
177 Wear_Leveling_Count     0x0013   095   095   000    Pre-fail  Always   -           50
194 Temperature_Celsius     0x0022   100   050   000    Old_age   Always   -           45
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always   -           0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Always   -           0
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always   -           1000000000
^^^ ^^^^^^^^^^^^^^^^^^^^^^                                                            ^^^^^^^^^^^^
ID   属性名                                                                            真实值（关键）

字段说明：

列	含义
`ID#`	属性 ID（厂商间一致）
`ATTRIBUTE_NAME`	属性名
`VALUE`	当前归一化值（0-100、越大越好）
`WORST`	历史最差值
`THRESH`	阈值（VALUE < THRESH = 失败）
`TYPE`	`Pre-fail` 关键 / `Old_age` 信息
`WHEN_FAILED`	`-` 没失败；`FAILING_NOW` 现在失败；`In_the_past` 过去失败
`RAW_VALUE`	真实数值

必看的关键属性

5 Reallocated_Sector_Ct

重新分配的扇区数——盘内部坏块被自动替换。

RAW_VALUE	含义
0	完美
几个	正常老化
几十	要警惕
几百+	盘要坏、立即更换

197 Current_Pending_Sector

待重映射扇区——读出错、但还没确认是坏块。

RAW_VALUE 0       OK
RAW_VALUE > 0     **警告**——读出错正在发生

任何 > 0 都要查。

198 Offline_Uncorrectable

离线无法修复的扇区数——彻底坏的块。

RAW_VALUE 0       OK
RAW_VALUE > 0     **盘损坏中、立即备份 + 更换**

9 Power_On_Hours

通电小时数——盘的使用时长。

5000 小时 ≈ 7 个月 24x7
40000+ 小时 ≈ 4-5 年 24x7

老盘 + 高小时数 → 慎用做关键数据。

194 Temperature_Celsius

温度。

< 40°C —— 正常
40-50°C —— 偏热
50°C —— 太热、寿命减半
60°C —— 危险

机房空调 / 服务器风扇问题。

177 Wear_Leveling_Count (SSD)

SSD 磨损均衡计数 —— RAW 是擦写次数 / 平均擦写周期。

VALUE（归一化 0-100）：

100       新盘
60        用了一半寿命
10        快到寿命
<5        **SSD 寿命快终**

SSD 厂商保证的 TBW（Total Bytes Written）通常对应 wear_leveling = 0。

241 Total_LBAs_Written (SSD)

总写入 LBA 数 → 算 TBW。

RAW × 512 字节 / 1024^4 = TBW (TiB)

# 例：1000000000 LBA × 512 / 1024^4 = 0.46 TiB 写入

跟厂商 TBW 上限比较：

870 EVO 1TB：600 TBW
WD Red Pro 1TB：256 TBW
Intel S4510 1TB：1000 TBW

接近上限 → 计划换。

一眼看出"坏盘征兆"

$ smartctl -A /dev/sda | grep -E "Reallocated|Pending|Uncorrectable"
  5 Reallocated_Sector_Ct ... 0           ← 都 0、健康
197 Current_Pending_Sector ... 0
198 Offline_Uncorrectable ... 0

$ smartctl -A /dev/sdb | grep -E "Reallocated|Pending|Uncorrectable"
  5 Reallocated_Sector_Ct ... 320          ← 重映射 320 个、警告
197 Current_Pending_Sector ... 5           ← 5 个待重映射、问题
198 Offline_Uncorrectable ... 12           ← 坏块、立即换

任何 Reallocated/Pending/Uncorrectable > 0 都要查、持续涨就换盘。

3. NVMe 盘的属性

NVMe 用不同属性集（不是 ATA SMART）：

$ smartctl -a /dev/nvme0n1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00                  ← 0 = 没警告
Temperature:                        45 Celsius
Available Spare:                    100%                  ← spare 块剩余
Available Spare Threshold:          10%                   ← 低于 10% 警告
Percentage Used:                    5%                    ← **磨损 5%、寿命 95%**
Data Units Read:                    1,234,567 [632 GB]
Data Units Written:                 2,345,678 [1.20 TB]
Host Read Commands:                 12,345,678
Host Write Commands:                23,456,789
Controller Busy Time:                5,000
Power Cycles:                       10
Power On Hours:                     500
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0                     ← 数据错误数
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

NVMe 关键字段：

字段	含义	异常
`Critical Warning`	警告 bitmap	非 0 都要查
`Available Spare`	剩余 spare 块比例	< 10% 警告
`Percentage Used`	寿命用了多少	> 80% 计划换
`Data Units Written`	总写入量	跟厂商 TBW 比
`Media and Data Integrity Errors`	数据完整性错误	> 0 警告
`Unsafe Shutdowns`	不安全断电次数	偶发正常 / 频繁 = 电源问题

4. 自检 `-t`

让盘自己跑自检：

smartctl -t short /dev/sda                   # 短自检（1-2 分钟）
smartctl -t long /dev/sda                    # 长自检（几小时）
smartctl -t offline /dev/sda                  # 离线自检
smartctl -X /dev/sda                          # 中止自检

自检在盘内部异步跑——你的命令立即返回。查看进度：

$ smartctl -c /dev/sda | grep "Self-test"
Self-test routine in progress: 30% remaining

完成后看结果：

$ smartctl -l selftest /dev/sda
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error  00%      5000             -
# 2  Extended offline    Completed without error  00%      4500             -
                          ^^^^^^^^^^^^^^^^^^^^^^^
                          PASS

Completed without error = OK。 Completed: read failure 等 = 盘有问题。

周期性自检（生产推荐）

通过 smartd 守护：

# /etc/smartd.conf
/dev/sda -a -o on -S on -s (S/../.././02|L/../../6/03) -m alerts@example.com
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
                            每天 2 点短自检、每周六 3 点长自检

systemctl enable --now smartd

短自检几分钟、长自检不影响业务读写。

或者用 cron：

# /etc/cron.d/smart-check
0 2 * * * root smartctl -t short /dev/sda > /dev/null
0 3 * * 6 root smartctl -t long /dev/sda > /dev/null

5. 看错误日志

$ smartctl -l error /dev/sda
SMART Error Log Version: 1
No Errors Logged                            ← 健康

# 或者
SMART Error Log Version: 1
ATA Error Count: 5                          ← 5 个错误
        ...
Error 5 [4] occurred at disk power-on lifetime: 1234 hours
  When the command that caused the error occurred, the device was active or idle.
  Error: UNC at LBA = 0x12345678 = 305419896
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                            坏 sector LBA
        ...

UNC = Uncorrectable = 数据无法读取。坏块。

# 找 LBA 对应的文件（极烦但可能）
$ debugfs /dev/sda3
debugfs> icheck 305419896          # 找 inode
debugfs> ncheck <inode>             # 找文件名

文件损坏 → 看是否能从备份恢复。

6. 实战场景

场景 1：节点 dmesg 报 I/O error、判断盘

$ dmesg -T | grep -iE "i/o|sector|sd" | tail -5
[Mon May 27 14:30:00 2026] sd 0:0:0:0: [sda] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Mon May 27 14:30:00 2026] sd 0:0:0:0: [sda] tag#5 Sense Key : Medium Error [current]
[Mon May 27 14:30:00 2026] sd 0:0:0:0: [sda] tag#5 Add. Sense: Unrecovered read error
[Mon May 27 14:30:00 2026] sd 0:0:0:0: [sda] tag#5 CDB: Read(10) 28 00 00 12 34 56 00 00 08 00
[Mon May 27 14:30:00 2026] blk_update_request: I/O error, dev sda, sector 1193046

→ sda 有 sector 读不出来。立即 smartctl：

$ smartctl -a /dev/sda | grep -E "Reallocated|Pending|Uncorrectable|Health"
SMART overall-health self-assessment test result: PASSED
  5 Reallocated_Sector_Ct ... 320           ← 320 重映射
197 Current_Pending_Sector ... 5            ← 5 待映射
198 Offline_Uncorrectable ... 12            ← 12 坏块

# Health 还 PASSED 但实际很差
# 立即:
# 1. 备份数据
# 2. drain 节点 / 把 PV 迁走
# 3. 联系 IDC / 云厂商换盘

场景 2：批量检查所有节点

for h in m1 m2 m3 m4 m5; do
  echo "=== $h ==="
  ssh $h 'smartctl -H /dev/sda 2>/dev/null | grep "overall-health"'
done

或者用 Ansible / 监控更优雅。

场景 3：监控集成（Prometheus）

# 装 smartctl_exporter
docker run -d \
  --name smartctl-exporter \
  --privileged \                            # 需要访问磁盘
  -v /:/host:ro \
  -p 9633:9633 \
  prometheuscommunity/smartctl-exporter

Prometheus 抓 :9633/metrics：

# 重映射扇区数突然涨
rate(smartctl_device_attribute{attribute_name="Reallocated_Sector_Ct"}[1h]) > 0

# 温度过高
smartctl_device_temperature > 50

# SSD 寿命
smartctl_device_attribute{attribute_name="Percentage_Used"} > 80

加 Alertmanager 告警。

场景 4：SSD 寿命估算

$ smartctl -a /dev/sda | grep -E "Power_On_Hours|Total_LBAs_Written|Wear"
  9 Power_On_Hours          ... 8000
177 Wear_Leveling_Count      ... 80           ← 用了 20%
241 Total_LBAs_Written        ... 500000000

# 算 TBW
echo "scale=2; 500000000 * 512 / 1024 / 1024 / 1024 / 1024" | bc
# 约 0.23 TiB 写入

# 8000 小时 = 333 天
# 0.23 TiB / 333 天 = 0.7 GB/天
# 厂商 TBW = 600 TiB
# 剩余寿命 = (600 - 0.23) / 0.7 / 365 = ~2300 年

# → 这盘不用担心寿命

# 数据库盘
Total_LBAs_Written = 2000000000000      # 2T LBA = 1 TiB
Power_On_Hours = 8000                    # 1 年
Wear_Leveling_Count = 30                 # 70% used

# 写入 1 TiB/年 × 厂商 600 TBW
# → 600 年理论寿命

# 但 wear leveling 30 = 70% used 不对应
# → 厂商 TBW 标称偏高、实际用了 70%
# → 计划 1 年内换

注意 Wear_Leveling_Count VALUE 跟厂商 TBW 不完全对应——以更悲观的为准。

7. 反面教材

反面 1：只看 -H PASSED 就放心

$ smartctl -H /dev/sda
... PASSED

PASSED 不代表盘没问题。很多坏盘征兆不触发 FAILED——直到真的写不下数据为止。

永远看 attributes，特别 Reallocated / Pending / Uncorrectable。

反面 2：以为 SMART 显示一切

SMART 只暴露预定义的属性。某些故障模式 SMART 完全看不到：

控制器损坏
接口线缆问题
电源问题
固件 bug

dmesg / I/O error 比 SMART 更直接。

反面 3：测自检卡死节点

$ smartctl -t long /dev/sda
# 自检几小时
# 期间盘有少量额外 I/O，不阻塞业务

短 / 长自检在盘内部异步——不卡 I/O。安心跑。

但离线自检（-t offline）某些盘会拒绝其它 I/O。生产用 short / long。

反面 4：用 SMART 判断 RAID 盘

$ smartctl -a /dev/md0
# RAID 设备没 SMART

软 RAID 用 mdadm、硬 RAID 用厂商工具（megacli / storcli）。RAID 内部的物理盘要：

# 软 RAID 看成员盘
smartctl -a /dev/sda
smartctl -a /dev/sdb

# 硬 RAID 看成员盘
smartctl -d megaraid,N /dev/sda           # N = LSI ID
smartctl -d sat -a /dev/sda                # SATA 模式

反面 5：在虚拟机里跑 smartctl

$ smartctl /dev/sda
/dev/sda: requires option '-d cciss,N'
                                           # 或类似错误

虚拟化的 disk 通常没 SMART 暴露——hypervisor 屏蔽了。

→ 物理机 / 裸金属才能用 smartctl。 → 云上 / VM 看云厂商提供的盘健康指标（CloudWatch / 阿里监控）。

反面 6：测试 SSD 寿命用错指标

Power_On_Hours       = 1 年
Total_LBAs_Written   = 100 TiB
Wear_Leveling_Count  = 50  ← 用了 50%

3 个指标用哪个？

厂商承诺	关键指标
TBW（Terabytes Written）	Total_LBAs_Written
DWPD (Drive Writes Per Day)	算 Total_LBAs / Power_On_Hours
通用磨损	Wear_Leveling_Count

以最悲观的为准。如果三者出入大、查 spec。

8. 关联命令

dmesg —— 盘错误第一时间在 dmesg
lsblk —— 看盘 / 找设备名
iostat —— 慢盘判断
systemctl —— smartd.service 管理
mdadm —— 软 RAID 健康
megacli / storcli —— 硬 RAID 厂商工具
nvme —— NVMe 专用工具（nvme smart-log /dev/nvme0）