ZFS 置換硬碟紀錄

ZFS 置換硬碟紀錄

發生情況

突然覺得File Server 速度突然變得很慢, 連上Server 看一下ZFS 的狀況.

# zpool status
  pool: fspool
 state: ONLINE
  scan: scrub repaired 0 in 18h19m with 0 errors on Mon Nov  3 21:04:18 2014
config:

    NAME        STATE     READ WRITE CKSUM
    fspool      ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        ada0    ONLINE       0     0     0
        ada1    ONLINE       0     0     0
        ada2    ONLINE       0     0     0
        ada3    ONLINE       0     0     0
        ada5    ONLINE       0     0     0

errors: No known data errors

看起來zpool 是正常的啊.

後來查看了一下/var/log/messages 紀錄, 赫然發現Server中的某個硬碟已經發生了SMART error.

# cat /var/log/messages
Nov  6 15:16:45 hfs3 smartd[1373]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!
Nov  6 15:16:45 hfs3 smartd[1373]: Device: /dev/ada1, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Nov  6 15:46:45 hfs3 smartd[1373]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!
Nov  6 15:46:45 hfs3 smartd[1373]: Device: /dev/ada1, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Nov  6 16:16:45 hfs3 smartd[1373]: Device: /dev/ada1, FAILED SMART self-check. BACK UP DATA NOW!
Nov  6 16:16:45 hfs3 smartd[1373]: Device: /dev/ada1, Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.

天啊, 硬碟已經快壞掉了, 再用smartctl 確認一下

# smartctl -a /dev/ada1
smartctl 6.0 2012-10-10 r3643 [FreeBSD 9.1-RELEASE amd64] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST32000542AS
Serial Number:    5XW1VR5F
LU WWN Device Id: 5 000c50 02e7e8b9f
Firmware Version: CC34
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5900 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Nov  6 20:32:47 2014 CST

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/213915en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (  633) seconds.
Offline data collection
capabilities:            (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 453) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x103f) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       204814378
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       73
  5 Reallocated_Sector_Ct   0x0033   003   003   036    Pre-fail  Always   FAILING_NOW 3974
  7 Seek_Error_Rate         0x000f   088   060   030    Pre-fail  Always       -       681344031
  9 Power_On_Hours          0x0032   062   062   000    Old_age   Always       -       34092
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       73
183 Runtime_Bad_Block       0x0032   099   099   000    Old_age   Always       -       1
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   056   045    Old_age   Always       -       36 (Min/Max 35/42)
194 Temperature_Celsius     0x0022   036   044   000    Old_age   Always       -       36 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   046   027   000    Old_age   Always       -       204814378
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       225292509545553
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4046415663
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1941661744

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

的確, HDD ada1 的SMART Attributes Reallocated_Sector_Ct 已經出問題了. 趕緊上網去訂了一顆硬碟, 準備替換.

通常SMRT 回報錯誤的時候, 並不一定硬碟的資料會立刻損毀, 但是通常這硬碟也活不久了. 還好, 硬碟的資料還沒損毀, 即使是損毀了, 有Raidz 保護, 應該還可撐幾天沒問題吧.


Replace HDD

終於拿到新硬碟到了. 嗯, 準備動手替換.

首先, 先將有問題的HDD 做離線動作.

# zpool offline fspool ada1
# zpool status
  pool: fspool
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Online the device using 'zpool online' or replace the device with
    'zpool replace'.
  scan: scrub repaired 0 in 18h19m with 0 errors on Mon Nov  3 21:04:18 2014
config:

    NAME                      STATE     READ WRITE CKSUM
    fspool                    DEGRADED     0     0     0
      raidz1-0                DEGRADED     0     0     0
        ada0                  ONLINE       0     0     0
        11298317341861346220  OFFLINE      0     0     0  was /dev/ada1
        ada2                  ONLINE       0     0     0
        ada3                  ONLINE       0     0     0
        ada5                  ONLINE       0     0     0

errors: No known data errors

這時fspool 已經經入DEGRADED 狀態, 這個狀態表示這個zpool 雖然還能運作, 但是處於危險狀態.
原來的ada1 名字被換成11298317341861346220, 先記下這個值, 等會置換時會用到.

趕緊將問題硬碟拔出, 再換上新硬碟. 先用smartctl 確認新硬碟是否ok.

# smartctl -a /dev/ada1
smartctl 6.0 2012-10-10 r3643 [FreeBSD 9.1-RELEASE amd64] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA DT01ACA300
Serial Number:    84MBWPHGS
LU WWN Device Id: 5 000039 ff4e1970c
Firmware Version: MX6OABB0
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Nov  6 20:46:33 2014 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x80) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (21791) seconds.
Offline data collection
capabilities:            (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 364) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   054    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       0
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1
194 Temperature_Celsius     0x0002   214   214   000    Old_age   Always       -       28 (Min/Max 25/28)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay

看來是ok的, 接下來就是讓新硬碟 online 囉. 使用zpool replace 命令置換.

# zpool replace fspool 11298317341861346220 ada1

上面那個11298317341861346220 就是剛剛記下來的那個值.

再檢查一下zpool 狀態

# zpool status
  pool: fspool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov  6 20:47:24 2014
        550G scanned out of 7.44T at 144M/s, 13h56m to go
        107G resilvered, 7.22% done
config:

    NAME                        STATE     READ WRITE CKSUM
    fspool                      DEGRADED     0     0     0
      raidz1-0                  DEGRADED     0     0     0
        ada0                    ONLINE       0     0     0
        replacing-1             OFFLINE      0     0     0
          11298317341861346220  OFFLINE      0     0     0  was /dev/ada1/old
          ada1                  ONLINE       0     0     0  (resilvering)
        ada2                    ONLINE       0     0     0
        ada3                    ONLINE       0     0     0
        ada5                    ONLINE       0     0     0

errors: No known data errors

嗯, 正在重新掛載中, 看來要等到明天, zpool 的狀態才會由DEGRADED 回復到 ONLINE 狀態. 在這段期間ZFS 還是能夠運作喔. 還是可以繼續操它喔.


參考資料

  1. How do I remove / replace a failed disk in a ZFS array
  2. Replacing a Device in a ZFS Storage Pool
This entry was posted in ZFS and tagged . Bookmark the permalink.

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *