Friday, May 6, 2016

smartctl and Samsung 850 EVO

I have 3 small servers at home for performance testing. Each is an Intel NUC with 8G of RAM and a core i3 via 5i3RYH.  These work quietly under my desk. I have been collecting results for MongoDB and MySQL to understand storage engine performance and efficiency. I use them for single-threaded workloads to learn when storage engines sacrifice too much performance at low concurrency to make things better at high concurrency.

Each NUC has one SATA disk and one SSD. Most tests use the SSD because the disk has the OS install and I don't want to lose the install when too much testing makes the disk unhappy. My current SSD is Samsung 850 EVO with 120G and one of these became sick.
[3062127.595842] attempt to access beyond end of device
[3062127.595847] sdb1: rw=129, want=230697888, limit=230686720
[3062127.595850] XFS (sdb1): discard failed for extent [0x7200223,8192], error 5

Other error messages were amusing.
[2273399.254789] Uhhuh. NMI received for unknown reason 3d on CPU 3.
[2273399.254818] Do you have a strange power saving mode enabled?
[2273399.254840] Dazed and confused, but trying to continue

What does smartctl say? I am interested in Wear_Leveling_Count. The raw value is 1656. If that means what I think it means then this device can go to 2000 thanks to 3D TLC NAND (aka 3D V-NAND). The VALUE is 022 and that counts down from 100 to 0 so this device is 80% done and Wear_Leveling_Count might reach 2000. I created a new XFS filesystem on the device, rebooted the server and restarted my test. I don't think I need to replace this SSD today.

sudo smartctl -a /dev/sdb1
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       4323
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       42
177 Wear_Leveling_Count     0x0013   022   022   000    Pre-fail  Always       -       1656
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   055   049   000    Old_age   Always       -       45
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       9
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       365781411804

5 comments:

  1. Mark,

    Usually I see drives perform for quite a while after wear level indicator hits zero. I also have bunch of NUCs with inexpensive SSDs I use for testing :)

    ReplyDelete
    Replies
    1. Does performance get a lot worse as the device reaches its advertised limit?

      Delete
  2. Frankly I did not measure. What physically would you expect to cause slowdown unless there are correctable errors which require remapping ?

    ReplyDelete
    Replies
    1. Would be great if vendors were willing to make a statement about that

      Delete
  3. Just read about SSD's here (there are more articles about this stresstest):
    http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead

    Those are few generations behind, newer discs are better and last longer. But still, this kind of memory has it's internal limitations and will break at some point.

    ReplyDelete

Fixing some of the InnoDB scan perf regressions in a MySQL fork

I recently learned of Advanced MySQL , a MySQL fork, and ran my sysbench benchmarks for it. It fixed some, but not all, of the regressions f...