Small Datum: smartctl and Samsung 850 EVO

Friday, May 6, 2016

smartctl and Samsung 850 EVO

I have 3 small servers at home for performance testing. Each is an Intel NUC with 8G of RAM and a core i3 via 5i3RYH. These work quietly under my desk. I have been collecting results for MongoDB and MySQL to understand storage engine performance and efficiency. I use them for single-threaded workloads to learn when storage engines sacrifice too much performance at low concurrency to make things better at high concurrency.

Each NUC has one SATA disk and one SSD. Most tests use the SSD because the disk has the OS install and I don't want to lose the install when too much testing makes the disk unhappy. My current SSD is Samsung 850 EVO with 120G and one of these became sick.
[3062127.595842] attempt to access beyond end of device
[3062127.595847] sdb1: rw=129, want=230697888, limit=230686720
[3062127.595850] XFS (sdb1): discard failed for extent [0x7200223,8192], error 5

Other error messages were amusing.
[2273399.254789] Uhhuh. NMI received for unknown reason 3d on CPU 3.
[2273399.254818] Do you have a strange power saving mode enabled?
[2273399.254840] Dazed and confused, but trying to continue

What does smartctl say? I am interested in Wear_Leveling_Count. The raw value is 1656. If that means what I think it means then this device can go to 2000 thanks to 3D TLC NAND (aka 3D V-NAND). The VALUE is 022 and that counts down from 100 to 0 so this device is 80% done and Wear_Leveling_Count might reach 2000. I created a new XFS filesystem on the device, rebooted the server and restarted my test. I don't think I need to replace this SSD today.

sudo smartctl -a /dev/sdb1

...

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0

9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 4323

12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 42

177 Wear_Leveling_Count 0x0013 022 022 000 Pre-fail Always - 1656

179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0

181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0

182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0

183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0

187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

190 Airflow_Temperature_Cel 0x0032 055 049 000 Old_age Always - 45

195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0

199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0

235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 9

241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 365781411804

5 comments:

PeterMay 6, 2016 at 3:47 PM
Mark,

Usually I see drives perform for quite a while after wear level indicator hits zero. I also have bunch of NUCs with inexpensive SSDs I use for testing :)
ReplyDelete
Replies
PeterMay 8, 2016 at 1:55 PM
Frankly I did not measure. What physically would you expect to cause slowdown unless there are correctable errors which require remapping ?

ReplyDelete
Replies
Marcin S.May 9, 2016 at 1:06 AM
Just read about SSD's here (there are more articles about this stresstest):
http://techreport.com/review/27909/the-ssd-endurance-experiment-theyre-all-dead

Those are few generations behind, newer discs are better and last longer. But still, this kind of memory has it's internal limitations and will break at some point.
ReplyDelete
Replies

Add comment

Friday, May 6, 2016

smartctl and Samsung 850 EVO

5 comments:

Challenges compiling old C++ code on modern Linux