EMC VNX2 and Hot SparingVNX2 has changed the way hot spares work. This change has the potential to impact performance of next-generation EMC devices, for good and bad, as well as impact the administration of these devices. The information shared below will define the way hot sparing will work for VNX2 devices, describe the rules it will use to select spares, and discuss the impact for administrators.

Keep in mind that these changes will impact VNX 2 devices; VNX8000, VNX7600, VNX5800, VNX5600, VNX5400, and VNX5200. These devices are running MCx series code (05.33), also known as “multi-core optimization”, as defined by EMC.

Here are the primary points you need to know.

VNX2 no longer utilizes hot-spare designations.

Sparing is now utilized from unassigned drives and is considered permanent. There is no equalization upon replacement. The unassigned drive permanently becomes the replacement. In other words, the user will no longer select specific drives as “hot spares”. The MCx code will consider any and every un-configured drive in the array to be an available spare. The user will no longer configure drives in the array to be designated as a “spare” for a failed drive. As long as there is a drive available, sparing will occur. We found this discussion thread on EMC’s site to be particularly helpful: https://community.emc.com/thread/184504

However, there are potential issues.

One of the many potential issues here is that it will be possible to spare out to a lower revolution drive. The MCx code does not take into account form factor or RPM speed of the drive. Therefore, if a lower RPM drive is the only unassigned drive available, a 15KRPM drive can and will spare out to a 7.2KRPM drive. This puts your performance at risk. In addition, this puts more management responsibility on the administrator than before. In essence, administrators will be expected to frequently monitor whether a drive has failed over to a slower drive or drives. With current generation VNX devices, you know the destination to which your drives are failing over. With the new VNX, you may not be so sure where or when they are failing over, and it will be up to the administrator to determine whether a drive has failed over to a slower drive.

The way RAID groups are configured has also changed.

In the past, the Bus, Enclosure & Disk (B.E.D.) was used to identify RAID group elements. However, in these VNX2 devices, serial numbers will be used to determine which drives are associated with a particular RAID group. This new methodology allows for a new feature called “drive mobility”. This means you may physically move a drive to another slot, and it will be recognized as part of a specific RAID group and continue processing, no matter where this drive is placed within the array. This has been done before and works well in certain instances, for example, when trying to balance or rebalance drives. However, this fails miserably when the process is not well planned or followed.

In conclusion, it will be difficult to predict how production will play out. From an engineering standpoint, these changes are complicated with regard to OS and code implementation. If you are considering an upgrade to VNX2 model, you will want to take into account the potential performance issues with these changes to VNX2 devices. You will need significant planning from a certified engineer to make sure you are not putting your data at risk. Reliant is available to help. Contact us at StorageConsultant@Reliant-Technology.com


Appendix
 
If the only drive available is a slower drive this will change the performance characteristics for that RAID Group. 
 
Sparing Rules:
1. Type (SSD; SAS; NLSAS)
• MCx will look at every drive in the array on all of the busses and will find all the ‘like’ drives into a bucket. IE: If the failing/failed drive is a SAS drive it will find all of the SAS drives in the system. 
 
2. Bus
• MCx will then take the drives found in step 1 and determine which ones are on the same physical bus as the failing/failed drive.
 
3. Size (used capacity)
• MCx then chooses the drives found in step 2 and looks for drives of the exact same size*. Larger drives will be checked for if no drives of the exact size are available.
 
 4. Enclosure
• The last step is checking to see if any of the drives in step 3 are in the same physical enclosure of the failing/failed drive. 
 
* Note that in the case of the vault drives (0.0.0 - 0.0.3), only the space used for array LUNs would be taken into account, as the 300 GB Vault area does not rebuild to spare drives. Instead, this area would only rebuild once a replacement drive is REPLACEed.