How to Recover Data from NetApp® Server with WAFL Inconsistent Error
The client organization is a large-sized, global manufacturing company based in California, with a presence in over 50 countries. Its subsidiary in India had deployed NetApp® Fabric-Attached Storage (FAS 2020) as a Network Attached
NetApp ® FAS systems are popularly used for leveraging their combination of performance and flexibility with technologies that are built for driving efficiency. FAS systems facilitate data management and can swiftly respond, based on the storage needs for flash, disk, and cloud.
The FAS2000 Series NAS used in this particular case was an old setup with ONTAP® 7.3.7 (Data ONTAP 7G) operating system, having a total of 12 WD® SATA HDDs configured using NetApp® RAID-DP technology.
RAID-DP is a standard feature of the Data ONTAP operating system, which implements double-parity RAID 6 to prevent data loss when two drives fail. Each of the 12 HDDs – including the one spare hard drive – in the RAID had 1TB storage capacity. Out of the total 12 TB ‘aggregate’ capacity, approx. 6.23 TB of the storage space was used for storing company data along with critical information of one of its key customers.
One of the hard drives in the RAID failed all of a sudden due to which the setup went into a degraded state. To address this issue, the storage administrator manually removed and replaced the failed disk with the spare hard drive.
As a result, the RAID was apparently restored to its normal functional state. However after about two days from this initial patch up of disk failure and performance degradation issue, the RAID hit a major problem; it went into a series of reboot cycles that were repeating after every 10 minutes of duration.
This was likely due to failure of multiple hard drives, considering the fact that RAID DP with double parity can withstand failure of up to 2 drives (without data loss) and would fail and go offline if the number of failed drives is >2.
Reportedly, the server ON duration between these reboot cycles was insufficient to allow the RAID to resync, due to which each of these reboots was resulting in ‘unclean’ shutdown. After running so through these reboot cycles for about 24 hours, the RAID went into a permanently degraded state (failed). Also, the single aggregate volume in the RAID setup turned into “WAFL inconsistent” state, (turned corrupt) due to file system corruption.
Failure of the RAID turned the NetApp® storage inaccessible, with the potential risk of losing critically important data unless the RAID could be reconstructed to allow data recovery operations.
Recover data from crashed NetApp® NAS server in ‘WAFL inconsistent’ state.
Data Recovery Challenges
- High complexity and time associated with rectifying WAFL Inconsistent state
- No access to client server domain for performing data recovery
Data Recovery Approach
Stellar® constituted a dedicated team of data care experts to execute this NAS data recovery project. This team employed the following steps:
- In-lab Diagnosis of NetApp® Fabric-Attached Storage Model FAS 2020
- Stellar® data care experts began with examining the hard drives. They removed all hard drives from the NAS box and checked them individually, which revealed that 3 of the hard drives were nonfunctional.
- As surmised earlier in this case study, failure of more than 2 drives had resulted in failure of RAID.
- The data care team sought tampering permission from the client, after which they opened up these failed hard drives inside a Class 100 clean room lab. It was found that these hard drives had failed due to crashed heads.
- Further, one of these hard drives had light visible scratches on coating of the platter surface, which could possibly result in permanent loss of data stored on the affected sectors on the drive.
2. Head Assembly Transplant and Drive Cloning
- Next, the data care experts transplanted new head assemblies on these failed hard drives so as to restore them to a functional state to allow disk cloning.
- The team was able to clone two of the hard disk drives completely, while 50% of the third hard drive could be cloned due to scratches that were earlier observed on the platter coating.
3. WAFL Inconsistency Check and Repair
- Stellar® data care experts installed back all the constituent hard drives (including the 3 clone drives) inside the NAS box.
- The NAS couldn’t reboot, as the root aggregate had been marked WAFL inconsistent in this case.
- The aggregate was then restricted to allow the system to mark it as a native aggregate, which is an essential step before running WAFL_check on inconsistent aggregates.
- For this step, the team had to boot the file in Maintenance Mode, run aggregate restrict command and reboot the filer to the Special Boot Menu.
- Next, the team ran WAFL_check on the storage to correct the inconsistencies. It connected the NAS to a console port on the filer using a laptop PC.
- The WAFL_check process checked a host of parameters including mean size of files in the file system, number of inodes, data layout, size of the aggregate, number of inconsistencies, and CPU speed and system memory, etc.
- It took about 4- weeks to repair WAFL inconsistency, with the maximum time spent on rebuilding the inodes*.
*An inode is a data structure in UNIX that contains important information pertaining to files within a file system.
4. Accessing Domain-Restricted Server to Recover the Data
- It was found after switching ON the NetApp® storage in normal mode and entering the client-supplied user credentials that the server had restricted access; it could only be accessed within its native domain controlled environment.
- Stellar® data care experts monitored the WAFL credential cache statistics to determine the entries available in the WAFL credential cache along with their access rights pertaining to the server.
- Using this information, the team was able to recreate the policies so as to gain access to the NAS for performing data recovery.
- Next, the data care experts used a proprietary data recovery software to run file signature-based scan on the NetApp® storage. The software located total 7.4 TB of lost data, of which the team recovered 3 TB data that was specifically required by the client.
Stellar® data care experts successfully recovered requisite data from NetApp® Fabric-Attached Storage (FAS 2020). The entire data recovery project — from job intake and assignment to execution and final closure — was completed within the committed time. The data was recovered intact, with 100% integrity in its original form, as verified by the client organization. The quick turnaround and quality of service helped the client organization to quickly recover from the downtime and reinstate normal business operations.
“Our Net App Server was creating WAFL Inconsistent error. It’s containing important data and we were not able to access that data. It was data lost situation and we contacted stellar to recover data. The complete data recovery process was very transparent and the team is very professional. We got 100% data from the server within estimated time. I recommend stellar® for any data recovery service.”