SAN Data Recovery from Raid 5 failure


For modern data centers, SAN (Storage Area Network) systems are the backbone of business operations. A single SAN system can support everything from payroll and billing to mission-critical databases simultaneously for multiple users.

But what happens when a SAN, stacked with dozens of enterprise hard drives, suffers a catastrophic multi-disk failure?

That’s the case that recently landed on our desks at Stellar Data Recovery – Gurugram. And like most cases of enterprise SAN data recoveries, it was a complicated and high-pressure situation.

The system in question was a dual-bay enterprise SAN packed with 48 drives (24 in each bay), established by a data center to support accounting databases for five corporate clients.

Every minute of downtime posed a risk of lost revenue, broken contracts, and, most of all, a major blow to the trust of the clients and thousands of end users.

The SAN at a Glance (Not Just Another RAID Array)

This was not a simple file server. The SAN was a high-density storage platform with two 24-bay enclosures filled with 2.5″ SAS HDDs. Each of these HDDs was 1.8 TB in capacity.

Storage Breakdown

  • Total drives: 48
  • Total bays: 2
  • Drives per bay: 24
    • OS/system pool per bay: 4 SSDs (configured in RAID 0)
    • Data pool per bay: 20 HDDs (split into several LUNs and configured in RAID 5)
    • LUNs: Each bay’s data pool was organized into 4–6 LUNs. In the affected bay, there were 4 LUNs. Of these, two LUNs were in scope for the data recovery.
    • LUN-A (~5 TB): Hosted 4 virtual disks (for VMs/databases)
    • LUN-B (~10 TB): Hosted 25 virtual disks
    • Other LUNs: Not in the scope of this recovery

Overview of the SAN Failure

The SAN’s design relied on RAID 5 arrays (with parity). But as every seasoned IT pro knows, RAID is not a backup (especially not RAID 5). Here’s what happened.

  • First failure: One HDD suffered physical damage and stopped working. The SAN’s controller started an automatic rebuild by pulling data from the other drives and using parity to restore the lost disk onto a spare.
  • Second failure (within 2 hours): While the system was still rebuilding, a second drive failed. This was alarming, as RAID 5 cannot handle two simultaneous disk failures, and the entire setup was at risk.
  • Third failure (10 hours later): Yet another drive dropped out. It further fragmented the data and made any software-based recovery impossible.
  • Fourth failure: The client later reported a fourth disk in the same pool had also suffered physical damage.

At this point, the two affected LUNs went offline. Applications stopped, and SQL databases (including the client’s own accounting system) could not mount.

End users, from accountants to customer service teams, were left staring at error messages.

Estimated Business Impact

  • Number of affected users: Over 1,000 across multiple organizations
  • Critical datasets at risk: Three core SQL Server databases (1.3 TB, 700 GB, & 1 TB in size—fragmented across multiple virtual disks)
  • Potential loss: Delays in processing financial transactions, claims, and account reconciliations for thousands of downstream customers.

The Unique Technical Challenges of This SAN Data Recovery

What made this case particularly challenging was the use of thin provisioning and sparse allocation in the SAN setup. These features are designed to save storage space but are notorious for complicating recovery, which is exactly what they did in this case.

A thin pool (sometimes called sparse format) is a way of managing storage where the SAN doesn’t allocate all physical space up front. Instead, it only uses space as data is actually written. This is called thin provisioning.

In our case, the SAN split physical storage into large blocks, or “chunks”—each starting at 64 KB. Every time new data was written, the SAN would assign a new chunk. This is different from typical file systems like NTFS, which use smaller “clusters” (usually 4 KB or 512 bytes each) to organize files.

Because of these different block sizes, the start and end of actual files (or database records) did not line up neatly with where each storage chunk begins and ends.

During manual data extraction, when we looked at the data in these chunks, we found that the beginning of a file (or a piece of it) was in the middle of a chunk, not at the very start.

This “misalignment” is called a non-aligned offset.

In this case of SAN data recovery, it meant we could not simply pull out whole files from each chunk. Instead, we had to reassemble files and records by tracing exactly where every part started and ended.

Stellar’s Data Recovery Approach

Given the complexity, our first decision was to bring the entire SAN to our Class 100 cleanroom lab. Here’s why.

  • Proprietary mapping tables (thin pool allocation, chunk to physical mapping, parity history) are only stored on the controller. Without them, reconstructing thin-provisioned storage is guesswork.
  • Three of the four failed disks needed advanced cleanroom repairs to be readable at all. The large number of physically damaged drives was suspicious, and we could not rule out the possibility of there being more damaged drives. So, we deemed it necessary to bring the entire setup to our lab.

Step-by-Step SAN Data Recovery Process

  1. We imaged each virtual drive in a write-blocked, forensically sound manner.
  2. For the three drives that suffered severe physical damage, we used our cleanroom to repair and recover readable data. For this, we performed R/W head & actuator arm swaps, firmware-level repairs, and microcode patching as needed.
  3. We connected the original SAN controllers in our lab to extract all the important metadata.
  4. Using the extracted metadata, we recreated the RAID 5 arrays in our server lab.
  5. We analyzed the 64 KB chunk structure and carefully mapped how it lined up with the 4 KB clusters used by the NTFS file system inside the virtual machines.
  6. We wrote custom scripts to piece together files and SQL database records that were split across chunk boundaries.
  7. We rebuilt and extracted the required SQL Server databases (MDF, NDF, and LDF files).
  8. We loaded them in a secure test environment and ran consistency checks (DBCC CHECKDB) on every recovered database.
  9. We provided the client with all recovered data, a full recovery report, and detailed documentation of every step for their records.

Lessons Learned and Recommendations

  • RAID is not the same as a backup. Especially not RAID 5, which is vulnerable to exactly this kind of multi-disk failure. For critical data, RAID 6 or even erasure coding is safer.
  • Thin provisioning is complex. While it saves space, it complicates both everyday management and disaster recovery.
  • Controller metadata is crucial, so always preserve it if possible.

For organizations managing critical enterprise storage, partnering with a trusted specialist like Stellar Data Recovery can significantly reduce risk during high-impact SAN failures.”

Not every data recovery provider is equipped to handle multi-layered failures like these, especially when thin provisioning, offset issues, and controller-dependent mapping are involved. At Stellar, it’s our blend of deep technical capability, vast donor drive library, and custom scripting that makes the impossible possible.

If you’re responsible for enterprise data and your SAN goes dark, don’t panic. Contact Stellar Data Recovery for SAN, NAS, and DAS servers. We understand the stakes, and we have the engineering muscle to bring your data back.


Read More Case Studies

Stellar Client

Healthcare Services Provider

Ransomware and DIY Disaster: Full Recovery From Linux-Based RAID 5 Server

Stellar Client

Leading Media Production House

How 58TB of Critical Video Footage Was Successfully Recovered from a RAID 0 NAS

Stellar Client

Corporate User

RAID 1 Recovery After Physical Damage: Crucial SQL Data at Stake

Stellar Client

Corporate User

Data Recovery From Hacked RAID-5 Server and NAS Box