Every few months, the “hardware RAID vs. software RAID” argument pops back up — because someone, somewhere, just had a storage incident and is now learning things… rapidly.
Sometimes the storage stack looks like a wedding cake designed by three different teams across ten years. And one day, it becomes your problem.
Okay, we’re not here to judge. Most of us learn from mistakes. So here’s what we found about the hardware vs. software RAID question.
RAID reduces downtime when a drive fails. It does not protect you from logical corruption, deletion, ransomware, operator mistakes, or “the wrong disk got wiped” moments. That work is for backups. And when you mix those up, the universe charges you with data loss.
With that truth out of the way, let’s move to the theory.
RAID architecture in servers combines multiple disks into a single logical unit and uses different technologies to distribute data among them, providing different levels of redundancy and performance.
A RAID controller manages a redundant array of independent disks, controlling the entire system's data distribution, redundancy, and fault tolerance. It combines the disks into one unit that the operating system can use.
Types of RAID controllers: either hardware RAID cards installed in the server, or software RAID configurations using the CPU for control.
Hardware RAID is a separate controller that manages the array itself. It often has a write-back cache and power protection (BBU/CacheVault). For the OS, it appears as one or more logical/virtual disks.
Software RAID is an array assembled by the OS or FS stack (mdadm, ZFS, Storage Spaces, etc.). Metadata is usually stored on the disks themselves, while logic and recovery are handled by the OS.
FakeRAID, or BIOS RAID, is a setup where the configuration and metadata are set in the BIOS/UEFI (or on the chipset or controller), but processing is often handled by the OS driver. This often leads to problems with portability and diagnostics.
Hardware or software, or check them both on isolated resources of a physical server.
HBA (Host Bus Adapter) is an adapter that doesn’t manage the array and doesn’t try to be smarter than the OS. Its task is to connect SAS/SATA disks (often via an expander) and pass them up as separate devices.
This gives software RAID, ZFS, or Storage Spaces direct access to the disks and their signals (errors, disk identity).
HBAs are often found in teams that use ZFS pools, Ceph or other SDS solutions, mdadm-RAID, and in scenarios where redundancy and write policies must reside in the OS/FS stack.
IT mode (Initiator Target mode) is a mode or firmware (more common in the LSI and Broadcom ecosystems) in which a RAID card stops behaving like a RAID controller and acts as an HBA. It passes disks through as-is, without building arrays at the controller level.
The hardware is used, and the necessary transparency for ZFS/SDS is achieved.
Passthrough is a general term for passing a device up as-is, without processing or abstraction at the intermediate layer. In the context of disks and RAID, this means that the OS (or VM) accesses the physical disk directly, rather than a virtual disk assembled by the controller.
JBOD (Just a Bunch Of Disks), in the most general sense, is simply a set of disks without RAID logic. That means there is no mirroring/parity, and no array presented as a single object.
However, JBOD can still look like RAID to the OS and validators, depending on how the device is classified by the driver. If the controller driver returns devices as RAID class (in Windows, this is often reflected in BusType), the platform may assume that the disks are not “raw” and reject them. Some stacks specifically require “physical SATA/SAS/NVMe devices” and cut off anything that looks like RAID abstraction, even if they come one disk at a time.
Hardware RAID is not only a RAID level, but also firmware, cache, and write policies that change the behavior of the system in case of failures. Typically, inside the controller, there is a RAID engine/ASIC that calculates parity and manages queues, and firmware that decides when to consider a write complete and how to recover from errors.
Write-back cache speeds up writes because the controller confirms I/O immediately after data enters DRAM and flushes it to disk a little later. This provides a noticeable performance gain on RAID-5/6 and small sync-writes, but in the event of a power loss, completed and confirmed writes may not physically be on the disks.
That’s why we have BBU or CacheVault. Their task is to protect the controller cache during a power failure — either by keeping the DRAM powered or by transferring the data to flash or NAND in time.
The OS sees only virtual devices. Instead of /dev/sdX for each disk, it sees Virtual Disk/Logical Drive — a single block device that hides the array composition, disk order, and controller metadata.
Because of this abstraction, some telemetry may be distorted: SMART, serial numbers, temperature, and detailed media errors. Visibility depends on the controller model, driver, and passthrough mode.
In practice, this often shows up in monitoring: “everything is OK in the array,” but it’s more difficult to understand which disk is generating errors.
If the controller or firmware fails, recovery may require a compatible controller/firmware, sometimes even the same vendor generation. This is where vendor lock-in becomes a procurement problem during an incident.
The controller also constantly performs background tasks that can manifest as performance degradation or unexpected issues. Rebuilds, patrol reads, and consistency checks all generate I/O and affect the surface of the disks.
Software types of RAID controllers (configurations) can exist at the OS level or at the file system level.
OS-level is the classic approach, like Linux md/mdadm: the OS combines several devices into a single md device, on top of which you put ext4/xfs and use it like a “normal disk.”
Windows Storage Spaces is similar in concept: there’s a storage pool, from which virtual disks with mirror/parity are cut.
FS-level or ZFS (and partially btrfs) is when a single mechanism covers RAID-like redundancy, volume management, and file system features.
ZFS prefers direct access to disks because it needs control over data paths and error signals. ZFS stores checksums, verifies blocks, and — if redundancy is available — can “self-repair” by reading the correct copy and rewriting the damaged one.
Software RAID and ZFS tend to shift recovery toward disk portability and metadata that lives with the pool.
Common recovery expectation: if the server dies, you move disks (or an HBA) to another host and import the array/pool.
The nuance is that “portable” still requires compatible controllers/HBAs/drivers, consistent expectations about how the storage stack identifies devices, and clear runbooks. Now you know what is RAID software.
Modern storage platforms often want raw disks because they are responsible for distribution, fault tolerance, and integrity checks.
A practical rule for Proxmox and ZFS is to avoid placing ZFS on top of hardware RAID with its own cache management and write structure. Instead, use HBA or IT mode, where the controller simply provides the disks as they are.
The more layers control the disks, the more difficult it is to predict failure, and the more expensive recovery becomes. Let’s talk about the storage stack that keeps getting roasted in the threads.
The canonical shape is:
It can look reasonable: more capacity, more flexibility, fewer moving parts “visible” to the OS. But one failure at the bottom can break everything above it, and recovery becomes a guessing game: which layer knows the truth right now?
More layers mean more weird failure modes, longer downtime, and higher odds of making things worse while trying to fix them.
A practical rule that survives contact with reality: one layer should own redundancy. Either the controller owns it (and you accept controller-centric recovery), or the OS/filesystem stack owns it (and you design for raw-disk visibility).
The “hardware vs. software RAID” conflict rarely shows up as a clean failure. More often, if your storage stack is heading into “layer cake” territory, it appears through warnings, validation errors, or operational edge cases that suddenly become production blockers.
Pick the approach you can monitor and repair with your actual team, and don’t stack two layers that both try to be smart about caching, ordering, and redundancy. Always ask yourself, “What happens when this breaks, and how ugly is recovery?”
If you don’t have a very specific constraint, default to raw disks with a software stack. Choose this if you care about data integrity, predictable recovery, and long-term operability (especially on Proxmox/TrueNAS/Linux storage boxes). Use HBA/IT mode/passthrough controller mode. OpenZFS explicitly recommends using an HBA instead of a RAID controller for reliability reasons.
Pick mdadm (OS-level software RAID) if you want simple, portable RAID without ZFS’s “whole stack” approach (e.g., ext4/xfs on top), and you still want to avoid controller lock-in. mdadm arrays carry metadata on the drives, so you can usually move the disks to another Linux box and reassemble the array (with --assemble/--scan, plus mdadm.conf hygiene). This option best fits general Linux servers, straightforward redundancy (RAID1/10), and environments where snapshots or checksumming aren’t the priority.
Pick hardware RAID if you need one of these:
Whatever you choose, you still need tested restores and off-site copies.