mdadm RAID6 causing SATA errors

netthier · Posted at 11/15/2022 18:47:00

Last edited by netthier In 11/15/2022 19:11 Editor

I have 4 HDDs connected via the included SATA cables:

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 10.9T 0 disk
sdb 8:16 0 10.9T 0 disk
sdc 8:32 0 10.9T 0 disk
sdd 8:48 0 10.9T 0 disk

Copy the code

I ran the long SMART test on all drives and no errors were detected.
After creating a RAID array using

mdadm --create /dev/md0 --level=6 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
Copy the code
The initial resync begins, but gets stuck after a few seconds.
dmesg logs are available here: https://paste.debian.net/hidden/b161164f/

Is this a problem with your SATA cables or the on-board SATA controller? Or maybe a software issue?
I will attempt switching out the cables later today to see if anything changes

Both the board and the drives are powered using an external ATX PSU, so I doubt its a power issue.

Note: last commit in the kernel directory is a95cf55eb5292c66c62fef90bd8d4abb5d776d17
Latest bundle I applied is v1.0.4a

netthier · Posted at 12/10/2022 21:02:02

Adding a PCIe SATA controller and connecting the drives to it worked, seems like the one on-board is not suited for RAID.

nodisknofun · Posted at 12/12/2022 03:34:30

Hey there,

i did the same with BTRFS-Raid and got similar errors like netthier. Would be nice if someone dig into the kernel source to fix this.

Greeting,
NoDiskNoFun

emcginnis · Posted at 12/13/2022 08:51:50

I can recreate this issue by writing to multiple sata drives at the same time, which means this isn't raid or btrfs-raid related and is instead is likely an issue with the sata port multiplier.
Example using dcfldd https://linux.die.net/man/1/dcfldd

sudo dcfldd if=/dev/urandom of=/dev/sda1 of=/dev/sdb1 count=100000

Copy the code

Which freezes after transferring ~1.5GB of data.
The kernel shows the following errors

[ 230.178075] ata1.00: failed to read SCR 1 (Emask=0x40)
[ 230.178204] ata1.01: failed to read SCR 1 (Emask=0x40)
[ 230.178251] ata1.02: failed to read SCR 1 (Emask=0x40)
[ 230.178289] ata1.03: failed to read SCR 1 (Emask=0x40)
[ 230.178335] ata1.04: failed to read SCR 1 (Emask=0x40)
[ 230.178370] ata1.05: failed to read SCR 1 (Emask=0x40)
[ 230.178406] ata1.06: failed to read SCR 1 (Emask=0x40)
[ 230.178441] ata1.07: failed to read SCR 1 (Emask=0x40)
[ 230.178476] ata1.08: failed to read SCR 1 (Emask=0x40)
[ 230.178510] ata1.09: failed to read SCR 1 (Emask=0x40)
[ 230.178544] ata1.10: failed to read SCR 1 (Emask=0x40)
[ 230.178576] ata1.11: failed to read SCR 1 (Emask=0x40)
[ 230.178609] ata1.12: failed to read SCR 1 (Emask=0x40)
[ 230.178642] ata1.13: failed to read SCR 1 (Emask=0x40)
[ 230.178675] ata1.14: failed to read SCR 1 (Emask=0x40)
[ 230.178731] ata1.01: exception Emask 0x100 SAct 0x1100000 SErr 0x0 action 0x6 frozen
[ 230.178774] ata1.01: failed command: WRITE FPDMA QUEUED
[ 230.178838] ata1.01: cmd 61/40:a0:00:08:00/05:00:00:00:00/40 tag 20 ncq dma 688128 out
[ 230.178838] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 230.178878] ata1.01: status: { DRDY }
[ 230.178910] ata1.01: failed command: WRITE FPDMA QUEUED
[ 230.178969] ata1.01: cmd 61/40:c0:40:0d:00/05:00:00:00:00/40 tag 24 ncq dma 688128 out
[ 230.178969] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 230.179002] ata1.01: status: { DRDY }
[ 230.179039] ata1.03: exception Emask 0x100 SAct 0xfeefffff SErr 0x0 action 0x6 frozen

Copy the code

emcginnis · Posted at 12/13/2022 09:38:58

Disabling FBS/FIS fixes this issue by instead defaulting to CBS
But this is not a great workaround since CBS is super slow. CBS means the host can only do a transaction to one device at a time, whereas FBS lets the host interleave transactions between each device on the port multiplier.

To disable FBS, and get functional sata ports, comment out the following in drivers/ata/ahci_platform.c

if (of_device_is_compatible(dev->of_node, "rockchip,rk-ahci"))
hpriv->flags |= AHCI_HFLAG_YES_FBS;

Copy the code

[Linux] mdadm RAID6 causing SATA errors

【Linux】 mdadm RAID6 causing SATA errors