Firefly Open Source Community

   Login   |   Register   |
New_Topic
Print Previous Topic Next Topic

[Linux] mdadm RAID6 causing SATA errors

67

Credits

0

Prestige

0

Contribution

registered members

Rank: 2

Credits
67

【Linux】 mdadm RAID6 causing SATA errors

Posted at 11/15/2022 18:47:00      View:1340 | Replies:4        Print      Only Author   [Copy Link] 1#
  • Type: Self-Compiled Firmware
  • SDK Package Name: rk3588_linux_bsp_release_20221012_v1.0.2a.xml
  • Last Commit: 0000-00-00 00:00:00
  • Modification Content: Added CONFIG_MD=y CONFIG_BLK_DEV_MD=y CONFIG_DM_RAID=y config
  • Log: dmesg.zip
Problem description and steps to reproduce:
Last edited by netthier In 11/15/2022 19:11 Editor

I have 4 HDDs connected via the included SATA cables:
  1. # lsblk
  2. NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
  3. sda            8:0    0  10.9T  0 disk
  4. sdb            8:16   0  10.9T  0 disk
  5. sdc            8:32   0  10.9T  0 disk
  6. sdd            8:48   0  10.9T  0 disk
Copy the code
I ran the long SMART test on all drives and no errors were detected.
After creating a RAID array using
  1. mdadm --create /dev/md0 --level=6 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
Copy the code
The initial resync begins, but gets stuck after a few seconds.
dmesg logs are available here: https://paste.debian.net/hidden/b161164f/

Is this a problem with your SATA cables or the on-board SATA controller? Or maybe a software issue?
I will attempt switching out the cables later today to see if anything changes

Both the board and the drives are powered using an external ATX PSU, so I doubt its a power issue.

Note: last commit in the kernel directory is a95cf55eb5292c66c62fef90bd8d4abb5d776d17
Latest bundle I applied is v1.0.4a



dmesg.zip

2.14 KB, Down times: 1

Reply

Use props Report

67

Credits

0

Prestige

0

Contribution

registered members

Rank: 2

Credits
67
Posted at 12/10/2022 21:02:02        Only Author  2#
Adding a PCIe SATA controller and connecting the drives to it worked, seems like the one on-board is not suited for RAID.
Reply

Use props Report

8

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
8
Posted at 12/12/2022 03:34:30        Only Author  3#
Hey there,

i did the same with BTRFS-Raid and got similar errors like netthier. Would be nice if someone dig into the kernel source to fix this.


Greeting,
NoDiskNoFun
Reply

Use props Report

6

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
6
Posted at 12/13/2022 08:51:50        Only Author  4#
I can recreate this issue by writing to multiple sata drives at the same time, which means this isn't raid or btrfs-raid related and is instead is likely an issue with the sata port multiplier.
Example using dcfldd https://linux.die.net/man/1/dcfldd
  1. sudo dcfldd if=/dev/urandom of=/dev/sda1 of=/dev/sdb1 count=100000
Copy the code
Which freezes after transferring ~1.5GB of data.
The kernel shows the following errors
  1. [  230.178075] ata1.00: failed to read SCR 1 (Emask=0x40)
  2. [  230.178204] ata1.01: failed to read SCR 1 (Emask=0x40)
  3. [  230.178251] ata1.02: failed to read SCR 1 (Emask=0x40)
  4. [  230.178289] ata1.03: failed to read SCR 1 (Emask=0x40)
  5. [  230.178335] ata1.04: failed to read SCR 1 (Emask=0x40)
  6. [  230.178370] ata1.05: failed to read SCR 1 (Emask=0x40)
  7. [  230.178406] ata1.06: failed to read SCR 1 (Emask=0x40)
  8. [  230.178441] ata1.07: failed to read SCR 1 (Emask=0x40)
  9. [  230.178476] ata1.08: failed to read SCR 1 (Emask=0x40)
  10. [  230.178510] ata1.09: failed to read SCR 1 (Emask=0x40)
  11. [  230.178544] ata1.10: failed to read SCR 1 (Emask=0x40)
  12. [  230.178576] ata1.11: failed to read SCR 1 (Emask=0x40)
  13. [  230.178609] ata1.12: failed to read SCR 1 (Emask=0x40)
  14. [  230.178642] ata1.13: failed to read SCR 1 (Emask=0x40)
  15. [  230.178675] ata1.14: failed to read SCR 1 (Emask=0x40)
  16. [  230.178731] ata1.01: exception Emask 0x100 SAct 0x1100000 SErr 0x0 action 0x6 frozen
  17. [  230.178774] ata1.01: failed command: WRITE FPDMA QUEUED
  18. [  230.178838] ata1.01: cmd 61/40:a0:00:08:00/05:00:00:00:00/40 tag 20 ncq dma 688128 out
  19. [  230.178838]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
  20. [  230.178878] ata1.01: status: { DRDY }
  21. [  230.178910] ata1.01: failed command: WRITE FPDMA QUEUED
  22. [  230.178969] ata1.01: cmd 61/40:c0:40:0d:00/05:00:00:00:00/40 tag 24 ncq dma 688128 out
  23. [  230.178969]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
  24. [  230.179002] ata1.01: status: { DRDY }
  25. [  230.179039] ata1.03: exception Emask 0x100 SAct 0xfeefffff SErr 0x0 action 0x6 frozen
Copy the code





Reply

Use props Report

6

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
6
Posted at 12/13/2022 09:38:58        Only Author  5#
Disabling FBS/FIS fixes this issue by instead defaulting to CBS
But this is not a great workaround since CBS is super slow. CBS means the host can only do a transaction to one device at a time, whereas FBS lets the host interleave transactions between each device on the port multiplier.  

To disable FBS, and get functional sata ports, comment out the following in drivers/ata/ahci_platform.c

  1.         if (of_device_is_compatible(dev->of_node, "rockchip,rk-ahci"))
  2.                 hpriv->flags |= AHCI_HFLAG_YES_FBS;
Copy the code
Reply

Use props Report

You need to log in before you can reply Login | Register

This forum Credits Rules

Quick Reply Back to top Back to list