Firefly Open Source Community

   Login   |   Register   |
New_Topic
Print Previous Topic Next Topic

[Linux] mdadm RAID6 causing SATA errors

67

Credits

0

Prestige

0

Contribution

registered members

Rank: 2

Credits
67

【Linux】 mdadm RAID6 causing SATA errors

Posted at 11/15/2022 18:47:00      View:2264 | Replies:5        Print      Only Author   [Copy Link] 1#
  • Type: Self-Compiled Firmware
  • SDK Package Name: rk3588_linux_bsp_release_20221012_v1.0.2a.xml
  • Last Commit: 0000-00-00 00:00:00
  • Modification Content: Added CONFIG_MD=y CONFIG_BLK_DEV_MD=y CONFIG_DM_RAID=y config
  • Log: dmesg.zip
Problem description and steps to reproduce:
Last edited by netthier In 11/15/2022 19:11 Editor

I have 4 HDDs connected via the included SATA cables:
  1. # lsblk
  2. NAME         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
  3. sda            8:0    0  10.9T  0 disk
  4. sdb            8:16   0  10.9T  0 disk
  5. sdc            8:32   0  10.9T  0 disk
  6. sdd            8:48   0  10.9T  0 disk
Copy the code
I ran the long SMART test on all drives and no errors were detected.
After creating a RAID array using
  1. mdadm --create /dev/md0 --level=6 --raid-devices=4 /dev/sda /dev/sdb /dev/sdc /dev/sdd
Copy the code
The initial resync begins, but gets stuck after a few seconds.
dmesg logs are available here: https://paste.debian.net/hidden/b161164f/

Is this a problem with your SATA cables or the on-board SATA controller? Or maybe a software issue?
I will attempt switching out the cables later today to see if anything changes

Both the board and the drives are powered using an external ATX PSU, so I doubt its a power issue.

Note: last commit in the kernel directory is a95cf55eb5292c66c62fef90bd8d4abb5d776d17
Latest bundle I applied is v1.0.4a



dmesg.zip

2.14 KB, Down times: 1

Reply

Use props Report

67

Credits

0

Prestige

0

Contribution

registered members

Rank: 2

Credits
67
Posted at 12/10/2022 21:02:02        Only Author  2#
Adding a PCIe SATA controller and connecting the drives to it worked, seems like the one on-board is not suited for RAID.
Reply

Use props Report

8

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
8
Posted at 12/12/2022 03:34:30        Only Author  3#
Hey there,

i did the same with BTRFS-Raid and got similar errors like netthier. Would be nice if someone dig into the kernel source to fix this.


Greeting,
NoDiskNoFun
Reply

Use props Report

6

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
6
Posted at 12/13/2022 08:51:50        Only Author  4#
I can recreate this issue by writing to multiple sata drives at the same time, which means this isn't raid or btrfs-raid related and is instead is likely an issue with the sata port multiplier.
Example using dcfldd https://linux.die.net/man/1/dcfldd
  1. sudo dcfldd if=/dev/urandom of=/dev/sda1 of=/dev/sdb1 count=100000
Copy the code
Which freezes after transferring ~1.5GB of data.
The kernel shows the following errors
  1. [  230.178075] ata1.00: failed to read SCR 1 (Emask=0x40)
  2. [  230.178204] ata1.01: failed to read SCR 1 (Emask=0x40)
  3. [  230.178251] ata1.02: failed to read SCR 1 (Emask=0x40)
  4. [  230.178289] ata1.03: failed to read SCR 1 (Emask=0x40)
  5. [  230.178335] ata1.04: failed to read SCR 1 (Emask=0x40)
  6. [  230.178370] ata1.05: failed to read SCR 1 (Emask=0x40)
  7. [  230.178406] ata1.06: failed to read SCR 1 (Emask=0x40)
  8. [  230.178441] ata1.07: failed to read SCR 1 (Emask=0x40)
  9. [  230.178476] ata1.08: failed to read SCR 1 (Emask=0x40)
  10. [  230.178510] ata1.09: failed to read SCR 1 (Emask=0x40)
  11. [  230.178544] ata1.10: failed to read SCR 1 (Emask=0x40)
  12. [  230.178576] ata1.11: failed to read SCR 1 (Emask=0x40)
  13. [  230.178609] ata1.12: failed to read SCR 1 (Emask=0x40)
  14. [  230.178642] ata1.13: failed to read SCR 1 (Emask=0x40)
  15. [  230.178675] ata1.14: failed to read SCR 1 (Emask=0x40)
  16. [  230.178731] ata1.01: exception Emask 0x100 SAct 0x1100000 SErr 0x0 action 0x6 frozen
  17. [  230.178774] ata1.01: failed command: WRITE FPDMA QUEUED
  18. [  230.178838] ata1.01: cmd 61/40:a0:00:08:00/05:00:00:00:00/40 tag 20 ncq dma 688128 out
  19. [  230.178838]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
  20. [  230.178878] ata1.01: status: { DRDY }
  21. [  230.178910] ata1.01: failed command: WRITE FPDMA QUEUED
  22. [  230.178969] ata1.01: cmd 61/40:c0:40:0d:00/05:00:00:00:00/40 tag 24 ncq dma 688128 out
  23. [  230.178969]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
  24. [  230.179002] ata1.01: status: { DRDY }
  25. [  230.179039] ata1.03: exception Emask 0x100 SAct 0xfeefffff SErr 0x0 action 0x6 frozen
Copy the code





Reply

Use props Report

6

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
6
Posted at 12/13/2022 09:38:58        Only Author  5#
Disabling FBS/FIS fixes this issue by instead defaulting to CBS
But this is not a great workaround since CBS is super slow. CBS means the host can only do a transaction to one device at a time, whereas FBS lets the host interleave transactions between each device on the port multiplier.  

To disable FBS, and get functional sata ports, comment out the following in drivers/ata/ahci_platform.c

  1.         if (of_device_is_compatible(dev->of_node, "rockchip,rk-ahci"))
  2.                 hpriv->flags |= AHCI_HFLAG_YES_FBS;
Copy the code
Reply

Use props Report

16

Credits

0

Prestige

0

Contribution

new registration

Rank: 1

Credits
16
Posted at 6/21/2024 01:04:27        Only Author  6#
Hi everyone. I had the same problem on four 3588ITX boards. They worked very unstable with SATA drivers. The problem was with firmware of JMB575. After flashing EEPROM next to JMB575 with working firmware everything is ok now.
Reply

Use props Report

You need to log in before you can reply Login | Register

This forum Credits Rules

Quick Reply Back to top Back to list