Fix for abysmal network (and PCIe) performance


  • Hello,

    I've been spending several days trying to figure why the network performance was so low on the dev kit (Q80-26). I noticed that even as little as 1 Gbps of traffic over the LAN port was enough to keep several cores busy in ksoftirq, which made no sense. Today I needed to run some tests with a 100G NIC so I started to play with it, to notice exactly the same performance limitation as on the on-board port, ~1 Mpps in each direction, no more, with possibly all cores at full throttle.

    The kernel was up-to-date for an ubuntu 22 (5.15.0-78), and as usual on ubuntu, the perf tool doesn't work, so I rebuilt it and could notice that almost all the CPU was spent in a "mov daif, x3" instruction in the kernel, in arm_smmu_cmdq_issue_cmdlist(). I tried to re-enable bypass by passing "arm_smmu_v3.disable_bypass=n" to the kernel but it had no effect. I suspected it was ignored by 5.15 so I upgraded to 6.2 (which lacks support for intel ICE board which is present in 5.15). I also had to switch from the intel to a Mellanox ConnectX-6 board. It faced exactly the same problem and started spewing AER errors due to ASPM. I also disabled ASPM, and it calmed all messages but the performance remained abysmal at exactly the same level. Finally I found that arm64 supports "iommu.passthrough=1", and that was it, now I'm doing 9.2 Mpps in both directions! Oh, and now pulling 25 Gbps of HTTP traffic out of it takes 1.2% of CPU, or just one core, much better :-)

    For those interested, and who have to deal with PCIe devices (mainly network), just be aware that you'll absolutely need to enable IOMMU passthrough or the device will basically be unusable. Even the NVME SSDs now deliver 2.6 GB/s versus ~550 MB before! Just to save time to those experimenting, here's the grub cmdline I'm using (in /etc/default/grub on ubuntu 22):

    GRUB_CMDLINE_LINUX_DEFAULT="mitigations=off net.ifnames=0 biosdevname=0 arm_smmu_v3.disable_bypass=n pcie_aspm=off iommu.passthrough=1"

    In hope it helps others facing similar trouble...



  • Thank you very much, awesome work! 💯 

    I am considering using this platform to build an all nvme/flash SAN. Same as you I'm going to the melanox route and 25Gbps interfaces. 

    I was going to use an pice express to m2 split in one of the 16x slots and the melanox in the other 16x slot, maybe few other m.2 in the X4. 

    Do you have an opinion about the soundness of the plan? Also have you played around with the irq handling?




  • Did not fix everything for me, I use a x16 splitter to 4 x4 lanes, for m.2 drives. (this thing: https://www.asus.com/nl/motherboards-components/motherboards/accessories/hyper-m-2-x16-gen-4-card/ )

    In the BIOS -> Platform manager -> PCIe Root complex -> Root complex 1 (that is the upper x16 slot on your motherboard),

    I've set it to 4x4x4x4, which will show all the disks by the way. I've set the speeds for all to GEN4 (which you can do in the same screen). However I got sub optimal performance, my disk is able to write at 5000MB/s, but I only manage 1000MB/S

    lspci -vvv shows:

    000d:04:00.0 Non-Volatile memory controller: Sandisk Corp Western Digital WD Black SN850X NVMe SSD (rev 01) (prog-if 02 [NVM Express])
    LnkCap:Port #0, Speed 16GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
    LnkSta:Speed 8GT/s (downgraded), Width x4
     
    Only the first slot of the 4 m.2 slots is at the full 16GT/s (and not downgraded)

    If you set it to GEN4, you have to disable: PCIE_ASPM=OFF, or else it will spawn errors.

    I also "enabled" arm_smmu_v3.disable_bypass=n and iommu.passthrough=1, but for me this doesn't speed up the performance to what you see.

    If anyone has got this working, I'm all ears :)


  • @Jose Luis Pedrosa I have no particular opinion, I'm mostly in networking and not at all in storage, so my experience there doesn't count. Regarding irqs, I initially tried a bit but it was not my problem here.

     


  • @Derek den Haas I don't know what the capabilities of the PCIe controller in the SoC are regarding splitting, maybe it cannot sustain gen4 for too many devices and has to be downgraded, I really don't know. What's important for your performance issues however is to see if you're seeing CPU saturation or not. If the CPU is saturated, it could be possible that you're facing something comparable to the iommu issues which makes CPU stall during transfers. However if your CPUs are twiddling thumbs, or if the performance remains the same when you disable half of them, it clearly means you're not facing this problem but are more exposed to some bus performance issues. You may also want to see if playing with the maxpayload settings of PCIe changes anything for you, as it may be effective for storage to have a large value (e.g. 4k).

     


  • Hi Willy,

    I checked the Amphere altra max manual, and it should be able to split it to 4x4x4x4 per Root complex, in gen 4 mode:
    https://d1o0i0v5q5lp8h.cloudfront.net/ampere/live/assets/documents/Altra_Rev_A1_DS_v1.30_20220728.pdf
    https://amperecomputing.com/en/briefs/ampere-altra-family-product-brief

    The CPU is almost idle when doing IO tests (du / fio). I will check the maxpayload setting, and see if this will help. Since I only need 4 NVMe drives, I also ordered 4 seperate PCIe to M.2 (gen4) cards, to see if it has max performance when populating 4 PCIe ports on the motherboard, with default settings on the root complexes.

    P.s. using Linux 6.4.latest (though we have 6.5 now).

    P.s. also the "bios" is suggesting it should be able to get GEN4 speeds (since you can assign it, I guess per 2 lanes, since there are 8 boxes)


Please login to reply this topic!